Skip to content
Snippets Groups Projects
Commit 73b268fc authored by innes's avatar innes
Browse files

Fix several typos in README.md

parent 1bc41a4b
No related branches found
No related tags found
No related merge requests found
......@@ -321,7 +321,7 @@ In practice, this proved impractical due to the large number of classes and feat
However, both the Decision Tree and the Random Forest classifiers achieved a relatively high accuracy (see <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#evaluation">Evaluation</a>), with the latter predictably being the best of the two on the test set. Since the problem is in effect relatively simple and the number of classes high, a Random Forest classifier is a very good choice.
With its flexibility and capability of highlighting certain feature combinations, a Multi-Layer Perceptron was also trained on the data. And as is explained later, after optimising the parameters (see <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#development">Development</a>), it lead to the best results on the Test set, despite how small the dataset is, although it had a longer training-time.
With its flexibility and capability of highlighting certain feature combinations, a Multi-Layer Perceptron was also trained on the data. And as is explained later, after optimising the parameters (see <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#development">Development</a>), it lead to the best results on the Test set, despite how small the dataset is. However the training-time was noticeably longer.
For comparison, a Naive Bayes model was also trained on the data to use as a baseline. For a discussion of the results of the different algorithms see <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#evaluation">Evaluation</a>.
......@@ -432,9 +432,9 @@ During the early stages of the project, when character features were added to th
</tr>
</table>
With this table it becomes clear that the characters (alphabets) are more informative than the 10 most frequent words, however the Decision Tree, Random Forest and MLP algorithms are most accurate when these two factors are combined. It is interesting to note, however, that the accuracy of Naive Bayes decreases with the introduction of the BoW features. This could be due to the fact that Naive Bayes assumes that all features are independent of each other (<a href="https://nlp.stanford.edu/IR-book/html/htmledition/properties-of-naive-bayes-1.html">the Naive Bayes assumption</a>). In this case, this is of course not true, and thus, as we can see, the most informative feature combination is when information about the combination of words and characters is used, which a Naive Bayes classifier cannot do. Introducing the BoW features therefore creates more noise and thus the accuracy decreases.
With this table it becomes clear that the characters (alphabets) are more informative than the 10 most frequent words, however the Decision Tree, Random Forest and MLP algorithms are most accurate when these two factors are combined. It is interesting to note, however, that the accuracy of Naive Bayes decreases with the introduction of the BoW features. This could be due to the fact that Naive Bayes assumes that all features are independent of each other (<a href="https://nlp.stanford.edu/IR-book/html/htmledition/properties-of-naive-bayes-1.html">the Naive Bayes assumption</a>). In this case, this is of course not true, and thus, as we can see, the most informative feature combination is when information about the combination of words and characters is used, which a Naive Bayes classifier cannot do. Introducing the BoW features therefore creates more noise, and thus the accuracy decreases.
After the above discussion on which features to include in the model, it is of great interest to see exactly which features are the most "important" for the model. <a href="https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html">Using the sklearn built-in attribute `feature_importances_`</a> and the code given <a href="development/feature_importance.py">here</a>, this graph was plotted to show the MDI (Median Decrease of Impurity) of each feature for a Random Forest classifier. Roughly the first half of the features represent the BoW features, and the second half the alphabets.
After the above discussion on which features to include in the model, it is of great interest to see exactly which features are the most "important" for the model. <a href="https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html">Using the sklearn built-in attribute `feature_importances_`</a> and the code given <a href="development/feature_importance.py">here</a>, this graph was plotted to show the MDI (Median Decrease of Impurity) of each feature for a Random Forest classifier. Roughly the first half of the features represents the BoW features, and the second half the alphabets.
<img src="images/feature_importances_NOLABELS.png" width=500>
......@@ -445,7 +445,7 @@ With this graph it is clearly not the case that with the letters the BoW feature
# Performance
## Baselines
Two baselines were written from scratch: a **Random baseline** and a **Majority baseline** (which can be found <a href ="evaluation/dummy_classifiers.py">here</a>). The Random baseline randomly picks a label based off of the class-probabilities in the training data. The Majority baseline however simply predicts the most common class each time. In the evaluation, the smarter Naive Bayes algorithm was also used as a baseline.
Two baselines were written from scratch: a **Random baseline** and a **Majority baseline** (which can be found <a href ="evaluation/dummy_classifiers.py">here</a>). The Random baseline randomly picks a label based off of the class-probabilities in the training data. The Majority baseline however simply predicts the most common class each time. In the evaluation, the Naive Bayes algorithm was also used as a baseline.
If the model has a lower accuracy than the Random baseline, then it predicts worse than randomly assigning a class. This could mean that the model has learned the wrong features (for example noise). In a binary classification task, the performance could be improved by predicting the opposite class than the algorithm does, however in this case with 16 classes it is impossible. In other cases, the dataset itself must be cleaned to make sure the algorithm learns the right features.
......@@ -497,13 +497,13 @@ After generating the dataset, Cross-Validation with 5 folds was carried out imme
</tr>
</table>
As expected, MLP, Random Forest and Decision Tree all performed better than the Naive Bayes baseline, with an accuracy well over the initial goal of 90%. What is also clear, is that MLP and Random Forest have a better accuracy than Decision Tree, which was also expected. Also not unexpected was the result of the Neural Network, as they are renowned for their ability to achieve high accuracy, combining features in a more complex way than 'simpler' algorithms such as Decision Trees. Although the corpus was small, the MLP performed better than the other classifiers used, and thus the <a href="final_model.zip">final model</a> uses an optimised MLP classifier as well as a Random Forest for comparison.
As expected, MLP, Random Forest and Decision Tree all performed better than the Naive Bayes baseline, with an accuracy well over the initial goal of 90%. What is also clear, is that MLP and Random Forest have a better accuracy than Decision Tree, which was also expected. Also not unexpected was the result of the Neural Network as they are renowned for their ability to achieve high accuracy, combining features in a more complex way than 'simpler' algorithms such as Decision Trees. Although the corpus was small, the MLP performed better than the other classifiers used, and thus the <a href="final_model.zip">final model</a> uses an optimised MLP classifier as well as a Random Forest for comparison.
The learning curve shows how the accuracy increases as the size of the training data increases. For this the sklearn method `learning_curve()` was used, which can automatically plot a learning curve. The test sizes were the following: `[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000]`. On the graph you can also see the comparison with two baselines: Naive Bayes and Random Class classifier, as mentioned above. The results show a steady increase in accuracy up to around 1000, from which point the increase in accuracy slows, however is still present.
<img src="images/learning_curve_w_mlp_fade.png" width=600>
1000 data points, roughly 63 data points per class, is evidently enough to learn the most important features (as we can see <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#feature-importance">above</a>, in most cases particular letters). Interesting to note is also that Decision Tree and Random Forest both have 100% accuracy on the training set, whereas Naive Bayes only has around 90%.
1000 data points, roughly 63 data points per class, is evidently enough to learn the most important features (as we can see <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#feature-importance">above</a>, in most cases this means particular letters). Interesting to note is also that Decision Tree and Random Forest both have 100% accuracy on the training set, whereas Naive Bayes only has around 90%.
In order to obtain a more reliable measure of how well the model works, here is a breakdown of the precision, recall and F1-score per class using a Random Forest classifier. This was created using the `classification_report()` function in `sklearn.metrics` (source code <a href="evaluation/plot_classification_report.py">here</a>).
......@@ -540,7 +540,7 @@ However, the above metrics are generated by testing the model on the Wikipedia c
> Все люди рождаются свободными и равными в своем достоинстве и правах. Они наделены разумом и совестью и должны поступать в отношении друг друга в духе братства.
Article 1 is comparatively relatively short - the length of the English text is `30` words, in Russian `26` - so the model works less well, both algorithms classifying 13 of the 16 languages correctly. Using the somewhat longer article 2 in the preamble right before - `79` in Russian - the accuracy increased. The Article 1 translations were retrieved from <a href="https://omniglot.com/udhr/index.htm">Omniglot</a>, and the Article 2 texts from the <a href="https://www.ohchr.org/EN/UDHR/Pages/SearchByLang.aspx">United Nations official translations</a>.
Article 1 is comparatively relatively short - the length of the English text is `30` words, in Russian `26` - so the model works less well, both algorithms classifying 13 of the 16 languages correctly. Using the somewhat longer Article 2 - `79` words in Russian - the accuracy increased. The Article 1 translations were retrieved from <a href="https://omniglot.com/udhr/index.htm">Omniglot</a>, and the Article 2 texts from the <a href="https://www.ohchr.org/EN/UDHR/Pages/SearchByLang.aspx">United Nations official translations</a>.
| Supported languages: | Article 1 (MLP) | Article 1 (RF) | Article 2 (MLP) | Article 2 (RF) |
| --- | --- | --- | --- | --- |
......@@ -561,7 +561,7 @@ Article 1 is comparatively relatively short - the length of the English text is
| Rusyn | :white_check_mark: | :white_check_mark: | N/A | N/A |
| Old Church Slavonic: Cyrillic | :white_check_mark: | :white_check_mark: | N/A | N/A |
To further test the languages for which Article 2 was not available, paragraphs were taken from other sources, which was sometimes tricky as the 5 languages as little to no internet presence. Below is an example of the first two lines of the Lord's Prayer, correctly classified as Silesian, and the output.
To further test the languages for which Article 2 was not available, paragraphs were taken from other sources, which was sometimes tricky as the 5 languages have little to no internet presence. Below is an example of the first two lines of the Lord's Prayer, correctly classified as Silesian, and the output.
```
Enter text to be classified: Ôjcze nŏsz, kery jeżeś we niebie, bydź poświyncōne miano Twoje.
......@@ -592,7 +592,7 @@ Ukrainian: 0.0
Would you like to enter another text? (y/n)
```
The next highest probability was Polish, which is to be expected due to the similarity of the two languages (some regard Silesian as a dialect of Polish), and it is mostly other languages written in the Latin script, which are predicted, also unsurprising.
The next highest probability was Polish, which is to be expected due to the similarity of the two languages (<a href="https://cadmus.eui.eu/bitstream/handle/1814/1351/HEC03-01.pdf">some regard Silesian as a dialect of Polish</a>), and it is mostly other languages written in the Latin script, which are predicted, also unsurprising.
It is therefore clear that, while no precise accuracy can be given for real-world data, the model works very reliably, given the document is greater than a particular length (around 50 words). One possible way of calculating more reliable metrics on this point without using annotation could be to use several single-language corpora and use that data to calculate metrics.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment