@@ -503,7 +501,7 @@ As expected, MLP, Random Forest and Decision Tree all performed better than the
The learning curve shows how the accuracy increases as the size of the training data increases. For this the sklearn method `learning_curve()` was used, which can automatically plot a learning curve. The test sizes were the following: `[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000]`. On the graph you can also see the comparison with two baselines: Naive Bayes and Random Class classifier, as mentioned above. The results show a steady increase in accuracy up to around 1000, from which point the increase in accuracy slows, however is still present.
<imgsrc="images/learning_curve_w_mlp_fade.png"width=800><!--IS THIS TOO CHAOTIC?!-->
1000 datapoints is roughly 63 datapoints per class (assuming a uniform distribution). This is evidently enough to learn the most important features (as we can see <ahref="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#feature-importance">above</a>: the alphabets). Interesting to note is also that Decision Tree and Random Forest both have 100% accuracy on the training set, whereas Naive Bayes only has around 90%.
...
...
@@ -603,12 +601,16 @@ It is therefore clear that, while no precise accuracy can be given for real-worl
This model was purely focussed on Slavic languages however could easily be extended to other language groups which have a Wikipedia. In order to do this, one would have to update the alphabet and it might be necessary to change the number of n-most common words in the BoW model. A more complex model could also use N-grams to capture morphological information in a more sophisticated way.
For details on how to download and use the model, see the <ahref="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/README.md">introductory README for the seminar</a>.