Merge branch 'master' of https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1

bc590932 · Samuel Innes · dd4f0590 · b2187da8 · bc590932
Commit bc590932 authored 3 years ago by Samuel Innes
--- a/project/README.md
+++ b/project/README.md
@@ -8,7 +8,7 @@ This project takes as input a text in a particular Slavic language and returns w
    - <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#motivation">Motivation</a>
    - <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#similar-projects">Similar Projects</a>
    - <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#the-slavic-language-family">The Slavic language family</a>
- <a href="">Model</a>
+- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#model">Model</a>
    - <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#supported-languages">Supported languages</a>
    - <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#corpus">Corpus</a>
    - <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#features">Features</a> 
@@ -18,7 +18,7 @@ This project takes as input a text in a particular Slavic language and returns w
    - <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#parameters">Parameters</a>
      - <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#decision-tree-and-random-forest">Decision Tree and Random Forest</a>
      - <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#neural-network">Neural Network</a>
-      - <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.mdnaive-bayes">Naive Bayes</a>
+      - <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#naive-bayes">Naive Bayes</a>
    - <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#feature-importance">Feature importance</a>
 - <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#performance">Performance</a>
    - <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#baselines">Baselines</a>
@@ -26,8 +26,6 @@ This project takes as input a text in a particular Slavic language and returns w
 - <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#technical-details">Technical details</a>
    - <a href = "https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#system-requirements">System Requirements</a>
    - <a href = "https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#installation">Installation</a>
-    - <a href = "https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#todos">TODOs</a>
-    - <a href = "https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#common-problems">Common Problems</a>
    - <a href = "https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#author">Author</a>
 - <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#references">References</a>

@@ -503,7 +501,7 @@ As expected, MLP, Random Forest and Decision Tree all performed better than the

 The learning curve shows how the accuracy increases as the size of the training data increases. For this the sklearn method `learning_curve()` was used, which can automatically plot a learning curve. The test sizes were the following: `[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000]`. On the graph you can also see the comparison with two baselines: Naive Bayes and Random Class classifier, as mentioned above. The results show a steady increase in accuracy up to around 1000, from which point the increase in accuracy slows, however is still present.

-<img src="images/learning_curve_w_mlp_fade.png" width=800><!--IS THIS TOO CHAOTIC?!-->
+<img src="images/learning_curve_w_mlp_fade.png" width=600>

 UPDATE THIS DESCRIPTION
 1000 data points is roughly 63 data points per class (assuming a uniform distribution). This is evidently enough to learn the most important features (as we can see <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#feature-importance">above</a>: the alphabets). Interesting to note is also that Decision Tree and Random Forest both have 100% accuracy on the training set, whereas Naive Bayes only has around 90%. 
@@ -603,12 +601,16 @@ It is therefore clear that, while no precise accuracy can be given for real-worl
 This model was purely focussed on Slavic languages however could easily be extended to other language groups which have a Wikipedia. In order to do this, one would have to update the alphabet and it might be necessary to change the number of n-most common words in the BoW model. A more complex model could also use N-grams to capture morphological information in a more sophisticated way.

 <img src="images/birch-forest-crop.jpg" width=2000/>
+# Technical details

 ## System Requirements
 Computer<br>
 Python 3.9 (or similar)<br>
 Scikit-learn<br>

+## Installation
+For details on how to download and use the model, see the <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/README.md">introductory README for the seminar</a>.
+
 ## Author
 Samuel Innes: dd257@stud.uni-heidelberg.de