Skip to content
Snippets Groups Projects
Commit bc590932 authored by Samuel Innes's avatar Samuel Innes
Browse files
parents dd4f0590 b2187da8
No related branches found
No related tags found
No related merge requests found
......@@ -8,7 +8,7 @@ This project takes as input a text in a particular Slavic language and returns w
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#motivation">Motivation</a>
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#similar-projects">Similar Projects</a>
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#the-slavic-language-family">The Slavic language family</a>
- <a href="">Model</a>
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#model">Model</a>
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#supported-languages">Supported languages</a>
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#corpus">Corpus</a>
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#features">Features</a>
......@@ -18,7 +18,7 @@ This project takes as input a text in a particular Slavic language and returns w
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#parameters">Parameters</a>
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#decision-tree-and-random-forest">Decision Tree and Random Forest</a>
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#neural-network">Neural Network</a>
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.mdnaive-bayes">Naive Bayes</a>
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#naive-bayes">Naive Bayes</a>
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#feature-importance">Feature importance</a>
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#performance">Performance</a>
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#baselines">Baselines</a>
......@@ -26,8 +26,6 @@ This project takes as input a text in a particular Slavic language and returns w
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#technical-details">Technical details</a>
- <a href = "https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#system-requirements">System Requirements</a>
- <a href = "https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#installation">Installation</a>
- <a href = "https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#todos">TODOs</a>
- <a href = "https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#common-problems">Common Problems</a>
- <a href = "https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#author">Author</a>
- <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#references">References</a>
......@@ -503,7 +501,7 @@ As expected, MLP, Random Forest and Decision Tree all performed better than the
The learning curve shows how the accuracy increases as the size of the training data increases. For this the sklearn method `learning_curve()` was used, which can automatically plot a learning curve. The test sizes were the following: `[10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 2000, 3000, 4000, 5000]`. On the graph you can also see the comparison with two baselines: Naive Bayes and Random Class classifier, as mentioned above. The results show a steady increase in accuracy up to around 1000, from which point the increase in accuracy slows, however is still present.
<img src="images/learning_curve_w_mlp_fade.png" width=800><!--IS THIS TOO CHAOTIC?!-->
<img src="images/learning_curve_w_mlp_fade.png" width=600>
UPDATE THIS DESCRIPTION
1000 data points is roughly 63 data points per class (assuming a uniform distribution). This is evidently enough to learn the most important features (as we can see <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/project/README.md#feature-importance">above</a>: the alphabets). Interesting to note is also that Decision Tree and Random Forest both have 100% accuracy on the training set, whereas Naive Bayes only has around 90%.
......@@ -603,12 +601,16 @@ It is therefore clear that, while no precise accuracy can be given for real-worl
This model was purely focussed on Slavic languages however could easily be extended to other language groups which have a Wikipedia. In order to do this, one would have to update the alphabet and it might be necessary to change the number of n-most common words in the BoW model. A more complex model could also use N-grams to capture morphological information in a more sophisticated way.
<img src="images/birch-forest-crop.jpg" width=2000/>
# Technical details
## System Requirements
Computer<br>
Python 3.9 (or similar)<br>
Scikit-learn<br>
## Installation
For details on how to download and use the model, see the <a href="https://gitlab.cl.uni-heidelberg.de/innes/exp-ml-1/-/blob/master/README.md">introductory README for the seminar</a>.
## Author
Samuel Innes: dd257@stud.uni-heidelberg.de
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment