Skip to content
Snippets Groups Projects
Commit 3e00e318 authored by innes's avatar innes
Browse files

Clean data_preparation_WORKFLOW.txt

parent 73b268fc
No related branches found
No related tags found
No related merge requests found
Workflow:
- find out if no of articles is sufficient (research papers)
- find out if number of articles is sufficient
- download all corpora onto computer (create file in repository for corpora) as one document after another
- train_test_split!!!!!
- train_test_split
- generate features with 10 most common words and with combined alphabet
FOR language IN languages:
- TRAINING: tokenise (create new file with list of all tokenised words),
- Counter for n most common words
- TRAINING: tokenise (create new file with list of all tokenised words)
FOR article IN articles:
- count occurrences of n most common words
......@@ -14,4 +14,4 @@ FOR language IN languages:
- label with language
- add to list of cleaned training data
#TRAIN DATA FINISHED
\ No newline at end of file
#TRAIN DATA FINISHED
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment