Clean data_preparation_WORKFLOW.txt

3e00e318 · innes · 73b268fc · 3e00e318
Commit 3e00e318 authored 3 years ago by innes
--- a/project/data/data_preparation_WORKFLOW.txt
+++ b/project/data/data_preparation_WORKFLOW.txt
 Workflow:
- find out if no of articles is sufficient (research papers)
+- find out if number of articles is sufficient
 - download all corpora onto computer (create file in repository for corpora) as one document after another
- train_test_split!!!!!
+- train_test_split
+- generate features with 10 most common words and with combined alphabet

 FOR language IN languages:

-	- TRAINING: tokenise (create new file with list of all tokenised words),
-	- Counter for n most common words
+	- TRAINING: tokenise (create new file with list of all tokenised words)

 	FOR article IN articles:
 		- count occurrences of n most common words
@@ -14,4 +14,4 @@ FOR language IN languages:
 		- label with language
 		- add to list of cleaned training data

-#TRAIN DATA FINISHED
\ No newline at end of file
+#TRAIN DATA FINISHED