Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
E
exp-ml-1
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Container Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
innes
exp-ml-1
Commits
3e00e318
Commit
3e00e318
authored
3 years ago
by
innes
Browse files
Options
Downloads
Patches
Plain Diff
Clean data_preparation_WORKFLOW.txt
parent
73b268fc
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
project/data/data_preparation_WORKFLOW.txt
+5
-5
5 additions, 5 deletions
project/data/data_preparation_WORKFLOW.txt
with
5 additions
and
5 deletions
project/data/data_preparation_WORKFLOW.txt
+
5
−
5
View file @
3e00e318
Workflow:
- find out if n
o
of articles is sufficient
(research papers)
- find out if n
umber
of articles is sufficient
- download all corpora onto computer (create file in repository for corpora) as one document after another
- train_test_split!!!!!
- train_test_split
- generate features with 10 most common words and with combined alphabet
FOR language IN languages:
- TRAINING: tokenise (create new file with list of all tokenised words),
- Counter for n most common words
- TRAINING: tokenise (create new file with list of all tokenised words)
FOR article IN articles:
- count occurrences of n most common words
...
...
@@ -14,4 +14,4 @@ FOR language IN languages:
- label with language
- add to list of cleaned training data
#TRAIN DATA FINISHED
\ No newline at end of file
#TRAIN DATA FINISHED
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment