From 4702cf04a18282f694f63f66dd5555cf02edc4eb Mon Sep 17 00:00:00 2001 From: Maximilian Blunck <blunck@cl.uni-heidelberg.de> Date: Thu, 8 Feb 2018 22:38:59 +0100 Subject: [PATCH] Updated readme --- README.md | 66 +++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 54 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 2d54a88..ee9a7d5 100644 --- a/README.md +++ b/README.md @@ -14,13 +14,7 @@ We suggest running the `setup.sh` file. This creates a virtual python environmen $ bash setup.sh -After running the setup, you will need to activate the virtualenv - - $ source sopro_env/bin/activate - -Alternatively, you can manually install the following requirements. - -## Requirements +Alternatively, you can manually install the following requirements: The program requires NLTK, NumPy, SciPy, SciKit Learn, requests, textblob and matplotlib. Please note that SciPy and NumPy need to be installed before SciKit Learn. @@ -36,17 +30,65 @@ Please note that SciPy and NumPy need to be installed before SciKit Learn. ## Run -To run the main programm run `main.py` + If not already activated, activate the virtualenv + + $ source sopro_env/bin/activate + +To run the main programm run `main.py`. $ cd src/ $ python3 main.py -With the default settings, several classifiers will be trained on 80% of the data and tested on the other 20%. Results will be then printed out and also saved to the `results/` directory. In this setting, a certain feature-combination is used which generated the best scores in various experiments. +With the default settings, several classifiers will be trained on 80% of the data and tested on the other 20%. Results will be then printed out and also saved to the `results/` directory. In this setting, a certain feature-combination is used, which generated the best scores in prior experiments. + +Changes can be made in `config.py`. Examples: -Changes can be made in `config.py`. -To generate cross-validation scores which can be compared to [Buschmeier et al.](http://acl2014.org/acl2014/W14-26/pdf/W14-2608.pdf), change the following variables: +To generate cross-validation scores which can be compared to [Buschmeier et al.](http://acl2014.org/acl2014/W14-26/pdf/W14-2608.pdf), change the following variables to: split_ratio = 1.0 validate = True -See `config.py` itself for further options. \ No newline at end of file +To choose a different combination of Features, modify the following variable: + + feature_selection = ['f1', 'f4', 'f7'] + +If you'd like to run the programm for all possible combinations of the selected features, change the following variable to: + + use_all_variants = True + +Feature specific options like the n-parameter of the bag-of-n-grams feature can also be adjusted. Changing the following variable as shown will make the feature extract uni- and bigrams: + + n_range_words = (1,2) + + +See `config.py` itself for further options. + +## App Structure + +### Main Programm + - main.py > entry point to App, calls machine_learning.py's run()-function + +### Feature Related Files + - feature.py > provides an abstract Feature class + |- ngram_feature.py > inherites from Feature, offers method for extracting F1 feature + |- surface_patterns.py > inherites from NGramFeature, offers method for extracting F3 feature + |- pos_feature.py > inherites from Feature, offers method for extracting F2 feature + |- sent_rating_feature.py > inherites from Feature, offers method for extracting F4 feature + |- punctuation_feature.py > inherites from Feature, offers method for extracting F5 feature + |- contrast_feature.py > inherites from Feature, offers method for extracting F6 feature + |- stars_feature.py > inherites from Feature, offers method for extracting F7 feature + - feature_extraction.py > provides functions for extracting and concatenating feature vectors + +### Machine Learning + - machine_learning.py > includes run-function, which incorperates all ML related steps (training,testing,..) + +### Other + - corpus.py > contains a reading function to load corpus, can also be run to convert raw corpus + - utilities.py > collection of functions & helpers used throughout the app + - config.py > file for adjusting setting and options + +### Directories + - src/ > holds all the source code above + - results/ > default location where test/validation results are saved + - corpus/ > contains complete corpus in a single csv-file (shuffled) + -- GitLab