@@ -14,13 +14,7 @@ We suggest running the `setup.sh` file. This creates a virtual python environmen
$ bash setup.sh
After running the setup, you will need to activate the virtualenv
$ source sopro_env/bin/activate
Alternatively, you can manually install the following requirements.
## Requirements
Alternatively, you can manually install the following requirements:
The program requires NLTK, NumPy, SciPy, SciKit Learn, requests, textblob and matplotlib.
Please note that SciPy and NumPy need to be installed before SciKit Learn.
...
...
@@ -36,17 +30,65 @@ Please note that SciPy and NumPy need to be installed before SciKit Learn.
## Run
To run the main programm run `main.py`
If not already activated, activate the virtualenv
$ source sopro_env/bin/activate
To run the main programm run `main.py`.
$ cd src/
$ python3 main.py
With the default settings, several classifiers will be trained on 80% of the data and tested on the other 20%. Results will be then printed out and also saved to the `results/` directory. In this setting, a certain feature-combination is used which generated the best scores in various experiments.
With the default settings, several classifiers will be trained on 80% of the data and tested on the other 20%. Results will be then printed out and also saved to the `results/` directory. In this setting, a certain feature-combination is used, which generated the best scores in prior experiments.
Changes can be made in `config.py`. Examples:
Changes can be made in `config.py`.
To generate cross-validation scores which can be compared to [Buschmeier et al.](http://acl2014.org/acl2014/W14-26/pdf/W14-2608.pdf), change the following variables:
To generate cross-validation scores which can be compared to [Buschmeier et al.](http://acl2014.org/acl2014/W14-26/pdf/W14-2608.pdf), change the following variables to:
split_ratio = 1.0
validate = True
See `config.py` itself for further options.
\ No newline at end of file
To choose a different combination of Features, modify the following variable:
feature_selection = ['f1', 'f4', 'f7']
If you'd like to run the programm for all possible combinations of the selected features, change the following variable to:
use_all_variants = True
Feature specific options like the n-parameter of the bag-of-n-grams feature can also be adjusted. Changing the following variable as shown will make the feature extract uni- and bigrams:
n_range_words = (1,2)
See `config.py` itself for further options.
## App Structure
### Main Programm
- main.py > entry point to App, calls machine_learning.py's run()-function
### Feature Related Files
- feature.py > provides an abstract Feature class
|- ngram_feature.py > inherites from Feature, offers method for extracting F1 feature
|- surface_patterns.py > inherites from NGramFeature, offers method for extracting F3 feature
|- pos_feature.py > inherites from Feature, offers method for extracting F2 feature
|- sent_rating_feature.py > inherites from Feature, offers method for extracting F4 feature
|- punctuation_feature.py > inherites from Feature, offers method for extracting F5 feature
|- contrast_feature.py > inherites from Feature, offers method for extracting F6 feature
|- stars_feature.py > inherites from Feature, offers method for extracting F7 feature
- feature_extraction.py > provides functions for extracting and concatenating feature vectors
### Machine Learning
- machine_learning.py > includes run-function, which incorperates all ML related steps (training,testing,..)
### Other
- corpus.py > contains a reading function to load corpus, can also be run to convert raw corpus
- utilities.py > collection of functions & helpers used throughout the app
- config.py > file for adjusting setting and options
### Directories
- src/ > holds all the source code above
- results/ > default location where test/validation results are saved
- corpus/ > contains complete corpus in a single csv-file (shuffled)