From 764e7387f8448581cf6764d609853defbb060350 Mon Sep 17 00:00:00 2001 From: chernenko <chernenko@cl.uni-heidelberg.de> Date: Fri, 30 Mar 2018 14:20:54 +0200 Subject: [PATCH] Update README.md --- README.md | 45 ++++++++++++++++++++++++++++----------------- 1 file changed, 28 insertions(+), 17 deletions(-) diff --git a/README.md b/README.md index 9bd38f2..633d5d1 100644 --- a/README.md +++ b/README.md @@ -1,17 +1,20 @@ # CHERTOY -This is an implementation of the CHERTOY system for the Word Sense Induction task. This project also contains an implementation of the baseline and 40 experiments with it. +This is an implementation of the CHERTOY system for the Word Sense Induction task (the Task 11 at SemEval 2013). +This project also contains an implementation of the baseline and 40 experiments with it. -CHERTOY performs the example solution for the WSI (word sense induction) task (the Task 11 at SemEval 2013). For introduction to the task you can use the ... -The whole description of the system you can find here ... +We experiment with language models, specific features and clustering algorithms based on the sense2vec and the sent2vec systems. +With a detailed research over 40 experiments we got an interesting insight on the effects of several +feature combinations which resulted in our WSI system CHERTOY. -The system creates semantic related clusters from the given snippets (the text fragments get back from the search engine) for each pre-defined ambiguous topic. It makes the pre-processing of the input data, creates a language model using vector representations for each snippet with sense2vec and vector misture model (BOW representation with summarization for each snippet) and creates semantic clusters with the Mean Shift clustering algorithm. +The system creates semantic related clusters from the given snippets (the text fragments get back from the search engine) for each pre-defined ambiguous topic. +It makes the preprocessing of the input data, creates a language model using vector representations for each snippet with sense2vec and vector misture model (BOW representation with summarization for each snippet) and creates semantic clusters with the Mean Shift clustering algorithm. ## RUNNING INSTRUCTIONS Dependences: - sense2vec (paper: https://arxiv.org/abs/1511.06388, code: https://github.com/explosion/sense2vec) -- sklearn +- skikit-learn ### Input files: @@ -55,7 +58,7 @@ id description ### Output files: -CHERTOY produces the output file ( output.txt) that is formatted as follows: +CHERTOY produces the output file (output.txt) that is formatted as follows: subTopicID \t resultID @@ -80,7 +83,7 @@ Create a folder with your projects: After running the system you'll have the output file in your project folder. -### TO RUN THE SYSTEM: +### RUN THE SYSTEM: git clone https://gitlab.cl.uni-heidelberg.de/semantik_project/wsd_chernenko_schindler_toyota.git @@ -93,13 +96,14 @@ _Example:_ python3 chertoy.py /home/tatiana/Desktop/FSemantik_Projekt /test /topics.txt /results.txt /output_chertoy.txt -(TODO...Add Citations from sense2vec; -add folder structure) ### Other files: + +* Appendix.pdf - a performance table with F1, RI, ARI and JI values of the baseline and 40 experiments (incl. CHERTOY) on the trial data. + * bin -The folder contains the implementation of the CHERTOY system and code for pre-processing Wikipedia 2017 Corpus, that we used for the experiments. +The folder contains the implementation of the CHERTOY system. * experiments @@ -107,19 +111,26 @@ The folder experiments contains an implementation of the baseline and 40 differe * lib -The folder contains two trained models of the Wikipedia 2017 Corpus, that we used in our experiments with sent2vec (...) and pre-processed Wikipedia 2017 Corpus. -Other models that we used during our experiments you can find in sense2vec and sent2vec repositories. +The folder contains code for preprocessing Wikipedia Dataset to train own sent2vec models for the experiments, +preprocessed Wikipedia 2017 Dataset, +two self-trained models of the Wikipedia 2017 Dataset, that we used in our experiments with sent2vec, +README file, +Other models that we used during our experiments can be found in sense2vec and sent2vec repositories. + +* experiments + +Implementation of the baseline and 40 experiments with it. * output -* outputs_experiments ...? -* output_test ---? -... + +outputs\_trial_data - output files for the experiments on the trial data +output\_test\_data - output file for the test data ### LICENSES -This software uses ... License. +This software is distributed under the MIT License. -The part of the system uses sense2vec, which is distributed under the following License: +The part of the system utilizes sense2vec, which is distributed under the following License: The MIT License (MIT) -- GitLab