Update README.md

97fefc95 · chernenko · d1894ee4 · 97fefc95
Commit 97fefc95 authored 7 years ago by chernenko
--- a/README.md
+++ b/README.md
-README
\ No newline at end of file
+# CHERTOY
+
+This is an implementation of the CHERTOY system for the Word Sense Induction  task. This project also contains an implementation of the baseline and 40 experiments with it.
+
+CHERTOY performs the example solution for the WSI (word sense induction) task (the Task 11 at SemEval 2013). For introduction to the task you can use the ...
+The whole description of the system you can find here ...
+
+The system creates semantic related clusters from the given snippets (the text fragments get back from the search engine) for each pre-defined ambiguous topic. It makes the pre-processing of the input data, creates a language model using vector representations for each snippet with sense2vec and vector misture model (BOW representation with summarization for each snippet) and creates semantic clusters with the Mean Shift clustering algorithm.
+
+## RUNNING INSTRUCTIONS
+
+Dependences:
+- sense2vec (paper: https://arxiv.org/abs/1511.06388, code: https://github.com/explosion/sense2vec)
+- sklearn
+
+### Input files:
+
+The input data must consist of two text files: results.txt and topics.txt.
+
+* results.txt
+
+A file with snippets in the following format: 
+
+ID \t url \t title \t snippet
+
+There are no empty lines between the snippets.
+
+_Example:_
+
+ID \t url \t title \t snippet
+
+1.1	\t http://www.polaroid.com/	\t Polaroid | Home | 74.208.163.206	Create and share like never before at <b>Polaroid</b>.com. Find instant film and   cameras reinvented for the digital age. Plus, digital cameras, digital camcorders,   LCD <b>...</b>
+
+1.2	\t http://www.polaroid.com/products	\t products | www.polaroid.com	Come check out a listing of Polaroid products, by category.
+
+1.3	\t http://en.wikipedia.org/wiki/Polaroid_ \t Corporation	Polaroid Corporation - Wikipedia, the free encyclopedia	<b>Polaroid</b> Corporation is an American-based international consumer electronics   and eyewear company, originally founded in 1937 by Edwin H. Land. It is most <b>...</b>
+
+
+* topics.txt
+
+A file with ambiguous topics. The system will create clusters for each of these topics.
+
+_Example:_
+
+id	description
+
+1	polaroid
+
+2	kangaroo
+
+3	shakira
+
+4	kawasaki
+
+
+### Output files:
+
+CHERTOY produces the output file ( output.txt) that is formatted as follows:
+
+subTopicID \t resultID
+
+Here the subTopicID consists of the topic ID from the file topicx.txt and the number of the cluster (meaning). 
+The resultID is the ID number of the snippet from the file results.txt.
+
+_Example:_
+
+subTopicID \t resultID
+
+45.0 \t 45.1
+
+45.0 \t 45.10
+
+45.1 \t 45.20
+ 
+### TO RUN THE SYSTEM:
+
+git clone 
+
+cd 
+
+python3 model.py path model usw ...