This is an implementation of the CHERTOY system for the Word Sense Induction task. This project also contains an implementation of the baseline and 40 experiments with it.
CHERTOY performs the example solution for the WSI (word sense induction) task (the Task 11 at SemEval 2013). For introduction to the task you can use the ...
The whole description of the system you can find here ...
The system creates semantic related clusters from the given snippets (the text fragments get back from the search engine) for each pre-defined ambiguous topic. It makes the pre-processing of the input data, creates a language model using vector representations for each snippet with sense2vec and vector misture model (BOW representation with summarization for each snippet) and creates semantic clusters with the Mean Shift clustering algorithm.
The input data must consist of two text files: results.txt and topics.txt.
* results.txt
A file with snippets in the following format:
ID \t url \t title \t snippet
There are no empty lines between the snippets.
_Example:_
ID \t url \t title \t snippet
1.1 \t http://www.polaroid.com/ \t Polaroid | Home | 74.208.163.206 Create and share like never before at <b>Polaroid</b>.com. Find instant film and cameras reinvented for the digital age. Plus, digital cameras, digital camcorders, LCD <b>...</b>
1.2 \t http://www.polaroid.com/products \t products | www.polaroid.com Come check out a listing of Polaroid products, by category.
1.3 \t http://en.wikipedia.org/wiki/Polaroid_ \t Corporation Polaroid Corporation - Wikipedia, the free encyclopedia <b>Polaroid</b> Corporation is an American-based international consumer electronics and eyewear company, originally founded in 1937 by Edwin H. Land. It is most <b>...</b>
* topics.txt
A file with ambiguous topics. The system will create clusters for each of these topics.
_Example:_
id description
1 polaroid
2 kangaroo
3 shakira
4 kawasaki
### Output files:
CHERTOY produces the output file ( output.txt) that is formatted as follows:
subTopicID \t resultID
Here the subTopicID consists of the topic ID from the file topicx.txt and the number of the cluster (meaning).
The resultID is the ID number of the snippet from the file results.txt.