Skip to content
Snippets Groups Projects
Commit 97fefc95 authored by chernenko's avatar chernenko
Browse files

Update README.md

parent d1894ee4
No related branches found
No related tags found
No related merge requests found
README
\ No newline at end of file
# CHERTOY
This is an implementation of the CHERTOY system for the Word Sense Induction task. This project also contains an implementation of the baseline and 40 experiments with it.
CHERTOY performs the example solution for the WSI (word sense induction) task (the Task 11 at SemEval 2013). For introduction to the task you can use the ...
The whole description of the system you can find here ...
The system creates semantic related clusters from the given snippets (the text fragments get back from the search engine) for each pre-defined ambiguous topic. It makes the pre-processing of the input data, creates a language model using vector representations for each snippet with sense2vec and vector misture model (BOW representation with summarization for each snippet) and creates semantic clusters with the Mean Shift clustering algorithm.
## RUNNING INSTRUCTIONS
Dependences:
- sense2vec (paper: https://arxiv.org/abs/1511.06388, code: https://github.com/explosion/sense2vec)
- sklearn
### Input files:
The input data must consist of two text files: results.txt and topics.txt.
* results.txt
A file with snippets in the following format:
ID \t url \t title \t snippet
There are no empty lines between the snippets.
_Example:_
ID \t url \t title \t snippet
1.1 \t http://www.polaroid.com/ \t Polaroid | Home | 74.208.163.206 Create and share like never before at <b>Polaroid</b>.com. Find instant film and cameras reinvented for the digital age. Plus, digital cameras, digital camcorders, LCD <b>...</b>
1.2 \t http://www.polaroid.com/products \t products | www.polaroid.com Come check out a listing of Polaroid products, by category.
1.3 \t http://en.wikipedia.org/wiki/Polaroid_ \t Corporation Polaroid Corporation - Wikipedia, the free encyclopedia <b>Polaroid</b> Corporation is an American-based international consumer electronics and eyewear company, originally founded in 1937 by Edwin H. Land. It is most <b>...</b>
* topics.txt
A file with ambiguous topics. The system will create clusters for each of these topics.
_Example:_
id description
1 polaroid
2 kangaroo
3 shakira
4 kawasaki
### Output files:
CHERTOY produces the output file ( output.txt) that is formatted as follows:
subTopicID \t resultID
Here the subTopicID consists of the topic ID from the file topicx.txt and the number of the cluster (meaning).
The resultID is the ID number of the snippet from the file results.txt.
_Example:_
subTopicID \t resultID
45.0 \t 45.1
45.0 \t 45.10
45.1 \t 45.20
### TO RUN THE SYSTEM:
git clone
cd
python3 model.py path model usw ...
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment