The pipeline system performs the 36th variant of the pipeline for WSI (word sense induction) task (the Task 11 at SemEval 2013).
The system creates semantic related clusters from the given snippets (the text fragments we get back from the search engine) for each pre-defined ambigue topic.
------------------- METHODS -------------------
For the WSI purposes it uses the following methods:
- For pre-rpocessing: tokenization + remove punctuation
- Language model: sent2vec, wiki_unigrams - pretrained Model
- Compositional semantics: vector mixture model (BOW (bag-of-words) representation with summarization for each snippet)
vector_paragr=np.zeros(len_vector)#vector for one snippet for a sum
forsentinparagr[1:]:#sent in a snippet
#print("sent: ", sent)
vector_sent=[]#list for all sentences in a snippet
forwordinsent:#word
try:
query_vector=model.embed_sentence(word)
vector_sent.append(query_vector)#add a word-vector to a list for sentences in a snippet - BOW for all words in a snippet - now for a sentence - for every sentence
except:
continue
summe=np.zeros(len_vector)# vector for a summ
forvectorinvector_sent:# for one word in all sentences
summe+=vector# summ all words in a snippet - vector for a snippet
#par_list.append(summe) #?!!!!!! war
par_list.append(summe)#?#add a summ(vector for a snippet) to a list with a snippet
forsentenceinpar_list:# for all snippet-vectors
vector_paragr+=sentence# sum all snippets
paragr.append(vector_paragr)#add to a snippet a summ of all snippets
compos_vectors=prepr_data
#print(compos_vectors["45"]) # example of the output for the topic "45"
returncompos_vectors
# Create a vocabulary for subtopics with topics as keys and lists with subtopics with IDs as values:
vector_paragr=np.zeros(len_vector)#vector for one snippet for a sum
forsentinparagr[1:]:#sent in a snippet
vector_sent=[]#list for all sentences in a snippet
forwordinsent:#word
try:
query_vector=model.embed_sentence(word)
vector_sent.append(query_vector)#add a word-vector to a list for sentences in a snippet - BOW for all words in a snippet - now for a sentence - for every sentence
except:
continue
summe=np.zeros(len_vector)# vector for a summ
forvectorinvector_sent:# for one word in all sentences
summe+=vector# summ all words in a snippet - vector for a snippet
par_list.append(summe)#?#add a summ(vector for a snippet) to a list with a snippet
forsentenceinpar_list:# for all snippet-vectors
vector_paragr+=sentence# sum all snippets
paragr.append(vector_paragr)#add to a snippet a summ of all snippets