This repository is part of our project for the course ```Formale Semantik``` at the University Heidelberg. The project task can be summarized as the classification of lexical semantic relations between the components of nominal compounds. If this topic caught your interest the [Project Report]() offers a detailed insight into the project and its outcomes.
This repository is part of our project for the course ```Formale Semantik``` at the University Heidelberg. The project task can be summarized as the classification of lexical semantic relations between the components of nominal compounds. If this topic caught your interest the [Project Report](documents/Project_Report.pdf) offers a detailed insight into the project and its outcomes.
## Task 📝
A system is to be trained with a noun compound of the form NC = noun1 noun2 and paraphrases describing the relation between noun1 and noun2. It was now to be tested whether semantic relations between the two components of a noun compound, head and modifier, have been learned and can be reproduced. For this purpose, we tested to what extent the components masked in paraphrases - the verbs - can be completed by a machine and how well relations can be predicted for a nominal compound occurring in a sentence by a fine-tuned model.
## Prerequisites 🗂
...
...
@@ -14,7 +14,7 @@ pip install -r requirements.txt
| subdirectory | content | README
| ---- | ---- | ---- |
| data | contains all data needed for probing and fine-tuning | [README](data/README.md)
| documents | contains the first plan for our [**Project Outline**]() and the final [**Project Report**](documents/Gruppe_9__NC-RC_-_Outline.pdf)
| documents | contains the first plan for our [**Project Outline**](documents/Gruppe_9__NC-RC_-_Outline.pdf) and the final [**Project Report**](documents/Project_Report.pdf)
| fine_tuning | contains code to fine-tune models, the fine-tuned models, test results and evaluation| [README](fine_tuning/README.md)
| probing | contains code for probing and its evaluation | [README](probing/README.md)
Since our project focuses on "breaking" / analyzing a neural system which tries to predict semantic relations of nominal compounds, we decided to create set of sentences containing said compounds. To gather a base dataset for later fine-tuning and testing sessions, [Wortschatz Leipzig](https://wortschatz.uni-leipzig.de/de) was used to search through News and Wikipedia snippets released between 2016 and 2020, which were adding up to roughly 8M unique Sentences. The search itself will center around a set of compounds for both fine and coarse grained relations, which were taken from [Tratz and Hovy (2010)](https://github.com/vered1986/panic/tree/master/classification/data).
The search itself was done by iterating over the sentences using a regex pattern like (`"\b({})".format(noun_compound)`). In order to reduce the number of iterations it was augmented with a join method:
```
# let step be 10 - or an integer of choice
for i in range(0, len(compounds), step):
if i + step < len(compounds):
p = r"\b({})".format("|".join(compounds[i:i+step]))
else:
p = r"\b({})".format("|".join(compounds[i:]))
```
This can also be easily accelerated using multithreading jobs.
## Compound variety for fine relations <!-- omit in TOC -->
