Software Projekt - Data Augmentation for Metonymy Resolution
Members of the project:
- Margareta Anna Kulscar kulscar@cl.uni-heidelberg.de
- Sandra Friebolin friebolin@cl.uni-heidelberg.de
- Mira Umlauf umlauf@cl.uni-heidelberg.de
Table of contents
- 📚 Project documents
- 🔎 Metonomy Resolution
- 📈 Data Augmentation
-
💡 Methods
- 📝 Backtranslation
- 🍸 MixUp
- 🗃️ Data
- 🛠️ Set Up
- ⚙️ Usage
- 🏯 Code Structure
- 📑 References
📚 Project documents
This README gives a rough overview of the project. The full documentation and additional information can be found in the documents listed below.
🔎 Metonymy Resolution
A metonymy is the replacement of the actual expression by another one that is closely related to the first one.
Metonymies use a contiguity relation between two domains.
🖊️ Example:
-
🇧🇷 BRAZIL lost the finals.
- The term Brazil stands for a sports team of the country and is thus to be classified as a metonymy.
-
💻 GOOGLE pays its employees poorly.
- This is a metonymy where the keyword Google stands for the owner of the company, but not for the literal sense of the term.
-
🇧🇷 BRAZIL has a beautiful coast.
- Here, Brazil stands for the country and contains thus the literal meaning of the term.
-
💻 GOOGLE is an American company.
- The term has literal meaning: Google refers to the company.
Metonymy resolution is about determining whether a potentially metonymic word is used metonymically in a particular context. In this project we focus on metonymic
and literal
readings for locations and organizations.
ℹ️ Sentences that allow for mixed readings, where both a literal and metonymic sense is evoked, are considered metonymic
in this project. This is true of the following sentence, in which the term Nigeria prompts both a metonymic and a literal reading.
- "They arrived in Nigeria, hitherto a leading critic of [...]" [ZITAT: SemEval-2007 Task 08: Metonymy Resolution at SemEval-2007]
➡️ Hence, we use the two classes non-literal
and literal
for this binary classification task.
📈 Data Augmentation
Data augmentation is the generation of synthetic training data for machine learning through transformations of existing data.
It is an important component for:
- ➕ increasing the generalization capabilities of a model
- ➕ overcoming data scarcity
- ➕ regularizing the target
- ➕ limiting the amount of data used to protect privacy
Consequently, it is a vital technique for evaluating the robustness of models, and improving the diversity and volume of training data to achieve better performance.
💡 Methods
When selecting methods for our task, the main goal was to find a tradeoff between label preserving methods and diversifying our dataset. Since the language models BERT 1 and RoBERTa 2 have not been found to profit from very basic augmentation strategies (e.g. case changing of single characters or embedding replacements [ZITAT einfügen?]), we chose more innovative and challenging methods.
To be able to compare the influence of augmentations in different spaces, we select a method for data space and two methods for the feature space.
📝 1. Backtranslation (Data Space)
As a comparatively safe (= label preserving) data augmentation strategy, we selected backtranslation using the machine translation model Fairseq3 [[Ott et al., 2019] ]. Similar to Chen et al.4 we use the pre-trained single models :
- [`transformer.wmt19.en-de.single_model`](https://huggingface.co/facebook/wmt19-en-de)
- [`transformer.wmt19.de-en.single_model`](https://huggingface.co/facebook/wmt19-de-en)
-
🔄 Each original sentences is back-translated 4 times to generate a paraphrase that is slightly different from the original sentence, but still close enough to preserve the class.
-
✅ For each sentence, the top 5 paraphrases are kept, using nucleus/topp as our sampling method, likewise for diversity reasons.
-
🔥 We test two versions: Generating paraphrases using a lower (0.8) and higher (1.2)
temperature
. This hyperparameter determines how creative the translation model becomes: highertemperature
leads to more linguistic variety, lowertemperature
to results closer to the original sentence. -
🌈 The diversity of the paraphrases is evaluated via the Longest Common Subsequence (LCS) score in comparison to their respective original sentence.
🖊️ Example:
- Original: BMW and Nissan launch electric cars.
- EN - DE: BMW und Nissan bringen Elektroautos auf den Markt.
- DE - EN: BMW and Nissan are bringing electric cars to the market.
🚮 Filtering:|
-
All paraphrases that did not contain the original (metonymic) target word or had syntactic variations were filtered out.
-
Those that contained the target word more than once were also filtered out.
🍸 2. MixUp (Feature Space)
Our method adopts the framework of the mixup transformer proposed by Sun et al. 5. This approach involves interpolating the representation of two instances on the last hidden state of the transformer model (in our case, BERT-base-uncased 1).
To derive the interpolated hidden representation and corresponding label, we use the following formulas on the representation of two data samples:
🔡 Instance interpolation:
🏷️ Label interpolation :
Here,
🗃️ Data
The datasets used in this project will be taken from Li et al.6 We confine ourselves to the following three:
- SemEval: [ZITAT: Markert and Nissim, 2007 ]
1. SemEval: Locations | 2. SemEval: Companies & Organizations | 3. ReLocar: Locations |
---|---|---|
[ZITAT: Markert and Nissim, 2007 ] | [ZITAT: Markert and Nissim, 2007 ] | [ZITAT: Gritta et al., 2017 ] |
![]() |
![]() |
![]() |
🖊️ Data Point Example:
0️⃣ literal: {"sentence": ["The", "radiation", "doses", "received", "by", "workers", "in", "the", "UK", "are", "analysed."], "pos": [8, 9], "label": 0}
1️⃣ non-literal: {"sentence": ["Finally,", "we", "examine", "the", "UK", "privatization", "programme", "in", "practice."], "pos": [4, 5], "label": 1}
✂️ We split 10% of the training set to use as development set. The following shows the final absolute and relative class distribution:

🛠️ Set Up
Creating a virtual environment to ensure that dependencies between the different projects are separated is a recommended first step:
python3 -m venv mrda-venv
source mrda-venv/bin/activate
Install all necessary requirements next:
pip install -r requirements.txt
[noch zu überlegen: evtl 2 requirements/envs für BT extra wegen torch version]
⚙️ Usage
🚀 Launch our application by following the steps below:
[welche argumente genau?]
./main.py <COMMAND?????> <ARGUMENTS??????????>...
For <COMMAND>
you must enter one of the commands you find in the list below, where you can also find an overview about necessary <ARGUMENTS>
.
Command | Functionality | Arguments |
---|---|---|
train |
? | ? |
evaluate |
? | ? |
demo |
? | ? |
-
train
explain more ... evaluate
- ...
[ADD screenshot of demo?]
🏯 Code-Structure
-
⚙️
requirements.txt
: All necessary modules to install. -
📱
main.py
: Our main code file which does ... -
💻
code
: Here, you can find all code files for our different models and data augmentation methods. -
📀
data
: Find all datasets in this folder.-
🗂️
original_datasets
: Semeval_loc, Semeval_org, Relocar in their original form. -
🗂️
backtranslation
: Contains unfiltered generated paraphrases. -
🗂️
paraphrases
: Contains only filtered paraphrases. -
🗂️
fused_datasets
: Contains original datasets fused with filtered paraphrases. Ready to be used for training the models.
-
🗂️
-
📝
documentation
: Contains our organizational data and visualizations.-
🗂️
organization
: Our research plan, presentation, final reports. -
🗂️
images
: Contains all relevant visualizations. -
🗂️
results
: Find tables of our results.
-
🗂️
📑 References
-
Github Repository of Li et al.: "Target Word Masking for Location Metonymy Resolution"
-
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. ↩ ↩2
-
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. ↩
-
Fairseq Tool. ↩
-
Backtranslation paper. ↩
-
Sun, L., Xia, C., Yin, W., Liang, T., Yu, P. S., & He, L. (2020). Mixup-transformer: dynamic data augmentation for NLP tasks. arXiv preprint arXiv:2010.02394. ↩
-
Li et al. ↩