Snippets Groups Projects

authored

Name	Last commit	Last update
Code
data
documentation
.gitignore
README.md
main.py
requirements.txt

Software Projekt - Data Augmentation for Metonymy Resolution

Members of the project:

Margareta Anna Kulscar kulscar@cl.uni-heidelberg.de
Sandra Friebolin friebolin@cl.uni-heidelberg.de
Mira Umlauf umlauf@cl.uni-heidelberg.de

Table of contents

📚 Project documents
🔎 Metonomy Resolution
📈 Data Augmentation
💡 Methods
1. 📝 Backtranslation
2. 🍸 MixUp
🗃️ Data
🛠️ Set Up
⚙️ Usage
🏯 Code Structure
📑 References

📚 Project documents

This README gives a rough overview of the project. The full documentation and additional information can be found in the documents listed below.

🔎 Metonymy Resolution

A metonymy is the replacement of the actual expression by another one that is closely related to the first one.

Metonymies use a contiguity relation between two domains.

🖊️ Example:

🇧🇷 BRAZIL lost the finals.
- The term Brazil stands for a sports team of the country and is thus to be classified as a metonymy.
💻 GOOGLE pays its employees poorly.
- This is a metonymy where the keyword Google stands for the owner of the company, but not for the literal sense of the term.
🇧🇷 BRAZIL has a beautiful coast.
- Here, Brazil stands for the country and contains thus the literal meaning of the term.
💻 GOOGLE is an American company.
- The term has literal meaning: Google refers to the company.

Metonymy resolution is about determining whether a potentially metonymic word is used metonymically in a particular context. In this project we focus on metonymic and literal readings for locations and organizations.

ℹ️ Sentences that allow for mixed readings, where both a literal and metonymic sense is evoked, are considered metonymic in this project. This is true of the following sentence, in which the term Nigeria prompts both a metonymic and a literal reading.

"They arrived in Nigeria, hitherto a leading critic of [...]" [ZITAT: SemEval-2007 Task 08: Metonymy Resolution at SemEval-2007]

➡️ Hence, we use the two classes non-literal and literal for this binary classification task.

📈 Data Augmentation

Data augmentation is the generation of synthetic training data for machine learning through transformations of existing data.

It is an important component for:

➕ increasing the generalization capabilities of a model
➕ overcoming data scarcity
➕ regularizing the target
➕ limiting the amount of data used to protect privacy

Consequently, it is a vital technique for evaluating the robustness of models, and improving the diversity and volume of training data to achieve better performance.

💡 Methods

When selecting methods for our task, the main goal was to find a tradeoff between label preserving methods and diversifying our dataset. Since the language models BERT ¹ and RoBERTa ² have not been found to profit from very basic augmentation strategies (e.g. case changing of single characters or embedding replacements [ZITAT einfügen?]), we chose more innovative and challenging methods.

To be able to compare the influence of augmentations in different spaces, we select a method for data space and two methods for the feature space.

📝 1. Backtranslation (Data Space)

As a comparatively safe (= label preserving) data augmentation strategy, we selected backtranslation using the machine translation model Fairseq³ [[Ott et al., 2019] ]. Similar to Chen et al.⁴ we use the pre-trained single models :

- [`transformer.wmt19.en-de.single_model`](https://huggingface.co/facebook/wmt19-en-de)
- [`transformer.wmt19.de-en.single_model`](https://huggingface.co/facebook/wmt19-de-en)

🔄 Each original sentences is back-translated 4 times to generate a paraphrase that is slightly different from the original sentence, but still close enough to preserve the class.
✅ For each sentence, the top 5 paraphrases are kept, using nucleus/topp as our sampling method, likewise for diversity reasons.
🔥 We test two versions: Generating paraphrases using a lower (0.8) and higher (1.2) temperature. This hyperparameter determines how creative the translation model becomes: higher temperature leads to more linguistic variety, lower temperature to results closer to the original sentence.
🌈 The diversity of the paraphrases is evaluated via the Longest Common Subsequence (LCS) score in comparison to their respective original sentence.

🖊️ Example:

Original: BMW and Nissan launch electric cars.
EN - DE: BMW und Nissan bringen Elektroautos auf den Markt.
DE - EN: BMW and Nissan are bringing electric cars to the market.

🚮 Filtering:|

All paraphrases that did not contain the original (metonymic) target word or had syntactic variations were filtered out.
Those that contained the target word more than once were also filtered out.

🍸 2. MixUp (Feature Space)

Our method adopts the framework of the mixup transformer proposed by Sun et al. ⁵. This approach involves interpolating the representation of two instances on the last hidden state of the transformer model (in our case, BERT-base-uncased ¹).

To derive the interpolated hidden representation and corresponding label, we use the following formulas on the representation of two data samples:

🔡 Instance interpolation:

\hat{x} = \lambda T(x_i) + (1- \lambda)T(x_j)

🏷️ Label interpolation :

\hat{y} = \lambda T(y_i) + (1- \lambda)T(y_j)

Here, T(x_i) and T(x_j) represent the hidden representations of the two instances, T(y_i) and T(y_j) represent their corresponding labels, and \lambda is a mixing coefficient that determines the degree of interpolation. We used a fixed \lambda which was set for the entire training process. In the following the derived instances \hat{x} with the derived label \hat{y} as new true label are given into the classifier to generate a prediction. The MixUp process can be used dynamically during training at any epoch.

🗃️ Data

The datasets used in this project will be taken from Li et al.⁶ We confine ourselves to the following three:

SemEval: [ZITAT: Markert and Nissim, 2007 ]

1. SemEval: Locations	2. SemEval: Companies & Organizations	3. ReLocar: Locations
[ZITAT: Markert and Nissim, 2007 ]	[ZITAT: Markert and Nissim, 2007 ]	[ZITAT: Gritta et al., 2017 ]

🖊️ Data Point Example:

0️⃣ literal: {"sentence": ["The", "radiation", "doses", "received", "by", "workers", "in", "the", "UK", "are", "analysed."], "pos": [8, 9], "label": 0}

1️⃣ non-literal: {"sentence": ["Finally,", "we", "examine", "the", "UK", "privatization", "programme", "in", "practice."], "pos": [4, 5], "label": 1}

✂️ We split 10% of the training set to use as development set. The following shows the final absolute and relative class distribution:

🛠️ Set Up

Creating a virtual environment to ensure that dependencies between the different projects are separated is a recommended first step:

python3 -m venv mrda-venv
source mrda-venv/bin/activate

Install all necessary requirements next:

pip install -r requirements.txt

[noch zu überlegen: evtl 2 requirements/envs für BT extra wegen torch version]

⚙️ Usage

🚀 Launch our application by following the steps below:

[welche argumente genau?]

./main.py <COMMAND?????> <ARGUMENTS??????????>...

For <COMMAND> you must enter one of the commands you find in the list below, where you can also find an overview about necessary <ARGUMENTS>.

Command	Functionality	Arguments
`train`	?	?
`evaluate`	?	?
`demo`	?	?

train explain more ...
evaluate
...

[ADD screenshot of demo?]

🏯 Code-Structure

⚙️ requirements.txt: All necessary modules to install.
📱 main.py: Our main code file which does ...
💻 code: Here, you can find all code files for our different models and data augmentation methods.
📀 data: Find all datasets in this folder.
- 🗂️ original_datasets: Semeval_loc, Semeval_org, Relocar in their original form.
- 🗂️ backtranslation: Contains unfiltered generated paraphrases.
- 🗂️ paraphrases: Contains only filtered paraphrases.
- 🗂️ fused_datasets: Contains original datasets fused with filtered paraphrases. Ready to be used for training the models.
📝 documentation: Contains our organizational data and visualizations.
- 🗂️ organization: Our research plan, presentation, final reports.
- 🗂️ images: Contains all relevant visualizations.
- 🗂️ results: Find tables of our results.

📑 References

Link to Li et al. Paper
Github Repository of Li et al.: "Target Word Masking for Location Metonymy Resolution"
- Link to TWM datasets
- Downloaded TWM datasets

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. ↩ ↩²
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. ↩
Fairseq Tool. ↩
Backtranslation paper. ↩
Sun, L., Xia, C., Yin, W., Liang, T., Yu, P. S., & He, L. (2020). Mixup-transformer: dynamic data augmentation for NLP tasks. arXiv preprint arXiv:2010.02394. ↩
Li et al. ↩