Snippets Groups Projects

2 years ago
e4fe55c5

Update · e4fe55c5
friebolin authored 2 years ago

e4fe55c5

History

Update
friebolin authored 2 years ago

README.md 16.25 KiB

Software Projekt - Data Augmentation for Metonymy Resolution

Members of the project:

Margareta Anna Kulscar kulscar@cl.uni-heidelberg.de
Sandra Friebolin friebolin@cl.uni-heidelberg.de
Mira Umlauf umlauf@cl.uni-heidelberg.de

Table of contents

Project documents
Metonomy Resolution
Data Augmentation
Methods
Data
Set Up
Usage
Code Structure
References

Project documents

This README gives a rough overview of the project. The full documentation and additional information can be found in the documents listed below.

Metonymy Resolution

A metonymy is the replacement of the actual expression by another one that is closely associated with it ¹.

Metonymies use a contiguity relation between two domains.

Example:

BRAZIL lost the finals.
- The term Brazil stands for a sports team of the country and is thus to be classified as a metonymy.
GOOGLE pays its employees poorly.
- This is a metonymy where the keyword Google stands for the owner of the company, but not for the literal sense of the term.
BRAZIL has a beautiful coast.
- Here, Brazil stands for the country and contains thus the literal meaning of the term.
GOOGLE is an American company.
- The term has literal meaning: Google refers to the company.

Metonymy resolution is about determining whether a potentially metonymic word is used metonymically in a particular context. In this project we focus on metonymic and literal readings for locations and organizations.

Sentences that allow for mixed readings, where a literal and metonymic sense is evoked, are considered non-literal in this project. This is true of the following sentence, in which the term Nigeria prompts both a metonymic and a literal reading.

"They arrived in Nigeria, hitherto a leading critic of [...]" ²

Hence, we use the two classes non-literal and literal for our binary classification task.

Data Augmentation

Data augmentation is the generation of synthetic training data for machine learning through transformations of existing data.

It is an important component for:

increasing the generalization capabilities of a model
overcoming data scarcity
regularizing the target
limiting the amount of data used to protect privacy

Consequently, it is a vital technique for evaluating the robustness of models, and improving the diversity and volume of training data to achieve better performance.

Methods

When selecting methods for our task, the main goal was to find a tradeoff between label preserving methods and diversifying our dataset. Since the language models BERT ³ and RoBERTa ⁴ have not been found to profit from very basic augmentation strategies (e.g. case changing of single characters or embedding replacements ⁵), we chose more innovative and challenging methods.

To be able to compare the influence of augmentations in different spaces, we select a method for data space and two methods for the feature space.

1. Backtranslation (Data Space)

As a comparatively safe (= label preserving) data augmentation strategy, we selected backtranslation using the machine translation model Fairseq ⁶. Adapting the approach of Chen et al. ⁷ we use the pre-trained single models :

- [`transformer.wmt19.en-de.single_model`](https://huggingface.co/facebook/wmt19-en-de)
- [`transformer.wmt19.de-en.single_model`](https://huggingface.co/facebook/wmt19-de-en)

Each original sentences is back-translated 4 times to generate a paraphrase that is slightly different from the original sentence, but still close enough to preserve the class.
For each sentence, the top 5 paraphrases are kept, using nucleus/topp as our sampling method, likewise for diversity reasons.
We test two versions: Generating paraphrases using a lower (0.8) and higher (1.2) temperature. This hyperparameter determines how creative the translation model becomes: higher temperature leads to more linguistic variety, lower temperature to results closer to the original sentence.
The diversity of the paraphrases is evaluated via the Longest Common Subsequence (LCS) score in comparison to their respective original sentence.

Example:

Original: BMW and Nissan launch electric cars.
EN - DE: BMW und Nissan bringen Elektroautos auf den Markt.
DE - EN: BMW and Nissan are bringing electric cars to the market.

Filtering:

All paraphrases that did not contain the original (metonymic) target word or had syntactic variations were filtered out.
Those that contained the target word more than once were also filtered out.

2. MixUp (Feature Space)

Our method adopts the framework of the MixUp transformer proposed by Sun et al. ⁸. This approach involves interpolating the representation of two instances on the last hidden state of the transformer model (in our case, BERT-base-uncased).

To derive the interpolated hidden representation and corresponding label, we use the following formulas on the representation of two data samples:

Instance interpolation:

\hat{x} = \lambda T(x_i) + (1- \lambda)T(x_j)

Label interpolation :

\hat{y} = \lambda T(y_i) + (1- \lambda)T(y_j)

Here,

T(x_i)

and

T(x_j)

represent the hidden representations of the two instances, and

T(y_i)

and

T(y_j)

represent their corresponding labels.

\lambda

is a mixing coefficient that determines the degree of the interpolation.

We used a fixed

\lambda

which was set for the entire training process. In the following, the derived instances

\hat{x}

with the derived label

\hat{y}

as new true label are given into the classifier to generate a prediction. The MixUp process can be used dynamically during training at any epoch.

3. TMix (Feature Space)

We use the same set fixed

\lambda

, but in contrast to MixUp, TMix is applied in all epochs ⁹. It can dynamically be used in any layer, and we focus our experiments on the transformer layers 7 and 9 for interpolation, since they have been found to contain the syntactic and semantic information.

Data

The datasets used in this project will be taken from Li et al.¹⁰ We confine ourselves to the following three:

SemEval: Locations & SemEval: Companies & Organizations	ReLocar: Locations
3800 sentences from the BNC corpus ²	Wikipedia-baseddataset containing 2026 sentences ¹¹

Data Point Example:

literal: {"sentence": ["The", "radiation", "doses", "received", "by", "workers", "in", "the", "UK", "are", "analysed."], "pos": [8, 9], "label": 0}

non-literal: {"sentence": ["Finally,", "we", "examine", "the", "UK", "privatization", "programme", "in", "practice."], "pos": [4, 5], "label": 1}

We split 10% of the training set to use as development set. The following shows the final absolute and relative class distribution:

Set Up

Creating a virtual environment to ensure that dependencies between the different projects are separated is a recommended first step:

python3 -m venv mrda-venv
source mrda-venv/bin/activate

Install all necessary requirements next:

pip install -r requirements.txt

[noch zu überlegen: evtl 2 requirements/envs für BT extra wegen torch version]

Usage

Launch our application by following the steps below:

[welche argumente genau?]

./main.py <COMMAND> <ARGUMENTS>...

For <COMMAND> you must enter one of the commands you find in the list below, where you can also find an overview about necessary <ARGUMENTS>.

Command	Functionality	Arguments
General
`--architecture`	Defines which model is used.	Choose `bert-base-uncased` or `roberta`
`--model_type`	How to initialize the Classification Model	Choose `separate` or `one`
`--mixlayer`	Specify in which `layer` the interpolation takes place. Only select one layer at a time.	Choose from {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}
`--tokenizer`	Which tokenizer to use when preprocessing the datasets.	Choose `swp` for our tokenizer, `li` for the tokenizer of Li et al. ¹⁰, or `salami` for the tokenizer used by another student project
`-max`/`--max_length`	Defines the maximum sequence length when tokenizing the sentences.	Always choose 256 for TMix and 512 for the other models.
`--train_loop`	Defines which train loop to use.	Choose `swp` for our train loop implementation and `salami` for the one of the salami student project.
`-e`/`--epochs`	Number of epochs for training.
`-lr`/`--learning_rate`	Learning rate for training.	`type=float`
`-rs`/`--random_seed`	Random seed for initialization of the model.	Default is 42 .
`-sd`/`--save_directory`	This option specifies the destination directory for the output results of the run.
`-msp`/`--model_save_path`	This option specifies the destination directory for saving the model.	We recommend saving models in Code/saved_models.
`-tc`/`--tcontext`	Whether or not to preprocess the training set with context.
`--masking`	Whether or not to mask the target word.
`-lambda`/`--lambda_value`	Speficies the lambda value for interpolation of MixUp and TMix	Default is 0.4 , `type=float`
	MixUp specific
`-mixup`/`--mix_up`	Whether or not to use MixUp. If yes, please specify `lambda` and `-mixepoch`
`-mixepoch`/`--mixepoch`	Specifies the epoch(s) in which to apply MixUp.	Default is `None`
	TMix specific
`--tmix`	Whether or not to use TMix. If yes, please specify `-mixlayer` and `-lambda`
	Datasets specific
`-t`/`"--train_dataset`	Defines which dataset is chosen for training.	Choose any of the datasets from original_datasets, fused_datasets or paraphrases
`-v`/`--test_dataset`	Defines which dataset is chosen for testing.	Choose from "semeval_test.txt", "companies_test.txt" or "relocar_test.txt"
`--imdb`	Whether or not to use the IMDB dataset. Note that this is only relevant for validating our TMix implementation.
`-b`/`--batch_size`	Defines the batch size for the training process.	Default is 32 .
`-tb`/`--test_batch_size`	Specifies the batch size for the test process.	Default is 16 .

extra: BT and inference

[ADD screenshot of demo?]

Code-Structure

requirements.txt: All necessary modules to install.
main.py: Our main code file which does ...
Code: Here, you can find all code files for our different models and data augmentation methods.
data: Find all datasets in this folder.
- backtranslations: Contains unfiltered generated paraphrases.
- fused_datasets: Contains original datasets fused with filtered paraphrases. Ready to be used for training the models.
- original_datasets: Semeval_loc, Semeval_org, Relocar in their original form.
- paraphrases: Contains only filtered paraphrases.
documentation: Contains our organizational data and visualizations.
- images: Contains all relevant visualizations.
- organization: Our research plan, presentation, final reports.
- results: Find tables of our results.

References

English Oxford Dictionary. "Metonymy" ↩
Markert, Katja & Nissim, Malvina. "SemEval-2007 task 08: Metonymy resolution at SemEval-2007." Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), 2007. ↩ ↩²
Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton & Toutanova, Kristina. "BERT: pre-training of deep bidirectional transformers for language understanding." CoRR, 2018. ↩
Liu, Yinhan, Ott, Myle, Goyal, Naman, Du, Jingfei, Joshi, Mandar, Chen, Danqi, Levy, Omer, Lewis, Mike, Stoyanov, Veselin & Stoyanov, Veselin. "RoBERTa: A robustly optimized BERT pretraining approach." CoRR, 2019. ↩
Bayer, Markus, Kaufhold, Marc-André & Reuter, Christian. "A survey on data augmentation for text classification." CoRR, 2021. ↩
Ott, Myle, Edunov, Sergey, Baevski, Alexei, Fan, Angela, Gross, Sam, Ng, Nathan, Grangier, David & Auli, Michael. "fairseq: A fast, extensible toolkit for sequence modeling." Proceedings of NAACL-HLT 2019: Demonstrations, 2019. ↩
Chen, Jiaao, Wu, Yuwei & Yang, Diyi. "Semi-Supervised Models via Data Augmentation for Classifying Interactive Affective Responses." 2020. ↩
Sun, Lichao, Xia, Congying, Yin, Wenpeng, Liang, Tingting, Yu, Philip S. & He, Lifang. "Mixup-transformer: dynamic data augmentation for NLP tasks." 2020. ↩
Chen, Jiaao, Wu, Yuwei & Yang, Diyi. "MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification." 2020. ↩
Li, Haonan, Vasardani, Maria, Tomko, Martin & Baldwin, Timothy. "Target word masking for location metonymy resolution." Proceedings of the 28th International Conference on Computational Linguistics, December 2020. ↩ ↩²
Gritta, Milan, Pilehvar, Mohammad, Taher, Limsopatham, Nut & Collier, Nigel. "Vancouver welcomes you! minimalist location metonymy resolution." Proceedings of the 55th Annual Meeting of the Association for Computational, 2017. ↩