Skip to content
Snippets Groups Projects

Software Projekt - Data Augmentation for Metonymy Resolution

Members of the project:

Table of contents

  1. :books: Project documents
  2. :mag_right: Metonomy Resolution
  3. :chart_with_upwards_trend: Data Augmentation
  4. :bulb: Methods
    1. :pencil: Backtranslation
    2. :cocktail: MixUp
    3. :globe_with_meridians:TMix
  5. :card_box: Data
  6. :tools: Set Up
  7. :gear: Usage
  8. :japanese_castle: Code Structure
  9. :bookmark_tabs: References

:books: Project documents

This README gives a rough overview of the project. The full documentation and additional information can be found in the documents listed below.


:mag_right: Metonymy Resolution

A metonymy is the replacement of the actual expression by another one that is closely associated with it 1.

Metonymies use a contiguity relation between two domains.

:pen_ballpoint: Example:

  1. :flag_br: BRAZIL lost the finals.

    • The term Brazil stands for a sports team of the country and is thus to be classified as a metonymy.
  2. :computer: GOOGLE pays its employees poorly.

    • This is a metonymy where the keyword Google stands for the owner of the company, but not for the literal sense of the term.
  3. :flag_br: BRAZIL has a beautiful coast.

    • Here, Brazil stands for the country and contains thus the literal meaning of the term.
  4. :computer: GOOGLE is an American company.

    • The term has literal meaning: Google refers to the company.

Metonymy resolution is about determining whether a potentially metonymic word is used metonymically in a particular context. In this project we focus on metonymic and literal readings for locations and organizations.

:information_source: Sentences that allow for mixed readings, where a literal and metonymic sense is evoked, are considered non-literal in this project. This is true of the following sentence, in which the term Nigeria prompts both a metonymic and a literal reading.

  • "They arrived in Nigeria, hitherto a leading critic of [...]" 2

:arrow_right: Hence, we use the two classes non-literal and literal for our binary classification task.


:chart_with_upwards_trend: Data Augmentation

Data augmentation is the generation of synthetic training data for machine learning through transformations of existing data.

It is an important component for:

  • :heavy_plus_sign: increasing the generalization capabilities of a model
  • :heavy_plus_sign: overcoming data scarcity
  • :heavy_plus_sign: regularizing the target
  • :heavy_plus_sign: limiting the amount of data used to protect privacy

Consequently, it is a vital technique for evaluating the robustness of models, and improving the diversity and volume of training data to achieve better performance.


:bulb: Methods

When selecting methods for our task, the main goal was to find a tradeoff between label preserving methods and diversifying our dataset. Since the language models BERT 3 and RoBERTa 4 have not been found to profit from very basic augmentation strategies (e.g. case changing of single characters or embedding replacements 5), we chose more innovative and challenging methods.

To be able to compare the influence of augmentations in different spaces, we select a method for data space and two methods for the feature space.

:pencil: 1. Backtranslation (Data Space)

As a comparatively safe (= label preserving) data augmentation strategy, we selected backtranslation using the machine translation model Fairseq 6. Adapting the approach of Chen et al. 7 we use the pre-trained single models :

- [`transformer.wmt19.en-de.single_model`](https://huggingface.co/facebook/wmt19-en-de)
- [`transformer.wmt19.de-en.single_model`](https://huggingface.co/facebook/wmt19-de-en)
  • :arrows_counterclockwise: Each original sentences is back-translated 4 times to generate a paraphrase that is slightly different from the original sentence, but still close enough to preserve the class.

  • :white_check_mark: For each sentence, the top 5 paraphrases are kept, using nucleus/topp as our sampling method, likewise for diversity reasons.

  • :fire: We test two versions: Generating paraphrases using a lower (0.8) and higher (1.2) temperature. This hyperparameter determines how creative the translation model becomes: higher temperature leads to more linguistic variety, lower temperature to results closer to the original sentence.

  • :rainbow: The diversity of the paraphrases is evaluated via the Longest Common Subsequence (LCS) score in comparison to their respective original sentence.

:pen_ballpoint: Example:

  • Original: BMW and Nissan launch electric cars.
  • EN - DE: BMW und Nissan bringen Elektroautos auf den Markt.
  • DE - EN: BMW and Nissan are bringing electric cars to the market.

:put_litter_in_its_place: Filtering:

  • All paraphrases that did not contain the original (metonymic) target word or had syntactic variations were filtered out.

  • Those that contained the target word more than once were also filtered out.

:cocktail: 2. MixUp (Feature Space)

Our method adopts the framework of the MixUp transformer proposed by Sun et al. 8. This approach involves interpolating the representation of two instances on the last hidden state of the transformer model (in our case, BERT-base-uncased).

To derive the interpolated hidden representation and corresponding label, we use the following formulas on the representation of two data samples:

:abcd: Instance interpolation:

\hat{x} = \lambda T(x_i) + (1- \lambda)T(x_j)

:label: Label interpolation :

\hat{y} = \lambda T(y_i) + (1- \lambda)T(y_j)

Here,

T(x_i)
and
T(x_j)
represent the hidden representations of the two instances, and
T(y_i)
and
T(y_j)
represent their corresponding labels.
\lambda
is a mixing coefficient that determines the degree of the interpolation.

We used a fixed

\lambda
which was set for the entire training process. In the following, the derived instances
\hat{x}
with the derived label
\hat{y}
as new true label are given into the classifier to generate a prediction. The MixUp process can be used dynamically during training at any epoch.

:globe_with_meridians: 3. TMix (Feature Space)

We use the same set fixed

\lambda
, but in contrast to MixUp, TMix is applied in all epochs 9. It can dynamically be used in any layer, and we focus our experiments on the transformer layers 7 and 9 for interpolation, since they have been found to contain the syntactic and semantic information.


:card_box: Data

The datasets used in this project will be taken from Li et al.10 We confine ourselves to the following three:

SemEval: Locations & SemEval: Companies & Organizations ReLocar: Locations
3800 sentences from the BNC corpus 2 Wikipedia-baseddataset containing 2026 sentences 11

:pen_ballpoint: Data Point Example:

:zero: literal: {"sentence": ["The", "radiation", "doses", "received", "by", "workers", "in", "the", "UK", "are", "analysed."], "pos": [8, 9], "label": 0}

:one: non-literal: {"sentence": ["Finally,", "we", "examine", "the", "UK", "privatization", "programme", "in", "practice."], "pos": [4, 5], "label": 1}

:scissors: We split 10% of the training set to use as development set. The following shows the final absolute and relative class distribution:


:tools: Set Up

Creating a virtual environment to ensure that dependencies between the different projects are separated is a recommended first step:

python3 -m venv mrda-venv
source mrda-venv/bin/activate

Install all necessary requirements next:

pip install -r requirements.txt

[noch zu überlegen: evtl 2 requirements/envs für BT extra wegen torch version]


:gear: Usage

:rocket: Launch our application by following the steps below:

[welche argumente genau?]

./main.py <COMMAND> <ARGUMENTS>...

For <COMMAND> you must enter one of the commands you find in the list below, where you can also find an overview about necessary <ARGUMENTS>.

Command Functionality Arguments
General
--architecture Defines which model is used. Choose bert-base-uncased or roberta
--model_type How to initialize the Classification Model Choose separate or one
--mixlayer Specify in which layer the interpolation takes place. Only select one layer at a time. Choose from
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}
--tokenizer Which tokenizer to use when preprocessing the datasets. Choose swp for our tokenizer, li for the tokenizer of Li et al. 10, or salami for the tokenizer used by another student project
-max/--max_length Defines the maximum sequence length when tokenizing the sentences. :warning: Always choose 256 for TMix and 512 for the other models.
--train_loop Defines which train loop to use. Choose swp for our train loop implementation and salami for the one of the salami student project.
-e/--epochs Number of epochs for training.
-lr/--learning_rate Learning rate for training. type=float
-rs/--random_seed Random seed for initialization of the model. Default is
42
.
-sd/--save_directory This option specifies the destination directory for the output results of the run.
-msp/--model_save_path This option specifies the destination directory for saving the model. We recommend saving models in Code/saved_models.
-tc/--tcontext Whether or not to preprocess the training set with context.
--masking Whether or not to mask the target word.
-lambda/--lambda_value Speficies the lambda value for interpolation of MixUp and TMix Default is
0.4
, type=float
MixUp specific
-mixup/--mix_up Whether or not to use MixUp. If yes, please specify lambda and -mixepoch
-mixepoch/--mixepoch Specifies the epoch(s) in which to apply MixUp. Default is None
TMix specific
--tmix Whether or not to use TMix. If yes, please specify -mixlayer and -lambda
Datasets specific
-t/"--train_dataset Defines which dataset is chosen for training. Choose any of the datasets from original_datasets, fused_datasets or paraphrases
-v/--test_dataset Defines which dataset is chosen for testing. Choose from "semeval_test.txt", "companies_test.txt" or "relocar_test.txt"
--imdb Whether or not to use the IMDB dataset. Note that this is only relevant for validating our TMix implementation.
-b/--batch_size Defines the batch size for the training process. Default is
32
.
-tb/--test_batch_size Specifies the batch size for the test process. Default is
16
.

extra: BT and inference

[ADD screenshot of demo?]


:japanese_castle: Code-Structure

  • :gear: requirements.txt: All necessary modules to install.
  • :iphone: main.py: Our main code file which does ...
  • :computer: Code: Here, you can find all code files for our different models and data augmentation methods.
  • :dvd: data: Find all datasets in this folder.
    • :dividers: backtranslations: Contains unfiltered generated paraphrases.
    • :dividers: fused_datasets: Contains original datasets fused with filtered paraphrases. Ready to be used for training the models.
    • :dividers: original_datasets: Semeval_loc, Semeval_org, Relocar in their original form.
    • :dividers: paraphrases: Contains only filtered paraphrases.
  • :pencil: documentation: Contains our organizational data and visualizations.
    • :dividers: images: Contains all relevant visualizations.
    • :dividers: organization: Our research plan, presentation, final reports.
    • :dividers: results: Find tables of our results.

:bookmark_tabs: References

  1. English Oxford Dictionary. "Metonymy"

  2. Markert, Katja & Nissim, Malvina. "SemEval-2007 task 08: Metonymy resolution at SemEval-2007." Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), 2007. 2

  3. Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton & Toutanova, Kristina. "BERT: pre-training of deep bidirectional transformers for language understanding." CoRR, 2018.

  4. Liu, Yinhan, Ott, Myle, Goyal, Naman, Du, Jingfei, Joshi, Mandar, Chen, Danqi, Levy, Omer, Lewis, Mike, Stoyanov, Veselin & Stoyanov, Veselin. "RoBERTa: A robustly optimized BERT pretraining approach." CoRR, 2019.

  5. Bayer, Markus, Kaufhold, Marc-André & Reuter, Christian. "A survey on data augmentation for text classification." CoRR, 2021.

  6. Ott, Myle, Edunov, Sergey, Baevski, Alexei, Fan, Angela, Gross, Sam, Ng, Nathan, Grangier, David & Auli, Michael. "fairseq: A fast, extensible toolkit for sequence modeling." Proceedings of NAACL-HLT 2019: Demonstrations, 2019.

  7. Chen, Jiaao, Wu, Yuwei & Yang, Diyi. "Semi-Supervised Models via Data Augmentation for Classifying Interactive Affective Responses." 2020.

  8. Sun, Lichao, Xia, Congying, Yin, Wenpeng, Liang, Tingting, Yu, Philip S. & He, Lifang. "Mixup-transformer: dynamic data augmentation for NLP tasks." 2020.

  9. Chen, Jiaao, Wu, Yuwei & Yang, Diyi. "MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification." 2020.

  10. Li, Haonan, Vasardani, Maria, Tomko, Martin & Baldwin, Timothy. "Target word masking for location metonymy resolution." Proceedings of the 28th International Conference on Computational Linguistics, December 2020. 2

  11. Gritta, Milan, Pilehvar, Mohammad, Taher, Limsopatham, Nut & Collier, Nigel. "Vancouver welcomes you! minimalist location metonymy resolution." Proceedings of the 55th Annual Meeting of the Association for Computational, 2017.