Skip to content
Snippets Groups Projects
Commit 69c97729 authored by umlauf's avatar umlauf
Browse files

read me citations

parent e30832d1
No related branches found
No related tags found
No related merge requests found
......@@ -75,7 +75,7 @@ Consequently, it is a vital technique for evaluating the robustness of models, a
***
## 💡 Methods <a name="methods"></a>
When selecting methods for our task, the main goal was to find a tradeoff between label preserving methods and diversifying our dataset. Since the language models BERT and RoBERTa have not been found to profit from very basic augmentation strategies (e.g.
When selecting methods for our task, the main goal was to find a tradeoff between label preserving methods and diversifying our dataset. Since the language models BERT [^6] and RoBERTa [^7] have not been found to profit from very basic augmentation strategies (e.g.
case changing of single characters or embedding replacements [ZITAT einfügen?]), we chose more innovative and challenging methods.
To be able to compare the influence of augmentations in different spaces, we select a method for data space and two methods for the feature space.
......@@ -108,15 +108,15 @@ As a comparatively safe (= label preserving) data augmentation strategy, we sele
### 🍸 2. MixUp (Feature Space)<a name="mixup"></a>
Our method adopts the framework of the mixup transformer proposed by Sun et al. [^4]. This approach involves interpolating the representation of two instances on the last hidden state of the transformer model (in our case, BERT-base-uncased \cite{BERT}).
Our method adopts the framework of the mixup transformer proposed by Sun et al. [^4]. This approach involves interpolating the representation of two instances on the last hidden state of the transformer model (in our case, BERT-base-uncased [^6]).
To derive the interpolated hidden representation and corresponding label, we use the following formulas on the representation of two data samples:
🔡 *Instance interpolation:*
🔡 **Instance interpolation:**
$$\hat{x} = \lambda T(x_i) + (1- \lambda)T(x_j)$$
🏷️ *Label interpolation :*
🏷️ **Label interpolation :**
$$\hat{y} = \lambda T(y_i) + (1- \lambda)T(y_j)$$
......@@ -125,6 +125,7 @@ represent the hidden representations of the two instances, $T(y_i)$
and $T(y_j)$ represent their corresponding labels, and $\lambda$ is a mixing coefficient that determines the degree of interpolation.
We used a fixed $\lambda$ which was set for the entire training process. In the following the derived instances $\hat{x}$ with the derived label $\hat{y}$ as new true label are given into the classifier to generate a prediction.
The MixUp process can be used dynamically during training at any epoch.
***
## 🗃️ Data <a name="data"></a>
......@@ -221,8 +222,10 @@ For `<COMMAND>` you must enter one of the commands you find in the list below, w
[^1]: Fairseq Tool.
[^2]: Backtranslation paper.
[^3]: Zhang, Hongyi, Cissé, Moustapha, Dauphin, Yann N. & Lopez-Paz, David. mixup: Beyond empirical risk minimization. *CoRR*, 2017.
[^4]: Sun et al.
[^4]: Sun, L., Xia, C., Yin, W., Liang, T., Yu, P. S., & He, L. (2020). Mixup-transformer: dynamic data augmentation for NLP tasks. arXiv preprint arXiv:2010.02394.
[^5]: Li et al.
[^6]: Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
[^7]: Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
[^note]:
not listed footnote
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment