To transform representations of our data in the feature space, we use MixUp, proposed by Zhang et al.[^3], which is based on linear interpolations of input data and corresponding labels in hidden space. We take our inspiration from Sun et al.[^4], who employ this method to the last hidden layer of transformer models. Their dynamic MixUp approach can be applied to later training epochs exclusively, thus allowing to learn good representations first.
Our method adopts the framework of the mixup transformer proposed by Sun et al.[^4]. This approach involves interpolating the representation of two instances on the last hidden state of the transformer model (in our case, BERT-base-uncased \cite{BERT}).
To derive the interpolated hidden representation and corresponding label, we use the following formulas on the representation of two data samples:
🔡 *Instance interpolation:*
$$\hat{x} = \lambda T(x_i) + (1- \lambda)T(x_j)$$
🏷️ *Label interpolation :*
$$\hat{y} = \lambda T(y_i) + (1- \lambda)T(y_j)$$
Here, $T(x_i)$ and $T(x_j)$
represent the hidden representations of the two instances, $T(y_i)$
and $T(y_j)$ represent their corresponding labels, and $\lambda$ is a mixing coefficient that determines the degree of interpolation.
We used a fixed $\lambda$ which was set for the entire training process. In the following the derived instances $\hat{x}$ with the derived label $\hat{y}$ as new true label are given into the classifier to generate a prediction.
The MixUp process can be used dynamically during training at any epoch.