@@ -11,12 +11,12 @@ The work "[MLP-Mixer: An all-MLP Architecture for Vision](https://arxiv.org/pdf/
<imgsrc="./mlp-mixer.png"width="500px"></img>
The aims of this project are:
* reconfigure the MLP-Mixer architecture for NLP tasks (➜ NLP-Mixer)
* reimplement the network in tensorflow (instead of jax/flax)
* adapt the MLP-Mixer architecture for NLP tasks (➜ NLP-Mixer)
* create a comparable (convolutional) Baseline
* conduct experiments on datasets that are also used by other efficient transformer approaches
## 2. Instructions
## 2. Instructions :information_source:
### Installing the required modules
```
...
...
@@ -47,17 +47,22 @@ To compare against results from other publications ([Big Bird](https://proceedin
|---|---:|---:|---:|
| #classes | 2 | 2 | 2 |
| #examples | 25000 | 67349 | 6007 |
| avg.words | 233 | 9 | 8096 |
| max.words | 2470 | 52 | 104698 |
| avg.words | 233 | 9 | 8096 |
| max.words | 2470 | 52 | 104698 |
## 4. Experiments :microscope:
## 4. Experimental Setup :microscope:
For all experiments the same parameters (epochs, learning rate, layer dimensionality, etc.) are used and can be accessed in the [result files](/results/). Only the embedding type (default / pretrained) is changed.
For all experiments the same parameters (epochs, learning rate, layer dimensionality, etc.) are used and can be accessed in the [result files](/results/). Only the embedding type (default / pretrained) is changed. As pretrained embeddings *Bert-uncased* from tensorflow-hub is used.
### 4.1 Baseline
Instead of Mlp-Mixer-Blocks, Convolution-Blocks are used -- otherwise the architecture stays the same.
Instead of Mlp-Mixer-Blocks, 1D-Convolution-Blocks are used – otherwise the architecture stays the same.
### 4.2 NLP-Mixer
Adaptation of the MLP-Mixer architecture:
Instead of image patches we feed the representations (embeddings) of the text into the network. An image *channel* corresponds now to the *embedding dimension* and *patches* to *tokens*.
## 5. Results :bar_chart:
...
...
@@ -123,9 +128,11 @@ Accuracy for the text classification datasets:
\* *all 6 labels*
## 6. Conclusion :bulb:
## 6. Analysis & Conclusion :bulb:
For the sake of comparison the parameters for both models have been kept the same throughout the experiments. The Baseline started overfitting pretty fast (in contrast to the NLP-Mixer) and therefore only a rather small number of epochs are trained. Also no extensive hyperparameter tuning was performed in order to stay within a limited timeframe. For the imdb dataset the baseline and nlp-mixer-model performed worse in comparison to other transformer architectures by quite a margin, while coming close for sst2. To have a real comparison for the arxiv dataset, a run with all 6 classes would be needed (as classification tasks get naturally more difficult with an increasing number of classes), but was left out for computing time and resource reasons.
Even though NLP-Mixer performs better overall than the baseline (especially with pretraining), it is surprising how well the simple baseline performs. Therefore it's not entirely clear if the mixing can approximate the self attention between the tokens. Also there is no way to nicely visualize the connection between words as it can be done with the attention mechanism. In addition transformer approaches still perform better in terms of accuracy, but if runtime is a concern, then one of the approaches introduced in this project could be a valid alternative.