From d6717bcd023a0915dc7b77041be2895aa8ff306c Mon Sep 17 00:00:00 2001
From: liang <liang@cl.uni-heidelberg.de>
Date: Tue, 12 May 2020 23:09:34 +0200
Subject: [PATCH] Updated Discussion

---
 lrp.ipynb | 84 ++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 61 insertions(+), 23 deletions(-)

diff --git a/lrp.ipynb b/lrp.ipynb
index d8d6cde..8d3cf41 100644
--- a/lrp.ipynb
+++ b/lrp.ipynb
@@ -45,7 +45,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As it is showed in the figure, both branches in the network have the same architecture:  each branch is consist of two linear transformation layers with ReLU activation between them. The final layer outputs will be normalized by their L2 norm. The embedding loss states that the Embedding Network learns to reduce the distance between the transformed image and text features from positive image-text pairs and increase the distance between the negative pairs. \n",
+    "As it is showed in the figure, both branches in the network have the same architecture:  each branch consists of two linear transformation layers with ReLU activation between them. The final layer outputs will be normalized by their L2 norm. The embedding loss states that the Embedding Network learns to reduce the distance between the transformed image and text features from positive image-text pairs and increase the distance between the negative pairs. \n",
     "\n",
     "Because [Wang et al. (2017)](#1) focused on investigating the behavior of the proposed two-branch networks, the inputs `X` and `Y` are pre-computed features extracted from other trained models, i.e. word embedding model for text features and image classification model for image features. But let's first have a view of the data and understand the difficulties of the task before we get to the motivation of investigating the network."
    ]
@@ -170,21 +170,20 @@
     "    - [Load trained Embedding Network](#model)  \n",
     "    - [Prepare sentence features](#set)\n",
     "    - [Prepare image features](#im)\n",
+    "    \n",
     "2. [Compute embedded representations and obtain relevance](#evaluate)\n",
     "    - [Text branch](#textbranch)\n",
     "    - [Image branch](#imagebranch)\n",
-    "    -[The Eucldean distance between the normalized text and image representations](#eudldist)\n",
+    "    - [The Eucldean distance between the normalized representations](#eudldist)\n",
     "    - [Obtain the relevance to be propagated](#rel)\n",
     "    \n",
     "3. [Layer-wise relevance propagation](#LRP)\n",
-    "    - [Introduction of multiway attention networks](#mwan)\n",
-    "    - [The attention functions](#mwanatt) \n",
-    "    - [The input representations and attention aggregation](#inputmwan)\n",
-    "    - [Simple Implementation in PyTorch](#impmwan)   \n",
+    "    - [Linear layer](#linear)\n",
+    "    - [Batch normalization layer](#batchnorm)\n",
+    "    - [Propagate the relevance from layer to layer](#propagate)\n",
+    "    - [Obtain word-level relevance](#wordrel)\n",
     "    \n",
-    "3. [Explaining some examples of distribution functions(Samuel Kiegeland)](#task3)\n",
-    "    - [Soft vs hard attention as in Show, Attend and Tell ](#sat)\n",
-    "    - [Global vs local attention](#globalvslocal)\n",
+    "4. [Word-level relevance visualization and discussion](#result)\n",
     "    \n"
    ]
   },
@@ -336,7 +335,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 7,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [
     {
@@ -345,7 +344,7 @@
        "torch.Size([2, 3, 224, 224])"
       ]
      },
-     "execution_count": 7,
+     "execution_count": 1,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -380,7 +379,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 2,
    "metadata": {},
    "outputs": [
     {
@@ -870,7 +869,7 @@
        ")"
       ]
      },
-     "execution_count": 8,
+     "execution_count": 2,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -888,7 +887,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [
     {
@@ -944,7 +943,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [
     {
@@ -966,6 +965,35 @@
     "    X[i+1]=(m.forward(X[i]))"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bottleneck = [name for name in modules[-3]._modules.keys()]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "36"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(bottleneck)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -975,7 +1003,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 5,
    "metadata": {},
    "outputs": [
     {
@@ -996,7 +1024,7 @@
    "metadata": {},
    "source": [
     "<a id=\"evaluate\"> </a>\n",
-    "## Compute embedded representations \n",
+    "## Compute embedded representations and obtain relevance\n",
     "After the image and text features for the above two examples are prepared and the trained parameters are loaded into the model, we can project the text and image features into one joint latent space through two branches of the network and compute the multiplication between two representations at the end. The similarity scores can be considered as relevance which can be propagated back to the input features. According to the redistribution rules of LRP, the elements from the input which are relevant to the similarity result should receive high relevance [(Bach et al. 2015)](#2). \n",
     "\n",
     "Firstly we can decompose each branch of the Embedding Network and extract all the layers from each branch to obtain the output of each layer."
@@ -1163,7 +1191,7 @@
    "metadata": {},
    "source": [
     "<a id=\"eucldist\"> </a>\n",
-    "#### The Eucldean distance between the normalized text and image representations \n",
+    "#### The Eucldean distance between the normalized representations \n",
     "The Euclidean distance between $\\mathbf{p}$ ( img\\_feats_normalized) and $\\mathbf{p}$ (text\\_feats_normalized) is obtained through: \n",
     "$ \\left\\| \\mathbf{q} - \\mathbf{p} \\right\\| = \\sqrt{ \\left\\| \\mathbf{p} \\right\\|^2 + \\left\\| \\mathbf{q} \\right\\| ^2 - 2 \\mathbf{p}\\cdot\\mathbf{q}} $. Since both text and image features representations are normalized by their L2 norm, $\\left\\| \\mathbf{p} \\right\\|^2$ and $\\left\\| \\mathbf{q} \\right\\|^2$ are equal to 1, as a result, $ \\left\\| \\mathbf{q} - \\mathbf{p} \\right\\| = \\sqrt{ 2 - 2 \\mathbf{p}\\cdot\\mathbf{q}} $\n",
     "\n",
@@ -1423,7 +1451,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<a id=\"propagation\"> </a>\n",
+    "<a id=\"propagate\"> </a>\n",
     "#### Propagate the relevance from layer to layer"
    ]
   },
@@ -1547,13 +1575,13 @@
    "metadata": {},
    "source": [
     "<a id=\"result\"></a>\n",
-    "## Result visualization and discussion\n",
+    "## Word-level relevance visualization and discussion\n",
     "\n",
-    "In this post, we have just propagated the relevance in the text branch, as it is mentioned at the beginning, the relevance propagation for the image branch involves further propagation in the vision model, which is not covered in this post. But we still can investigate the token relevance since the visualization of each individual token according their score values supplies evidence of the matching results of the Embedding Network. In order to compare the word level relevance to the image relevant features, we can use the image classification results as reference, because the classification results show that what the image features are representing for.\n",
+    "In this post, we have just propagated the relevance in the text branch. As it is mentioned at the beginning, the relevance propagation for the image branch involves further propagation in the vision model, which is not covered in this post. But we still can investigate the token relevance since the visualization of each individual token according their score values supplies evidence of the matching results of the Embedding Network. In order to compare the word level relevance to the image relevant features, we can use the image classification results as reference, because the classification results show that what the image features are representing for.\n",
     "\n",
     "As it is mentioned above, sentence-image retrieval is different to classification task, we obtain the relevance from the similarity scores, hence, we will investigate the relevant features of all captions given one image, not just the positive ones. In order to obtain the word-level visualization for the example captions written in HTML format please check the heatmap functions in the `utils` file and import the assisting function `html_table` and return the HTML table which is the visualization of relevant tokens.\n",
     "\n",
-    "The retrieval score is basically the very first relevance from the matching results. The higher the score, the more confident the network matches the caption to the given image. The following word-level relevances represent that, tokens highlighted in red are the positive matches to the given image, in blue then in the opposite side. The color opacity is computed through normalization with the maximum and minimum absolute relevance scores in each sentence following the instruction from [(Arras, et al. 2016)](#3). "
+    "The retrieval score is basically the very first relevance from the matching results. The higher the score, the more confident the network matches the caption to the given image. The following word-level relevances represent that, tokens highlighted in red are the positive matches to the given image, in blue then in the opposite side. The color opacity is computed through normalization with the maximum and minimum absolute relevance scores in each sentence following the instruction from [(Arras, et al., 2016)](#3). "
    ]
   },
   {
@@ -1731,6 +1759,15 @@
     "</table>"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The matching scores have shown us that the best match for the first image is the second caption, and the relevance visualization provides us with the evidence for the score with the most relevant token \"toy\". And this relevant token can be also found in the fifth caption that also received a higher score than the other matches. Of course, it is difficult to look into every example and find out specific patterns from the language side to explain the sentence-image matching results. But from the relevance result, we obtain an intuitive view about the reasons for the similarity results.\n",
+    "\n",
+    "There are several possible quantitative analysis methods to validate the relevant tokens or using the the relevance for evaluating the working of the network depending on the task and data. For example, if we have marked phrases annotated in the data set, we will be able to calculate the overlaps between the annotated key phrases and the found relevant tokens. To validate the relevant tokens, we can move out the negative tokens (in blue) and just keep the positive relevant tokens in the captions and execute the testing process. If the matching results doesn't change much that can proof the found relevant tokens."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 10,
@@ -1883,7 +1920,8 @@
    "metadata": {},
    "source": [
     "#### Future work \n",
-    "In this post, we go through the whole procedures about how to compute the layer outputs of both branches in the Embedding Network and compute the relevance in the Embeddrelevance Apply the whole procedure again on the examples from on Flickr30K dataset."
+    "In this post, we go through the whole procedures about how to compute the layer outputs of both branches in the Embedding Network and obtain the relevance to be propagated. After that, we have learned the knowledge about how to apply the layer-wise relevance propagation approach to redistribute the relevance back to the input word embeddings for obtaining the word-level evidence of the sentence-image matching results. What can we in the future further pro is that we can try is to experiment the approach on the Flickr30K dataset and propagate the image relevance back to vision model and further to get the evidence from the original image features. In this way, we have to  the matched relevant features from both side and \n",
+    "compute the relevance in the Embeddrelevance Apply the whole procedure again on the examples from on Flickr30K dataset."
    ]
   },
   {
-- 
GitLab