Merge branch 'master' of https://gitlab.cl.uni-heidelberg.de/burkhardt/lsem-rc-in-nominal-compounds

4554b2aa · wernicke · e5363de5 · 76d16e0c · 4554b2aa · 4554b2aa
Commit 4554b2aa authored 3 years ago by wernicke
--- a/README.md
+++ b/README.md
@@ -2,7 +2,7 @@


 ## Introduction 👋🏼
-This repository is part of our project for the course ```Formale Semantik``` at the University Heidelberg. The project task can be summarized as the classification of lexical semantic relations between the components of nominal compounds. If this topic caught your interest the [Project Report]() offers a detailed insight into the project and its outcomes.
+This repository is part of our project for the course ```Formale Semantik``` at the University Heidelberg. The project task can be summarized as the classification of lexical semantic relations between the components of nominal compounds. If this topic caught your interest the [Project Report](documents/Project_Report.pdf) offers a detailed insight into the project and its outcomes.
 ## Task 📝
 A system is to be trained with a noun compound of the form NC = noun1 noun2 and paraphrases describing the relation between noun1 and noun2. It was now to be tested whether semantic relations between the two components of a noun compound, head and modifier, have been learned and can be reproduced. For this purpose, we tested to what extent the components masked in paraphrases - the verbs - can be completed by a machine and how well relations can be predicted for a nominal compound occurring in a sentence by a fine-tuned model.
 ## Prerequisites 🗂
@@ -14,7 +14,7 @@ pip install -r requirements.txt
 | subdirectory | content | README
 | ---- | ---- | ---- |
 | data | contains all data needed for probing and fine-tuning | [README](data/README.md)
-| documents | contains the first plan for our [**Project Outline**]() and the final [**Project Report**](documents/Gruppe_9__NC-RC_-_Outline.pdf)
+| documents | contains the first plan for our [**Project Outline**](documents/Gruppe_9__NC-RC_-_Outline.pdf) and the final [**Project Report**](documents/Project_Report.pdf)
 | fine_tuning | contains code to fine-tune models, the fine-tuned models, test results and evaluation| [README](fine_tuning/README.md)
 | probing | contains code for probing and its evaluation | [README](probing/README.md)


--- a/data/fine-tuning/README.md
+++ b/data/fine-tuning/README.md
@@ -16,6 +16,18 @@
 # Searching for data
 Since our project focuses on "breaking" / analyzing a neural system which tries to predict semantic relations of nominal compounds, we decided to create set of sentences containing said compounds. To gather a base dataset for later fine-tuning and testing sessions, [Wortschatz Leipzig](https://wortschatz.uni-leipzig.de/de) was used to search through News and Wikipedia snippets released between 2016 and 2020, which were adding up to roughly 8M unique Sentences. The search itself will center around a set of compounds for both fine and coarse grained relations, which were taken from [Tratz and Hovy (2010)](https://github.com/vered1986/panic/tree/master/classification/data).

+The search itself was done by iterating over the sentences using a regex pattern like (`"\b({})".format(noun_compound)`). In order to reduce the number of iterations it was augmented with a join method:
+
+```
+# let step be 10 - or an integer of choice
+for i in range(0, len(compounds), step):
+  if i + step < len(compounds):
+    p = r"\b({})".format("|".join(compounds[i:i+step]))
+  else:
+    p = r"\b({})".format("|".join(compounds[i:]))
+```
+
+This can also be easily accelerated using multithreading jobs.

 ## Compound variety for fine relations <!-- omit in TOC -->
 ![Compound variety for fine relations](media/compounds_fine.png)

--- a/data/fine-tuning/corpus_search.ipynb
+++ b/data/fine-tuning/corpus_search.ipynb
@@ -32,9 +32,9 @@
    "\n",
    "    for i in range(0, len(compounds), step):\n",
    "        if i + step < len(compounds):\n",
-    "            p = r\"({})\".format(\"|\".join(compounds[i:i+step]))\n",
+    "            p = r\"\\b({})\".format(\"|\".join(compounds[i:i+step]))\n",
    "        else:\n",
-    "            p = r\"({})\".format(\"|\".join(compounds[i:]))\n",
+    "            p = r\"\\b({})\".format(\"|\".join(compounds[i:]))\n",
    "        \n",
    "        for line in data:\n",
    "            result = re.findall(p, line)\n",

 %% Cell type:code id: tags:

 ``` python
 import os
 import re
 import pandas as pd
 import numpy as np
 import matplotlib.pyplot as plt
 ```

 %% Cell type:code id: tags:

 ``` python
 def read_corpus(data):
    with open(data, "r", encoding="utf-8") as f:
        return [line.split("\t")[1].strip("\n") for line in f]


 def search_bigrams(data, compounds, keys, step=10, join=False, join_frame=None):
    sr = []
    la = []
    nc = []
    ex = []

    for i in range(0, len(compounds), step):
        if i + step < len(compounds):
-            p = r"({})".format("|".join(compounds[i:i+step]))
+            p = r"\b({})".format("|".join(compounds[i:i+step]))
        else:
-            p = r"({})".format("|".join(compounds[i:]))
+            p = r"\b({})".format("|".join(compounds[i:]))

        for line in data:
            result = re.findall(p, line)

            for r in result:
                sr.append(keys["Relation"][r])
                la.append(keys["Label"][keys["Relation"][r]])
                nc.append(r)
                ex.append(line)

    df = pd.DataFrame({
        "Relation": sr,
        "Label": la,
        "NC": nc,
        "Sentence": ex
    })

    df["Sentence"] = df["Sentence"].str.lower()

    if join:
        tdf = pd.concat([df, join_frame])
        return tdf.drop_duplicates().sort_values(by=["Relation", "NC"]).reset_index().drop(columns="index")
    else:
        return df.sort_values(by=["Relation", "NC"]).reset_index().drop(columns="index")


 def limit_occurences(frame, n=400):
    df = frame.copy()

    temp = pd.DataFrame({
        "Count": frame["NC"].value_counts()
    })

    temp.reset_index(inplace=True)
    temp.rename(columns={"index": "NC"}, inplace=True)

    temp = temp[temp["Count"] > n]

    for nc in temp["NC"]:
        df.drop(df[df["NC"] == nc].iloc[n:].index, inplace=True)

    return df
 ```

 %% Cell type:code id: tags:

 ``` python
 # Grab compounds
 compounds_fine = pd.read_csv("compounds_fine.csv")
 compounds_coarse = pd.read_csv("compounds_coarse.csv")

 # Grab existing set of data
 sentences_fine = pd.read_csv("sentences_fine.csv")
 sentences_coarse = pd.read_csv("sentences_coarse.csv")
 sentences_fine_50 = pd.read_csv("limited/fine_50.csv")
 sentences_coarse_50 = pd.read_csv("limited/coarse_50.csv")

 # Grab dictionaries for fine-grained relations
 dict_fine = compounds_fine[["NC", "Relation"]].set_index("NC").to_dict()
 dict_fine["Label"] = compounds_fine[["Relation", "Label"]].set_index("Relation").to_dict()["Label"]

 # Grab dictionaries for coarse-grained relations
 dict_coarse = compounds_coarse[["NC", "Relation"]].set_index("NC").to_dict()
 dict_coarse["Label"] = compounds_coarse[["Relation", "Label"]].set_index("Relation").to_dict()["Label"]
 ```

 %% Cell type:code id: tags:

 ``` python
 # Check for 3-grams
 temp = compounds_fine["NC"].str.split(expand=True)
 temp[temp[2].notna()].drop_duplicates()
 ```

 %% Output

                  0             1             2
    366        fast          food         chain
    467        real        estate      business
    948       couch        potato          jock
    1422     coffee          shop       chatter
    1581    nursing          home     equipment
    2677       farm       machine         maker
    2755     health          care      overhaul
    2762      heavy         truck         maker
    2862       life     insurance  policyholder
    2982       palm          tree      rustling
    3097       real        estate          gain
    3172      small      business      supplier
    3186      space       station       funding
    3768   floating          rate          note
    3914       real        estate       bargain
    4160      stock          fund         asset
    4169    working          girl          mink
    4199     credit          card     portfolio
    4660       home   improvement          loan
    4724        law   enforcement         grant
    5041      stock  appreciation         right
    5235    foreign      exchange        market
    5236     foster         child        system
    5261   mountain          bike          shoe
    5281       real        estate         agent
    5282       real        estate        market
    5383  brokerage         house         stock
    5393    capital         gains          rate
    5510       home      building         stock
    5627       real        estate         share
    6259     reagan           era       neglect
    6383       bond       trading  conversation
    6547     health          care         issue
    6892     health          care        expert

 %% Cell type:markdown id: tags:

 # Search for both fine and coarse relations
 - Read in data two times for both fine / coarse relation compounds
 - Iterate over other files, searching for unseen compounds
  - (duplicates will be dropped)

 %% Cell type:code id: tags:

 ``` python
 data = read_corpus("corpus_data_news20/eng_news_2020_100K-sentences.txt")

 bigrams_fine = search_bigrams(data, compounds_fine["NC"], dict_fine)
 bigrams_coarse = search_bigrams(data, compounds_coarse["NC"], dict_coarse)
 ```

 %% Cell type:code id: tags:

 ``` python
 # Grab data files
 with os.scandir("corpus_data") as d:
    data_files = []

    for data in d:
        if data.name.endswith(".txt") and data.is_file():
            data_files.append(f"corpus_data/{data.name}")
 ```

 %% Cell type:code id: tags:

 ``` python
 i = 1
 for d in data_files:
    data = read_corpus(d)
    bigrams_fine = search_bigrams(data, compounds_fine["NC"], dict_fine, join=True, join_frame=bigrams_fine)

    print(f"finished {i}/{len(data_files)}: {d}")
    i += 1
 ```

 %% Cell type:code id: tags:

 ``` python
 i = 1
 for d in data_files:
    data = read_corpus(d)
    bigrams_coarse = search_bigrams(data, compounds_coarse["NC"], dict_coarse, join=True, join_frame=bigrams_coarse)

    print(f"finished {i}/{len(data_files)}: {d}")
    i += 1
 ```

 %% Cell type:code id: tags:

 ``` python
 bigrams_fine.to_csv("sentences_fine.csv", index=False)
 bigrams_coarse.to_csv("sentences_coarse.csv", index=False)
 ```

 %% Cell type:markdown id: tags:

 # OneVS. Classification - Datasets
 ... based on fine / coarse grained sentences limited to 50 occuring compounds within a relation

 - read in dataset for fine / coarse relations
 - for each relation:
  - use onevs function and apply it to the dataframe...
  - ... to binarize the label column into 1 if relation, 0 else
 - datasets will be split into train / val / test
 - fine-tune a BERT model using train set and validate it
 - make a simply testing run on the fine-tuned model with the test set

 %% Cell type:code id: tags:

 ``` python
 fine_50 = pd.read_csv("limited/fine_50.csv")
 coarse_50 = pd.read_csv("limited/coarse_50.csv")

 datasets = [fine_50, coarse_50]
 names = ["fine", "coarse"]

 def onevs(label, n):
    return 1 if label == n else 0

 for data in zip(datasets, names):
    df = data[0]
    name = data[1]
    rdict = df.set_index("Label").to_dict()["Relation"]

    for n in range(len(df["Relation"].unique())):
        dff = df.copy()

        dff["Label"] = dff["Label"].apply(lambda x: onevs(x, n+1))
        dff.to_csv(f"onevs_{name}/onevs_{name}_50_{rdict[n+1]}.csv", index=False)
 ```

 %% Cell type:markdown id: tags:

 # Visualizations
 ## fine-grained

 %% Cell type:code id: tags:

 ``` python
 sentences_fine.info()
 ```

 %% Output

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 226369 entries, 0 to 226368
    Data columns (total 4 columns):
     #   Column    Non-Null Count   Dtype
    ---  ------    --------------   -----
     0   Relation  226369 non-null  object
     1   Label     226369 non-null  int64
     2   NC        226369 non-null  object
     3   Sentence  226369 non-null  object
    dtypes: int64(1), object(3)
    memory usage: 6.9+ MB

 %% Cell type:code id: tags:

 ``` python
 # Used to shorten some names of relations
 rDict = {
    "WHOLE+ATTRIBUTE&FEATURE&QUALITY_VALUE_IS_CHARACTERISTIC_OF": "IS_CHARACTERISTIC_OF",
    "PART&MEMBER_OF_COLLECTION&CONFIG&SERIES": "MEMBER_OF_COLLECTION"
 }
 ```

 %% Cell type:code id: tags:

 ``` python
 # Exemplary overview of a relation
 sentences_fine[sentences_fine["Relation"] == "VARIETY&GENUS_OF"]["NC"].value_counts()
 ```

 %% Output

    plant species      84
    cigarette brand     8
    grape variety       6
    fauna species       3
    Name: NC, dtype: int64

 %% Cell type:code id: tags:

 ``` python
 # Exemplary overview of n relation
 sentences_fine[sentences_fine["Relation"] == "OBJECTIVE"]["NC"].value_counts()
 ```

 %% Output

    climate change           3977
    state government         2276
    business owner           1208
    service provider          813
    property owner            521
                             ...
    right advocate              1
    newspaper circulation       1
    house demolition            1
    utilization review          1
    market tremor               1
    Name: NC, Length: 925, dtype: int64

 %% Cell type:code id: tags:

 ``` python
 # Found compounds for each Relation
 plt.style.use("seaborn")
 plt.rcParams["figure.figsize"] = (11,6)

 bar1 = plt.bar(
    x=sentences_fine["Relation"].replace(rDict).value_counts().index,
    height=sentences_fine["Relation"].value_counts(),
    label="unlimited"
    )

 bar2 = plt.bar(
    x=sentences_fine_50["Relation"].replace(rDict).value_counts().index,
    height=sentences_fine_50["Relation"].value_counts(),
    label="limited to 50"
    )

 plt.title("found sentences for each Relation (fine)")
 plt.xticks(rotation=90)

 plt.legend()
 ```

 %% Output

    <matplotlib.legend.Legend at 0x211bd2f41f0>



 %% Cell type:code id: tags:

 ``` python
 # Compound variety for each relation
 plt.style.use("seaborn")
 plt.rcParams["figure.figsize"] = (11,6)

 bar = plt.bar(x=compounds_fine["Relation"].replace(rDict).value_counts().index, height=compounds_fine["Relation"].value_counts())

 plt.title("compound variety (fine)")
 plt.bar_label(bar, padding=2.7)
 plt.xticks(rotation=90)

 plt.show()
 ```

 %% Output



 %% Cell type:markdown id: tags:

 ## coarse-grained

 %% Cell type:code id: tags:

 ``` python
 sentences_coarse.info()
 ```

 %% Output

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 160238 entries, 0 to 160237
    Data columns (total 4 columns):
     #   Column    Non-Null Count   Dtype
    ---  ------    --------------   -----
     0   Relation  160238 non-null  object
     1   Label     160238 non-null  int64
     2   NC        160238 non-null  object
     3   Sentence  160238 non-null  object
    dtypes: int64(1), object(3)
    memory usage: 4.9+ MB

 %% Cell type:code id: tags:

 ``` python
 # Found sentences for each Relation
 plt.rcParams["figure.figsize"] = (8,6)

 bar1 = plt.bar(x=sentences_coarse["Relation"].value_counts().index, height=sentences_coarse["Relation"].value_counts(), label="unlimited")
 bar2 = plt.bar(x=sentences_coarse_50["Relation"].value_counts().index, height=sentences_coarse_50["Relation"].value_counts(), label="limited to 50")

 plt.title("found sentences for each Relation (coarse)")
 plt.xticks(rotation=90)

 plt.legend()
 ```

 %% Output

    <matplotlib.legend.Legend at 0x211befcb100>



 %% Cell type:code id: tags:

 ``` python
 # Compound variety for each relation
 plt.style.use("seaborn")
 plt.rcParams["figure.figsize"] = (8,6)

 bar = plt.bar(x=compounds_coarse["Relation"].replace(rDict).value_counts().index, height=compounds_coarse["Relation"].value_counts())

 plt.title("compound variety (coarse)")
 plt.bar_label(bar, padding=2.7)
 plt.xticks(rotation=90)

 plt.show()
 ```

 %% Output



 %% Cell type:markdown id: tags:

 # Word Frequencies

 %% Cell type:code id: tags:

 ``` python
 # Word count for fine relations
 wcount = 0

 for sent in sentences_fine["Sentence"].unique():
    wcount += len(sent.split())

 # Build Dataframe with informations
 freq_fine = pd.DataFrame({
    "freq": sentences_fine["NC"].value_counts(),
    "rank": range(1, len(sentences_fine["NC"].value_counts())+1),
    "prob": sentences_fine["NC"].value_counts() / wcount,
    })

 freq_fine = freq_fine.reset_index().rename(columns={"index": "NC"})

 # Word count for coarse Relations
 wcount = 0

 for sent in sentences_coarse["Sentence"].unique():
    wcount += len(sent.split())

 # Build Dataframe with informations
 freq_coarse = pd.DataFrame({
    "freq": sentences_coarse["NC"].value_counts(),
    "rank": range(1, len(sentences_coarse["NC"].value_counts())+1),
    "prob": sentences_coarse["NC"].value_counts() / wcount
    })

 freq_coarse = freq_coarse.reset_index().rename(columns={"index": "NC"})
 ```

 %% Cell type:code id: tags:

 ``` python
 freq_coarse.head()
 ```

 %% Output

                     NC  freq  rank      prob
    0    police officer  4580     1  0.001243
    1   research report  3114     2  0.000845
    2  research analyst  1910     3  0.000518
    3   health official  1382     4  0.000375
    4    company report  1126     5  0.000306

 %% Cell type:code id: tags:

 ``` python
 plt.rcParams["figure.figsize"] = (5,4)

 plt.plot(freq_fine["rank"], freq_fine["freq"])

 plt.title("Word Frequencies for fine-grained Relations")
 plt.xlabel("Compound Rank")
 plt.ylabel("Compound Frequency")
 plt.show()
 ```

 %% Output



 %% Cell type:code id: tags:

 ``` python
 plt.rcParams["figure.figsize"] = (5,4)

 plt.plot(freq_coarse["rank"], freq_coarse["freq"])

 plt.title("Word Frequencies for coarse-grained Relations")
 plt.xlabel("Compound Rank")
 plt.ylabel("Compound Frequency")
 plt.show()
 ```

 %% Output