Skip to content
Snippets Groups Projects
Commit 4554b2aa authored by wernicke's avatar wernicke
Browse files
parents e5363de5 76d16e0c
No related branches found
No related tags found
No related merge requests found
......@@ -2,7 +2,7 @@
## Introduction 👋🏼
This repository is part of our project for the course ```Formale Semantik``` at the University Heidelberg. The project task can be summarized as the classification of lexical semantic relations between the components of nominal compounds. If this topic caught your interest the [Project Report]() offers a detailed insight into the project and its outcomes.
This repository is part of our project for the course ```Formale Semantik``` at the University Heidelberg. The project task can be summarized as the classification of lexical semantic relations between the components of nominal compounds. If this topic caught your interest the [Project Report](documents/Project_Report.pdf) offers a detailed insight into the project and its outcomes.
## Task 📝
A system is to be trained with a noun compound of the form NC = noun1 noun2 and paraphrases describing the relation between noun1 and noun2. It was now to be tested whether semantic relations between the two components of a noun compound, head and modifier, have been learned and can be reproduced. For this purpose, we tested to what extent the components masked in paraphrases - the verbs - can be completed by a machine and how well relations can be predicted for a nominal compound occurring in a sentence by a fine-tuned model.
## Prerequisites 🗂
......@@ -14,7 +14,7 @@ pip install -r requirements.txt
| subdirectory | content | README
| ---- | ---- | ---- |
| data | contains all data needed for probing and fine-tuning | [README](data/README.md)
| documents | contains the first plan for our [**Project Outline**]() and the final [**Project Report**](documents/Gruppe_9__NC-RC_-_Outline.pdf)
| documents | contains the first plan for our [**Project Outline**](documents/Gruppe_9__NC-RC_-_Outline.pdf) and the final [**Project Report**](documents/Project_Report.pdf)
| fine_tuning | contains code to fine-tune models, the fine-tuned models, test results and evaluation| [README](fine_tuning/README.md)
| probing | contains code for probing and its evaluation | [README](probing/README.md)
......
......@@ -16,6 +16,18 @@
# Searching for data
Since our project focuses on "breaking" / analyzing a neural system which tries to predict semantic relations of nominal compounds, we decided to create set of sentences containing said compounds. To gather a base dataset for later fine-tuning and testing sessions, [Wortschatz Leipzig](https://wortschatz.uni-leipzig.de/de) was used to search through News and Wikipedia snippets released between 2016 and 2020, which were adding up to roughly 8M unique Sentences. The search itself will center around a set of compounds for both fine and coarse grained relations, which were taken from [Tratz and Hovy (2010)](https://github.com/vered1986/panic/tree/master/classification/data).
The search itself was done by iterating over the sentences using a regex pattern like (`"\b({})".format(noun_compound)`). In order to reduce the number of iterations it was augmented with a join method:
```
# let step be 10 - or an integer of choice
for i in range(0, len(compounds), step):
if i + step < len(compounds):
p = r"\b({})".format("|".join(compounds[i:i+step]))
else:
p = r"\b({})".format("|".join(compounds[i:]))
```
This can also be easily accelerated using multithreading jobs.
## Compound variety for fine relations <!-- omit in TOC -->
![Compound variety for fine relations](media/compounds_fine.png)
......
%% Cell type:code id: tags:
``` python
import os
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
```
%% Cell type:code id: tags:
``` python
def read_corpus(data):
with open(data, "r", encoding="utf-8") as f:
return [line.split("\t")[1].strip("\n") for line in f]
def search_bigrams(data, compounds, keys, step=10, join=False, join_frame=None):
sr = []
la = []
nc = []
ex = []
for i in range(0, len(compounds), step):
if i + step < len(compounds):
p = r"({})".format("|".join(compounds[i:i+step]))
p = r"\b({})".format("|".join(compounds[i:i+step]))
else:
p = r"({})".format("|".join(compounds[i:]))
p = r"\b({})".format("|".join(compounds[i:]))
for line in data:
result = re.findall(p, line)
for r in result:
sr.append(keys["Relation"][r])
la.append(keys["Label"][keys["Relation"][r]])
nc.append(r)
ex.append(line)
df = pd.DataFrame({
"Relation": sr,
"Label": la,
"NC": nc,
"Sentence": ex
})
df["Sentence"] = df["Sentence"].str.lower()
if join:
tdf = pd.concat([df, join_frame])
return tdf.drop_duplicates().sort_values(by=["Relation", "NC"]).reset_index().drop(columns="index")
else:
return df.sort_values(by=["Relation", "NC"]).reset_index().drop(columns="index")
def limit_occurences(frame, n=400):
df = frame.copy()
temp = pd.DataFrame({
"Count": frame["NC"].value_counts()
})
temp.reset_index(inplace=True)
temp.rename(columns={"index": "NC"}, inplace=True)
temp = temp[temp["Count"] > n]
for nc in temp["NC"]:
df.drop(df[df["NC"] == nc].iloc[n:].index, inplace=True)
return df
```
%% Cell type:code id: tags:
``` python
# Grab compounds
compounds_fine = pd.read_csv("compounds_fine.csv")
compounds_coarse = pd.read_csv("compounds_coarse.csv")
# Grab existing set of data
sentences_fine = pd.read_csv("sentences_fine.csv")
sentences_coarse = pd.read_csv("sentences_coarse.csv")
sentences_fine_50 = pd.read_csv("limited/fine_50.csv")
sentences_coarse_50 = pd.read_csv("limited/coarse_50.csv")
# Grab dictionaries for fine-grained relations
dict_fine = compounds_fine[["NC", "Relation"]].set_index("NC").to_dict()
dict_fine["Label"] = compounds_fine[["Relation", "Label"]].set_index("Relation").to_dict()["Label"]
# Grab dictionaries for coarse-grained relations
dict_coarse = compounds_coarse[["NC", "Relation"]].set_index("NC").to_dict()
dict_coarse["Label"] = compounds_coarse[["Relation", "Label"]].set_index("Relation").to_dict()["Label"]
```
%% Cell type:code id: tags:
``` python
# Check for 3-grams
temp = compounds_fine["NC"].str.split(expand=True)
temp[temp[2].notna()].drop_duplicates()
```
%% Output
0 1 2
366 fast food chain
467 real estate business
948 couch potato jock
1422 coffee shop chatter
1581 nursing home equipment
2677 farm machine maker
2755 health care overhaul
2762 heavy truck maker
2862 life insurance policyholder
2982 palm tree rustling
3097 real estate gain
3172 small business supplier
3186 space station funding
3768 floating rate note
3914 real estate bargain
4160 stock fund asset
4169 working girl mink
4199 credit card portfolio
4660 home improvement loan
4724 law enforcement grant
5041 stock appreciation right
5235 foreign exchange market
5236 foster child system
5261 mountain bike shoe
5281 real estate agent
5282 real estate market
5383 brokerage house stock
5393 capital gains rate
5510 home building stock
5627 real estate share
6259 reagan era neglect
6383 bond trading conversation
6547 health care issue
6892 health care expert
%% Cell type:markdown id: tags:
# Search for both fine and coarse relations
- Read in data two times for both fine / coarse relation compounds
- Iterate over other files, searching for unseen compounds
- (duplicates will be dropped)
%% Cell type:code id: tags:
``` python
data = read_corpus("corpus_data_news20/eng_news_2020_100K-sentences.txt")
bigrams_fine = search_bigrams(data, compounds_fine["NC"], dict_fine)
bigrams_coarse = search_bigrams(data, compounds_coarse["NC"], dict_coarse)
```
%% Cell type:code id: tags:
``` python
# Grab data files
with os.scandir("corpus_data") as d:
data_files = []
for data in d:
if data.name.endswith(".txt") and data.is_file():
data_files.append(f"corpus_data/{data.name}")
```
%% Cell type:code id: tags:
``` python
i = 1
for d in data_files:
data = read_corpus(d)
bigrams_fine = search_bigrams(data, compounds_fine["NC"], dict_fine, join=True, join_frame=bigrams_fine)
print(f"finished {i}/{len(data_files)}: {d}")
i += 1
```
%% Cell type:code id: tags:
``` python
i = 1
for d in data_files:
data = read_corpus(d)
bigrams_coarse = search_bigrams(data, compounds_coarse["NC"], dict_coarse, join=True, join_frame=bigrams_coarse)
print(f"finished {i}/{len(data_files)}: {d}")
i += 1
```
%% Cell type:code id: tags:
``` python
bigrams_fine.to_csv("sentences_fine.csv", index=False)
bigrams_coarse.to_csv("sentences_coarse.csv", index=False)
```
%% Cell type:markdown id: tags:
# OneVS. Classification - Datasets
... based on fine / coarse grained sentences limited to 50 occuring compounds within a relation
- read in dataset for fine / coarse relations
- for each relation:
- use onevs function and apply it to the dataframe...
- ... to binarize the label column into 1 if relation, 0 else
- datasets will be split into train / val / test
- fine-tune a BERT model using train set and validate it
- make a simply testing run on the fine-tuned model with the test set
%% Cell type:code id: tags:
``` python
fine_50 = pd.read_csv("limited/fine_50.csv")
coarse_50 = pd.read_csv("limited/coarse_50.csv")
datasets = [fine_50, coarse_50]
names = ["fine", "coarse"]
def onevs(label, n):
return 1 if label == n else 0
for data in zip(datasets, names):
df = data[0]
name = data[1]
rdict = df.set_index("Label").to_dict()["Relation"]
for n in range(len(df["Relation"].unique())):
dff = df.copy()
dff["Label"] = dff["Label"].apply(lambda x: onevs(x, n+1))
dff.to_csv(f"onevs_{name}/onevs_{name}_50_{rdict[n+1]}.csv", index=False)
```
%% Cell type:markdown id: tags:
# Visualizations
## fine-grained
%% Cell type:code id: tags:
``` python
sentences_fine.info()
```
%% Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226369 entries, 0 to 226368
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Relation 226369 non-null object
1 Label 226369 non-null int64
2 NC 226369 non-null object
3 Sentence 226369 non-null object
dtypes: int64(1), object(3)
memory usage: 6.9+ MB
%% Cell type:code id: tags:
``` python
# Used to shorten some names of relations
rDict = {
"WHOLE+ATTRIBUTE&FEATURE&QUALITY_VALUE_IS_CHARACTERISTIC_OF": "IS_CHARACTERISTIC_OF",
"PART&MEMBER_OF_COLLECTION&CONFIG&SERIES": "MEMBER_OF_COLLECTION"
}
```
%% Cell type:code id: tags:
``` python
# Exemplary overview of a relation
sentences_fine[sentences_fine["Relation"] == "VARIETY&GENUS_OF"]["NC"].value_counts()
```
%% Output
plant species 84
cigarette brand 8
grape variety 6
fauna species 3
Name: NC, dtype: int64
%% Cell type:code id: tags:
``` python
# Exemplary overview of n relation
sentences_fine[sentences_fine["Relation"] == "OBJECTIVE"]["NC"].value_counts()
```
%% Output
climate change 3977
state government 2276
business owner 1208
service provider 813
property owner 521
...
right advocate 1
newspaper circulation 1
house demolition 1
utilization review 1
market tremor 1
Name: NC, Length: 925, dtype: int64
%% Cell type:code id: tags:
``` python
# Found compounds for each Relation
plt.style.use("seaborn")
plt.rcParams["figure.figsize"] = (11,6)
bar1 = plt.bar(
x=sentences_fine["Relation"].replace(rDict).value_counts().index,
height=sentences_fine["Relation"].value_counts(),
label="unlimited"
)
bar2 = plt.bar(
x=sentences_fine_50["Relation"].replace(rDict).value_counts().index,
height=sentences_fine_50["Relation"].value_counts(),
label="limited to 50"
)
plt.title("found sentences for each Relation (fine)")
plt.xticks(rotation=90)
plt.legend()
```
%% Output
<matplotlib.legend.Legend at 0x211bd2f41f0>
%% Cell type:code id: tags:
``` python
# Compound variety for each relation
plt.style.use("seaborn")
plt.rcParams["figure.figsize"] = (11,6)
bar = plt.bar(x=compounds_fine["Relation"].replace(rDict).value_counts().index, height=compounds_fine["Relation"].value_counts())
plt.title("compound variety (fine)")
plt.bar_label(bar, padding=2.7)
plt.xticks(rotation=90)
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
## coarse-grained
%% Cell type:code id: tags:
``` python
sentences_coarse.info()
```
%% Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160238 entries, 0 to 160237
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Relation 160238 non-null object
1 Label 160238 non-null int64
2 NC 160238 non-null object
3 Sentence 160238 non-null object
dtypes: int64(1), object(3)
memory usage: 4.9+ MB
%% Cell type:code id: tags:
``` python
# Found sentences for each Relation
plt.rcParams["figure.figsize"] = (8,6)
bar1 = plt.bar(x=sentences_coarse["Relation"].value_counts().index, height=sentences_coarse["Relation"].value_counts(), label="unlimited")
bar2 = plt.bar(x=sentences_coarse_50["Relation"].value_counts().index, height=sentences_coarse_50["Relation"].value_counts(), label="limited to 50")
plt.title("found sentences for each Relation (coarse)")
plt.xticks(rotation=90)
plt.legend()
```
%% Output
<matplotlib.legend.Legend at 0x211befcb100>
%% Cell type:code id: tags:
``` python
# Compound variety for each relation
plt.style.use("seaborn")
plt.rcParams["figure.figsize"] = (8,6)
bar = plt.bar(x=compounds_coarse["Relation"].replace(rDict).value_counts().index, height=compounds_coarse["Relation"].value_counts())
plt.title("compound variety (coarse)")
plt.bar_label(bar, padding=2.7)
plt.xticks(rotation=90)
plt.show()
```
%% Output
%% Cell type:markdown id: tags:
# Word Frequencies
%% Cell type:code id: tags:
``` python
# Word count for fine relations
wcount = 0
for sent in sentences_fine["Sentence"].unique():
wcount += len(sent.split())
# Build Dataframe with informations
freq_fine = pd.DataFrame({
"freq": sentences_fine["NC"].value_counts(),
"rank": range(1, len(sentences_fine["NC"].value_counts())+1),
"prob": sentences_fine["NC"].value_counts() / wcount,
})
freq_fine = freq_fine.reset_index().rename(columns={"index": "NC"})
# Word count for coarse Relations
wcount = 0
for sent in sentences_coarse["Sentence"].unique():
wcount += len(sent.split())
# Build Dataframe with informations
freq_coarse = pd.DataFrame({
"freq": sentences_coarse["NC"].value_counts(),
"rank": range(1, len(sentences_coarse["NC"].value_counts())+1),
"prob": sentences_coarse["NC"].value_counts() / wcount
})
freq_coarse = freq_coarse.reset_index().rename(columns={"index": "NC"})
```
%% Cell type:code id: tags:
``` python
freq_coarse.head()
```
%% Output
NC freq rank prob
0 police officer 4580 1 0.001243
1 research report 3114 2 0.000845
2 research analyst 1910 3 0.000518
3 health official 1382 4 0.000375
4 company report 1126 5 0.000306
%% Cell type:code id: tags:
``` python
plt.rcParams["figure.figsize"] = (5,4)
plt.plot(freq_fine["rank"], freq_fine["freq"])
plt.title("Word Frequencies for fine-grained Relations")
plt.xlabel("Compound Rank")
plt.ylabel("Compound Frequency")
plt.show()
```
%% Output
%% Cell type:code id: tags:
``` python
plt.rcParams["figure.figsize"] = (5,4)
plt.plot(freq_coarse["rank"], freq_coarse["freq"])
plt.title("Word Frequencies for coarse-grained Relations")
plt.xlabel("Compound Rank")
plt.ylabel("Compound Frequency")
plt.show()
```
%% Output
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment