Skip to content
Snippets Groups Projects
Commit 2446fb61 authored by engel's avatar engel
Browse files

Add header to readme

parent 89e1f55e
No related branches found
No related tags found
No related merge requests found
# Data
## Table of Contents
- [Data](#data)
- [Table of Contents](#table-of-contents)
- [Searching for data](#searching-for-data)
- [Forcing BERTs Attention on the compound](#forcing-berts-attention-on-the-compound)
- [Limiting NC Occurrences](#limiting-nc-occurrences)
......
%% Cell type:code id: tags:
``` python
import random
import pandas as pd
```
%% Cell type:code id: tags:
``` python
df = pd.read_csv("sentences_fine_200.csv")
```
%% Cell type:code id: tags:
``` python
sent_amount = len(df)
perc_train = sent_amount*0.6
perc_test = sent_amount*0.25
perc_val = sent_amount*0.15
print(perc_train)
print(perc_val)
print(perc_test)
```
%% Output
104350.8
26087.7
43479.5
%% Cell type:code id: tags:
``` python
# Create different DataFrames for each set
train_set = pd.DataFrame(columns = ['Relation', 'Label', 'NC', 'Sentence'])
val_set = pd.DataFrame(columns = ['Relation', 'Label', 'NC', 'Sentence'])
test_set = pd.DataFrame(columns = ['Relation', 'Label', 'NC', 'Sentence'])
```
%% Cell type:code id: tags:
``` python
grouped_NC = df.groupby(['Label', 'NC']).size().reset_index(name="Count")
grouped_rel = df.groupby(['Label']).size().reset_index(name="Count_rel")
sum = 0
# iterate over relations to get amount of sentences from each relation
for index, row in grouped_rel.iterrows():
count = row['Count_rel']
amount = count*0.6
print(f"sum {sum}")
print("Label {}".format(row['Label']))
print(f" not over {amount}")
# iterate over NC of relation to get sentences and dont split up NC in different sets
# amount of sentences should stay below amount calculated above
if sum < perc_train:
sum2 = 0
# get only nc from relevant relation
NC_rel = grouped_NC.where(grouped_NC['Label']== row['Label']).dropna().reset_index(drop=True)
for index2, row2 in NC_rel.iterrows():
count2 = row2['Count']
temp = sum2 + count2
if temp < amount and sum+sum2 < perc_train:
df1 = df[df['NC'].str.contains(row2['NC'])]
train_set = pd.concat([train_set, df1], axis=0)
sum2 += count2
sum = len(train_set)
train_set.to_csv(f'./train_data/train_set.csv')
```
%% Output
sum 0
Label 1
not over 1210.2
sum 1197
Label 2
not over 1104.6
sum 2301
Label 3
not over 930.5999999999999
sum 3231
Label 4
not over 4637.4
sum 7868
Label 5
not over 1587.0
sum 9453
Label 6
not over 3437.4
sum 12890
Label 7
not over 6597.599999999999
sum 19797
Label 8
not over 297.59999999999997
sum 20094
Label 9
not over 103.8
sum 20192
Label 10
not over 1199.3999999999999
sum 21390
Label 11
not over 5962.8
sum 27352
Label 12
not over 1728.0
sum 29079
Label 13
not over 6036.599999999999
sum 35115
Label 14
not over 731.4
sum 35846
Label 15
not over 15852.0
sum 51706
Label 16
not over 1271.3999999999999
sum 52977
Label 17
not over 1659.6
sum 54732
Label 18
not over 4556.4
sum 59292
Label 19
not over 2518.7999999999997
sum 61824
Label 20
not over 1667.3999999999999
sum 63489
Label 21
not over 385.2
sum 63871
Label 22
not over 13704.0
sum 77574
Label 23
not over 2521.2
sum 80098
Label 24
not over 6627.599999999999
sum 86737
Label 25
not over 3777.0
sum 90523
Label 26
not over 1959.6
sum 92483
Label 27
not over 2916.6
sum 95400
Label 28
not over 819.6
sum 96217
Label 29
not over 5308.8
sum 100331
Label 30
not over 156.6
sum 100392
Label 31
not over 319.8
sum 100708
Label 32
not over 789.0
sum 101496
Label 33
not over 60.599999999999994
sum 101513
Label 34
not over 474.0
sum 101975
Label 35
not over 1441.2
%% Cell type:code id: tags:
``` python
merge_set = pd.merge(df, train_set, how='outer', indicator=True)
reduced = merge_set.loc[merge_set._merge == 'left_only', ['Relation','Label','NC','Sentence']]
print(reduced.head)
```
%% Output
<bound method NDFrame.head of Relation Label NC \
1173 ADJ-LIKE_NOUN 1 mass destruction
1174 ADJ-LIKE_NOUN 1 mass destruction
1175 ADJ-LIKE_NOUN 1 mass destruction
1176 ADJ-LIKE_NOUN 1 mass destruction
1177 ADJ-LIKE_NOUN 1 mass destruction
... ... ... ...
173965 WHOLE+PART_OR_MEMBER_OF 35 wing tip
173966 WHOLE+PART_OR_MEMBER_OF 35 wing tip
173967 WHOLE+PART_OR_MEMBER_OF 35 wing tip
173968 WHOLE+PART_OR_MEMBER_OF 35 wing tip
173969 WHOLE+PART_OR_MEMBER_OF 35 wing tip
Sentence
1173 one used a primitive revolver; the other a wea...
1174 american obligations to come to israel’s defen...
1175 archbishop jose h. gomez of los angeles in a j...
1176 as robert draper recently us, those in the adm...
1177 didn’t powell say that iraq had ‘weapons of ma...
... ...
173965 snip out the bird’s backbone and add it to the...
173966 the arrest was made following tip-off received...
173967 the incident occurred when police, including g...
173968 apical meristem or growing tip.
173969 lesser is a smaller bird, with slimmer build, ...
[70748 rows x 4 columns]>
%% Cell type:code id: tags:
``` python
val_set = pd.DataFrame(columns = ['Relation', 'Label', 'NC', 'Sentence'])
grouped_NC = reduced.groupby(['Label', 'NC']).size().reset_index(name="Count")
grouped_rel = reduced.groupby(['Label']).size().reset_index(name="Count_rel")
sum = 0
# iterate over relations to get amount of sentences from each relation
for index, row in grouped_rel.iterrows():
count = row['Count_rel']
amount = count*0.15
print(f"sum {sum}")
print("Label {}".format(row['Label']))
print(f" not over {amount}")
# iterate over NC of relation to get sentences and dont split up NC in different sets
# amount of sentences should stay below amount calculated above
if sum < perc_val:
sum2 = 0
# get only nc from relevant relation
NC_rel = grouped_NC.where(grouped_NC['Label']== row['Label']).dropna().reset_index(drop=True)
for index2, row2 in NC_rel.iterrows():
count2 = row2['Count']
temp = sum2 + count2
if temp < amount and sum+sum2 < perc_val:
df1 = reduced[reduced['NC'].str.contains(row2['NC'])]
val_set = pd.concat([val_set, df1], axis=0)
sum2 += count2
sum = len(val_set)
val_set.to_csv(f'./val_data/val_set.csv')
```
%% Output
sum 0
Label 1
not over 123.0
sum 116
Label 2
not over 110.55
sum 224
Label 3
not over 93.14999999999999
sum 317
Label 4
not over 463.79999999999995
sum 780
Label 5
not over 159.0
sum 937
Label 6
not over 343.8
sum 1280
Label 7
not over 613.35
sum 1893
Label 8
not over 29.849999999999998
sum 1919
Label 9
not over 11.25
sum 1919
Label 10
not over 120.14999999999999
sum 2036
Label 11
not over 595.9499999999999
sum 2631
Label 12
not over 172.95
sum 2803
Label 13
not over 603.75
sum 3405
Label 14
not over 73.2
sum 3477
Label 15
not over 1585.35
sum 5062
Label 16
not over 127.19999999999999
sum 5180
Label 17
not over 166.04999999999998
sum 5346
Label 18
not over 455.7
sum 5801
Label 19
not over 252.14999999999998
sum 6052
Label 20
not over 167.4
sum 6216
Label 21
not over 39.0
sum 6251
Label 22
not over 1356.1499999999999
sum 7607
Label 23
not over 252.14999999999998
sum 7859
Label 24
not over 662.85
sum 8521
Label 25
not over 376.34999999999997
sum 8897
Label 26
not over 196.04999999999998
sum 9093
Label 27
not over 291.75
sum 9382
Label 28
not over 82.35
sum 9460
Label 29
not over 710.1
sum 10170
Label 30
not over 30.0
sum 10170
Label 31
not over 32.55
sum 10200
Label 32
not over 79.05
sum 10279
Label 33
not over 12.6
sum 10279
Label 34
not over 49.199999999999996
sum 10279
Label 35
not over 174.45
%% Cell type:code id: tags:
``` python
merge_set2 = pd.merge(reduced, val_set, how='outer', indicator=True)
test_set = merge_set2.loc[merge_set2._merge == 'left_only', ['Relation','Label','NC','Sentence']]
print(test_set.head)
test_set.to_csv(f'./test_data/test_set.csv')
```
%% Output
<bound method NDFrame.head of Relation Label NC \
116 ADJ-LIKE_NOUN 1 mass extinction
117 ADJ-LIKE_NOUN 1 mass extinction
118 ADJ-LIKE_NOUN 1 mass extinction
119 ADJ-LIKE_NOUN 1 mass extinction
120 ADJ-LIKE_NOUN 1 mass extinction
... ... ... ...
70743 WHOLE+PART_OR_MEMBER_OF 35 wing tip
70744 WHOLE+PART_OR_MEMBER_OF 35 wing tip
70745 WHOLE+PART_OR_MEMBER_OF 35 wing tip
70746 WHOLE+PART_OR_MEMBER_OF 35 wing tip
70747 WHOLE+PART_OR_MEMBER_OF 35 wing tip
Sentence
116 earth is at the start of a sixth mass extincti...
117 the heisei era indicates that godzilla was a d...
118 the impact would have thrown trillions of tons...
119 across england and wales, towns and villages a...
120 geographically widespread organisms fare bette...
... ...
70743 snip out the bird’s backbone and add it to the...
70744 the arrest was made following tip-off received...
70745 the incident occurred when police, including g...
70746 apical meristem or growing tip.
70747 lesser is a smaller bird, with slimmer build, ...
[60295 rows x 4 columns]>
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment