Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
softwareprojektws17
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container Registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Steffen Knapp
softwareprojektws17
Commits
00a46f0d
Commit
00a46f0d
authored
7 years ago
by
Maximilian Blunck
Browse files
Options
Downloads
Patches
Plain Diff
added tokens to corpus file. the tokenized review can now be directly accesed via 'TOKENS' key
parent
f9848244
No related branches found
No related tags found
No related merge requests found
Changes
2
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
corpus.csv
+1255
-1255
1255 additions, 1255 deletions
corpus.csv
corpus.py
+15
-9
15 additions, 9 deletions
corpus.py
with
1270 additions
and
1264 deletions
corpus.csv
+
1255
−
1255
View file @
00a46f0d
This diff is collapsed.
Click to expand it.
corpus.py
+
15
−
9
View file @
00a46f0d
import
os
,
os
.
path
import
re
import
csv
from
nltk.tokenize
import
word_tokenize
def
read_corpus
(
csv_corpus_path
):
"""
Reads a csv-file and returns a list of dicts.
Each dict represents one corpus file.
Keys: [
'
LABEL
'
,
'
FILENAME
'
,
'
STARS
'
,
'
TITLE
'
,
'
DATE
'
,
'
AUTHOR
'
,
'
PRODUCT
'
,
'
REVIEW
'
]
Keys: [
'
LABEL
'
,
'
FILENAME
'
,
'
STARS
'
,
'
TITLE
'
,
'
DATE
'
,
'
AUTHOR
'
,
'
PRODUCT
'
,
'
REVIEW
'
,
'
TOKENS
'
]
"""
corpus
=
[]
...
...
@@ -38,7 +39,7 @@ def convert_corpus(corpus_path, out):
with
open
(
out
,
'
w
'
)
as
csvfile
:
fieldnames
=
[
'
LABEL
'
,
'
FILENAME
'
,
'
STARS
'
,
'
TITLE
'
,
'
DATE
'
,
'
AUTHOR
'
,
'
PRODUCT
'
,
'
REVIEW
'
]
fieldnames
=
[
'
LABEL
'
,
'
FILENAME
'
,
'
STARS
'
,
'
TITLE
'
,
'
DATE
'
,
'
AUTHOR
'
,
'
PRODUCT
'
,
'
REVIEW
'
,
'
TOKENS
'
]
writer
=
csv
.
DictWriter
(
csvfile
,
fieldnames
)
writer
.
writeheader
()
...
...
@@ -60,9 +61,13 @@ def convert_corpus(corpus_path, out):
data
[
fieldnames
[
1
]]
=
file_path
.
split
(
"
/
"
)[
-
1
]
for
tag
in
fieldnames
[
2
:]:
for
tag
in
fieldnames
[
2
:
-
1
]:
data
[
tag
]
=
get_tag_content
(
tag
,
s
)
# tokenization
tokens
=
word_tokenize
(
data
[
'
REVIEW
'
])
data
[
"
TOKENS
"
]
=
tokens
writer
.
writerow
(
data
)
print
(
"
Corpus written to:
"
+
out
)
...
...
@@ -83,13 +88,14 @@ def get_tag_content(tag, text):
if
__name__
==
'
__main__
'
:
"""
corpus_path =
"
../corpus/SarcasmAmazonReviewsCorpus
"
convert_corpus(corpus_path,
"
corpus.csv
"
)
#corpus_path = "../corpus/SarcasmAmazonReviewsCorpus"
#convert_corpus(corpus_path, "corpus.csv")
#corpus = read_corpus("corpus.csv")
#print("Corpus size: "+str(len(corpus)))
corpus = read_corpus(
"
corpus.csv
"
)
print(
"
Corpus size:
"
+str(len(corpus)))
"""
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment