Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
T
text_translation_and_summarization
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container Registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
nwarslan
text_translation_and_summarization
Commits
7e12a7d4
Commit
7e12a7d4
authored
5 years ago
by
nwarslan
Browse files
Options
Downloads
Patches
Plain Diff
added README
parent
972b55b1
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
README.md
+150
-5
150 additions, 5 deletions
README.md
with
150 additions
and
5 deletions
README.md
+
150
−
5
View file @
7e12a7d4
...
...
@@ -7,15 +7,160 @@ spektrum_links:
---------------
directory to original input data
3 files, each in csv, json and pickle
containing ID and title of german summary and corresponding links to english articles
(also date, keywords, source)
3 files, each in csv and json (for example see "_Sample" files)
containing ID, title, date, keywords, source of german summary and url links to english articles
(date and source find no further use)
de_spektrum_summaries:
----------------------
directory to German summaries
3 pickle files
1 json file, summarizing all summaries
Example:
{ID: {"DeTitle": "Junge Dinger",
"DeUnderTitle": null,
"DeTeaser": null,
"DeSummary": "..., beobachten Astronomen im fernen China,..."},
ID2: ...
}
spektrum_keywords:
------------------
directory to keywords of spektrum_links
files with around 400 lines,
200 lines German keywords and translated via google tanslate, 200 lines English keywords
(around 200 lines is the limit of translator)
spektrum_keyword_dict.json:
Dictionary with German - English keywords
Example:
{..., "alterungsprozess": "aging process", "alt werden": "to become old", "alt sein": "be old",...}
code:
=====
filter_data.py:
--------------
filters and annotates data
IN:
INPUT = '../data/spektrum_links/json/'
KEYWORDS = '../data/spektrum_keywords/spektrum_keyword_dict.json'
OUT:
OUTPUT = '../output/spektrum_links_output/filtered_Spektrum_Links.json'
WIKI = '../output/spektrum_links_output/wiki_links'
ERRORS = '../output/spektrum_links_output/error_links'
INVALID = '../output/spektrum_links_output/invalid_links'
SM = '../output/spektrum_links_output/social_media_links'
filter:
-
urls to wikipedia
-
urls to social media
-
urls to error page
-
invalid urls
-
urls to german websites
annotation:
-
En_title: str, title of each url page
-
Abstract: bool, True if url text has a section "Abstract" or "Intoduction"
-
Structure: int, 0: no structure, 6: scientific structure
-
Keyword: str, number of matching translated German keywords in En_title / total number of German keywords of summary
-
Pdfs: list, links to pdf files of url page
Example OUTPUT:
{ID : { Title : ‘German title’,
Keywords : [‘German’,’keywords’]
Urls : { url : { En_title : ‘English title’,
Abstract : True,
Structure : 6,
Keyword : 4/9
Pdfs : [‘links, ’to’, ’pdfs’]
},
url_2 : {…}
}
},
ID_2 : {…}
}
main_extractor.py:
------------------
Decides which url or pdf to link to German summary
Calls url_extractor.py or pdf_extractor.py respectively
IN:
INPUT = '../output/spektrum_links_output/filtered_Spektrum_Links.json'
FAILS:
TIKA_FAIL = '../output/extracted_articles/tika_fails.txt'
-> IDs of articles where tika can't parse the pdf file
DOWNLOAD_FAILS = '../output/extracted_articles/download_fails.txt'
-> name and source of pdf files that failed to download
FAILS = '../output/extracted_articles/extraction_fails.txt'
-> IDs of articles that could not be extracted
OUT:
PDFS = '../output/extracted_articles/pdf_extraction/pdfs/'
-> directory to pdf downloads
PDF_DICT = '../output/extracted_articles/pdf_extraction/pdfs/pdf_dict.json'
-> dictionary to map source of pdf to a number (name of pdf file)
PDF_EXTRACT = '../output/extracted_articles/pdf_extraction/'
-> directory to html and text files of pdf extracion
URL_EXTRACT = '../output/extracted_articles/url_extraction/'
-> directory to html and text files of url extracion
DONE = '../output/extracted_articles/done.json'
-> list of IDs that have been extracted / failed to resume in case of crash
DE_EN = '../output/extracted_articles/de_en_articles.json'
-> path to merged German summaries and English articles
url_extractor.py:
-----------------
The function extract(soup,ID) is the main function
It filters soup of url link and extracts the title, sections and text then
calls article_to.py to save the article in .html, .txt and/or in de_en_articles.json
pdf_extractor.py:
-----------------
The function extract(soup,ID) is the main function
It filters soup of pdf link and extracts the title, sections and text then
calls article_to.py to save the article in .html, .txt and/or in de_en_articles.json
article_to.py:
--------------
adds article to de_en_articles.json and
writes article to .html and .txt file (to OUT+'url_extraction/' or OUT+'pdf_extraction/')
IN:
DE_SUM = '../data/de_spektrum_summaries/de_summaries.json'
-> path to German summaries
OUT:
OUT ='../output/extracted_articles/'
running instructions:
=====================
1.
) translate German keywords and add them to '/data/spektrum_keywords/spektrum_keyword_dict.json'
2.
) run filter_data.py
3.
) run main_extractor.py
If you don't need to extract html or text data, make changes in the function extract(soup,ID) in url_extractor.py and pdf_extractor.py.
documentation:
==============
If you want to extract an article from any html file, run the function extract(soup,ID) in url_extractor.py or pdf_extractor.py
and adjust article_to.py if there is no matching summary in '/data/de_spektrum_summaries/de_summaries.json'.
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment