Skip to content
Snippets Groups Projects
Commit 7e12a7d4 authored by nwarslan's avatar nwarslan
Browse files

added README

parent 972b55b1
No related branches found
No related tags found
No related merge requests found
......@@ -7,15 +7,160 @@ spektrum_links:
---------------
directory to original input data
3 files, each in csv, json and pickle
containing ID and title of german summary and corresponding links to english articles
(also date, keywords, source)
3 files, each in csv and json (for example see "_Sample" files)
containing ID, title, date, keywords, source of german summary and url links to english articles
(date and source find no further use)
de_spektrum_summaries:
----------------------
directory to German summaries
3 pickle files
1 json file, summarizing all summaries
Example:
{ID: {"DeTitle": "Junge Dinger",
"DeUnderTitle": null,
"DeTeaser": null,
"DeSummary": "..., beobachten Astronomen im fernen China,..."},
ID2: ...
}
spektrum_keywords:
------------------
directory to keywords of spektrum_links
files with around 400 lines,
200 lines German keywords and translated via google tanslate, 200 lines English keywords
(around 200 lines is the limit of translator)
spektrum_keyword_dict.json:
Dictionary with German - English keywords
Example:
{..., "alterungsprozess": "aging process", "alt werden": "to become old", "alt sein": "be old",...}
code:
=====
filter_data.py:
--------------
filters and annotates data
IN:
INPUT = '../data/spektrum_links/json/'
KEYWORDS = '../data/spektrum_keywords/spektrum_keyword_dict.json'
OUT:
OUTPUT = '../output/spektrum_links_output/filtered_Spektrum_Links.json'
WIKI = '../output/spektrum_links_output/wiki_links'
ERRORS = '../output/spektrum_links_output/error_links'
INVALID = '../output/spektrum_links_output/invalid_links'
SM = '../output/spektrum_links_output/social_media_links'
filter:
- urls to wikipedia
- urls to social media
- urls to error page
- invalid urls
- urls to german websites
annotation:
- En_title: str, title of each url page
- Abstract: bool, True if url text has a section "Abstract" or "Intoduction"
- Structure: int, 0: no structure, 6: scientific structure
- Keyword: str, number of matching translated German keywords in En_title / total number of German keywords of summary
- Pdfs: list, links to pdf files of url page
Example OUTPUT:
{ID : { Title : ‘German title’,
Keywords : [‘German’,’keywords’]
Urls : { url : { En_title : ‘English title’,
Abstract : True,
Structure : 6,
Keyword : 4/9
Pdfs : [‘links, ’to’, ’pdfs’]
},
url_2 : {…}
}
},
ID_2 : {…}
}
main_extractor.py:
------------------
Decides which url or pdf to link to German summary
Calls url_extractor.py or pdf_extractor.py respectively
IN:
INPUT = '../output/spektrum_links_output/filtered_Spektrum_Links.json'
FAILS:
TIKA_FAIL = '../output/extracted_articles/tika_fails.txt'
-> IDs of articles where tika can't parse the pdf file
DOWNLOAD_FAILS = '../output/extracted_articles/download_fails.txt'
-> name and source of pdf files that failed to download
FAILS = '../output/extracted_articles/extraction_fails.txt'
-> IDs of articles that could not be extracted
OUT:
PDFS = '../output/extracted_articles/pdf_extraction/pdfs/'
-> directory to pdf downloads
PDF_DICT = '../output/extracted_articles/pdf_extraction/pdfs/pdf_dict.json'
-> dictionary to map source of pdf to a number (name of pdf file)
PDF_EXTRACT = '../output/extracted_articles/pdf_extraction/'
-> directory to html and text files of pdf extracion
URL_EXTRACT = '../output/extracted_articles/url_extraction/'
-> directory to html and text files of url extracion
DONE = '../output/extracted_articles/done.json'
-> list of IDs that have been extracted / failed to resume in case of crash
DE_EN = '../output/extracted_articles/de_en_articles.json'
-> path to merged German summaries and English articles
url_extractor.py:
-----------------
The function extract(soup,ID) is the main function
It filters soup of url link and extracts the title, sections and text then
calls article_to.py to save the article in .html, .txt and/or in de_en_articles.json
pdf_extractor.py:
-----------------
The function extract(soup,ID) is the main function
It filters soup of pdf link and extracts the title, sections and text then
calls article_to.py to save the article in .html, .txt and/or in de_en_articles.json
article_to.py:
--------------
adds article to de_en_articles.json and
writes article to .html and .txt file (to OUT+'url_extraction/' or OUT+'pdf_extraction/')
IN:
DE_SUM = '../data/de_spektrum_summaries/de_summaries.json'
-> path to German summaries
OUT:
OUT ='../output/extracted_articles/'
running instructions:
=====================
1.) translate German keywords and add them to '/data/spektrum_keywords/spektrum_keyword_dict.json'
2.) run filter_data.py
3.) run main_extractor.py
If you don't need to extract html or text data, make changes in the function extract(soup,ID) in url_extractor.py and pdf_extractor.py.
documentation:
==============
If you want to extract an article from any html file, run the function extract(soup,ID) in url_extractor.py or pdf_extractor.py
and adjust article_to.py if there is no matching summary in '/data/de_spektrum_summaries/de_summaries.json'.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment