added README

7e12a7d4 · nwarslan · 972b55b1 · 7e12a7d4
Commit 7e12a7d4 authored 5 years ago by nwarslan
--- a/README.md
+++ b/README.md
@@ -7,15 +7,160 @@ spektrum_links:
 ---------------

 directory to original input data
-3 files, each in csv, json and pickle
-containing ID and title of german summary and corresponding links to english articles
-(also date, keywords, source)
+3 files, each in csv and json (for example see "_Sample" files)
+containing ID, title, date, keywords, source of german summary and url links to english articles	
+(date and source find no further use)
+ 
+de_spektrum_summaries:
+----------------------
+
+directory to German summaries
+3 pickle files
+1 json file, summarizing all summaries
+	Example:
+	{ID: {"DeTitle": "Junge Dinger", 
+	      "DeUnderTitle": null, 
+              "DeTeaser": null, 
+              "DeSummary": "..., beobachten Astronomen im fernen China,..."},
+	 ID2: ...
+	 }
+
+spektrum_keywords:
+------------------
+
+directory to keywords of spektrum_links
+files with around 400 lines, 
+200 lines German keywords and translated via google tanslate, 200 lines English keywords
+(around 200 lines is the limit of translator)
+
+
+spektrum_keyword_dict.json:
+
+Dictionary with German - English keywords
+	Example:
+	{..., "alterungsprozess": "aging process", "alt werden": "to become old", "alt sein": "be old",...}


 code:
 =====

+filter_data.py:
+--------------
+filters and annotates data
+
+IN:
+
+INPUT = '../data/spektrum_links/json/'
+KEYWORDS = '../data/spektrum_keywords/spektrum_keyword_dict.json'
+
+OUT:
+
+OUTPUT = '../output/spektrum_links_output/filtered_Spektrum_Links.json'
+WIKI = '../output/spektrum_links_output/wiki_links'
+ERRORS = '../output/spektrum_links_output/error_links'
+INVALID = '../output/spektrum_links_output/invalid_links'
+SM = '../output/spektrum_links_output/social_media_links'
+
+
+filter:
+ - urls to wikipedia
+ - urls to social media
+ - urls to error page
+ - invalid urls
+ - urls to german websites
+
+annotation:
+ - En_title: str, title of each url page
+ - Abstract: bool, True if url text has a section "Abstract" or "Intoduction"
+ - Structure: int, 0: no structure, 6: scientific structure
+ - Keyword: str, number of matching translated German keywords in En_title / total number of German keywords of summary
+ - Pdfs: list, links to pdf files of url page
+
+Example OUTPUT:
+{ID : { Title : ‘German title’,
+	Keywords : [‘German’,’keywords’]
+	Urls : { url : { En_title : ‘English title’,
+			Abstract : True,
+			Structure : 6,
+			Keyword : 4/9
+			Pdfs : [‘links, ’to’, ’pdfs’]
+		          },
+		url_2 : {…} 
+		}
+	},
+ ID_2 : {…} 
+}
+
+main_extractor.py:
+------------------
+Decides which url or pdf to link to German summary
+Calls url_extractor.py or pdf_extractor.py respectively
+
+IN:
+
+INPUT = '../output/spektrum_links_output/filtered_Spektrum_Links.json'
+
+FAILS:
+
+TIKA_FAIL = '../output/extracted_articles/tika_fails.txt'
+	-> IDs of articles where tika can't parse the pdf file
+DOWNLOAD_FAILS = '../output/extracted_articles/download_fails.txt'
+	-> name and source of pdf files that failed to download
+FAILS = '../output/extracted_articles/extraction_fails.txt'
+	-> IDs of articles that could not be extracted
+
+OUT:
+
+PDFS = '../output/extracted_articles/pdf_extraction/pdfs/'
+	-> directory to pdf downloads
+PDF_DICT = '../output/extracted_articles/pdf_extraction/pdfs/pdf_dict.json'
+	-> dictionary to map source of pdf to a number (name of pdf file)
+PDF_EXTRACT = '../output/extracted_articles/pdf_extraction/'
+	-> directory to html and text files of pdf extracion
+URL_EXTRACT = '../output/extracted_articles/url_extraction/'
+	-> directory to html and text files of url extracion
+DONE = '../output/extracted_articles/done.json'
+	-> list of IDs that have been extracted / failed to resume in case of crash
+DE_EN = '../output/extracted_articles/de_en_articles.json'
+	-> path to merged German summaries and English articles
+
+url_extractor.py:
+-----------------
+The function extract(soup,ID) is the main function
+It filters soup of url link and extracts the title, sections and text then
+calls article_to.py to save the article in .html, .txt and/or in de_en_articles.json
+
+pdf_extractor.py:
+-----------------
+The function extract(soup,ID) is the main function
+It filters soup of pdf link and extracts the title, sections and text then
+calls article_to.py to save the article in .html, .txt and/or in de_en_articles.json
+
+article_to.py:
+--------------
+adds article to de_en_articles.json and 
+writes article to .html and .txt file (to OUT+'url_extraction/' or OUT+'pdf_extraction/')
+
+IN:
+
+DE_SUM = '../data/de_spektrum_summaries/de_summaries.json'
+	-> path to German summaries
+
+OUT:
+
+OUT ='../output/extracted_articles/'
+
+
+running instructions:
+=====================
+
+1.) translate German keywords and add them to '/data/spektrum_keywords/spektrum_keyword_dict.json'
+
+2.) run filter_data.py
+
+3.) run main_extractor.py

+If you don't need to extract html or text data, make changes in the function extract(soup,ID) in url_extractor.py and pdf_extractor.py.

-documentation:
-==============
+If you want to extract an article from any html file, run the function extract(soup,ID) in url_extractor.py or pdf_extractor.py 
+and adjust article_to.py if there is no matching summary in '/data/de_spektrum_summaries/de_summaries.json'.