Alle: Korpus wählen

changed milestone to %1 Woche - Korpus, Preprocessing, Ansatz, Plan, Baseline

https://wiki.cl.uni-heidelberg.de/bin/view/Main/Resources/EnglishGigaword5thEdition (50 GB, Texte von 7 Nachrichtenagenturen, bereits mit POS-Tags?)

https://wiki.cl.uni-heidelberg.de/bin/view/Main/Resources/WaCkypedia A 2009 dump of the English Wikipedia (about 800 million tokens), in the same format as PukWaC, including POS/lemma information, as well as a full dependency parse (parsing performed with the MaltParser).

Other Datasets in my blog: https://ainterest.wordpress.com/2017/10/10/find-data-for-ai/

@toyota I like your proposal of Wikipedia dump. ( https://wiki.cl.uni-heidelberg.de/bin/view/Main/Resources/WaCkypedia ) We could also use English Gigaword or a British national Corpus, because is even larger than Wikipedia Dump (100 million words), but the diversity of Wikipedia Dump must be highter

We could also use the Dataset from the Task 11 Sem2013 that we are doing. It is already labelled and includes gold standard annotations for Web search results:

https://www.cs.york.ac.uk/semeval-2013/task11/index.php%3Fid=data.html (see Dataset)

(though this corpus could be later used a test corpus for the task)