Dataset | OpenDataMonitor

CKAN

Odm ID	5fef26e7-e127-4267-86f4-1ad3c3027789
Title	Magyar webkorpusz
Notes	With over 1.48 billion words unfiltered (589m words fully filtered), this is by far the largest Hungarian language corpus, and unlike the Hungarian National Corpus (125m words), it is available in its entirety under a permissive Open Content license. The Hungarian webcorpus was created in the winter of 2003 as part of the WordSword project at the Media Research and Education Centre. The corpus consists of 18 million pages downloaded from the .hu domain, thus representing common written language fairly extensively. Texts that were present multiple times and files which contained no useable text were filtered out. We stratified the remainder in four sections according to the proportion of words in a page that were accepted by a spellchecker.
Author
Author Email
Catalogue Url	http://opendata.hu/
Dataset Url	http://www.opendata.hu//dataset/magyar-webkorpusz
Metadata Updated	2015-09-23 10:10:36
Tags
Date Released
Date Updated
Update Frequency
Organisation
Country
State
Platform	ckan
Language	hu
Version	(not set)