European
Data Catalogues
Dataset

CKAN

Sub menu


Magyar webkorpusz

Dataset Profile

Odm ID
5fef26e7-e127-4267-86f4-1ad3c3027789
Title
Magyar webkorpusz
Notes
With over 1.48 billion words unfiltered (589m words fully filtered), this is by far the largest Hungarian language corpus, and unlike the Hungarian National Corpus (125m words), it is available in its entirety under a permissive Open Content license. The Hungarian webcorpus was created in the winter of 2003 as part of the WordSword project at the Media Research and Education Centre.
The corpus consists of 18 million pages downloaded from the .hu domain, thus representing common written language fairly extensively. Texts that were present multiple times and files which contained no useable text were filtered out. We stratified the remainder in four sections according to the proportion of words in a page that were accepted by a spellchecker.
Author
Author Email
Catalogue Url
Dataset Url
Metadata Updated
2015-09-23 10:10:36
Tags
Date Released
Date Updated
Update Frequency
Organisation
Country
State
Platform
ckan
Language
hu
Version
(not set)