Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba
Keywords:
reference corpora, text reception, text production, internet texts, language technologiesSynopsis
One of the aims of the Communication in Slovene project (2008-2013) was the compilation of a reference corpus of written Slovene. The outcome was the Gigafida korpus, containing over 1 billion words, which is an upgrade of two earlier corpora of Slovene: the FIDA corpus (2000) and the FidaPLUS corpus (2006).
All the collected texts were put in the Gigafida corpus (in addition to the texts from the FIDA corpus and the FidaPLUS corpus), however a more balanced distribution of genres has been planned and realized in a 100-million-word corpus called KRES. In addition, we built two subcorpora that are available under Creative Commons licence (“Attribution-NonCommercial-ShareAlike”): the first subcorpus (ccGigafida) contains 9% of Gigafida, the second one (ccKRES) 9% of KRES.