Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba


Nataša Logar Berginc, University of Ljubljana, Faculty of Social Sciences, Slovenia; Miha Grčar; Marko Brakus; Tomaž Erjavec, Jožef Stefan Institute, Ljubljana, Slovenia; Špela Arhar Holdt, University of Ljubljana, Faculty of Computer and Information Science, Slovenia; Simon Krek, Jožef Stefan Institute, Ljubljana, Slovenia


reference corpora, text reception, text production, internet texts, language technologies


One of the aims of the Communication in Slovene project (2008-2013) was the compilation of a reference corpus of written Slovene. The outcome was the Gigafida korpus, containing over 1 billion words, which is an upgrade of two earlier corpora of Slovene: the FIDA corpus (2000) and the FidaPLUS corpus (2006).

All the collected texts were put in the Gigafida corpus (in addition to the texts from the FIDA corpus and the FidaPLUS corpus), however a more balanced distribution of genres has been planned and realized in a 100-million-word corpus called KRES. In addition, we built two subcorpora that are available under Creative Commons licence (“Attribution-NonCommercial-ShareAlike”): the first subcorpus (ccGigafida) contains 9% of Gigafida, the second one (ccKRES) 9% of KRES.


Download data is not yet available.



August 28, 2020

How to Cite

Logar Berginc, N., Grčar, M., Brakus, M., Erjavec, T., Arhar Holdt, Špela, & Krek, S. (2020). Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. University of Ljubljana Press. https://doi.org/10.4312/9789610603542