A Guide to Frequency Lists from the Gigafida 2.0 and GOS 1.0 Corpora


Jaka Čibej
University of Ljubljana, Faculty of Computer and Information Science, Slovenia
Špela Arhar Holdt
University of Ljubljana, Faculty of Computer and Information Science, Slovenia
Kaja Dobrovoljc
University of Ljubljana, Faculty of Arts, Slovenia
Simon Krek
Jožef Stefan Institute, Ljubljana, Slovenia

Ključne besede:

written Slovene, spoken Slovene, LIST program, CLARIN.SI repository, language corpora

Kratka vsebina

The research project titled “The New Grammar of Modern Standard Slovene: Resources and Methods” was carried out by the researchers of the Jožef Stefan Institute, the Faculty of Arts, and the Faculty of Computer and Information Science of the University of Ljubljana. The goal of the project was to define a linguistic methodological basis for a computational analysis of written and spoken Slovene as present in modern Slovene language corpora. Based on these new methods, a series of open-access corpus-based databases were generated, which can serve as a basis for the preparation of an empirical grammatical description of modern Slovene, as well as the development of language technologies for Slovene.

The purpose of this publication is to provide a quick overview of the data made available at the CLARIN.SI repository, and to demonstrate the functions and uses of the LIST program, which can be used on other corpora for extracting similar frequency lists. The guide features short excerpts of all available frequency lists, i.e. the table header and approximately 30 lines. Each table also features the link to the data in the repository. Each subsection of a chapter begins with a short description of the conditions used in the extraction. The guide is available in Slovene and English.


Podatki o prenosih še niso na voljo.



December 30, 2020