Efficient n-gram, skipgram and flexgram modelling with Colibri Core

Abstract

Counting n-grams lies at the core of any frequentist corpus analysis and is often considered a trivial matter. Goingbeyond consecutive n-grams to patterns such as skipgrams and flexgrams increases the demand for efficient solutions. Theneed to operate on big corpus data does so even more. Lossless compression and non-trivial algorithms are needed tolower the memory demands, yet retain good speed. Colibri Core is software for the efficient computation and querying ofn-grams, skipgrams and flexgrams from corpus data. The resulting pattern models can be analysed and compared in variousways. The software offers a programming library for C++ and Python, as well as command-line tools.

Publication
Journal of Open Research Software
Date