Proycon's Homepage
2009-12-01

Ucto

Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation.

Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. Features

  • Comes with tokenization rules for English, Dutch, French, Italian, and Swedish; easily extendible to other languages.
  • Recognizes dates, times, units, currencies, abbreviations.
  • Recognizes paired quote spans, sentences, and paragraphs.
  • Produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input.
  • Optional conversion to all lowercase or uppercase.
  • Supports FoLiA XML

Ucto was written by Maarten van Gompel and Ko van der Sloot. Work on Ucto was funded by NWO, the Netherlands Organisation for Scientific Research, under the Implicit Linguistics project and the CLARIN-NL program.