FoLiA: A practical XML Format for Linguistic Annotation - a descriptive and comparative study


In this paper we present FoLiA, a Format for Linguistic Annotation, and conduct a comparativestudy with other annotation schemes, including the Linguistic Annotation Framework (LAF), theText Encoding Initiative (TEI) and Text Corpus Format (TCF). An additional point of focus isthe interoperability between FoLiA and metadata standards such as the Component MetaDataInfrastructure (CMDI), as well as data category registries such as ISOcat. The aim of the paperis to present a clear image of the capabilities of FoLiA and how it relates to other formats. Thisshould open discussion and aid users in their decision for a particular format.FoLiA is a practically-oriented XML-based annotation format for the representation of language resources, explicitly supporting a wide variety of annotation types. It introduces a flexibleand uniform paradigm and a representation independent of language or label set. It is designed tobe highly expressive, generic, and formalised, whilst at the same time focussing on being as practical as possible to ease its adoption and implementation. The aspiration is to offer a generic formatfor storage, exchange, and machine-processing of linguistically annotated documents, preventingusers as well as software tools from having to cope with a wide variety of different formats, which inthe field regularly causes convertibility issues and proliferation of ad-hoc formats. FoLiA emergedfrom such a practical need in the context of Computational Linguistics in the Netherlands andFlanders. It has been successfully adopted by numerous projects within this community. FoLiAwas developed in a bottom-up fashion, with special emphasis on software libraries and tools tohandle it.

Computational Linguistics in the Netherlands Journal