T-Scan: a new tool for analyzing Dutch text

Abstract

T-Scan is a new tool for analyzing Dutch text. It aims at extracting text features that aretheoretically interesting, in that they relate to genre and text complexity, as well as practicallyinteresting, in that they enable users and text producers to make text-specific diagnoses. T-Scanderives it features from tools such as Frog and Alpino, and resources such as SoNaR, SUBTLEX-NLand Referentie Bestand Nederlands.This paper offers a qualitative discussion of a number of T-Scan features, based on a minimaldemonstration corpus of six texts, three of them scientific articles and three of them drawn froma women’s magazine. We discuss features concerning lexical complexity, sentence complexity,referential cohesion and lexical diversity, lexical semantics and personal style. For all these domainswe examine the construct validity as well as the reliability of a number of important features.We conclude that T-Scan offers a number of promising lexical and syntactic features, while theinterpretation of referential cohesion/ lexical diversity features and personal style features is lessclear. Further developing the application and analyzing authentic text need to go hand in hand.

Publication
Computational Linguistics in the Netherlands Journal
Date