Text processing on the command line - sharing my tools
Introduction
I'm quite fond of the command-line and spend a larger chunk of my life in a
terminal emulator than I dare admit. I try to embrace the unix philosophy of
using tools that "do one thing only, and do it well" and "interconnect
largely via plain text formats". Following this philosophy, the output of one
program can typically be the input to the next, for instance through the pipe
operator (|
) in a command-line shell.
Chaining multiple heterogeneous tools in this way gives a great amount of power and flexibility, something that's much harder to achieve through complex monolithic GUIs. It allows to quickly automate things in shell scripts and do batch processing, even if the underlying tools are written in different languages.
For text processing and data science, the unix shell and environment kind of form the lingua franca of data science; a common foundation upon which we can build our data processing pipelines. This environment is usually a unix-like system such as Linux or macOS that offers a (POSIX-compliant) shell like bash and a set of core utilities such as provided by GNU coreutils and friends, by FreeBSD/OpenBSD/NetBSD/macOS itself or by busybox. All of these are different implementations of the same core utilities, but following some standard specification (POSIX). Even Windows users have access to such a command-line environment via the WSL, or alternatively via Cygwin.
I'll first mention some of these standard unix tools, then some additional tools, and finally I'll move on to what is the main subject of this writing: my tools; the text processing tools for the command-line which I myself developed (or co-developed) and want to share with you.
Basic standard tools
Out of the box, a unix environment usually gives you text processing tools like:
awk
- pattern scanning and processing languagecat
- print or concatenate text (ortac
to do it in reverse)cut
- extract/remove columns from linescolumn
- format input into multiple columnscolrm
- remove columns from a filecomm
- compare two sorted files line by linedos2unix
- convert DOS/Windows line endings to UNIX (\r\n
->\n
)diff
- compare files line by line and show where they differecho
/printf
- print textexpand
- convert tabs to spacesfold
- wrap each input line to fit in a specified widthhead
/tail
- extract the first/last lines of a textgrep
- print lines that match patterns (regular expressions)nl
- assign numbers to lines in a filepaste
- merge lines from multiple files into onerev
- reverse lines character-wisetr
- translate or delete characters (here translation is just a mapping between character, not between natural languages)sed
- stream editor for filtering and transforming textshuf
- shuffle lines in a random ordersplit
- split a file into several equally-sized piecessort
- sort linesuniq
- remove duplicate lineswc
- word count, line count, byte count
They shine when used together. Take the example of having a plain text file, and wanting to extract a top 1000 frequency list of the words in it, all lower-cased. All this can be captured in the following one-liner:
$ sed -E 's/\W+/\n/g' text.txt | tr '[:upper:]' '[:lower:]' | tr -s '\n' | sort | uniq -c | sort -rn | head -n 1000
An analysis of the many possibilities of these common unix tools for text processing and data analysis is well out of scope for this writing. I do not intend to write a tutorial here. If you are interested in such a thing, I can warmly recommend the O'Reilly book Data Science at the Command Line by Jeroen Janssens, freely accessible on-line. It also covers some of the tools I mention in the next section, and more.
Common additional tools
In addition to these standard unix tools that are often part of the core system, there are more specialised tools I can recommend for text & data processing. The following few work with structured data like JSON, CSV or XML and overlap partially in functionality:
jq
- likesed
but specifically for extracting and manipulating JSON data, written in C.- Miller -
mlr
- like awk, sed, cut, join, and sort for name-indexed data such as CSV and tabular JSON. Written in Go. - xsv - CSV command line toolkit, written in Rust.
dasel
- to query and modify JSON/CSV/YAML/XML , written in Go.- csvkit - suite of utilities for working with CSV (
csvlook
,csvcut
,csvgrep
,csvjson
,csvsql
,csvstat
), written in Python. xmllint
in libxml2 - to validate and query XML files (xpath), written in C.
For visualisation purposes, I first have to give an honourable mention to
less
and more
, both of which do more or less the same (pun intended). They
serve as the so-called pager. For a richer experience, I can recommend the following:
bat
- acat
clone and/orless
alternative with fancy features like syntax highlighting; great for viewer all kinds of files, written in Rust.glow
- a Markdown viewer, written in Go.csvlook
from csvkit - viewer for CSV files, written in Python.
When it comes to conversion of texts, I recommend the following:
pandoc
- A universal document converter. Converts between many document formats such as LaTeX/Markdown/ReStructuredText/plain text/Asciidoc/Word/Epub/Roff/etc... Written in Haskell.iconv
- convert plain text from one character encoding to another. Written in C.uconv
from icu - convert plain text from one character encoding to another. Written in C.
To combine these and all kinds of other data processing tools into larger
automated pipelines, the most basic solution is to write a shell script.
Alternatively, you can write a Makefile
that builds targets using
make
. The latter also offers a certain
degree of parallelisation using the -j
parameter. Another useful tool to
consider for parallelisation is
parallel
and/or xargs
with the
-P
parameter. The latter, xargs
, is also a basic standard tool useful when
building sequential pipelines.
Remembering how to invoke all these command line tools may be rather daunting,
and nobody expects you to remember everything anyway. Of course, a manual page
should be available by simply running man
followed by the command you need
help on. You should also be able to get usage information from the tool itself
by passing -h
and/or --help
. If you're lazy like me and prefer to quickly
get some very short concise usage examples for a number of common use cases for
a tool, then I can recommend
tldr
(internet slang for "too
long; didn't read").
Intermezzo: What about Python/Julia/R etc...?
Bear with me before I finally get to show you my own tools. At this point some readers may wonder: All this command-line stuff is fine, but why not just use Python, or R, or Julia or whatever other open-source language implementation you prefer? (this excludes proprietary ecosystems like Matlab and Wolfram which I explicitly condemn). Languages like Python/Julia/R come with a solid standard library and on top of that a wide range of third party libraries to accommodate all kinds of text processing needs or whatever else a data scientist might desire. Furthermore, they too come with an interactive shell or can be used in the web-browser in the form of Notebooks (Jupyter/Pluto/RStudio), Some may find the latter more appealing than a terminal, especially for data stories and teaching.
Indeed, I say in response, these are all perfectly fine solutions and great projects. Such language ecosystems tend to give you tighter coupling between components and cleaner abstraction than the unix shell can. The shell often has more archaic syntax and interoperability is not always optimal because each tool can be very different. The shell, however, does provide a kind of lowest-common denominator that no other can. It offers a great deal of flexibility with regards to the tools you use. It's been around for decades and likely will last a while longer. Whether your command-line interface is written in C, C++, Rust, Go, Zig, Hare, R, Python, Perl, Haskell, Lisp, Scheme, or God forbid, even Java, it can all be readily mixed. Just by virtue of having a command line interface and reading either from file or standard input, and outputting to file or standard output. Committing yourself to a higher-level language, on the other hand, adds an extra layer, often in the form of a high-level language interpreter which has its own overhead and limits certain choices. This has both benefits and drawbacks. The trade-off is often between diversity and uniformity.
This is not a question of one method being inherently superior to the other, it all depends on your use case, the complexity thereof, the technologies you and those working with you are familiar and comfortable with, and your intended audience; the users. You then choose the form of interface that fits these conditions best; whether it is invoking tools via command-line interfaces, calling functions in software libraries via an API, or WebAPI, or even clicking buttons in GUIs.
My tools
In this writing, I would like to introduce some of the tools I wrote over the years for text processing, often in line of my work. As mentioned in the introduction, these tools are largely specialised in doing one thing, and doing it well, as per the unix philosophy. As the years go by, I'm more and more drawn to simpler solutions, where simple entails:
- keeping the scope of a tool limited, constraining the amount of features
- keeping a codebase maintainable by not letting it grow too large
- limiting the number of dependencies (and in doing so limiting security vulnerabilities as well)
- getting the job done with optimal performance, reducing the amount of unnecessary overhead.
Rust is my language of choice nowadays, as it compiles to highly performant native code on a variety of platforms and processor architectures. It offers important safety guarantees that prevents a whole range of common memory bugs that are prevalent in other system's programming languages as C and C++. You will therefore find a lot of my tools are written in Rust, often with a Python binding on top for accessibility for researchers and developers from Python. All software I write is free open source software available, almost always under the GNU General Public License v3.
I will be listing the tools in approximate order from simpler to more complex tools:
charfreq
- Tiny tool that just counts unicode character frequencies from text it receives on standard input.hyphertool
- Hyphenation tool, performs wordwrap at morphologically sensible places for a number of languages.ssam
- Split sampler to draw train/development/test samples from plain text data using random sampling.lingua-cli
- Language Detectionsesdiff
- Computes a Shortest Edit Script that shows the difference between two strings on a character-levellexmatch
- Matching texts against one or more lexicons
The above tools were all fairly small, now we're moving on to bigger software projects, mostly developed in the scope of my work at the KNAW Humanities Cluster and Radboud University Nijmegen, often under the umbrella of the CLARIN-NL and CLARIAH projects.
stam
- Toolkit for stand-off annotation on textucto
- Unicode tokeniserfrog
- NLP suite for Dutchanaliticcl
- Fuzzy string-matching system used e.g. for spelling correction, text normalisation or post-OCR correction.colibri-core
- Efficient pattern extraction (n-grams/skipgrams/flexgrams) and modelling from text.folia-tools
- Toolkit for working with the FoLiA XML format.
Last, a small tool that is a bit of an odd-one-out in this list, but which I wanted to include anyway:
vocage
- Vocabulary training with flashcards (spaced-repetition system aka Leitner)
I'll briefly discuss each of the mentioned programs. Click the above links to quickly jump to the relevant section.
charfreq
charfreq
is a very simple tiny CLI tool that just computes (unicode) character frequencies from text received via standard input. Written in Rust.
You can find charfreq
on Sourcehut or Github.
hyphertool
hyphertool
does hyphenation and builds upon the third party
hypher library, which in turn uses rules from
the TeX hyphenation library. It is written in Rust.
$ hyphertool --language nl --width 15 test.txt
Dit is een test-
bestand. Kan je
dit bestand mooi
voor mij verwer-
ken?
Ik hoop op
een positief re-
sultaat.
It can also also output all syllables using hyphenation rules:
$ hyphertool --language nl test.txt
Dit is een test-be-stand. Kan je dit be-stand mooi voor mij ver-wer-ken?
Ik hoop op een po-si-tief re-sul-taat.
Or with character offsets:
$ hyphertool --language nl --standoff test.txt
Text BeginOffset EndOffset
...
Ik 68 70
hoop 71 75
op 76 78
een 79 82
po 83 85
si 85 87
tief 87 91
re 92 94
sul 94 97
taat 97 101
For more information, see hyphertool
on
Sourcehut,
Github or
Crates.io.
ssam
I wrote ssam, short for Split sampler, as a simple program that splits one or more text-based input files into multiple sets using random sampling. This is useful for splitting data into a training, test and development sets, or whatever sets you desire. This software was written in Rust.
It works nicely with multiple files when entries one the same lines correspond
(such as sentence-aligned parallel corpora). Suppose you a sentences.txt
and a sätze.txt
with the same
sentences in German (i.e. the same line numbers correspond
and contain translations). You can then make a dependent split as follows:
$ ssam --shuffle --sizes "0.1,0.1,*" --names "test,dev,train" sentences.txt sätze.txt
For more information, see ssam
on
Sourcehut,
Github or
Crates.io.
sesdiff
sesdiff
is a small and fast command-line tool and Rust library that reads a
two-column tab separated input from standard input and computes the shortest
edit script (Myers' diff algorithm) to go from the string in column A to the
string in column B. In other words, it computes how strings difer. It also
computes the edit distance (aka levenshtein distance). It builds upon the
dissimilar library by David Tolnay for
the bulk of the computations.
$ sesdiff < input.tsv
hablaron hablar =[hablar]-[on] 2
contaron contar =[contar]-[on] 2
pidieron pedir =[p]-[i]+[e]=[di]-[eron]+[r] 6
говорим говорить =[говори]-[м]+[ть] 3
The output is in a four-column tab separated format (reformatted for legibility here, the first two columns correspond to the input).
It can also do the reverse, given a word and a edit recipe, compute the other string.
It was initially designed to compute training data for a lemmatiser.
For more information, see sesdiff
on
Sourcehut,
Github or
Crates.io.
lingua-cli
lingua-cli
is a command-line tool for language detection. Given a text,
it will predict what language a text is in. It supports many languages. It can
also predict per line of the input, as the underlying algorithm is particularly
suited to deal with shorter text, or it can actively search the text for
languages and return offsets. This program is a mostly a wrapper around the
lingua-rs library by Peter M. Stahl. It is written in Rust. The
3rd party library is also available for
python,
go, and Javascript via
WASM.
$ echo -e "bonjour à tous\nhola a todos\nhallo allemaal" | lingua-cli --per-line --languages "fr,de,es,nl,en"
fr 0.9069164472389637 bonjour à tous
es 0.918273871035807 hola a todos
nl 0.988293648761749 hallo allemaal
Output is TSV and consists of an iso-639-1 language code, confidence score, and in line-by-line mode, a copy of the line.
In --multi
mode, it will detect languages in running text and return UTF-8 byte offsets:
$ lingua-cli --multi --languages fr,de,en < /tmp/test.txt
0 23 fr Parlez-vous français?
23 73 de Ich spreche ein bisschen spreche Französisch ja.
73 110 en A little bit is better than nothing.
For more information, see lingua-cli
on
Sourcehut,
Github or
Crates.io.
lexmatch
lexmatch
is a simple lexicon matching tool that, given a lexicon of words or
phrases, identifies all matches in a given target text, returning their exact
positions. It can be used compute a frequency list for a lexicon, on a target
corpus. The implementation uses either suffix arrays or hash tables. The
lexicon is in the form of one word or phrase per line. Multiple lexicons are supported. It is written in Rust.
$ lexmatch --lexicon lexicon.lst corpus.txt
Rather than take a lexicon, you can also directly supply lexical entries (words/phrases) on the command line using --query
:
$ lexmatch --query good --query bad /nettmp/republic.short.txt
Reading text from /tmp/republic.short.txt...
Building suffix array (this may take a while)...
Searching...
good 4 193 3307 3480 278
bad 3 201 3315 3488
By default, you get a TSV file with a column for the text, the occurrence
count, and one with the begin position (UTF-8 byte position) for each match
(dynamic columns). In --verbose
both you get cleaner TSV with separate UTF-8
byte offsets for each match:
$ lexmatch --verbose --query good --query bad /nettmp/republic.short.txt
Text BeginUtf8Offset EndUtf8Offset
Reading text from /tmp/republic.short.txt...
Building suffix array (this may take a while)...
Searching...
good 193 197
good 3307 3311
good 3480 3484
good 278 282
bad 201 204
bad 3315 3318
bad 3488 3491
For more information, see lexmatch
on
Sourcehut,
Github or
Crates.io.
stam
STAM Tools is a collection of command-line programs for working with stand-off annotations on plain text. These all make use of the STAM data model and accompanying library for representing annotations, which can also be exported as simple CSV or TSV. They are written in Rust.
Each can be accessed through the executable stam
and a subcommand. I mention just a few for brevity:
stam align
- Align two similar texts, mapping their coordinate spaces. This looks for similar strings in each text using the Needleman-Wunsch or Smith-Waterman algorithm.stam fromxml
- Turn text and annotations from XML-based formats (like xHTML, TEI) into plain text with separate stand-off annotations following the STAM model. This effectively 'untangles' text and annotations.stam query
orstam export
- Query an annotation store and export the output in tabular form to a simple TSV (Tab Separated Values) format. This is not lossless but provides a decent view on the data. It provides a lot of flexibility by allowing you to configure the output columns as you see fit.stam validate
- Validate a STAM model, tests if all annotation offsets still point to the intended text selections.stam tag
- Regular-expression based tagger on plain text, it also serves as a good tokeniser.stam transpose
- Given an alignment between text (as computed bystam align
), converts annotations in one coordinate space to another.stam view
- View annotations as queried by outputting to HTML (or ANSI coloured text).
A demo video is available that demonstrates all these and more:
See STAM for more information about the
data model and the general project, and
stam-tools for information
specifically about these command-line tools. A python
binding is also available (pip install stam
).
ucto
Ucto is a regular-expression based tokeniser. It comes with tokenisation rules for about a dozen languages. It is written in C++, initially by me but then in much larger part by my colleague Ko van der Sloot.
See the Ucto website for more information. A Python binding
is also available (pip install ucto
).
frog
Frog is a tool that integrates various NLP modules for Dutch. Such as a tokeniser (ucto), part of speech tagger, lemmatiser, dependency parser, named-entity recogniser, dependency parser and more. It is written in C++ by Ko van der Sloot as lead developer, and by myself, Antal van den Bosch and in its initial stages Bertjan Busser. It has a long history containing various components that are also made possible by Walter Daelemans, Jakub Zavrel, Sabine Buchholz, Sander Canisius, Gert Durieux and Peter Berck.
Though parts of the Frog have been superseded by more recent advancements in NLP, it is still a useful tool in many use-cases:
I also wrote a python binding so you
can use Frog directly from Python (pip install frog
).
See the Frog website for more information.
analiticcl
Analiticcl is an approximate string matching system. It can be used for spelling correction or text normalisation and post-OCR correction. It does comparisons against one or more lexicons. This may sound similar to lexmatch which I introduced earlier. However, analiticcl goes a lot further than lexmatch. It does not require exact matches with the lexicon but is designed to detect spelling variants, and to do so efficiently. It is written in Rust and builds upon earlier research by Martin Reynaert.
$ analiticcl query --lexicon examples/eng.aspell.lexicon --alphabet examples/simple.alphabet.tsv --output-lexmatch
--json < input.tsv > output.json
[
{ "input": "seperate", "variants": [
{ "text": "separate", "score": 0.734375, "dist_score": 0.734375, "freq_score": 1, "lexicons": [ "examples/eng.aspell.lexicon" ] },
{ "text": "desperate", "score": 0.6875, "dist_score": 0.6875, "freq_score": 1, "lexicons": [ "examples/eng.aspell.lexicon" ] },
{ "text": "operate", "score": 0.6875, "dist_score": 0.6875, "freq_score": 1, "lexicons": [ "examples/eng.aspell.lexicon" ] },
{ "text": "temperate", "score": 0.6875, "dist_score": 0.6875, "freq_score": 1, "lexicons": [ "examples/eng.aspell.lexicon" ] },
{ "text": "serrate", "score": 0.65625, "dist_score": 0.65625, "freq_score": 1, "lexicons": [ "examples/eng.aspell.lexicon" ] },
{ "text": "separated", "score": 0.609375, "dist_score": 0.609375, "freq_score": 1, "lexicons": [ "examples/eng.aspell.lexicon" ] },
{ "text": "separates", "score": 0.609375, "dist_score": 0.609375, "freq_score": 1, "lexicons": [ "examples/eng.aspell.lexicon" ] }
] }
]
See the analiticcl source repository for more information. A python
binding is also available (pip install analiticcl
).
colibri-core
Colibri Core is software to quickly and efficiently count and extract patterns from large corpus data, to extract various statistics on the extracted patterns, and to compute relations between the extracted patterns. The employed notion of pattern or construction encompasses ngrams, skipgrams and flexgrams (an abstract pattern with one or more gaps of variable-size).
N-gram extraction may seem fairly trivial at first, with a few lines in your favourite scripting language, you can move a simple sliding window of size n over your corpus and store the results in some kind of hashmap. This trivial approach however makes an unnecessarily high demand on memory resources, this often becomes prohibitive if unleashed on large corpora. Colibri Core tries to minimise these space requirements.
Colibri Core is written in C++, a python binding is also available.
A demo video is available:
See the Colibri Core website for more information. A journal article "Efficient n-gram, Skipgram and Flexgram modelling with colibri-core" was published in the Journal of Open Research Software (2016), and was also included in my PhD dissertation.
folia-tools
FoLiA tools is a set of numerous
command-line tools I wrote for working with the FoLiA format.
FoLiA is an XML-based Format for Linguistic
Annotation. It is written in Python (pip install folia-tools
). My colleague
Ko van der Sloot wrote complementary C++ tooling to work with the same format,
these are called foliautils.
Both contain a wide variety of programs such as validators, converters from and to other document and annotation formats (TEI, PageXML, Abby, plain text, ReStructuredText, CONLL, STAM) etc.
vocage
Vocage may be a bit of an odd-one out in this list as it is not really a text processing tool as-such, but rather a TUI (Terminal User Interface) for studying language vocabulary (which still counts as text so is excuse enough for inclusion here). I wrote it in Rust.
Vocage works by presenting flashcards using a spaced-repetition algorithm, which means it shows cards (e.g. words) you know well less frequently than cards you do not know yet.
The difference with more common software like Anki is that vocage is fully terminal-based, has vim keybindings, and it stores its data in a simple TSV format (both the wordlists as well as the learning progress go in the same TSV file).
It was featured in a video by Brodie Robertson in 2021.
For more information, see vocage
on
Sourcehut,
Github or
Crates.io.
Conclusion
I hope I have been able to convey some of my enthusiasm for the unix command-line for text processing and natural language processing. Moreover, I hope you find some of my software useful for some of your own projects.
You can publicly comment on this post on the fediverse (e.g. using mastodon).