I am Maarten van Gompel, also known as proycon on the internet. Welcome to my homepage where I share some of my work and interests. I live in Eindhoven, the Netherlands, with my boyfriend Hans, our cat Sam and our dog Jaiko. I am a research software engineer (just a fancy word for computer programmer) in the field of Natural Language Processing at the Radboud University, in Nijmegen.
I am very passionate about both languages and technology, and especially those areas where they meet!
PhD Candidate in Natural Language Processing, 2011-2019
Radboud University, Nijmegen
MA in Human Aspects of Information Technology, 2009
BSc in Cognitive Artificial Intelligence, 2008
I’m an avid support of open-source software, also known as free software (free as in free speech), and have published a lot thereof. Most of my software is hosted on GitHub. I also maintain packages for Arch Linux and Debian, and am active in promoting good research software quality & sustainability.
Labirinto is a virtual laboratory portal, it makes a collection of software browseable and searchable for the end-user. Labirinto presents the software’s metadata following the CodeMeta specification in an intuitive way and allows the user to filter and perform a limited search. The portal gives access to software if it offers web-based interfaces. This system is specifically geared towards research software, and for instance allows linking to relevant scientific publications for each tool.
LaMachine is a unified software distribution for Natural Language Processing. We integrate numerous open-source NLP tools, programming libraries, web-services, and web-applications in a single Virtual Research Environment that can be installed on a wide variety of machines.
FoLiA Document Server - HTTP webservice backend for serving and annotating FoLiA documents using the FoLiA Query Language (FQL). Used by FLAT.
Gecco is a generic modular and distributed framework for spelling correction. Aimed to build a complete context-aware spelling correction system given your own data set.
FLAT is a web-based linguistic annotation environment based around the FoLiA format, a rich XML-based format for linguistic annotation. FLAT allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm. It is a document-centric tool that fully preserves and visualises document structure.
Colibri Core is software to quickly and efficiently count and extract patterns from large corpus data, to extract various statistics on the extracted patterns, and to compute relations between the extracted patterns.
FoLiA is an XML-based annotation format, suitable for the representation of linguistically annotated language resources. FoLiA’s intended use is as a format for storing and/or exchanging language resources, including corpora. Our aim is to introduce a single rich format that can accommodate a wide variety of linguistic annotation types through a single generalised paradigm. We do not commit to any label set, language or linguistic theory.
PyNLPl, pronounced as ‘pineapple’, is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
Quickly turn command-line applications into RESTful webservices with a web-application front-end. You provide a specification of your command line application, its input, output and parameters, and CLAM wraps around your application to form a fully fledged RESTful webservice.
Valkuil.net is een automatische spellingcorrector voor het Nederlands die zowel gewone typefouten als grammaticale fouten en verwarringen tussen bestaande woorden opspoort.
Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation.
Via Nederlab kunnen onderzoekers en studenten grote aantallen gedigitaliseerde Nederlandstalige teksten van ca. 800 tot heden gezamenlijk doorzoeken en analyseren met binnen Nederlab ontwikkelde, gebruiksvriendelijke tekstanalysesoftware. Zo biedt Nederlab een laboratorium voor onderzoek naar de veranderingspatronen in de Nederlandse taal en cultuur.
Spreek2Schrijf is een haalbaarheidsstudie naar de mogelijkheid om de gesproken taal in de plenaire bijeenkomsten van de Tweede Kamer, automatisch om te zetten in schrijf -taal zoals die nu in de Handelingen wordt gebruikt, in opdracht van de Dienst Verslag en Redactie van het Nederlandse Parlement, de DVR.
A Dutch-Frisian Machine translation project in collaboration with the Fryske Akademy
Colibri is my PhD project, it investigates the role of source-language context information in machine translation
The goal of DutchSemCor was to deliver a one-million word Dutch corpus that is fully sense-tagged with senses and domain tags.
The CLARIN infrastructure is a research infrastructure intended for humanities researchers that work with language data and tools.
UniLang is a Language Community on the internet for people interested in learning languages. I co-founded UniLang in October 2000 and it has been around ever since. UniLang has a wide variety of language resources and a community forum to meet like-minded people and learn together.
A phrasebook of various grammar constructions, intended for translation into many different languages as a parallel learning resource.
This is a translation and English adaptation of an Esperanto course that was originally in Dutch, by Wil van Ganswijk. Containing 20-lessons.
Short story using basic vocabulary, intended for translation into many different languages as a parallel learning resource.
Deze vocabulairetrainer gaat over de vocabulaire behorend bij de cursus Taalverwerving Arabisch, les 1 t/m 29, aan de Universiteit Utrecht.
Deze woordenlijst somt alle vocabulaire op van de cursus Taalverwerving Arabisch, les 1 t/m 29, aan de Universiteit Utrecht.
A basic phrasebook, containing commonly used phrases, intended for translation into many different languages as a parallel learning resource.