NLP Resources

From the LDC Language Resource Wiki

Revision as of 00:25, 14 May 2010 by Mamandel (Talk | contribs)
Jump to: navigation, search



(Ftyers 19:13, 22 April 2010 (UTC))

This page is for language-independent NLP resources.



A free/open-source rule-based machine translation platform offering free linguistic data (morphological analysers, bilingual dictionaries, etc.) in XML formats for a range of languages.


An Crúbadán

Corpus building for minority languages: Home page for An Crúbadán, web crawling software by Kevin P. Scannell designed for corpus building for minority languages. From the site: [Mamandel 00:25, 14 May 2010 (UTC)]

Statistical techniques are a key part of most modern natural language processing systems. Unfortunately, such techniques require the existence of large bodies of text, and in the past corpus development has proved to be quite expensive. As a result, substantial corpora exist primarily for languages like English, French, German, etc. where there is a market-driven need for NLP tools.
My software is designed to exploit the vast quantities of text freely available on the web as a way of bringing the benefits of statistical NLP to languages with small numbers of speakers and/or limited computational resources. Initially it was deployed for the six Celtic languages, but more recently I've added support for a number of other languages from all parts of the world. You can find an up-to-date list of languages and the corpus statistics for each on the Status Page. There is also information on tools developed using these corpora on the Applications Page.



The Helsinki finite-state toolkit is a free/open-source rewrite of the Xerox finite-state tools. It provides an implementation both of the lexc and twolc formalisms.


Machine Translation Archive

Machine Translation Archive. Electronic repository and bibliography of articles, books and papers on topics in machine translation, computer translation systems, and computer-based translation tools. >6400 items. Aims to be comprehensive on English-language publications since 1990; adding earlier papers and books to provide partial coverage from the 1950s. [Mamandel 20:53, 22 April 2010 (UTC)]


Online Bibliography of Electronic Lexicography (OBELEX). All relevant articles, monographs, anthologies and reviews since 2000 and some older relevant works. Focus is on online lexicography. Dictionaries not included, but included in a supplementary database now under construction. Search by full text, keyword, person, analysed languages, or publication year. (Mamandel 22:26, 28 April 2010 (UTC))


An XML-based format for translation memories.


Universal Networking Language

[from the home page]: The Universal Networking Language (UNL) is an artificial language for representing, describing, summarizing, refining, storing and disseminating information in a natural-language-independent format. It is a kind of mark-up language which represents not the formatting but the core information of a text. As HTML annotations can be realized differently in the context of different applications, machines, displays, etc., so UNL expressions can have different realizations in different human languages. [Mamandel 20:26, 6 May 2010 (UTC)]


VISL Constraint Grammar

A free/open-source software reimplementation and extension of Fred Karlsson's Constraint Grammar formalism.


Personal tools