NLP Resources

From the LDC Language Resource Wiki

(Difference between revisions)
Jump to: navigation, search
m (University of Western Australia Web Text Mining and NLP Tools)
m (moved standards & best pracs to General Meta-resources, where I'd been having them)
 
(15 intermediate revisions not shown)
Line 1: Line 1:
{{Under construction}}
{{Under construction}}
-
{{si|[[User:Ftyers|Ftyers]] 19:13, 22 April 2010 (UTC)}}
+
{{si|[[User:Mamandel|Mamandel]] 14:18, 22 May 2011 (UTC)}}
-
This page is for language-independent NLP resources. <br>
+
__TOC__
-
Language-independent [[General Meta-resources]] that are not specific to NLP have their own page.
+
-
==Apertium==
+
This page is for language-independent resources for computational natural language processing. <br>
 +
Language-independent [[General Meta-resources]] that are not specific to NLP have their own page. <br>
 +
For metadata standards and infrastructure see the [[General Meta-resources#Metadata_standards_and_infrastructure|General Meta-resources]] page.
-
A free/open-source rule-based machine translation platform offering free linguistic data (morphological analysers, bilingual dictionaries, etc.) in XML formats for a range of languages.
+
==Software==
-
===Links===
+
* [http://borel.slu.edu/crubadan/index.html An Crúbadán]: Corpus building for minority languages. Web crawling software {{Hq|designed to exploit the vast quantities of text freely available on the web as a way of bringing the benefits of statistical NLP to languages with small numbers of speakers and/or limited computational resources.}} Kevin P. Scannell. {{si|[[User:Mamandel|Mamandel]] 00:25, 14 May 2010 (UTC)}}
-
* [http://www.apertium.org Apertium: Home]
+
* [http://www.apertium.org Apertium]. A free/open-source rule-based machine translation platform offering free linguistic data (morphological analysers, bilingual dictionaries, etc.) in XML formats for a range of languages.
-
== An Crúbadán ==
+
* [http://sourceforge.net/projects/foma/ Foma]. {{hq|a compiler, programming language, and C library for constructing finite-state automata and transducers for various uses. It has specific support for many natural language processing applications such as producing morphological analyzers.}}
-
[http://borel.slu.edu/crubadan/index.html Corpus building for minority languages]: Home page for ''An Crúbadán'', web crawling software by Kevin P. Scannell designed for corpus building for minority languages. {{si|[[User:Mamandel|Mamandel]] 00:25, 14 May 2010 (UTC)}}
+
**[http://www.aclweb.org/anthology-new/E/E09/E09-2008.pdf Foma: a finite-state compiler and library]. Hulden, Mans. 2009. ''Proceedings of the EACL 2009 Demonstrations Session'', pages 29–32, Athens, Greece, 3 April 2009. PDF
-
:{{Heavy lq}}Statistical techniques are a key part of most modern natural language processing systems. Unfortunately, such techniques require the existence of large bodies of text, and in the past corpus development has proved to be quite expensive. As a result, substantial corpora exist primarily for languages like English, French, German, etc. where there is a market-driven need for NLP tools.
+
-
: {{Heavy quotes|My software is designed to exploit the vast quantities of text freely available on the web as a way of bringing the benefits of statistical NLP to languages with small numbers of speakers and/or limited computational resources. Initially it was deployed for the six Celtic languages, but more recently I've added support for a number of other languages from all parts of the world. You can find an up-to-date list of languages and the corpus statistics for each on the Status Page. There is also information on tools developed using these corpora on the Applications Page.}}
+
-
==Foma==
+
* [http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/ Helsinki Finite-State Transducer Technology (HFST)]. A free/open-source rewrite of the Xerox finite-state tools. It provides an implementation both of the <code>lexc</code> and <code>twolc</code> formalisms.
-
[http://www.aclweb.org/anthology-new/E/E09/E09-2008.pdf Foma: a finite-state compiler and library]
+
-
==HFST==
+
*[http://www.unlweb.net/unlweb/ Universal Networking Language (UNL)]. {{hq|an artificial language for representing, describing, summarizing, refining, storing and disseminating information in a natural-language-independent format. It is a kind of mark-up language which represents not the formatting but the core information of a text. As HTML annotations can be realized differently in the context of different applications, machines, displays, etc., so UNL expressions can have different realizations in different human languages.}}
-
The Helsinki finite-state toolkit is a free/open-source rewrite of the Xerox finite-state tools. It provides an implementation both of the <code>lexc</code> and <code>twolc</code> formalisms.
+
* [http://beta.visl.sdu.dk/constraint_grammar.html VISL Constraint Grammar]. A free/open-source software reimplementation and extension of Fred Karlsson's Constraint Grammar formalism.
-
===Links===
+
==NLP Literature==
-
* [http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/ HFST: Home]
+
* [http://www.mt-archive.info/ Machine Translation Archive]. {{hq|Electronic repository and bibliography of articles, books and papers on topics in machine translation, computer translation systems, and computer-based translation tools. Latest update: 30 April 2011 [now containing over 7700 items]}} {{si|2011-05-10}} <br>{{hq|aims to cover comprehensively English-language publications since 1990.  Papers and books from previous years are being added in order to provide good coverage from the beginnings of MT in the 1950s to 1990.}}
-
==Machine Translation Archive==
+
* Probabilistic tagging of minority language data: a case study using Qtag. Christopher Cox. 2010.  In ''[http://www.rodopi.nl/senj.asp?BookId=LC+71 Corpus-linguistic applications]'', ed. Stefan Th. Gries, Stefanie Wulff, and Mark Davies. 2010. Electronic: ISBN 9789042028012; hardback: ISBN 9789042028005. <br>Reviewed in [http://linguistlist.org/issues/21/21-3318.html LINGUIST List 21.3318] (2010-08-17) by Andrew Caines.
-
[http://www.mt-archive.info/ Machine Translation Archive]. Electronic repository and bibliography of articles, books and papers on topics in machine translation, computer translation systems, and computer-based translation tools.  >6400 items. Aims to be comprehensive on English-language publications since 1990; adding earlier papers and books to provide partial coverage from the 1950s. {{si|[[User:Mamandel|Mamandel]] 20:53, 22 April 2010 (UTC)}}
+
-
 
+
-
==Methodology==
+
-
'''Probabilistic tagging of minority language data: a case study using Qtag'''
+
-
* Christopher Cox. 2010.  {{si|[[User:Mamandel|Mamandel]] 20:24, 23 August 2010 (UTC)}}
+
-
*In ''[http://www.rodopi.nl/senj.asp?BookId=LC+71 Corpus-linguistic applications]'', ed. Stefan Th. Gries, Stefanie Wulff, and Mark Davies. [http://www.rodopi.nl/ Rodopi]. Electronic: ISBN 9789042028012; hardback: ISBN 9789042028005.
+
-
* Reviewed in [http://linguistlist.org/issues/21/21-3318.html LINGUIST List 21.3318] by Andrew Caines (2010-08-17):
+
-
*:{{Heavy lq}}Cox's theme is corpus planning. He considers the tagging process, and evaluates the time-accuracy trade-off in using (a) normalized/unnormalized orthography; (b) various chunk sizes for rounds of iterative, interactive tagging; (c) tagset size. He does so in the context of corpus building for minority languages which are on the whole associated with more modest resources than major language projects.
+
-
*:{{Hq|Cox considers what is required to tag a minority-language corpus. He finds that orthographically normalized data is 20% more accurate but more expensive to prepare, that smaller chunks are preferable for iterative interactive tagging, and that a less elaborate tagset is more accurate and efficient. Cox notes that these observations must be set against the purpose of the corpus and the requirements of the researchers who will be using it. This is a well-written paper with well-defined research questions and conclusions which are explicitly linked back to them -- an attribute which cannot be taken for granted in academic literature.}}
+
-
 
+
-
==OBELEX==
+
-
[http://hypermedia.ids-mannheim.de/pls/lexpublic/bib_en.ansicht Online Bibliography of Electronic Lexicography] (OBELEX). All relevant articles, monographs, anthologies and reviews since 2000 and some older relevant works. Focus is on online lexicography. Dictionaries not included, but included in a supplementary database now under construction. Search by full text, keyword, person, analysed languages, or publication year. ([[User:Mamandel|Mamandel]] 22:26, 28 April 2010 (UTC))
+
-
*[http://hypermedia.ids-mannheim.de/pls/lexpublic/bib.ansicht Home page in German.]
+
-
*[http://linguistlist.org/issues/21/21-1915.html Announcement] on LINGUIST List {{attrib|19-Apr-2010 }}
+
-
 
+
-
==TMX==
+
-
 
+
-
An XML-based format for translation memories.
+
-
 
+
-
===Links===
+
-
 
+
-
* [http://www.lisa.org/Translation-Memory-e.34.0.html TMX: Home]
+
-
 
+
-
==Universal Networking Language==
+
-
[http://www.unlweb.net/unlweb/ Universal Networking Language (UNL)]: {{hq|an artificial language for representing, describing, summarizing, refining, storing and disseminating information in a natural-language-independent format. It is a kind of mark-up language which represents not the formatting but the core information of a text. As HTML annotations can be realized differently in the context of different applications, machines, displays, etc., so UNL expressions can have different realizations in different human languages.}}
+
-
{{si|[[User:Mamandel|Mamandel]] 20:26, 6 May 2010 (UTC)}}
+
-
 
+
-
== University of Western Australia Web Text Mining and NLP Tools ==
+
-
{{si|still active? page not accessible [[User:Mamandel|Mamandel]] 17:52, 4 May 2011 (UTC)}}
+
-
 
+
-
[From [http://linguistlist.org/issues/21/21-2867.html LINGUIST List 21.2867]]: {{Hq|We have made available a list of web services for accessing text mining and NLP tools implemented at our research group such as boilerplate removal (known as HERCULES), semantic similarity/relatedness measures (i.e. Normalised Web Distance, n-Degree of Wikipedia), noun phrase chunking, triple extraction, noisy text cleaning (known as ISSAC), simple term extraction, and access to our multi-domain, 300 million token text corpora (which are continuously growing).<br>--Dr Wilson Wong, School of Computer Science & Software Engineering, The University of Western Australia}}
+
-
===Links===
+
-
 
+
-
* [http://ontology.csse.uwa.edu.au The University of Western Australia (UWA) Text Mining Group]
+
-
* [http://ontology.csse.uwa.edu.au/research/api.pl API directory]
+
-
* Write to [mailto:wilson@csse.uwa.edu.au wilson@csse.uwa.edu.au] to obtain a free developer key
+
-
 
+
-
==VISL Constraint Grammar==
+
-
 
+
-
A free/open-source software reimplementation and extension of Fred Karlsson's Constraint Grammar formalism.
+
-
 
+
-
===Links===
+
-
 
+
-
* [http://beta.visl.sdu.dk/constraint_grammar.html VISL Constraint Grammar: Home]
+
 +
* [http://hypermedia.ids-mannheim.de/pls/lexpublic/bib_en.ansicht OBELEX: Online Bibliography of Electronic Lexicography]. {{hq|Articles, monographs, anthologies, and reviews from the field of electronic lexicography with a special focus on online lexicography.}} Search by full text, keyword, person, analysed languages, or publication year. {{hq|c. 600 entries}} {{si|2011-05-10}} ([http://hypermedia.ids-mannheim.de/pls/lexpublic/bib.ansicht German home page])
[[Category:Non-language-specific]]
[[Category:Non-language-specific]]

Latest revision as of 14:18, 22 May 2011

THIS PAGE IS

UNDER CONSTRUCTION


[Mamandel 14:18, 22 May 2011 (UTC)]

Contents


This page is for language-independent resources for computational natural language processing.
Language-independent General Meta-resources that are not specific to NLP have their own page.
For metadata standards and infrastructure see the General Meta-resources page.

Software

  • An Crúbadán: Corpus building for minority languages. Web crawling software designed to exploit the vast quantities of text freely available on the web as a way of bringing the benefits of statistical NLP to languages with small numbers of speakers and/or limited computational resources. Kevin P. Scannell. [Mamandel 00:25, 14 May 2010 (UTC)]
  • Apertium. A free/open-source rule-based machine translation platform offering free linguistic data (morphological analysers, bilingual dictionaries, etc.) in XML formats for a range of languages.
  • Foma. a compiler, programming language, and C library for constructing finite-state automata and transducers for various uses. It has specific support for many natural language processing applications such as producing morphological analyzers.
  • Universal Networking Language (UNL). an artificial language for representing, describing, summarizing, refining, storing and disseminating information in a natural-language-independent format. It is a kind of mark-up language which represents not the formatting but the core information of a text. As HTML annotations can be realized differently in the context of different applications, machines, displays, etc., so UNL expressions can have different realizations in different human languages.
  • VISL Constraint Grammar. A free/open-source software reimplementation and extension of Fred Karlsson's Constraint Grammar formalism.

NLP Literature

  • Machine Translation Archive. Electronic repository and bibliography of articles, books and papers on topics in machine translation, computer translation systems, and computer-based translation tools. Latest update: 30 April 2011 [now containing over 7700 items] [2011-05-10]
    aims to cover comprehensively English-language publications since 1990. Papers and books from previous years are being added in order to provide good coverage from the beginnings of MT in the 1950s to 1990.
Personal tools