Sandbox
From the LDC Language Resource Wiki
(Difference between revisions)
m |
m |
||
Line 11: | Line 11: | ||
This page is for language-independent resources for computational natural language processing. <br> | This page is for language-independent resources for computational natural language processing. <br> | ||
Language-independent [[General Meta-resources]] that are not specific to NLP have their own page. | Language-independent [[General Meta-resources]] that are not specific to NLP have their own page. | ||
- | |||
- | |||
==Software== | ==Software== | ||
- | |||
* [http://borel.slu.edu/crubadan/index.html An Crúbadán]: Corpus building for minority languages. Web crawling software {{Hq|designed to exploit the vast quantities of text freely available on the web as a way of bringing the benefits of statistical NLP to languages with small numbers of speakers and/or limited computational resources.}} Kevin P. Scannell. {{si|[[User:Mamandel|Mamandel]] 00:25, 14 May 2010 (UTC)}} | * [http://borel.slu.edu/crubadan/index.html An Crúbadán]: Corpus building for minority languages. Web crawling software {{Hq|designed to exploit the vast quantities of text freely available on the web as a way of bringing the benefits of statistical NLP to languages with small numbers of speakers and/or limited computational resources.}} Kevin P. Scannell. {{si|[[User:Mamandel|Mamandel]] 00:25, 14 May 2010 (UTC)}} | ||
Line 29: | Line 26: | ||
* [http://beta.visl.sdu.dk/constraint_grammar.html VISL Constraint Grammar]. A free/open-source software reimplementation and extension of Fred Karlsson's Constraint Grammar formalism. | * [http://beta.visl.sdu.dk/constraint_grammar.html VISL Constraint Grammar]. A free/open-source software reimplementation and extension of Fred Karlsson's Constraint Grammar formalism. | ||
- | |||
- | |||
- | == | + | ==NLP Literature== |
- | + | ||
- | * | + | * [http://www.mt-archive.info/ Machine Translation Archive]. Electronic repository and bibliography of articles, books and papers on topics in machine translation, computer translation systems, and computer-based translation tools. >6400 items. Aims to be comprehensive on English-language publications since 1990; adding earlier papers and books to provide partial coverage from the 1950s. |
- | * | + | |
- | * Reviewed in [http://linguistlist.org/issues/21/21-3318.html LINGUIST List 21.3318] by Andrew Caines (2010-08-17): | + | *''[http://www.rodopi.nl/senj.asp?BookId=LC+71 Corpus-linguistic applications]'', ed. Stefan Th. Gries, Stefanie Wulff, and Mark Davies. [http://www.rodopi.nl/ Rodopi]. Electronic: ISBN 9789042028012; hardback: ISBN 9789042028005. |
- | *:{{ | + | **Probabilistic tagging of minority language data: a case study using Qtag. Christopher Cox. 2010. |
- | *:{{Hq|Cox considers what is required to tag a minority-language corpus. He finds that orthographically normalized data is 20% more accurate but more expensive to prepare, that smaller chunks are preferable for iterative interactive tagging, and that a less elaborate tagset is more accurate and efficient. Cox notes that these observations must be set against the purpose of the corpus and the requirements of the researchers who will be using it. This is a well-written paper with well-defined research questions and conclusions which are explicitly linked back to them -- an attribute which cannot be taken for granted in academic literature.}} | + | ** Reviewed in [http://linguistlist.org/issues/21/21-3318.html LINGUIST List 21.3318] by Andrew Caines (2010-08-17): |
+ | **:{{Hlq}}Cox's theme is corpus planning. He considers the tagging process, and evaluates the time-accuracy trade-off in using (a) normalized/unnormalized orthography; (b) various chunk sizes for rounds of iterative, interactive tagging; (c) tagset size. He does so in the context of corpus building for minority languages which are on the whole associated with more modest resources than major language projects. | ||
+ | **:{{Hq|Cox considers what is required to tag a minority-language corpus. He finds that orthographically normalized data is 20% more accurate but more expensive to prepare, that smaller chunks are preferable for iterative interactive tagging, and that a less elaborate tagset is more accurate and efficient. Cox notes that these observations must be set against the purpose of the corpus and the requirements of the researchers who will be using it. This is a well-written paper with well-defined research questions and conclusions which are explicitly linked back to them -- an attribute which cannot be taken for granted in academic literature.}} | ||
- | + | * [http://hypermedia.ids-mannheim.de/pls/lexpublic/bib_en.ansicht OBELEX: Online Bibliography of Electronic Lexicography]. All relevant articles, monographs, anthologies and reviews since 2000 and some older relevant works. Focus is on online lexicography. Dictionaries not included, but included in a supplementary database now under construction. Search by full text, keyword, person, analysed languages, or publication year. {{si|[[User:Mamandel|Mamandel]] 22:26, 28 April 2010 (UTC)}} | |
- | [http://hypermedia.ids-mannheim.de/pls/lexpublic/bib_en.ansicht Online Bibliography of Electronic Lexicography] | + | ** [http://hypermedia.ids-mannheim.de/pls/lexpublic/bib.ansicht Home page in German.] |
- | *[http://hypermedia.ids-mannheim.de/pls/lexpublic/bib.ansicht Home page in German.] | + | ** [http://linguistlist.org/issues/21/21-1915.html Announcement] on LINGUIST List {{attrib|19-Apr-2010 }} |
- | *[http://linguistlist.org/issues/21/21-1915.html Announcement] on LINGUIST List {{attrib|19-Apr-2010 }} | + | |
Revision as of 20:35, 10 May 2011
The Sandbox is a place to play. Use this page for practicing wiki editing, making links, anything! Don't expect anything you put here to last.
- Learn how to manipulate the Wiki.
- What Can I Do?
- I can make things bold ('''bold''').
- I can italicize (''italicize'').
- I can timestamp and sign: Mamandel 14:52, 22 April 2010 (UTC) (four tildes: ~~~~)
- or just timestamp: 14:52, 22 April 2010 (UTC) (five tildes: ~~~~~)
- or just sign: Mamandel (three tildes: ~~~)
- I can make an external link ([http://ldc.upenn.edu external link] -- space between URL and text).
- I can make an internal link ([[Bengali/Bengali|internal link]] -- pipe character '|' between page title and text).
- What Can I Do?
I can make text preformatted and in a box (note, no auto-wrapping). (White space at beginning of line).
Some magic words and what they produce:
- {{SERVER}}: http://lrwiki.ldc.upenn.edu
- {{PAGENAME}}: Sandbox
For much, much more info see Mediawiki's editing help.
FEEL FREE TO DELETE ANYTHING BELOW THE DOUBLE LINE,
BUT DON'T TOUCH THE DOUBLE LINE OR ANYTHING ABOVE IT. THANKS.
-- The Mgt.
WELCOME TO THE SANDBOX
UNDER CONSTRUCTION
[Mamandel 20:19, 10 May 2011 (UTC)]
This page is for language-independent resources for computational natural language processing.
Language-independent General Meta-resources that are not specific to NLP have their own page.
Software
- An Crúbadán: Corpus building for minority languages. Web crawling software “designed to exploit the vast quantities of text freely available on the web as a way of bringing the benefits of statistical NLP to languages with small numbers of speakers and/or limited computational resources.” Kevin P. Scannell. [Mamandel 00:25, 14 May 2010 (UTC)]
- Apertium. A free/open-source rule-based machine translation platform offering free linguistic data (morphological analysers, bilingual dictionaries, etc.) in XML formats for a range of languages.
- Foma: a finite-state compiler and library. Hulden, Mans. 2009. Proceedings of the EACL 2009 Demonstrations Session, pages 29–32, Athens, Greece, 3 April 2009. PDF
- Helsinki Finite-State Transducer Technology (HFST). A free/open-source rewrite of the Xerox finite-state tools. It provides an implementation both of the
lexc
andtwolc
formalisms.
- Universal Networking Language (UNL). “an artificial language for representing, describing, summarizing, refining, storing and disseminating information in a natural-language-independent format. It is a kind of mark-up language which represents not the formatting but the core information of a text. As HTML annotations can be realized differently in the context of different applications, machines, displays, etc., so UNL expressions can have different realizations in different human languages.”
- VISL Constraint Grammar. A free/open-source software reimplementation and extension of Fred Karlsson's Constraint Grammar formalism.
NLP Literature
- Machine Translation Archive. Electronic repository and bibliography of articles, books and papers on topics in machine translation, computer translation systems, and computer-based translation tools. >6400 items. Aims to be comprehensive on English-language publications since 1990; adding earlier papers and books to provide partial coverage from the 1950s.
- Corpus-linguistic applications, ed. Stefan Th. Gries, Stefanie Wulff, and Mark Davies. Rodopi. Electronic: ISBN 9789042028012; hardback: ISBN 9789042028005.
- Probabilistic tagging of minority language data: a case study using Qtag. Christopher Cox. 2010.
- Reviewed in LINGUIST List 21.3318 by Andrew Caines (2010-08-17):
- “Cox's theme is corpus planning. He considers the tagging process, and evaluates the time-accuracy trade-off in using (a) normalized/unnormalized orthography; (b) various chunk sizes for rounds of iterative, interactive tagging; (c) tagset size. He does so in the context of corpus building for minority languages which are on the whole associated with more modest resources than major language projects.
- “Cox considers what is required to tag a minority-language corpus. He finds that orthographically normalized data is 20% more accurate but more expensive to prepare, that smaller chunks are preferable for iterative interactive tagging, and that a less elaborate tagset is more accurate and efficient. Cox notes that these observations must be set against the purpose of the corpus and the requirements of the researchers who will be using it. This is a well-written paper with well-defined research questions and conclusions which are explicitly linked back to them -- an attribute which cannot be taken for granted in academic literature.”
- OBELEX: Online Bibliography of Electronic Lexicography. All relevant articles, monographs, anthologies and reviews since 2000 and some older relevant works. Focus is on online lexicography. Dictionaries not included, but included in a supplementary database now under construction. Search by full text, keyword, person, analysed languages, or publication year. [Mamandel 22:26, 28 April 2010 (UTC)]
- Home page in German.
- Announcement on LINGUIST List [19-Apr-2010 ]