General Meta-resources

From the LDC Language Resource Wiki

Jump to: navigation, search

[Mamandel 17:06, 4 May 2011 (UTC)]

This page is for meta-resources that are applicable to many languages and are not specifically for computational natural language processing.
Language-independent NLP Resources have their own page.

Contents


Resource Organizations

  • Crowdsourcing the Development of Underserved Language Resources. At Random Hacks of Kindness. This project aims at ... leveraging open content, mobile technologies and crowd-sourcing to create language resources for the underserved world languages and make them available under open licenses to stimulate research and development in the area of Human Language Technologies (HLT). The project will use existing open text repositories (such as Wikipedia) in language such as Swahili, Arabic and Urdu, and will create a crowd-sourcing mechanism for developing these text repositories into language corpora. [121120: last modified 071024]
  • Documentation of Endangered Languages (DOBES: Dokumentation Bedrohter Sprachen). [DoBeS Programme]   In 2000 the VolkswagenFoundation started the DOBES programme in order to document languages that are potentially in danger of becoming extinct within a few years time. In 2000 the pilot phase was started with seven documentation teams and one archiving team, with the intention to come up with recommendations of how language documentation can work, and how the archiving can best be done. Since then, a few new documentation teams are selected on a yearly basis in order to carry out significant documentation work within 3 years. Until now, 50 documentation projects have been funded and there will be calls for concrete documentation projects until 2011. In 2006 the first documentation teams have finished their contractual phase, but many teams still carry on with the documentation work. Yearly workshops are being held in which all past and present documentation projects meet in order to exchange experiences and results. [121120: last modified 120123]]
  • Endangered Language Alliance*: A trio of poet, professor, and field linguist have combined forces in the heart of New York City to document, support, and protect one of the most precious stores of cultural, scientific, and creative human knowledge: living languages. The Endangered Language Alliance (ELA, pronounced ay-la) is a new organization whose goal is “is to further the documentation, description, maintenance, and revitalization of threatened and endangered languages, and to educate the public about the causes and consequences of language extinction.” In a small office on West 18th Street known as the Urban Fieldstation, endangered languages are being spoken, recorded, and translated before they possibly recede further into the margins. [130116]
    *not to be confused with the Endangered Languages Project, below
  • Endangered Languages Project*: About half of the world’s approximate 7,000 languages are at risk of disappearing in the next 100 years. ... The Endangered Languages Project puts technology at the service of the organizations and individuals working to confront the language endangerment by documenting, preserving and teaching them. Through this website, users can not only access the most up to date and comprehensive information on Endangered Languages as well as samples being provided by partners, but also play an active role in putting their languages online by submitting information or samples in the form of text, audio or video files. In addition, users will be able to share best practices and case studies through a knowledge sharing section and through joining relevant Google Groups. [120924]
    *not to be confused with the Endangered Language Alliance, above
  • Glottolog/Langdoc comprehensive bibliographical and other reference information for the world's languages, especially the lesser known languages. (Max Planck Institute for Evolutionary Anthropology) [121126]
    • Glottolog is a comprehensive catalogue of the world's languages, language families and dialects. It provides a conservative genealogical classification (the Glottolog tree) which assigns a unique and stable ID number for (potentially) all languoids, i.e. all families, languages, and dialects (and in the future also sociolects). Any variety that a linguist works on should eventually get its own entry.
      • [  272 families, comprising
         33399 languoids, of which
         10175 have ISO 396-3 codes]
    • Langdoc is a comprehensive collection of bibliographical data for the world's lesser known languages. It provides access to more than 180,000 references of descriptive works such as grammars, dictionaries, word lists, texts etc.
    • Both Glottolog and Langdoc will be continuously expanded and improved with the help of their users. The input of expert linguists is crucial.
  • The IMDI Metadata Domain allows you to browse and search in the whole domain of linked IMDI metadata descriptions as they are registered at the IMDI portal at the MPI for Psycholinguistics. All Metadata descriptions are openly accessible, for many resources however one needs to ask access permission. [See also IMDI below under Metadata standards and infrastructure.] [121127]
  • Language Description Heritage Open Access Library. The goal of the Language Description Heritage (LDH) Open Access Digital Library is to provide easy access to descriptive material about the world’s languages. This collection is being compiled at the Max Planck Society in Germany as an open access digital repository of existing scientific contribution describing the world-wide linguistic diversity, focussing on traditionally difficult to obtain works. [121120]
  • LDC: Linguistic Data Consortium (University of Pennsylvania). supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards [121126]
designed to integrate language information with data from the physical and social sciences by means of a Geographical Information System (GIS). ... relates geographical information (topography, political boundaries, demographics, climate, vegetation, and wildlife) ... to data on resources relevant to the language. A link to the Multi-Tree project provides information on all proposed genetic relationships of the languages, viewable in a geographic context. ... The system encourages collaboration between linguists, historians, archaeologists, ethnographers, and geneticists.
  • search page.
    Or you can insert the ISO 639-3 code in http://llmap.org/languages/___.html
  • Each language's page has tabs for: Maps, Family Trees, Books, Papers, Dissertations, Researchers, OLAC References, Links
[This project is fairly new, and maps for many languages, e.g., Indonesian, are not available.] [121120]
a searchable database of hypotheses on language relationships
  • compare language trees and access bibliographical information on them
  • see a graphical representation of every scholarly hypothesis on language relationships
  • view information on every language
  • share comments on hypotheses and add new hypotheses (as a registered user)
  • access an interactive map of the language or family of your choice through LLMap
  • National Council of Less Commonly Taught Languages. This website supports NCOLCTL in its efforts to address the issue of national capacity in the LCTLs by facilitating communications among member organizations and with the governmental, private, heritage, and overseas sectors of the language community. Its ultimate goal is to increase the collective impact of LCTL constituencies on America's ability to communicate with peoples from all parts of the world. Most resources available only to members. [121126]

  • OLAC: Open Language Archives Community: an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources. [121120]
    • Search. The search field at the top of the page does a full text search. To search for a specific language, language family, geographic region, or any of many other categories, use the "Browse by:" menu down the right side of the page.
    • Also listed under Metadata standards and infrastructure below.
  • PRO-Signs: Signed languages for professional purposes. The PRO-Signs project aims to establish European standards for signed language proficiency for professional purposes, focusing specifically on sign language teaching in Deaf Studies and Sign Language Interpreting programmes. The project will provide definitions of Common European Framework of Reference for Languages (CEFR) proficiency levels for signed languages and develop a sample assessment kit for signed language competency at the C1/C2 level indicating the qualification of professional interpreters. [121218]
  • The Rosetta Project: a global collaboration of language specialists and native speakers working to build a publicly accessible digital library of human languages. [121120]
  • SIL International. SIL serves language communities worldwide, building their capacity for sustainable language development, by means of research, translation, training and materials development. [121126]
    • Ethnologue. to provide a comprehensive listing of the known living languages of the world.... intended more as a catalog than as an encyclopedia and so provides summary data rather than more extensive descriptions of identified languages. (Introduction)
    • Non-Roman Script Initiative to enable ethnic minorities to bridge the digital divide. NRSI participates in the work of the Unicode Consortium
    • ScriptSource. a dynamic, collaborative reference to the writing systems of the world, with detailed information on scripts, characters, languages - and the remaining needs for supporting them in the computing realm. It currently contains only a skeleton of information, and so depends on your participation in order to grow and assist others. [apparent last update March 2012]

Multilingual resources

Sites that publish data in many languages

Metadata standards and infrastructure

  • e-linguistics. a cyber-infrastructure for linguistics ... meant to promote a paradigm shift within the field of linguistics where data are: interoperable -- shared -- open [last modified Aug 26, 2009 - accessed 121126]
  • E-MELD: Electronic Metastructure for Endangered Languages Data. a 5-year project with a dual objective: 1) To aid in the preservation of endangered languages data and documentation. 2) To aid in the development of the infrastructure necessary for effective collaboration among electronic archives. A LINGUIST List project. [121126]
  • GOLD Community ("General Ontology of Linguistic Description"). The purpose of the GOLD Community is to bring together scholars interested in best-practice encoding of linguistic data.
    • Lexicon Enhancement via the GOLD Ontology (LEGO): designed to develop "building blocks" for the interoperability of lexicon data. ... LEGO is creating a data interoperability network by mapping lexical items from multiple lexicons and wordlists to concepts in the General Ontology for Linguistic Description (GOLD). [This mapping] allows cross-lexicon search at a fine-grained level: users can search by morphosyntactic information as well as by definition, language, spelling, etc. LEGO is also committed to developing a set of "low-barrier" data requirements which lexicon creators can implement in order to join the interoperability network. [121126]
      • The LEGO project is undertaking to digitize a number of lexicons provided by their creators. The lexicons will be tagged with terms from GOLD and put into a database, so that they can be searched through a free online interface. These lexicons come from 16 different projects and cover over 300 languages, so the LEGO project will be a significant resource for typologists, semanticists, lexicographers, translators and other researchers.
  • IMDI (ISLE Meta Data Initiative): a proposed metadata standard to describe multi-media and multi-modal language resources. The standard provides interoperability for browsable and searchable corpus structures and resource descriptions with help of specific tools. The full specifications can be found in the IMDI documentation. The web-based Browsable Corpus at the Max Planck Institute for Psycholinguistics allows you to browse through IMDI corpora and search for language resources. [See also the IMDI Metadata Domain above under Resource Organizations.] [121120 -- home page last updated 11 Oct 2007]
  • ISO 639-3: a code that aims to define three-letter identifiers for all known human languages. [121120]
  • Open Road: a resource for discussing and exploring electronic multicultural library services, multilingual public internet access services, language enablement, and community language web publishing initiatives. (from the Open Road blog) [121120]
    • NOTE: The site is currently being revised, and many links are missing. See the blog. [121120 -- Last update 2011-11-14]
    • Open Road languages page: The Open Road will explore Unicode language support issues in minority and emerging community languages within Australia. There are also developing sections for African, Southeast Asian, and Canadian aboriginal languages.
    • Maintained by Vicnet, a division of the State Library of Victoria (Australia).

___

Personal tools