Urdu/Urdu

From the LDC Language Resource Wiki

(Difference between revisions)
Jump to: navigation, search
m (Lexicon)
m (Parallel Text)
Line 138: Line 138:
(See also [[#News|News]]. Many of those sources produce text in multiple languages, but we have not been able to check their current parallelism.)
(See also [[#News|News]]. Many of those sources produce text in multiple languages, but we have not been able to check their current parallelism.)
* [http://www.ling.lancs.ac.uk/corplang/emille/ EMILLE] corpus.  200,000 words of text in English (information leaflets from the UK Government and various local authorities) with Urdu translation. Free license for non-profit research use.
* [http://www.ling.lancs.ac.uk/corplang/emille/ EMILLE] corpus.  200,000 words of text in English (information leaflets from the UK Government and various local authorities) with Urdu translation. Free license for non-profit research use.
-
* [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T01 NIST Open Machine Translation 2008 Evaluation (MT08)]. Human reference translations and corresponding machine translations for the NIST Open MT08 test sets in four language pairs (Urdu-to-English, Arabic-to-English, Chinese-to-English, and English-to-Chinese). The Urdu-to-English data comprises 128 documents with 1794 segments, output from 12 machine translation systems.  See webpage for price and licensing.  
+
* [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T01 NIST Open Machine Translation 2008 Evaluation (MT08)]. Human reference translations and corresponding machine translations for the NIST Open MT08 test sets in four language pairs (Urdu-to-English, Arabic-to-English, Chinese-to-English, and English-to-Chinese). The Urdu-to-English data comprises 128 documents of newswire and web data with 1794 segments, output from 12 machine translation systems.  See webpage for price and licensing.  
* [http://www.smurf-project.info/urdu/index.html SMURF Project] (Sustainable Management of Urban Rivers and Floodplains).
* [http://www.smurf-project.info/urdu/index.html SMURF Project] (Sustainable Management of Urban Rivers and Floodplains).

Revision as of 22:02, 25 May 2010

Home > Urdu

اردو


Urdu


[Mamandel April 2010]

Contents

General

Language summary

(Information largely from Ethnologue, 2010-04-7)

  • ISO 639-3 code: urd
  • Population: 10,700,000 in Pakistan (1993), 48,100,000 in India (1997; esp. in Jammu and Kashmir), 250,000 in Bangladesh (2003 SIL). Population total all countries: 60,586,800.
  • Also spoken in: Afghanistan and diaspora.
  • Alternate names: Bihari, Islami, Undri, Urudu.
  • Dialects: (Urdu is mutually intelligible with Hindi, but its formal vocabulary is borrowed from Arabic and Persian.)
    • Dakhini (دکّنی) (Dakani, Dakkhini, Deccan, Desia, Mirgan). Used in India, has fewer Persian and Arabic loans.
    • Pinjari
    • Rekhta or Rekhti (ریختہ). Used in poetry.
  • Classification: Indo-European, Indo-Iranian, Indo-Aryan, Central zone, Western Hindi, Hindustani

Linguistic notes

Syntax is head-final, with word order fairly free. Pro-drop.

Nouns come in two genders, masculine and feminine. They are inflected for two cases (usually referred to as direct and indirect, the latter is used when the noun is governed by a postposition, the former elsewhere). Adjectives agree in gender and number with the noun they modify. Case-marking is split ergative, depending on verb aspect.

Verbs are inflected for tense and aspect, and for agreement with the subject in person, number, and (in the second person) politeness. (Some tense marking is inflectional, and some use auxiliary verbs.) There are many auxiliary verbs, some of which provide further aspectual information.

Writing

Urdu is usually written in Arabic script, in the Nastaliq (نستعلیق) style rather than the Naskh (نسخ) style commonly used for Arabic and Persian. See Encoding. All resources listed here are in Unicode unless noted otherwise.

Linguistic resources

Overview

  • Kachru, Yamuna. 1987. Hindu-Urdu. Transcription only. In The World's Major Languages, ed. Bernard Comrie, 1990, Oxford University Press; Chapter 22, pages 470-489. ISBN 978-0195065114.

Grammar

  • Barz, Richard Keith and Yogendra Yadav (1993). An introduction to Hindi and Urdu. 5th edn. New Delhi : Munshiram :Manoharlal. 330 pp. ISBN 8121506050.
  • McGregor, Ronald Stuart (1995). Outline of Hindi Grammar, with exercises (3rd rev. enl. ed.) Oxford University Press. ISBN 0198700083.
  • Schmidt, Ruth Laila (1999). Urdu : an essential grammar. London, New York: Routledge. 300 pp.
  • Schmidt, Ruth Laila (2003). "Urdu". In The Indo-Aryan Languages, Cardona, George, and Jain, Dhaneshed, eds. Routledge, pp. 286–350. ISBN 9780415772945.
  • Williams, Poul (1996-2000). A Short Introduction to Hindi. A short but useful HTML grammar of Hindi (but apart from the alphabet, it should be applicable to Urdu).

Lexicon


Topical word lists

Names
  • Islamic names. Thousands of names with meanings. Romanization only.
  • Urduseek.com: Muslim names. About 1500 names with meanings. Indexed by romanization, with Urdu script. (muslimname.com is the same.)
Other

Linguistic portals and bibliographies

Encoding and Fonts

Encoding

Urdu uses the Unicode Arabic Range 0600-60FF, or Unicode Arabic Presentation Forms-A FB50-FDFF and Unicode Arabic Presentation Forms-B FE70-FEFF. See Unicode Standard 5.0, chapter 8 (Middle Eastern Scripts).

Note that in some cases the Unicode specification says that Urdu uses a different glyph at the same Unicode code point from what Arabic uses, e.g. U+06F4 - 06F7 (digits '4'-'7'; Unicode Standard pp. 272-273, PDF pp. 12-13). Also, some of the code points in this range are not used in Arabic, e.g. U+06D2 ے, which is not only unique to Urdu texts, but also very common, providing a quick check that a given text is in Urdu.

Older encodings

See Urdu/Other encodings.

Fonts and font utilities

Data Sources

Monolingual Text

News

See also

Blog sites

(Individual blogs in Urdu are too numerous and too evanescent to list.)

Religious

Parallel Text

(See also News. Many of those sources produce text in multiple languages, but we have not been able to check their current parallelism.)

  • EMILLE corpus. 200,000 words of text in English (information leaflets from the UK Government and various local authorities) with Urdu translation. Free license for non-profit research use.
  • NIST Open Machine Translation 2008 Evaluation (MT08). Human reference translations and corresponding machine translations for the NIST Open MT08 test sets in four language pairs (Urdu-to-English, Arabic-to-English, Chinese-to-English, and English-to-Chinese). The Urdu-to-English data comprises 128 documents of newswire and web data with 1794 segments, output from 12 machine translation systems. See webpage for price and licensing.
  • SMURF Project (Sustainable Management of Urban Rivers and Floodplains).

Speech

  • ARL Urdu Speech Database. Recorded prompted speech (400 prompts: sentences, place names, and person names), with transcriptions, from 200 adult native Urdu speakers from seven dialect regions in Pakistan and Northern India. ISBN 1-58563-421-3. See webpage for price and licensing. [updated: Mamandel 16:12, 5 May 2010 (UTC)]
  • BBC. See also news page.
  • CRI. China Radio International; see Webradio.
  • EMILLE corpus. Approximately 512k words. Free license for non-profit research use. Documentation
  • IRIB. Islamic Republic of Iran Broadcasting.
  • Voice of America.

Video

Portals

Tools and Other NLP Resources

  • CRULP Linguistic Resources. Includes
    • Word List. 149466 words.
    • 5000 Most Frequently Used Words List
    • Closed Class Words List
    • Valid Ligatures of Urdu
    • Urdu Grammar
Personal tools