From the LDC Language Resource Wiki

Jump to: navigation, search

Home > Urdu



[Mamandel April 2010]



Language summary

(Information largely from Ethnologue, 2010-04-7)

  • ISO 639-3 code: urd
  • Population: 10,700,000 in Pakistan (1993), 48,100,000 in India (1997; esp. in Jammu and Kashmir), 250,000 in Bangladesh (2003 SIL). Population total all countries: 60,586,800.
  • Also spoken in: Afghanistan and diaspora.
  • Alternate names: Bihari, Islami, Undri, Urudu.
  • Dialects: (Urdu is mutually intelligible with Hindi, but its formal vocabulary is borrowed from Arabic and Persian.)
    • Dakhini (دکّنی) (Dakani, Dakkhini, Deccan, Desia, Mirgan). Used in India, has fewer Persian and Arabic loans.
    • Pinjari
    • Rekhta or Rekhti (ریختہ). Used in poetry.
  • Classification: Indo-European, Indo-Iranian, Indo-Aryan, Central zone, Western Hindi, Hindustani

Linguistic notes

Syntax is head-final, with word order fairly free. Pro-drop.

Nouns come in two genders, masculine and feminine. They are inflected for two cases (usually referred to as direct and indirect, the latter is used when the noun is governed by a postposition, the former elsewhere). Adjectives agree in gender and number with the noun they modify. Case-marking is split ergative, depending on verb aspect.

Verbs are inflected for tense and aspect, and for agreement with the subject in person, number, and (in the second person) politeness. (Some tense marking is inflectional, and some use auxiliary verbs.) There are many auxiliary verbs, some of which provide further aspectual information.


Urdu is usually written in Arabic script, in the Nastaliq (نستعلیق) style rather than the Naskh (نسخ) style commonly used for Arabic and Persian. See Encoding. All resources listed here are in Unicode unless noted otherwise.

Linguistic resources


  • Kachru, Yamuna. 1987. Hindu-Urdu. Transcription only. In The World's Major Languages, ed. Bernard Comrie, 1990, Oxford University Press; Chapter 22, pages 470-489. ISBN 978-0195065114.


  • Barz, Richard Keith and Yogendra Yadav (1993). An introduction to Hindi and Urdu. 5th edn. New Delhi : Munshiram :Manoharlal. 330 pp. ISBN 8121506050.
  • McGregor, Ronald Stuart (1995). Outline of Hindi Grammar, with exercises (3rd rev. enl. ed.) Oxford University Press. ISBN 0198700083.
  • Schmidt, Ruth Laila (1999). Urdu : an essential grammar. London, New York: Routledge. 300 pp.
  • Schmidt, Ruth Laila (2003). "Urdu". In The Indo-Aryan Languages, Cardona, George, and Jain, Dhaneshed, eds. Routledge, pp. 286–350. ISBN 9780415772945.
  • Williams, Poul (1996-2000). A Short Introduction to Hindi. A short but useful HTML grammar of Hindi (but apart from the alphabet, it should be applicable to Urdu).


Topical word lists

  • Islamic names. Thousands of names with meanings. Romanization only.
  • Urduseek.com: Muslim names. About 1500 names with meanings. Indexed by romanization, with Urdu script. (muslimname.com is the same.)

Linguistic portals and bibliographies

Encoding and Fonts


Urdu uses the Unicode Arabic Range 0600-60FF, or Unicode Arabic Presentation Forms-A FB50-FDFF and Unicode Arabic Presentation Forms-B FE70-FEFF. See Unicode Standard 5.0, chapter 8 (Middle Eastern Scripts).

Note that in some cases the Unicode specification says that Urdu uses a different glyph at the same Unicode code point from what Arabic uses, e.g. U+06F4 - 06F7 (digits '4'-'7'; Unicode Standard pp. 272-273, PDF pp. 12-13). Also, some of the code points in this range are not used in Arabic, e.g. U+06D2 ے, which is not only unique to Urdu texts, but also very common, providing a quick check that a given text is in Urdu.

Older encodings

See Urdu/Other encodings.

Fonts and font utilities

Data Sources

Monolingual Text


See also

Blog sites

(Individual blogs in Urdu are too numerous and too evanescent to list.)


Parallel Text

(See also News. Many of those sources produce text in multiple languages, but we have not been able to check their current parallelism.)

  • EMILLE corpus. 200,000 words of text in English (information leaflets from the UK Government and various local authorities) with Urdu translation. Free license for non-profit research use.
  • NIST Open Machine Translation 2008 Evaluation (MT08). Human reference translations and corresponding machine translations for the NIST Open MT08 test sets in four language pairs (Urdu-to-English, Arabic-to-English, Chinese-to-English, and English-to-Chinese). The Urdu-to-English data comprises 128 documents of newswire and web data with 1794 segments, output from 12 machine translation systems. See webpage for price and licensing.
  • SMURF Project (Sustainable Management of Urban Rivers and Floodplains).


  • ARL Urdu Speech Database. Recorded prompted speech (400 prompts: sentences, place names, and person names), with transcriptions, from 200 adult native Urdu speakers from seven dialect regions in Pakistan and Northern India. ISBN 1-58563-421-3. See webpage for price and licensing. [updated: Mamandel 16:12, 5 May 2010 (UTC)]
  • BBC. See also news page.
  • CRI. China Radio International; see Webradio.
  • EMILLE corpus. Approximately 512k words. Free license for non-profit research use. Documentation
  • IRIB. Islamic Republic of Iran Broadcasting.
  • Voice of America.



Tools and Other NLP Resources

  • CRULP Linguistic Resources. Includes
    • Word List. 149466 words.
    • 5000 Most Frequently Used Words List
    • Closed Class Words List
    • Valid Ligatures of Urdu
    • Urdu Grammar
  • NIST 2008 Open Machine Translation (OpenMT) Evaluation: a package containing source data, reference translations and scoring software used in the NIST 2008 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems. ... The 2008 task was to evaluate translation from Arabic to English, Chinese to English, English to Chinese (newswire only) and Urdu to English. LDC2010T21. ISBN:1-58563-567-7. [Mamandel 00:25, 2 March 2011 (UTC)]
  • NIST 2009 Open Machine Translation (OpenMT) Evaluation: is a package containing source data, reference translations and scoring software used in the NIST 2009 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems. ... The 2009 task was to evaluate translation from Arabic to English and Urdu to English. LDC2010T23. ISBN:1-58563-570-7. [Mamandel 00:25, 2 March 2011 (UTC)]
Personal tools