Urdu/Urdu

From the LDC Language Resource Wiki

(Difference between revisions)

Jump to: navigation, search

Revision as of 22:02, 25 May 2010

Home > Urdu

اردو

Urdu

[Mamandel April 2010]

General

Language summary

(Information largely from Ethnologue, 2010-04-7)

ISO 639-3 code: urd
Population: 10,700,000 in Pakistan (1993), 48,100,000 in India (1997; esp. in Jammu and Kashmir), 250,000 in Bangladesh (2003 SIL). Population total all countries: 60,586,800.
Also spoken in: Afghanistan and diaspora.
Alternate names: Bihari, Islami, Undri, Urudu.
Dialects: (Urdu is mutually intelligible with Hindi, but its formal vocabulary is borrowed from Arabic and Persian.)
- Dakhini (دکّنی) (Dakani, Dakkhini, Deccan, Desia, Mirgan). Used in India, has fewer Persian and Arabic loans.
- Pinjari
- Rekhta or Rekhti (ریختہ). Used in poetry.
Classification: Indo-European, Indo-Iranian, Indo-Aryan, Central zone, Western Hindi, Hindustani

Linguistic notes

Syntax is head-final, with word order fairly free. Pro-drop.

Nouns come in two genders, masculine and feminine. They are inflected for two cases (usually referred to as direct and indirect, the latter is used when the noun is governed by a postposition, the former elsewhere). Adjectives agree in gender and number with the noun they modify. Case-marking is split ergative, depending on verb aspect.

Verbs are inflected for tense and aspect, and for agreement with the subject in person, number, and (in the second person) politeness. (Some tense marking is inflectional, and some use auxiliary verbs.) There are many auxiliary verbs, some of which provide further aspectual information.

Writing

Urdu is usually written in Arabic script, in the Nastaliq (نستعلیق) style rather than the Naskh (نسخ) style commonly used for Arabic and Persian. See Encoding. All resources listed here are in Unicode unless noted otherwise.

Linguistic resources

Overview

Kachru, Yamuna. 1987. Hindu-Urdu. Transcription only. In The World's Major Languages, ed. Bernard Comrie, 1990, Oxford University Press; Chapter 22, pages 470-489. ISBN 978-0195065114.

Grammar

Barz, Richard Keith and Yogendra Yadav (1993). An introduction to Hindi and Urdu. 5th edn. New Delhi : Munshiram :Manoharlal. 330 pp. ISBN 8121506050.
- 4th edn, 1991. Canberra : Faculty of Asian Studies Australian National University. ISBN 0708110517.
McGregor, Ronald Stuart (1995). Outline of Hindi Grammar, with exercises (3rd rev. enl. ed.) Oxford University Press. ISBN 0198700083.
Schmidt, Ruth Laila (1999). Urdu : an essential grammar. London, New York: Routledge. 300 pp.
Schmidt, Ruth Laila (2003). "Urdu". In The Indo-Aryan Languages, Cardona, George, and Jain, Dhaneshed, eds. Routledge, pp. 286–350. ISBN 9780415772945.
Williams, Poul (1996-2000). A Short Introduction to Hindi. A short but useful HTML grammar of Hindi (but apart from the alphabet, it should be applicable to Urdu).

Lexicon

Brinkster Urdu-English Dictionary. Mid-sized lexicon in documented transliteration. Gives etymology and POS.
CRULP Online Urdu Dictionary. 56k? words, some with English glosses. Template:Heavy quotes
Digital Dictionaries of South Asia, U. of Chicago. Searchable. Various rights reserved, see the linked pages.
- Platts, John T. 1884. A dictionary of Urdu, classical Hindi, and English. Urdu and Hindi script and diacritized romanization. London: W. H. Allen & Co.
- Shakespear, John. 1834. A dictionary, Hindustani and English: with a copious index, fitting the work to serve, also, as a dictionary of English and Hindustani. Urdu and Hindi script and diacritized romanization. 3rd ed., much enl. London: Printed for the author by J.L. Cox and Son: Sold by Parbury, Allen, & Co.
- Turner, R. L. (Ralph Lilley). 1962-1985. A comparative dictionary of Indo-Aryan languages. Diacritized romanization. London: Oxford University Press, 1962-1966. Includes three supplements, published 1969-1985. ISBN 9788120816657.
Ijunoon.com English-Urdu Dictionary. Over 47000 words. Input English, Urdu, or romanized Urdu. English-to-Urdu output is image of Urdu text.
Mujtahidī, Yaʻqūb Mīrān (Mujtahedi, Yakoob Miran)̲. Urdū-Angrezī lug̲h̲at-i Mujtahidī / [muʼallif], Yaʻqūb Mīrān̲ Mujtahidī = Mujtahedi’s Urdu-English dictionary : the new age dictionary. Hyderabad : Dictionary House, 2007. 3 vol. ISBN 9788190643504, ISBN 8190643509. OCLC 276515369.
Prabhu, Dinesh. Urdu-English Word List. 1439 headwords. Roman transliteration, language of origin, POS, gloss. ASCII text. Posted at many locations, with no mention of rights restrictions, including:
Siddiqi, Waseem. 1997, rev. 2006. Non-commercial use only. Documented transliteration.
Urduword.com. Transcription, and Urdu Naskh script in JPEG image format. Searchable in both English and Urdu, and browsable alphabetically.
Wiktionary. Unicode. Monolingual. 112,963 entries (CC-BY-SA),(GFDL) [Mamandel 16:23, 3 May 2010 (UTC)]

Topical word lists

Names

Islamic names. Thousands of names with meanings. Romanization only.
Urduseek.com: Muslim names. About 1500 names with meanings. Indexed by romanization, with Urdu script. (muslimname.com is the same.)

Other

Siddiqi, Waseem. 1997, rev. 2006. Non-commercial use only. Documented transliteration (Main page).
- Kinships and Relationships. 124 kinship terms.
- Traditional Weights. 9 units of weight as subunits (analogous to "16 ounces = 1 pound").

Linguistic portals and bibliographies

SIL Bibliography
CRULP: Center for Research in Urdu Language Processing
- CRULP publications
UCLA Language Materials Project

Encoding and Fonts

Encoding

Urdu uses the Unicode Arabic Range 0600-60FF, or Unicode Arabic Presentation Forms-A FB50-FDFF and Unicode Arabic Presentation Forms-B FE70-FEFF. See Unicode Standard 5.0, chapter 8 (Middle Eastern Scripts).

Note that in some cases the Unicode specification says that Urdu uses a different glyph at the same Unicode code point from what Arabic uses, e.g. U+06F4 - 06F7 (digits '4'-'7'; Unicode Standard pp. 272-273, PDF pp. 12-13). Also, some of the code points in this range are not used in Arabic, e.g. U+06D2 ے, which is not only unique to Urdu texts, but also very common, providing a quick check that a given text is in Urdu.

Older encodings

See Urdu/Other encodings.

Fonts and font utilities

Penn State's Urdu info page, including advice on Mac and Windows keyboard entry.
The South Asia Language Resource Center's Urdu page (U. of Chicago) has links to
- Urdu fonts, most of them available for free download
- Input Schemes and Keyboard Layouts
- information about Mac vs. PC vs. Linux rendering issues
Wazu Japan's Gallery of Urdu Unicode fonts
- Wazu Japan Unicode test pages
Alan Wood’s Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications
- test page for Arabic Unicode support

Data Sources

Monolingual Text

EMILLE corpus. Approximately 1,640,000 words. Free license for non-profit research use. Documentation
Wikipedia. Unicode. 13,261 articles (CC-BY-SA),(GFDL) [Mamandel 16:26, 3 May 2010 (UTC)]

News

APS Associated Press Service. Mostly Urdu, some English. A Pakistan news agency, evidently not affiliated with the Associated Press (AP).
BBC
CRI. China Radio International.
Deutsche Welle.
IRIB. Islamic Republic of Iran Broadcasting.
Jang.net. (Same as urdunews.net, urduseek.com.)
Mehr News.
Online International News Network. Urdu and English.
PakTribune. English, Urdu and Arabic; requires registration.
Palestine Information Center. Also English, Indonesian, Turkish, Arabic, Russian, French, Farsi.
Voice of America.

Blog sites

(Individual blogs in Urdu are too numerous and too evanescent to list.)

Religious

Tebyan. News, religious, and cultural (About us). Iran-based? Urdu, English, Farsi, Arabic, French, Russian, Turkish.
Understanding Islam. Also English.

Parallel Text

(See also News. Many of those sources produce text in multiple languages, but we have not been able to check their current parallelism.)

EMILLE corpus. 200,000 words of text in English (information leaflets from the UK Government and various local authorities) with Urdu translation. Free license for non-profit research use.
NIST Open Machine Translation 2008 Evaluation (MT08). Human reference translations and corresponding machine translations for the NIST Open MT08 test sets in four language pairs (Urdu-to-English, Arabic-to-English, Chinese-to-English, and English-to-Chinese). The Urdu-to-English data comprises 128 documents of newswire and web data with 1794 segments, output from 12 machine translation systems. See webpage for price and licensing.
SMURF Project (Sustainable Management of Urban Rivers and Floodplains).

Speech

ARL Urdu Speech Database. Recorded prompted speech (400 prompts: sentences, place names, and person names), with transcriptions, from 200 adult native Urdu speakers from seven dialect regions in Pakistan and Northern India. ISBN 1-58563-421-3. See webpage for price and licensing. [updated: Mamandel 16:12, 5 May 2010 (UTC)]
BBC. See also news page.
CRI. China Radio International; see Webradio.
EMILLE corpus. Approximately 512k words. Free license for non-profit research use. Documentation
IRIB. Islamic Republic of Iran Broadcasting.
Voice of America.

Video

Voice of America.

Portals

Urdu Web. Home page in English and Urdu.

Tools and Other NLP Resources

CRULP Linguistic Resources. Includes
- Word List. 149466 words.
- 5000 Most Frequently Used Words List
- Closed Class Words List
- Valid Ligatures of Urdu
- Urdu Grammar

@@ Line 138: / Line 138: @@
 (See also [[#News|News]]. Many of those sources produce text in multiple languages, but we have not been able to check their current parallelism.)
 * [http://www.ling.lancs.ac.uk/corplang/emille/ EMILLE] corpus.  200,000 words of text in English (information leaflets from the UK Government and various local authorities) with Urdu translation. Free license for non-profit research use.
-* [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T01 NIST Open Machine Translation 2008 Evaluation (MT08)]. Human reference translations and corresponding machine translations for the NIST Open MT08 test sets in four language pairs (Urdu-to-English, Arabic-to-English, Chinese-to-English, and English-to-Chinese). The Urdu-to-English data comprises 128 documents with 1794 segments, output from 12 machine translation systems.  See webpage for price and licensing.
+* [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T01 NIST Open Machine Translation 2008 Evaluation (MT08)]. Human reference translations and corresponding machine translations for the NIST Open MT08 test sets in four language pairs (Urdu-to-English, Arabic-to-English, Chinese-to-English, and English-to-Chinese). The Urdu-to-English data comprises 128 documents of newswire and web data with 1794 segments, output from 12 machine translation systems.  See webpage for price and licensing.
 * [http://www.smurf-project.info/urdu/index.html SMURF Project] (Sustainable Management of Urban Rivers and Floodplains).