Urdu/Urdu
From the LDC Language Resource Wiki
(New page: Home > Urdu <center><span style="font-size:300%">اردو</span> <span style="font-size:300%; text-transform:uppercase">Urdu</span></center> Category:Urdu == <...)
Newer edit →
Revision as of 19:18, 14 April 2010
Contents |
NOTES FOR EDITING
- LRW name subspaces should be delimited with '/', not ':'.
- Remember to check Resource pages
- Delete all notes in red type, including this section of "NOTES FOR EDITING".
General
Language summary
(Information largely from Ethnologue, 2010-04-7)
- ISO 639-3 code: urd
- Population: 10,700,000 in Pakistan (1993), 48,100,000 in India (1997; esp. in Jammu and Kashmir), 250,000 in Bangladesh (2003 SIL). Population total all countries: 60,586,800.
- Also spoken in: Afghanistan and diaspora.
- Alternate names: Bihari, Islami, Undri, Urudu.
- Dialects: (Urdu is mutually intelligible with Hindi, but its formal vocabulary is borrowed from Arabic and Persian.)
- Dakhini (دکّنی) (Dakani, Dakkhini, Deccan, Desia, Mirgan). Used in India, has fewer Persian and Arabic loans.
- Pinjari
- Rekhta or Rekhti (ریختہ). Used in poetry.
- Classification: Indo-European, Indo-Iranian, Indo-Aryan, Central zone, Western Hindi, Hindustani
Linguistic notes
Syntax is head-final, with word order fairly free. Pro-drop.
Nouns come in two genders, masculine and feminine. They are inflected for two cases (usually referred to as direct and indirect, the latter is used when the noun is governed by a postposition, the former elsewhere). Adjectives agree in gender and number with the noun they modify. Case-marking is split ergative, depending on verb aspect.
Verbs are inflected for tense and aspect, and for agreement with the subject in person, number, and (in the second person) politeness. (Some tense marking is inflectional, and some use auxiliary verbs.) There are many auxiliary verbs, some of which provide further aspectual information.
Writing
Urdu is usually written in Arabic script, in the Nastaliq (نستعلیق) style rather than the Naskh (نسخ) style commonly used for Arabic and Persian. See Encoding. All resources listed here are in Unicode unless noted otherwise.
Linguistic resources
Overview
- Kachru, Yamuna. 1987. Hindu-Urdu. Transcription only. In The World's Major Languages, ed. Bernard Comrie, 1990, Oxford University Press; Chapter 22, pages 470-489. ISBN 978-0195065114.
Grammar
- Barz, Richard Keith and Yogendra Yadav (1993). An introduction to Hindi and Urdu. 5th edn. New Delhi : Munshiram :Manoharlal. 330 pp. ISBN 8121506050.
- 4th edn, 1991. Canberra : Faculty of Asian Studies Australian National University. ISBN 0708110517.
- McGregor, Ronald Stuart (1995). Outline of Hindi Grammar, with exercises (3rd rev. enl. ed.) Oxford University Press. ISBN 0198700083.
- Schmidt, Ruth Laila (1999). Urdu : an essential grammar. London, New York: Routledge. 300 pp.
- Schmidt, Ruth Laila (2003). "Urdu". In The Indo-Aryan Languages, Cardona, George, and Jain, Dhaneshed, eds. Routledge, pp. 286–350. ISBN 9780415772945.
- Williams, Poul (1996-2000). A Short Introduction to Hindi. A short but useful HTML grammar of Hindi (but apart from the alphabet, it should be applicable to Urdu).
Lexicon
- Brinkster Urdu-English Dictionary. Mid-sized lexicon in documented transliteration. Gives etymology and POS.
- CRULP Online Urdu Dictionary. 56k? words, some with English glosses. "An online Urdu to Urdu and English dictionary with comprehensive information including pronunciation, etymology, meanings, examples, synonyms, compounds and other word relations."
- Digital Dictionaries of South Asia, U. of Chicago. Searchable. Various rights reserved, see the linked pages.
- Platts, John T. 1884. A dictionary of Urdu, classical Hindi, and English. Urdu and Hindi script and diacritized romanization. London: W. H. Allen & Co.
- Shakespear, John. 1834. A dictionary, Hindustani and English: with a copious index, fitting the work to serve, also, as a dictionary of English and Hindustani. Urdu and Hindi script and diacritized romanization. 3rd ed., much enl. London: Printed for the author by J.L. Cox and Son: Sold by Parbury, Allen, & Co.
- Turner, R. L. (Ralph Lilley). 1962-1985. A comparative dictionary of Indo-Aryan languages. Diacritized romanization. London: Oxford University Press, 1962-1966. Includes three supplements, published 1969-1985. ISBN 9788120816657.
- Ijunoon.com English-Urdu Dictionary. Over 47000 words. Input English, Urdu, or romanized Urdu. English-to-Urdu output is image of Urdu text.
- Mujtahidī, Yaʻqūb Mīrān (Mujtahedi, Yakoob Miran)̲. Urdū-Angrezī lug̲h̲at-i Mujtahidī / [muʼallif], Yaʻqūb Mīrān̲ Mujtahidī = Mujtahedi’s Urdu-English dictionary : the new age dictionary. Hyderabad : Dictionary House, 2007. 3 vol. ISBN 9788190643504, ISBN 8190643509. OCLC 276515369.
- Prabhu, Dinesh. Urdu-English Word List. 1439 headwords. Roman transliteration, language of origin, POS, gloss. ASCII text; posted at many locations, including
- Siddiqi, Waseem. 1997, rev. 2006. Non-commercial use only. Documented transliteration.
- Urduword.com. Transcription, and Urdu Naskh script in JPEG image format. Searchable in both English and Urdu, and browsable alphabetically.
- Wiktionary. Monolingual.
Topical word lists
Names
- Islamic names. Thousands of names with meanings. Romanization only.
- Urduseek.com: Muslim names. About 1500 names with meanings. Indexed by romanization, with Urdu script. (muslimname.com is the same.)
Other
- Siddiqi, Waseem. 1997, rev. 2006. Non-commercial use only. Documented transliteration (Main page).
- Kinships and Relationships. 124 kinship terms.
- Traditional Weights. 9 units of weight as subunits (analogous to "16 ounces = 1 pound").
Monographs
Linguistic portals and bibliographies
- SIL Bibliography
- CRULP: Center for Research in Urdu Language Processing
- UCLA Language Materials Project
Encoding and Fonts
Encoding
Urdu uses the Unicode Arabic Range 0600-60FF, or Unicode Arabic Presentation Forms-A FB50-FDFF and Unicode Arabic Presentation Forms-B FE70-FEFF. See Unicode Standard 5.0, chapter 8 (Middle Eastern Scripts).
Note that in some cases the Unicode specification says that Urdu uses a different glyph at the same Unicode code point from what Arabic uses, e.g. U+06F4 - 06F7 (digits '4'-'7'; Unicode Standard pp. 272-273, PDF pp. 12-13). Also, some of the code points in this range are not used in Arabic, e.g. U+06D2 ے, which is not only unique to Urdu texts, but also very common, providing a quick check that a given text is in Urdu.
Older encodings
See Urdu/Other encodings.
Fonts and font utilities
- Penn State's Urdu info page, including advice on Mac and Windows keyboard entry.
- The South Asia Language Resource Center's Urdu page (U. of Chicago) has links to
- Urdu fonts, most of them available for free download
- Input Schemes and Keyboard Layouts
- information about Mac vs. PC vs. Linux rendering issues
- Wazu Japan's Gallery of Urdu Unicode fonts
- Alan Wood’s Unicode and Multilingual Support in HTML, Fonts, Web Browsers and Other Applications
Data Sources
Monolingual Text
- EMILLE corpus. Approximately 1,640,000 words. Free license for non-profit research use. Documentation
- Wikipedia
News
- APS Associated Press Service. Mostly Urdu, some English. A Pakistan news agency, evidently not affiliated with the Associated Press (AP).
- BBC
- CRI. China Radio International.
- Deutsche Welle.
- IRIB. Islamic Republic of Iran Broadcasting.
- Jang.net. (Same as urdunews.net, urduseek.com.)
- Mehr News.
- Online International News Network. Urdu and English.
- PakTribune. English, Urdu and Arabic; requires registration.
- Palestine Information Center. Also English, Indonesian, Turkish, Arabic, Russian, French, Farsi.
- Voice of America.
See also
Blog sites
(Individual blogs in Urdu are too numerous and too evanescent to list.)
Religious
- Tebyan. News, religious, and cultural (About us). Iran-based? Urdu, English, Farsi, Arabic, French, Russian, Turkish.
- Understanding Islam. Also English.
Parallel Text
(See also News. Many of those sources produce text in multiple languages, but we have not been able to check their current parallelism.)
- EMILLE corpus. 200,000 words of text in English (information leaflets from the UK Government and various local authorities) with Urdu translation. Free license for non-profit research use.
- NIST Open Machine Translation 2008 Evaluation (MT08). Human reference translations and corresponding machine translations for the NIST Open MT08 test sets in four language pairs. The Urdu-to-English data comprises 128 documents with 1794 segments, output from 12 machine translation systems.
- SMURF Project (Sustainable Management of Urban Rivers and Floodplains).
Speech
- ARL Urdu Speech Database. Recorded prompted speech from 200 adult native Urdu speakers from Pakistan and Northern India.
- BBC. See also news page.
- CRI. China Radio International; see Webradio.
- EMILLE corpus. Approximately 512k words. Free license for non-profit research use. Documentation
- IRIB. Islamic Republic of Iran Broadcasting.
- Voice of America.
Video
Portals
Tools and Other NLP Resources
- CRULP Linguistic Resources. Includes
- Word List. 149466 words.
- 5000 Most Frequently Used Words List
- Closed Class Words List
- Valid Ligatures of Urdu
- Urdu Grammar