Urdu/Urdu

From the LDC Language Resource Wiki

(Difference between revisions)
Jump to: navigation, search
m
m
 
(12 intermediate revisions not shown)
Line 6: Line 6:
<span style="font-size:300%; text-transform:uppercase">Urdu</span></center>
<span style="font-size:300%; text-transform:uppercase">Urdu</span></center>
 +
 +
 +
{{si|[[User:Mamandel|Mamandel]] April 2010}}
==General==
==General==
Line 47: Line 50:
===Lexicon===
===Lexicon===
* [http://www.urdu-dictionary.info/ Brinkster Urdu-English Dictionary]. Mid-sized lexicon in documented transliteration. Gives etymology and POS.
* [http://www.urdu-dictionary.info/ Brinkster Urdu-English Dictionary]. Mid-sized lexicon in documented transliteration. Gives etymology and POS.
-
* [http://www.crulp.org/oud/default.aspx CRULP Online Urdu Dictionary.] 56k? words, some with English glosses. "An online Urdu to Urdu and English dictionary with comprehensive information including pronunciation, etymology, meanings, examples, synonyms, compounds and other word relations."
+
* [http://www.crulp.org/oud/default.aspx CRULP Online Urdu Dictionary.] 56k? words, some with English glosses. {{hq|An online Urdu to Urdu and English dictionary with comprehensive information including pronunciation, etymology, meanings, examples, synonyms, compounds and other word relations.}}
* [http://dsal.uchicago.edu/dictionaries/ Digital Dictionaries of South Asia, U. of Chicago]. Searchable. Various rights reserved, see the linked pages.
* [http://dsal.uchicago.edu/dictionaries/ Digital Dictionaries of South Asia, U. of Chicago]. Searchable. Various rights reserved, see the linked pages.
**Platts, John T.  1884. [http://dsal.uchicago.edu/dictionaries/platts/ ''A dictionary of Urdu, classical Hindi, and English.''] Urdu and Hindi script and diacritized romanization.  London: W. H. Allen & Co.
**Platts, John T.  1884. [http://dsal.uchicago.edu/dictionaries/platts/ ''A dictionary of Urdu, classical Hindi, and English.''] Urdu and Hindi script and diacritized romanization.  London: W. H. Allen & Co.
Line 60: Line 63:
* [http://biphost.spray.se/tracker/dict/index.html Siddiqi, Waseem]. 1997, rev. 2006. Non-commercial use only. Documented transliteration.
* [http://biphost.spray.se/tracker/dict/index.html Siddiqi, Waseem]. 1997, rev. 2006. Non-commercial use only. Documented transliteration.
* [http://www.urduword.com/Home/index.cgi Urduword.com]. Transcription, and Urdu Naskh script in '''JPEG image format'''. Searchable in both English and Urdu, and browsable alphabetically.
* [http://www.urduword.com/Home/index.cgi Urduword.com]. Transcription, and Urdu Naskh script in '''JPEG image format'''. Searchable in both English and Urdu, and browsable alphabetically.
-
* {{CC-BY-SA}},{{GFDL}} [http://ur.wiktionary.org/ Wiktionary]. Monolingual.
+
* [http://ur.wiktionary.org/ Wiktionary]. Unicode. Monolingual. 112,963 entries {{CC-BY-SA}},{{GFDL}} {{si|[[User:Mamandel|Mamandel]] 16:23, 3 May 2010 (UTC)}}
 +
 
====Topical word lists====
====Topical word lists====
Line 104: Line 108:
===Monolingual Text===
===Monolingual Text===
*[http://www.ling.lancs.ac.uk/corplang/emille/ EMILLE] corpus. Approximately 1,640,000 words. Free license for non-profit research use. [http://www.emille.lancs.ac.uk/manual.pdf Documentation]
*[http://www.ling.lancs.ac.uk/corplang/emille/ EMILLE] corpus. Approximately 1,640,000 words. Free license for non-profit research use. [http://www.emille.lancs.ac.uk/manual.pdf Documentation]
-
* [http://ur.wikipedia.org/ Wikipedia]
+
* [http://ur.wikipedia.org Wikipedia]. Unicode. 13,261 articles {{CC-BY-SA}},{{GFDL}} {{si|[[User:Mamandel|Mamandel]] 16:26, 3 May 2010 (UTC)}}
====News====
====News====
Line 120: Line 124:
See also  
See also  
-
* [http://en.wikipedia.org/wiki/List_of_newspapers_in_Pakistan#Urdu-language_newspapers Wikipedia: List of Urdu-language newspapers in Pakistan]
+
* [http://en.wikipedia.org/wiki/List_of_newspapers_in_Pakistan#Urdu-language_newspapers Wikipedia: List of Urdu-language newspapers in Pakistan] {{CC-BY-SA}},{{GFDL}} {{si|[[User:Mamandel|Mamandel]] 06:46, 14 May 2010 (UTC)}}
====Blog sites====
====Blog sites====
Line 134: Line 138:
(See also [[#News|News]]. Many of those sources produce text in multiple languages, but we have not been able to check their current parallelism.)
(See also [[#News|News]]. Many of those sources produce text in multiple languages, but we have not been able to check their current parallelism.)
* [http://www.ling.lancs.ac.uk/corplang/emille/ EMILLE] corpus.  200,000 words of text in English (information leaflets from the UK Government and various local authorities) with Urdu translation. Free license for non-profit research use.
* [http://www.ling.lancs.ac.uk/corplang/emille/ EMILLE] corpus.  200,000 words of text in English (information leaflets from the UK Government and various local authorities) with Urdu translation. Free license for non-profit research use.
-
* [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T01 NIST Open Machine Translation 2008 Evaluation (MT08)]. Human reference translations and corresponding machine translations for the NIST Open MT08 test sets in four language pairs. The Urdu-to-English data comprises 128 documents with 1794 segments, output from 12 machine translation systems.
+
* [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T01 NIST Open Machine Translation 2008 Evaluation (MT08)]. Human reference translations and corresponding machine translations for the NIST Open MT08 test sets in four language pairs (Urdu-to-English, Arabic-to-English, Chinese-to-English, and English-to-Chinese). The Urdu-to-English data comprises 128 documents of newswire and web data with 1794 segments, output from 12 machine translation systems.  See webpage for price and licensing.  
* [http://www.smurf-project.info/urdu/index.html SMURF Project] (Sustainable Management of Urban Rivers and Floodplains).
* [http://www.smurf-project.info/urdu/index.html SMURF Project] (Sustainable Management of Urban Rivers and Floodplains).
===Speech===
===Speech===
-
* [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S03 ARL Urdu Speech Database]. Recorded prompted speech from 200 adult native Urdu speakers from Pakistan and Northern India.
+
* [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007S03 ARL Urdu Speech Database]. Recorded prompted speech (400 prompts: sentences, place names, and person names), with transcriptions, from 200 adult native Urdu speakers from seven dialect regions in Pakistan and Northern India. ISBN 1-58563-421-3. See webpage for price and licensing. {{si|updated: [[User:Mamandel|Mamandel]] 16:12, 5 May 2010 (UTC)}}
* [http://www.bbc.co.uk/urdu/radio/2009/03/090305_fm_bulletins.shtml BBC]. See also [http://www.bbc.co.uk/urdu/ news page].
* [http://www.bbc.co.uk/urdu/radio/2009/03/090305_fm_bulletins.shtml BBC]. See also [http://www.bbc.co.uk/urdu/ news page].
* [http://urdu.cri.cn/ CRI]. China Radio International; see Webradio.
* [http://urdu.cri.cn/ CRI]. China Radio International; see Webradio.
Line 159: Line 163:
** Valid Ligatures of Urdu
** Valid Ligatures of Urdu
** Urdu Grammar
** Urdu Grammar
 +
* [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T21 NIST 2008 Open Machine Translation (OpenMT) Evaluation]: {{hq|a package containing source data, reference translations and scoring software used in the NIST 2008 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems. ... The 2008 task was to evaluate translation from Arabic to English, Chinese to English, English to Chinese (newswire only) and Urdu to English.}} LDC2010T21. ISBN:1-58563-567-7. {{si|[[User:Mamandel|Mamandel]] 00:25, 2 March 2011 (UTC)}}
 +
* [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2010T23 NIST 2009 Open Machine Translation (OpenMT) Evaluation]: {{hq|is a package containing source data, reference translations and scoring software used in the NIST 2009 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems. ... The 2009 task was to evaluate translation from Arabic to English and Urdu to English.}} LDC2010T23. ISBN:1-58563-570-7. {{si|[[User:Mamandel|Mamandel]] 00:25, 2 March 2011 (UTC)}}
[[Category:Urdu]]
[[Category:Urdu]]

Latest revision as of 12:21, 10 May 2011

Home > Urdu

اردو


Urdu


[Mamandel April 2010]

Contents

General

Language summary

(Information largely from Ethnologue, 2010-04-7)

  • ISO 639-3 code: urd
  • Population: 10,700,000 in Pakistan (1993), 48,100,000 in India (1997; esp. in Jammu and Kashmir), 250,000 in Bangladesh (2003 SIL). Population total all countries: 60,586,800.
  • Also spoken in: Afghanistan and diaspora.
  • Alternate names: Bihari, Islami, Undri, Urudu.
  • Dialects: (Urdu is mutually intelligible with Hindi, but its formal vocabulary is borrowed from Arabic and Persian.)
    • Dakhini (دکّنی) (Dakani, Dakkhini, Deccan, Desia, Mirgan). Used in India, has fewer Persian and Arabic loans.
    • Pinjari
    • Rekhta or Rekhti (ریختہ). Used in poetry.
  • Classification: Indo-European, Indo-Iranian, Indo-Aryan, Central zone, Western Hindi, Hindustani

Linguistic notes

Syntax is head-final, with word order fairly free. Pro-drop.

Nouns come in two genders, masculine and feminine. They are inflected for two cases (usually referred to as direct and indirect, the latter is used when the noun is governed by a postposition, the former elsewhere). Adjectives agree in gender and number with the noun they modify. Case-marking is split ergative, depending on verb aspect.

Verbs are inflected for tense and aspect, and for agreement with the subject in person, number, and (in the second person) politeness. (Some tense marking is inflectional, and some use auxiliary verbs.) There are many auxiliary verbs, some of which provide further aspectual information.

Writing

Urdu is usually written in Arabic script, in the Nastaliq (نستعلیق) style rather than the Naskh (نسخ) style commonly used for Arabic and Persian. See Encoding. All resources listed here are in Unicode unless noted otherwise.

Linguistic resources

Overview

  • Kachru, Yamuna. 1987. Hindu-Urdu. Transcription only. In The World's Major Languages, ed. Bernard Comrie, 1990, Oxford University Press; Chapter 22, pages 470-489. ISBN 978-0195065114.

Grammar

  • Barz, Richard Keith and Yogendra Yadav (1993). An introduction to Hindi and Urdu. 5th edn. New Delhi : Munshiram :Manoharlal. 330 pp. ISBN 8121506050.
  • McGregor, Ronald Stuart (1995). Outline of Hindi Grammar, with exercises (3rd rev. enl. ed.) Oxford University Press. ISBN 0198700083.
  • Schmidt, Ruth Laila (1999). Urdu : an essential grammar. London, New York: Routledge. 300 pp.
  • Schmidt, Ruth Laila (2003). "Urdu". In The Indo-Aryan Languages, Cardona, George, and Jain, Dhaneshed, eds. Routledge, pp. 286–350. ISBN 9780415772945.
  • Williams, Poul (1996-2000). A Short Introduction to Hindi. A short but useful HTML grammar of Hindi (but apart from the alphabet, it should be applicable to Urdu).

Lexicon


Topical word lists

Names
  • Islamic names. Thousands of names with meanings. Romanization only.
  • Urduseek.com: Muslim names. About 1500 names with meanings. Indexed by romanization, with Urdu script. (muslimname.com is the same.)
Other

Linguistic portals and bibliographies

Encoding and Fonts

Encoding

Urdu uses the Unicode Arabic Range 0600-60FF, or Unicode Arabic Presentation Forms-A FB50-FDFF and Unicode Arabic Presentation Forms-B FE70-FEFF. See Unicode Standard 5.0, chapter 8 (Middle Eastern Scripts).

Note that in some cases the Unicode specification says that Urdu uses a different glyph at the same Unicode code point from what Arabic uses, e.g. U+06F4 - 06F7 (digits '4'-'7'; Unicode Standard pp. 272-273, PDF pp. 12-13). Also, some of the code points in this range are not used in Arabic, e.g. U+06D2 ے, which is not only unique to Urdu texts, but also very common, providing a quick check that a given text is in Urdu.

Older encodings

See Urdu/Other encodings.

Fonts and font utilities

Data Sources

Monolingual Text

News

See also

Blog sites

(Individual blogs in Urdu are too numerous and too evanescent to list.)

Religious

Parallel Text

(See also News. Many of those sources produce text in multiple languages, but we have not been able to check their current parallelism.)

  • EMILLE corpus. 200,000 words of text in English (information leaflets from the UK Government and various local authorities) with Urdu translation. Free license for non-profit research use.
  • NIST Open Machine Translation 2008 Evaluation (MT08). Human reference translations and corresponding machine translations for the NIST Open MT08 test sets in four language pairs (Urdu-to-English, Arabic-to-English, Chinese-to-English, and English-to-Chinese). The Urdu-to-English data comprises 128 documents of newswire and web data with 1794 segments, output from 12 machine translation systems. See webpage for price and licensing.
  • SMURF Project (Sustainable Management of Urban Rivers and Floodplains).

Speech

  • ARL Urdu Speech Database. Recorded prompted speech (400 prompts: sentences, place names, and person names), with transcriptions, from 200 adult native Urdu speakers from seven dialect regions in Pakistan and Northern India. ISBN 1-58563-421-3. See webpage for price and licensing. [updated: Mamandel 16:12, 5 May 2010 (UTC)]
  • BBC. See also news page.
  • CRI. China Radio International; see Webradio.
  • EMILLE corpus. Approximately 512k words. Free license for non-profit research use. Documentation
  • IRIB. Islamic Republic of Iran Broadcasting.
  • Voice of America.

Video

Portals

Tools and Other NLP Resources

  • CRULP Linguistic Resources. Includes
    • Word List. 149466 words.
    • 5000 Most Frequently Used Words List
    • Closed Class Words List
    • Valid Ligatures of Urdu
    • Urdu Grammar
  • NIST 2008 Open Machine Translation (OpenMT) Evaluation: a package containing source data, reference translations and scoring software used in the NIST 2008 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems. ... The 2008 task was to evaluate translation from Arabic to English, Chinese to English, English to Chinese (newswire only) and Urdu to English. LDC2010T21. ISBN:1-58563-567-7. [Mamandel 00:25, 2 March 2011 (UTC)]
  • NIST 2009 Open Machine Translation (OpenMT) Evaluation: is a package containing source data, reference translations and scoring software used in the NIST 2009 OpenMT evaluation. It is designed to help evaluate the effectiveness of machine translation systems. ... The 2009 task was to evaluate translation from Arabic to English and Urdu to English. LDC2010T23. ISBN:1-58563-570-7. [Mamandel 00:25, 2 March 2011 (UTC)]
Personal tools