Tamil/Tamil

From the LDC Language Resource Wiki

Revision as of 15:43, 18 May 2011 by Mamandel (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Home > Tamil

தமிழ்


TAMIL


Contents

General

Language summary

(Information based on Ethnologue, 2010-02-25)

  • ISO 639-3 code: tam
  • Population: 61,500,000 in India (1997). Population total all countries: 65,675,200.
  • Also spoken in: Malaysia (Peninsular), Mauritius, Réunion, Singapore, Sri Lanka
  • Alternate names: Damulian, Tamal, Tamalsan, Tambul, Tamili
  • Dialects: Adi Dravida, Aiyar, Aiyangar, Arava, Burgandi, Kongar, Madrasi, Madurai, Pattapu Bhasha, Tamil, Sri Lanka Tamil, Malaya Tamil, Burma Tamil, South Africa Tamil, Tigalu, Harijan, Sanketi, Hebbar, Mandyam Brahmin, Secunderabad Brahmin. (See Ethnologue for notes.)
  • Classification: Dravidian, Southern, Tamil-Kannada, Tamil-Kodagu, Tamil-Malayalam, Tamil

In addition, there is diglossia between Literary Tamil (centamil /centamiẓ/) and the various spoken dialects (kotuntamil /koṭuntamiẓ/). Spoken Tamil varies widely with geography and caste. To a certain extent there has arisen a "Standard Spoken Tamil" used by educated people from different regions when they come together, but this is not really standardized.

Linguistic notes

There is only one noun declension, with eight cases and two numbers (singular and plural), all marked by endings. Postpositions are also used. There are two genders, "rational" and "irrational" (roughly, ±human); the rational gender is subdivided into honorific, masculine, and feminine.

Verbs are inflected for person, number, gender (third person only), mood, and tense.

Writing

Tamil is written in a Brahmi-derived script. See also Encoding and Fonts.

Linguistic resources

Overview


Grammar

  • Agesthialingam, S. (1967) A Generative grammar of Tamil. Annamalai University Publications.
  • Asher, R.E. Tamil. (Croom Helm Descriptive Grammar.) Routledge. Reprint edition (April 1989) ASIN: 0415036828 Out of print.
  • Schiffman, Harold F. 1979. A grammar of spoken Tamil. Madras: Christian Literature Society. 108p. Online; requires Tamilnet font, downloadable from site
  • Schiffman, Harold F. 1999. A reference grammar of spoken Tamil. Cambridge:Cambridge University Press. 254p. ISBN 0521640741.

Lexicon

Many of the following are in the Digital South Asia Library of the University of Chicago] (DSAL).


Topical word lists

  • Babynology: List of Tamil baby names in Roman transliteration
  • Tamilcube Tamil-Hindi-English word lists (number of entries) [Accessed 2010-03-11]:
    • Indian spices and pulses (71)
    • Indian herbs and plants (807)
    • Indian vegetables (55)
    • Fruits (30)
    • Flowers (43)
    • Birds (21)
    • Animals (42)
    • Fishes (27)


Linguistic portals and bibliographies

  • OLAC list of Resources in and about the Tamil language.
  • Penn Language Center Web Assisted Learning and Teaching of Tamil. (Much of this site requires the Tamilnet font, which apparently uses the TAB encoding; downloadable, with keyboard and other tools.)
  • SIL Bibliography

Encoding and Fonts

Before the development and general use of Unicode, computer use of Tamil and other South Asian languages required special fonts using only one byte. Many websites used such fonts, often with idiosyncratic encodings. Some still do, including some listed on this page, and many corpora still use such fonts, and so we list some resources for other encodings, as well as fonts and encoding conversion.

Encodings

The International Forum for Information Technology in Tamil (INFITT), a non-governmental organization that appears to have the support of the state of Tamil Nadu as well as the various other countries with large Tamil speaking populations, decided at the TAMIL INTERNET 2001 conference to recommend that software make use of the Unicode encoding, but that where an 8-bit encoding is necessary, either the TAB or TISCII encoding be used.

Unicode

The Unicode range for Tamil is 0B80-0BFF. The Unicode encoding is based on the ISCII Standard. (Unicode Standard, v5.2, p.290)

Other standard encodings

Nonstandard and idiosyncratic encodings

These pages are on a weblog which is no longer maintained as of 2011-05-03, although the pages are still up. They are summarized in this wiki at Tamil/Nonstandard encodings.

  • Nonstandard fonts (there called "Very Special Scheme Fonts")
  • Idiosyncratic fonts (there called "Odd Scheme Fonts"): I couldnt categorise following set of fonts into one category. Each of them follow totally new type of coding scheme.

Fonts

Resources for fonts

Multiple encodings

Unicode

Non-Unicode

Conversion

  • Murasu Anjal Software tool suite for creating, editing, converting and publishing Tamil content. Windows and Mac OS X. Conversion to Unicode of legacy documents composed with older encoding formats like TSCII, TAB and Murasu-6.
  • Padma. A Firefox add-in. "Padma transforms Indic text encoded in proprietary formats (ex: dynamic fonts) automatically to Unicode. Padma also has support for transforming from ISCII and transliteration schemes like ITRANS and RTS (Telugu only)." More details on Padma homepage.
  • Unicodify: From Lancaster University, producers of the Emille corpus. For Windows; ANSI C source code also available.
  • Text conversion from TSCII 1.7 to Unicode. Muthu Nedumaran. 2007. PDF.
  • Visai Tamil 2008. Seems to be an input tool. TAM, TAB, TSCII, Unicode, ASCII; keyboard layouts, spell check; dictionary (135k+ Tamil entries, 65k English)


Data Sources

All these resources use Unicode unless otherwise described.

Monolingual Text

News and magazines

Literature

  • GRETIL Tamil. Göttingen Register of Electronic Texts in Indian Languages. Romanized in various schemes; each document is headed with a table of transliteration.
  • Project Madurai. Etexts of ancient literary works, in TSCII and (since 2004) in Unicode as well. All-volunteer effort. Homepage "last updated on 31 Jan 2005". English description is on lower half of home page; "Homepage in English" link is "File not found". [2010-04-02]
  • Tamil Library Partly Unicode, partly TAB encoding, and a section of romanized texts; private use area characters for at least some headings. Part of Tamil Virtual University

Blogs

Parallel Text

  • EMILLE corpus. 200,000 words of text in English (information leaflets from the UK Government and various local authorities) with Tamil translation. Free license for non-profit research use.

Speech

Broadcast

Telephone

  • CALLFRIEND Tamil. Alexandra Canavan and George Zipperlen. 1996. LDC96S59. The corpus consists of 60 unscripted telephone conversations, lasting between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers). [Mamandel 20:45, 1 March 2011 (UTC)]
  • CSLU: Multilanguage Telephone Speech Version 1.2: Yeshwant Muthusamy, Ron Cole, and Beatrice Oshika. 2006. LDC2006S35. The Multilanguage Telephone Speech corpus consists of telephone speech from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, Vietnamese. Tamil: 149 speakers; 2.82 hours of speech (total). [Mamandel 15:38, 18 May 2011 (UTC)]

Video


Portals

Tools and Other NLP Resources

Morphological analyzer

  • Hunspell. the default spell checker of OpenOffice.org and Mozilla Firefox 3 & Thunderbird. ... Unicode character encoding, compounding and complex morphology ... Morphological analysis, stemming and generation.
  • TAGTAMIL part-of-speech tagger cum spell checker. (Listed under "MSDOS", not "Windows 3.x")

Input in older encodings

  • Adhawin. Input tool: type in romanization, output in Adhawin 8-bit font (included). Windows.
  • Anjal text editor for Windows.


Personal tools