Language summary

(Information based on Ethnologue, 2010-02-25)

  • ISO 639-3 code: tam
  • Population: 61,500,000 in India (1997). Population total all countries: 65,675,200.
  • Also spoken in: Malaysia (Peninsular), Mauritius, Réunion, Singapore, Sri Lanka
  • Alternate names: Damulian, Tamal, Tamalsan, Tambul, Tamili
  • Dialects: Adi Dravida, Aiyar, Aiyangar, Arava, Burgandi, Kongar, Madrasi, Madurai, Pattapu Bhasha, Tamil, Sri Lanka Tamil, Malaya Tamil, Burma Tamil, South Africa Tamil, Tigalu, Harijan, Sanketi, Hebbar, Mandyam Brahmin, Secunderabad Brahmin. (See Ethnologue for notes.)
  • Classification: Dravidian, Southern, Tamil-Kannada, Tamil-Kodagu, Tamil-Malayalam, Tamil

In addition, there is diglossia between Literary Tamil (centamil /centamiẓ/) and the various spoken dialects (kotuntamil /koṭuntamiẓ/). Spoken Tamil varies widely with geography and caste. To a certain extent there has arisen a "Standard Spoken Tamil" used by educated people from different regions when they come together, but this is not really standardized.

Linguistic notes

There is only one noun declension, with eight cases and two numbers (singular and plural), all marked by endings. Postpositions are also used. There are two genders, "rational" and "irrational" (roughly, ±human); the rational gender is subdivided into honorific, masculine, and feminine.

Verbs are inflected for person, number, gender (third person only), mood, and tense.


Tamil is written in a Brahmi-derived script. See also Encoding and Fonts.

Many of the following are in the Digital South Asia Library of the University of Chicago] (DSAL).

Topical word lists

Linguistic portals and bibliographies

  Penn Language Center Web Assisted Learning and Teaching of Tamil.
Encoding and Fonts

Before the development and general use of Unicode, computer use of Tamil and other South Asian languages required special fonts using only one byte. Many websites used such fonts, often with idiosyncratic encodings. Some still do, including some listed on this page, and many corpora still use such fonts, and so we list some resources for other encodings, as well as fonts and encoding conversion.


The International Forum for Information Technology in Tamil (INFITT), a non-governmental organization that appears to have the support of the state of Tamil Nadu as well as the various other countries with large Tamil speaking populations, decided at the TAMIL INTERNET 2001 conference to recommend that software make use of the Unicode encoding, but that where an 8-bit encoding is necessary, either the TAB or TISCII encoding be used.


The Unicode range for Tamil is 0B80-0BFF. The Unicode encoding is based on the ISCII Standard. (Unicode Standard, v5.2, p.290)

These pages are on a weblog which is no longer maintained as of 2011-05-03, although the pages are still up. They are summarized in this wiki at Tamil/Nonstandard encodings.

  Nonstandard fonts
  Idiosyncratic fonts


  • Murasu Anjal Software tool suite for creating, editing, converting and publishing Tamil content. Windows and Mac OS X. Conversion to Unicode of legacy documents composed with older encoding formats like TSCII, TAB and Murasu-6.
  • Padma. A Firefox add-in. "Padma transforms Indic text encoded in proprietary formats (ex: dynamic fonts) automatically to Unicode. Padma also has support for transforming from ISCII and transliteration schemes like ITRANS and RTS (Telugu only)." More details on Padma homepage.
  • Unicodify: From Lancaster University, producers of the Emille corpus. For Windows; ANSI C source code also available.
  • Text conversion from TSCII 1.7 to Unicode. Muthu Nedumaran. 2007. PDF.
  • Visai Tamil 2008. Seems to be an input tool. TAM, TAB, TSCII, Unicode, ASCII; keyboard layouts, spell check; dictionary (135k+ Tamil entries, 65k English)

Data Sources

All these resources use Unicode unless otherwise described.

Monolingual Text

News and magazines


  • GRETIL Tamil. Göttingen Register of Electronic Texts in Indian Languages. Romanized in various schemes; each document is headed with a table of transliteration.
  • Project Madurai. Etexts of ancient literary works, in TSCII and (since 2004) in Unicode as well. All-volunteer effort. Homepage "last updated on 31 Jan 2005". English description is on lower half of home page; "Homepage in English" link is "File not found". [2010-04-02]
  • Tamil Library Partly Unicode, partly TAB encoding, and a section of romanized texts; private use area characters for at least some headings. Part of Tamil Virtual University


Parallel Text

  • EMILLE corpus. 200,000 words of text in English (information leaflets from the UK Government and various local authorities) with Tamil translation. Free license for non-profit research use.




  • CALLFRIEND Tamil. Alexandra Canavan and George Zipperlen. 1996. LDC96S59. The corpus consists of 60 unscripted telephone conversations, lasting between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers). [Mamandel 20:45, 1 March 2011 (UTC)]
  • CSLU: Multilanguage Telephone Speech Version 1.2: Yeshwant Muthusamy, Ron Cole, and Beatrice Oshika. 2006. LDC2006S35. The Multilanguage Telephone Speech corpus consists of telephone speech from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, Vietnamese. Tamil: 149 speakers; 2.82 hours of speech (total). [Mamandel 15:38, 18 May 2011 (UTC)]



Tools and Other NLP Resources

Morphological analyzer

  • Hunspell. the default spell checker of OpenOffice.org and Mozilla Firefox 3 & Thunderbird. ... Unicode character encoding, compounding and complex morphology ... Morphological analysis, stemming and generation.
  • TAGTAMIL part-of-speech tagger cum spell checker. (Listed under "MSDOS", not "Windows 3.x")

