Tamil/Tamil

From the LDC Language Resource Wiki

Jump to: navigation, search

Home > Tamil

தமிழ்

TAMIL

General

Language summary

(Information based on Ethnologue, 2010-02-25)

ISO 639-3 code: tam
Population: 61,500,000 in India (1997). Population total all countries: 65,675,200.
Also spoken in: Malaysia (Peninsular), Mauritius, Réunion, Singapore, Sri Lanka
Alternate names: Damulian, Tamal, Tamalsan, Tambul, Tamili
Dialects: Adi Dravida, Aiyar, Aiyangar, Arava, Burgandi, Kongar, Madrasi, Madurai, Pattapu Bhasha, Tamil, Sri Lanka Tamil, Malaya Tamil, Burma Tamil, South Africa Tamil, Tigalu, Harijan, Sanketi, Hebbar, Mandyam Brahmin, Secunderabad Brahmin. (See Ethnologue for notes.)
Classification: Dravidian, Southern, Tamil-Kannada, Tamil-Kodagu, Tamil-Malayalam, Tamil

In addition, there is diglossia between Literary Tamil (centamil /centamiẓ/) and the various spoken dialects (kotuntamil /koṭuntamiẓ/). Spoken Tamil varies widely with geography and caste. To a certain extent there has arisen a "Standard Spoken Tamil" used by educated people from different regions when they come together, but this is not really standardized.

Linguistic notes

There is only one noun declension, with eight cases and two numbers (singular and plural), all marked by endings. Postpositions are also used. There are two genders, "rational" and "irrational" (roughly, ±human); the rational gender is subdivided into honorific, masculine, and feminine.

Verbs are inflected for person, number, gender (third person only), mood, and tense.

Writing

Tamil is written in a Brahmi-derived script. See also Encoding and Fonts.

Omniglot
The Unicode Standard, Version 5.2. Ch. 9.6: Tamil, pp. 289-296 (PDF: 30-37). ISBN 9781936213009

Linguistic resources

Overview

Steever, Sanford B. 1987. Tamil and the Dravidian Languages. In The World's Major Languages, edited by Bernard Comrie, 1990, Oxford University Press; chapter 36, pages 725-746. ISBN 9780195065114

Grammar

Agesthialingam, S. (1967) A Generative grammar of Tamil. Annamalai University Publications.
Asher, R.E. Tamil. (Croom Helm Descriptive Grammar.) Routledge. Reprint edition (April 1989) ASIN: 0415036828 Out of print.
Schiffman, Harold F. 1979. A grammar of spoken Tamil. Madras: Christian Literature Society. 108p. Online; requires Tamilnet font, downloadable from site
Schiffman, Harold F. 1999. A reference grammar of spoken Tamil. Cambridge:Cambridge University Press. 254p. ISBN 0521640741.

Lexicon

Many of the following are in the Digital South Asia Library of the University of Chicago] (DSAL).

Cologne Online Tamil Lexicon. 117,000+ entries. "All main entries in the Madras Tamil Lexicon (TL) and Supplement (TLS), and their English meanings." Roman transliteration. (See Tamil lexicon below.)
Fabricius, Johann Philipp. 1972. J. P. Fabricius's Tamil and English dictionary. 4th ed., rev.and enl. Tranquebar: Evangelical Lutheran Mission Pub. House. (DSAL). Reprinted by Laurier Books Ltd. January 1998. ISBN 8120602641
Kadirvelu Pillai, Na. 1928. Moli Akarathi. N. Kathiraiver Pillai's Tamil Moli Akarathi: Tamil-Tamil dictionary = Na. Kathiraiver Pillayin Tamil Moliyakarati: Tamil-Tamil akarathi. 6th ed., rev. and enl. Cennai: Pi. Ve. Namacivaya Mutaliyar. (DSAL).
Kapruka. Sinhala / Tamil Online Dictionary. Online lookup, English to Tamil (and Sinhala) only. 28,335 words. Provides image of text and sound file.
McAlpin, David W. 1981. A core vocabulary for Tamil. Rev. ed. Philadelphia, Pa.: Dept. of South Asia Regional Studies, University of Pennsylvania. (DSAL).
Raghavan, Raamesh Gowri. 2002-2005. Freelang Tamil-English dictionary. T-E: 2,770 words; E-T: 2,763 words.
Schiffman, Harold, and Renganathan, Vasu. 2009. An English Dictionary of the Tamil Verb, Second Edition. LDC2009L01. “contains translations for 6597 English verbs and defines 9716 Tamil verbs. ... two formats: Adobe PDF and XML. ... The main goal of this dictionary is to get an English-knowing user to a Tamil verb, irrespective of whether he or she begins with an English verb or some other item, such as an adjective; this is because what may be a verb in Tamil may in fact not be a verb in English, and vice versa.” [Mamandel 20:40, 1 March 2011 (UTC)]
Subramanian, Pavoorchatram Rajagopal. 1992. Kriyāvin̲ tar̲kālat Tamil̲ akarāti : Tamil̲-Tamil̲-Āṅkilam. 1st ed. Madras: Kriyā. 979p. Reprinted with corrections 1992. ISBN 8185602573. (DSAL: searchable database under construction.)
Tamilcube Tamil-English-Tamil. 200k+ entries. Online lookup. Unicode. Also digits to Tamil words, e.g., "516" → "ஐநூற்று பதினாறு"
Tamil lexicon. 1924-1936. [Madras]: University of Madras. (DSAL). (See also Cologne Online Tamil Lexicon above.)
tamildict.com English-Tamil-German dictionary. Online lookup. "18,826 translation pairs" [accessed 2010-02-25]
Tar̲kālat Tamil̲ maraputtoṭar akarāti : Tamil̲-Tamil̲-Aṅkilam. 1997. Chennai : Mol̲i. 404p. ISBN 8190069403. (A dictionary of idioms and phrases in contemporary Tamil; Tamil-Tamil-English (DSAL: searchable database under construction.)
Visvanatha, Pillai. 1988. Tamil-English Dictionary. Laurier Books Ltd. 731p. ISBN 8120604377
Wiktionary. Unicode. Monolingual. 112,963 entries (CC-BY-SA),(GFDL) [Mamandel 16:35, 3 May 2010 (UTC)]
Winslow, Miron. 1862. A comprehensive Tamil and English dictionary of high and low Tamil. Madras: P.R. Hunt. (DSAL). 8th Rep edition (1998) Laurier Books Ltd. 967p. ISBN 8120600002

Topical word lists

Babynology: List of Tamil baby names in Roman transliteration

Tamilcube Tamil-Hindi-English word lists (number of entries) [Accessed 2010-03-11]:
- Indian spices and pulses (71)
- Indian herbs and plants (807)
- Indian vegetables (55)
- Fruits (30)
- Flowers (43)
- Birds (21)
- Animals (42)
- Fishes (27)

Linguistic portals and bibliographies

OLAC list of Resources in and about the Tamil language.
Penn Language Center Web Assisted Learning and Teaching of Tamil. (Much of this site requires the Tamilnet font, which apparently uses the TAB encoding; downloadable, with keyboard and other tools.)
SIL Bibliography

Encoding and Fonts

Before the development and general use of Unicode, computer use of Tamil and other South Asian languages required special fonts using only one byte. Many websites used such fonts, often with idiosyncratic encodings. Some still do, including some listed on this page, and many corpora still use such fonts, and so we list some resources for other encodings, as well as fonts and encoding conversion.

Encodings

The International Forum for Information Technology in Tamil (INFITT), a non-governmental organization that appears to have the support of the state of Tamil Nadu as well as the various other countries with large Tamil speaking populations, decided at the TAMIL INTERNET 2001 conference to recommend that software make use of the Unicode encoding, but that where an 8-bit encoding is necessary, either the TAB or TISCII encoding be used.

Unicode

The Unicode range for Tamil is 0B80-0BFF. The Unicode encoding is based on the ISCII Standard. (Unicode Standard, v5.2, p.290)

Penn State info page; Penn State chart of Unicode Entity Codes for the Tamil Script. "These charts show basic characters only. Check the latest Unicode charts to look for any additions to this block." (including OS X and Windows keyboard entry)

Other standard encodings

INSFOC (Indian Standard Font Code)
ISCII
TSCII
TAB and TAM: bilingual and monolingual encodings

Nonstandard and idiosyncratic encodings

These pages are on a weblog which is no longer maintained as of 2011-05-03, although the pages are still up. They are summarized in this wiki at Tamil/Nonstandard encodings.

Nonstandard fonts (there called "Very Special Scheme Fonts")
Idiosyncratic fonts (there called "Odd Scheme Fonts"): “I couldnt categorise following set of fonts into one category. Each of them follow totally new type of coding scheme.”

Fonts

Resources for fonts

Multiple encodings

SIL's list of Fonts in Cyberspace. Guide and symbols used at top of page. Tamil is under T.
The South Asia Language Resource Center of the University of Chicago has links to
- Tamil fonts, most of them available for free download
- Input Schemes and Keyboard Layouts
- information about Mac vs. Windows rendering issues

Unicode

Alan Wood’s Unicode Resources
- Large, multi-script Unicode fonts for Windows computers
- Test for Unicode support in Web browsers
- Alan Wood's South Asian Unicode fonts for Windows computers. Includes Multiscript Indic fonts and Tamil fonts.
OpenType Fonts for Tamil. Microsoft doc on Unicode 3.1 for Tamil. Contains useful info on the Tamil writing system.
South Asia Language Resource Center of the University of Chicago.
Wazu Japan's Gallery of Unicode Tamil fonts, and Tamil test page

Non-Unicode

TAB and TAM
- Pathippu-250. 200 TAM fonts - 50 TAB fonts. Source code available.
- Tamil Electronic Library: fonts and software tools
- Tamil Virtual University Fonts
TSCII
- Sarma. TSCII Fonts For Free Download
See above, Nonstandard and idiosyncratic encodings

Conversion

Murasu Anjal Software tool suite for creating, editing, converting and publishing Tamil content. Windows and Mac OS X. Conversion to Unicode of legacy documents composed with older encoding formats like TSCII, TAB and Murasu-6.
Padma. A Firefox add-in. "Padma transforms Indic text encoded in proprietary formats (ex: dynamic fonts) automatically to Unicode. Padma also has support for transforming from ISCII and transliteration schemes like ITRANS and RTS (Telugu only)." More details on Padma homepage.
Unicodify: From Lancaster University, producers of the Emille corpus. For Windows; ANSI C source code also available.
Text conversion from TSCII 1.7 to Unicode. Muthu Nedumaran. 2007. PDF.
Visai Tamil 2008. Seems to be an input tool. TAM, TAB, TSCII, Unicode, ASCII; keyboard layouts, spell check; dictionary (135k+ Tamil entries, 65k English)

Data Sources

All these resources use Unicode unless otherwise described.

Monolingual Text

EMILLE corpus. Free license for non-profit research use.
Wikipedia. Unicode. 22,270 articles (CC-BY-SA),(GFDL) [Mamandel 16:34, 3 May 2010 (UTC)]

News and magazines

China Radio International.
Daily Thanthi. Idiosyncratic eight-bit encoding, Elango TML Panchali font (download).
Dinakaran.
Dinamalar.
Dinamani.
Kalki. The site says it requires TSCII, but it seems to use Unicode.
Thats Tamil. (Also listed under Portals. Now part of OneIndia.in.)
Thenee. [Mamandel 2010-06-14]
Thinnai.
Thinaboomi. Tamil Nadu. Partly in Unicode.
Uthayan.
Vikatan.
Virakesari Online. Sri Lanka news.
Webdunia.
Newspaper portals:
- Indiapress: Tamil Newspapers. The page itself is in English.
News blogs
- tamilmaNam.net news blogs

Literature

GRETIL Tamil. Göttingen Register of Electronic Texts in Indian Languages. Romanized in various schemes; each document is headed with a table of transliteration.
Project Madurai. Etexts of ancient literary works, in TSCII and (since 2004) in Unicode as well. All-volunteer effort. Homepage "last updated on 31 Jan 2005". English description is on lower half of home page; "Homepage in English" link is "File not found". [2010-04-02]
Tamil Library Partly Unicode, partly TAB encoding, and a section of romanized texts; private use area characters for at least some headings. Part of Tamil Virtual University

Blogs

Blogspot Tamil Bloggers List
tamilmaNam.net Tamil Blogs Aggregator
- today's blogs
- news blogs
- resources, apparently including search engine links

Parallel Text

EMILLE corpus. 200,000 words of text in English (information leaflets from the UK Government and various local authorities) with Tamil translation. Free license for non-profit research use.

Speech

Broadcast

China Radio International.
Sooriyan Radio. Site for a network in Sri Lanka. Nonstandard encoding, TM-TTKapilan font. Has a "Live Radio" link which may require Internet Explorer and Windows Media Player, and/or may be active only during broadcast hours. May be music only.

Telephone

CALLFRIEND Tamil. Alexandra Canavan and George Zipperlen. 1996. LDC96S59. “The corpus consists of 60 unscripted telephone conversations, lasting between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers).” [Mamandel 20:45, 1 March 2011 (UTC)]
CSLU: Multilanguage Telephone Speech Version 1.2: Yeshwant Muthusamy, Ron Cole, and Beatrice Oshika. 2006. LDC2006S35. “The Multilanguage Telephone Speech corpus consists of telephone speech from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, Vietnamese.” Tamil: 149 speakers; 2.82 hours of speech (total). [Mamandel 15:38, 18 May 2011 (UTC)]

Video

Tamil Library: Cultural Gallery Partly Unicode, partly TAB encoding, with private use area characters for at least some headings. Part of Tamil Virtual University

Portals

Suratha: Links to newspapers and magazines in Tamil and English, Tamil music, movies, TV, Tamil computer tools, and ther portals.
tamilmaNam.net Tamil Blogs Aggregator. Includes:
- today's blogs
- news blogs
- resources, apparently including search engine links
- and several other sections whose URLs are in romanized Tamil
Tamil Virtual University. “An autonomous Institution, established by the Government of Talminadu. ... TAB to Unicode Conversion is undergoing, core content and part of the lessons are available in Unicode”. Downloadable fonts. At least some of the headings are encoded in the Unicode private use area.
Thats Tamil. Now part of OneIndia.in.
Vikatan. Requires login
WebTamilan
Yahoo India in Tamil.
"Yarl" Tamil Search Machine. Claims to use Unicode for Google search, but requires TSCII input (“Please insert/type your phrase in plain English(ex:-ammaa)”). Also has a page with searches for "Madurai Project" and "Forum Hub" (portals?) as well as "TSCII Font", "Bamini Font", and "Tab Font".

Tools and Other NLP Resources

Morphological analyzer

Hunspell. “the default spell checker of OpenOffice.org and Mozilla Firefox 3 & Thunderbird. ... Unicode character encoding, compounding and complex morphology ... Morphological analysis, stemming and generation.”
TAGTAMIL part-of-speech tagger cum spell checker. (Listed under "MSDOS", not "Windows 3.x")

Input in older encodings

Adhawin. Input tool: type in romanization, output in Adhawin 8-bit font (included). Windows.
Anjal text editor for Windows.

Tamil/Tamil

From the LDC Language Resource Wiki

Contents

General

Language summary

Linguistic notes

Writing

Linguistic resources

Overview

Grammar

Lexicon

Topical word lists

Linguistic portals and bibliographies

Encoding and Fonts

Encodings

Unicode

Other standard encodings

Nonstandard and idiosyncratic encodings

Fonts

Multiple encodings

Unicode

Non-Unicode

Conversion

Data Sources

Monolingual Text

News and magazines

Literature

Blogs

Parallel Text

Speech

Broadcast

Telephone

Video

Portals

Tools and Other NLP Resources

Morphological analyzer

Input in older encodings

Views

Personal tools

Navigation

Search

Toolbox