Tamil/Tamil
From the LDC Language Resource Wiki
Contents |
General
Language summary
(Information based on Ethnologue, 2010-02-25)
- ISO 639-3 code: tam
- Population: 61,500,000 in India (1997). Population total all countries: 65,675,200.
- Also spoken in: Malaysia (Peninsular), Mauritius, Réunion, Singapore, Sri Lanka
- Alternate names: Damulian, Tamal, Tamalsan, Tambul, Tamili
- Dialects: Adi Dravida, Aiyar, Aiyangar, Arava, Burgandi, Kongar, Madrasi, Madurai, Pattapu Bhasha, Tamil, Sri Lanka Tamil, Malaya Tamil, Burma Tamil, South Africa Tamil, Tigalu, Harijan, Sanketi, Hebbar, Mandyam Brahmin, Secunderabad Brahmin. (See Ethnologue for notes.)
- Classification: Dravidian, Southern, Tamil-Kannada, Tamil-Kodagu, Tamil-Malayalam, Tamil
In addition, there is diglossia between Literary Tamil (centamil /centamiẓ/) and the various spoken dialects (kotuntamil /koṭuntamiẓ/). Spoken Tamil varies widely with geography and caste. To a certain extent there has arisen a "Standard Spoken Tamil" used by educated people from different regions when they come together, but this is not really standardized.
Linguistic notes
There is only one noun declension, with eight cases and two numbers (singular and plural), all marked by endings. Postpositions are also used. There are two genders, "rational" and "irrational" (roughly, ±human); the rational gender is subdivided into honorific, masculine, and feminine.
Verbs are inflected for person, number, gender (third person only), mood, and tense.
Writing
Tamil is written in a Brahmi-derived script. See also Encoding and Fonts.
- Omniglot
- The Unicode Standard, Version 5.2. Ch. 9.6: Tamil, pp. 289-296 (PDF: 30-37). ISBN 9781936213009
Linguistic resources
Overview
- Steever, Sanford B. 1987. Tamil and the Dravidian Languages. In The World's Major Languages, edited by Bernard Comrie, 1990, Oxford University Press; chapter 36, pages 725-746. ISBN 9780195065114
Grammar
- Agesthialingam, S. (1967) A Generative grammar of Tamil. Annamalai University Publications.
- Asher, R.E. Tamil. (Croom Helm Descriptive Grammar.) Routledge. Reprint edition (April 1989) ASIN: 0415036828 Out of print.
- Schiffman, Harold F. 1979. A grammar of spoken Tamil. Madras: Christian Literature Society. 108p. Online; requires Tamilnet font, downloadable from site
- Schiffman, Harold F. 1999. A reference grammar of spoken Tamil. Cambridge:Cambridge University Press. 254p. ISBN 0521640741.
Lexicon
Many of the following are in the Digital South Asia Library of the University of Chicago] (DSAL).
- Cologne Online Tamil Lexicon. 117,000+ entries. "All main entries in the Madras Tamil Lexicon (TL) and Supplement (TLS), and their English meanings." Roman transliteration. (See Tamil lexicon below.)
- Fabricius, Johann Philipp. 1972. J. P. Fabricius's Tamil and English dictionary. 4th ed., rev.and enl. Tranquebar: Evangelical Lutheran Mission Pub. House. (DSAL). Reprinted by Laurier Books Ltd. January 1998. ISBN 8120602641
- Kadirvelu Pillai, Na. 1928. Moli Akarathi. N. Kathiraiver Pillai's Tamil Moli Akarathi: Tamil-Tamil dictionary = Na. Kathiraiver Pillayin Tamil Moliyakarati: Tamil-Tamil akarathi. 6th ed., rev. and enl. Cennai: Pi. Ve. Namacivaya Mutaliyar. (DSAL).
- Kapruka. Sinhala / Tamil Online Dictionary. Online lookup, English to Tamil (and Sinhala) only. 28,335 words. Provides image of text and sound file.
- McAlpin, David W. 1981. A core vocabulary for Tamil. Rev. ed. Philadelphia, Pa.: Dept. of South Asia Regional Studies, University of Pennsylvania. (DSAL).
- Raghavan, Raamesh Gowri. 2002-2005. Freelang Tamil-English dictionary. T-E: 2,770 words; E-T: 2,763 words.
- Schiffman, Harold, and Renganathan, Vasu. 2009. An English Dictionary of the Tamil Verb, Second Edition. LDC. “contains translations for 6597 English verbs and defines 9716 Tamil verbs. ... two formats: Adobe PDF and XML. ... The main goal of this dictionary is to get an English-knowing user to a Tamil verb, irrespective of whether he or she begins with an English verb or some other item, such as an adjective; this is because what may be a verb in Tamil may in fact not be a verb in English, and vice versa.” [Mamandel 20:40, 1 March 2011 (UTC)]
- Subramanian, Pavoorchatram Rajagopal. 1992. Kriyāvin̲ tar̲kālat Tamil̲ akarāti : Tamil̲-Tamil̲-Āṅkilam. 1st ed. Madras: Kriyā. 979p. Reprinted with corrections 1992. ISBN 8185602573. (DSAL: searchable database under construction.)
- Tamilcube Tamil-English-Tamil. 200k+ entries. Online lookup. Unicode. Also digits to Tamil words, e.g., "516" → "ஐநூற்று பதினாறு"
- Tamil lexicon. 1924-1936. [Madras]: University of Madras. (DSAL). (See also Cologne Online Tamil Lexicon above.)
- tamildict.com English-Tamil-German dictionary. Online lookup. "18,826 translation pairs" [accessed 2010-02-25]
- Tar̲kālat Tamil̲ maraputtoṭar akarāti : Tamil̲-Tamil̲-Aṅkilam. 1997. Chennai : Mol̲i. 404p. ISBN 8190069403. (A dictionary of idioms and phrases in contemporary Tamil; Tamil-Tamil-English (DSAL: searchable database under construction.)
- Visvanatha, Pillai. 1988. Tamil-English Dictionary. Laurier Books Ltd. 731p. ISBN 8120604377
- Wiktionary. Unicode. Monolingual. 112,963 entries (CC-BY-SA),(GFDL) [Mamandel 16:35, 3 May 2010 (UTC)]
- Winslow, Miron. 1862. A comprehensive Tamil and English dictionary of high and low Tamil. Madras: P.R. Hunt. (DSAL). 8th Rep edition (1998) Laurier Books Ltd. 967p. ISBN 8120600002
Topical word lists
- Babynology: List of Tamil baby names in Roman transliteration
- Tamilcube Tamil-Hindi-English word lists (number of entries) [Accessed 2010-03-11]:
- Indian spices and pulses (71)
- Indian herbs and plants (807)
- Indian vegetables (55)
- Fruits (30)
- Flowers (43)
- Birds (21)
- Animals (42)
- Fishes (27)
Linguistic portals and bibliographies
- OLAC list of Resources in and about the Tamil language.
- Penn Language Center Web Assisted Learning and Teaching of Tamil. (Much of this site requires the Tamilnet font, which apparently uses the TAB encoding; downloadable, with keyboard and other tools.)
- SIL Bibliography
Encoding and Fonts
Before the development and general use of Unicode, computer use of Tamil and other South Asian languages required special fonts using only one byte. Many websites used such fonts, often with idiosyncratic encodings. Some still do, including some listed on this page, and many corpora still use such fonts, and so we list some resources for other encodings, as well as fonts and encoding conversion.
Encodings
The International Forum for Information Technology in Tamil (INFITT), a non-governmental organization that appears to have the support of the state of Tamil Nadu as well as the various other countries with large Tamil speaking populations, decided at the TAMIL INTERNET 2001 conference to recommend that software make use of the Unicode encoding, but that where an 8-bit encoding is necessary, either the TAB or TISCII encoding be used.
Unicode
The Unicode range for Tamil is 0B80-0BFF. The Unicode encoding is based on the ISCII Standard. (Unicode Standard, v5.2, p.290)
- Penn State info page; Penn State chart of Unicode Entity Codes for the Tamil Script. "These charts show basic characters only. Check the latest Unicode charts to look for any additions to this block." (including OS X and Windows keyboard entry)
Other standard encodings
- INSFOC (Indian Standard Font Code)
- ISCII
- TSCII
- TAB and TAM: bilingual and monolingual encodings
Nonstandard and idiosyncratic encodings
- These pages are on a weblog which may be no longer maintained [2010-03-24]. They are summarized in this wiki at Tamil/Nonstandard encodings.
- Nonstandard fonts (there called "Very Special Scheme Fonts")
- Idiosyncratic fonts (there called "Odd Scheme Fonts"): "I couldnt categorise following set of fonts into one category. Each of them follow totally new type of coding scheme."
Fonts
Resources for fonts
Multiple encodings
- SIL's list of Fonts in Cyberspace. Guide and symbols used at top of page. Tamil is under T.
- The South Asia Language Resource Center of the University of Chicago has links to
- Tamil fonts, most of them available for free download
- Input Schemes and Keyboard Layouts
- information about Mac vs. Windows rendering issues
Unicode
- Alan Wood’s Unicode Resources
- Large, multi-script Unicode fonts for Windows computers
- Test for Unicode support in Web browsers
- Alan Wood's South Asian Unicode fonts for Windows computers. Includes Multiscript Indic fonts and Tamil fonts.
- OpenType Fonts for Tamil. Microsoft doc on Unicode 3.1 for Tamil. Contains useful info on the Tamil writing system.
- South Asia Language Resource Center of the University of Chicago.
- Wazu Japan's Gallery of Unicode Tamil fonts, and Tamil test page
Non-Unicode
- TAB and TAM
- Pathippu-250. 200 TAM fonts - 50 TAB fonts. Source code available.
- Tamil Electronic Library: fonts and software tools
- Tamil Virtual University Fonts
- TSCII
- Sarma. TSCII Fonts For Free Download
- See above, Nonstandard and idiosyncratic encodings
Conversion
- Murasu Anjal Software tool suite for creating, editing, converting and publishing Tamil content. Windows and Mac OS X. Conversion to Unicode of legacy documents composed with older encoding formats like TSCII, TAB and Murasu-6.
- Padma. A Firefox add-in. "Padma transforms Indic text encoded in proprietary formats (ex: dynamic fonts) automatically to Unicode. Padma also has support for transforming from ISCII and transliteration schemes like ITRANS and RTS (Telugu only)." More details on Padma homepage.
- Unicodify: From Lancaster University, producers of the Emille corpus. For Windows; ANSI C source code also available.
- Text conversion from TSCII 1.7 to Unicode. Muthu Nedumaran. 2007. PDF.
- Visai Tamil 2008. Seems to be an input tool. TAM, TAB, TSCII, Unicode, ASCII; keyboard layouts, spell check; dictionary (135k+ Tamil entries, 65k English)
Data Sources
All these resources use Unicode unless otherwise described.
Monolingual Text
- EMILLE corpus. Free license for non-profit research use.
- Wikipedia. Unicode. 22,270 articles (CC-BY-SA),(GFDL) [Mamandel 16:34, 3 May 2010 (UTC)]
News and magazines
- China Radio International.
- Daily Thanthi. Idiosyncratic eight-bit encoding, Elango TML Panchali font (download).
- Dinakaran.
- Dinamalar.
- Dinamani.
- Kalki. The site says it requires TSCII, but it seems to use Unicode.
- Thats Tamil. (Also listed under Portals. Now part of OneIndia.in.)
- Thenee. [Mamandel 2010-06-14]
- Thinnai.
- Thinaboomi. Tamil Nadu. Partly in Unicode.
- Uthayan.
- Vikatan.
- Virakesari Online. Sri Lanka news.
- Webdunia.
- Newspaper portals:
- Indiapress: Tamil Newspapers. The page itself is in English.
- News blogs
Literature
- GRETIL Tamil. Göttingen Register of Electronic Texts in Indian Languages. Romanized in various schemes; each document is headed with a table of transliteration.
- Project Madurai. Etexts of ancient literary works, in TSCII and (since 2004) in Unicode as well. All-volunteer effort. Homepage "last updated on 31 Jan 2005". English description is on lower half of home page; "Homepage in English" link is "File not found". [2010-04-02]
- Tamil Library Partly Unicode, partly TAB encoding, and a section of romanized texts; private use area characters for at least some headings. Part of Tamil Virtual University
Blogs
- Blogspot Tamil Bloggers List
- tamilmaNam.net Tamil Blogs Aggregator
- today's blogs
- news blogs
- resources, apparently including search engine links
Parallel Text
- EMILLE corpus. 200,000 words of text in English (information leaflets from the UK Government and various local authorities) with Tamil translation. Free license for non-profit research use.
Speech
Broadcast
- China Radio International.
- Sooriyan Radio. Site for a network in Sri Lanka. Nonstandard encoding, TM-TTKapilan font. Has a "Live Radio" link which may require Internet Explorer and Windows Media Player, and/or may be active only during broadcast hours. May be music only.
Telephone
- CALLFRIEND Tamil. Alexandra Canavan and George Zipperlen. 1996. LDC96S59. “The corpus consists of 60 unscripted telephone conversations, lasting between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers).” [Mamandel 20:45, 1 March 2011 (UTC)]
Video
- Tamil Library: Cultural Gallery Partly Unicode, partly TAB encoding, with private use area characters for at least some headings. Part of Tamil Virtual University
Portals
- Suratha: Links to newspapers and magazines in Tamil and English, Tamil music, movies, TV, Tamil computer tools, and ther portals.
- tamilmaNam.net Tamil Blogs Aggregator. Includes:
- today's blogs
- news blogs
- resources, apparently including search engine links
- and several other sections whose URLs are in romanized Tamil
- Tamil Virtual University. “An autonomous Institution, established by the Government of Talminadu. ... TAB to Unicode Conversion is undergoing, core content and part of the lessons are available in Unicode”. Downloadable fonts. At least some of the headings are encoded in the Unicode private use area.
- Thats Tamil. Now part of OneIndia.in.
- Vikatan. Requires login
- WebTamilan
- Yahoo India in Tamil.
- "Yarl" Tamil Search Machine. Claims to use Unicode for Google search, but requires TSCII input (“Please insert/type your phrase in plain English(ex:-ammaa)”). Also has a page with searches for "Madurai Project" and "Forum Hub" (portals?) as well as "TSCII Font", "Bamini Font", and "Tab Font".
Tools and Other NLP Resources
Morphological analyzer
- Hunspell. “the default spell checker of OpenOffice.org and Mozilla Firefox 3 & Thunderbird. ... Unicode character encoding, compounding and complex morphology ... Morphological analysis, stemming and generation.”
- TAGTAMIL part-of-speech tagger cum spell checker. (Listed under "MSDOS", not "Windows 3.x")
Input in older encodings
- Adhawin. Input tool: type in romanization, output in Adhawin 8-bit font (included). Windows.
- Anjal text editor for Windows.