Tamil/Tamil

From the LDC Language Resource Wiki

(Difference between revisions)
Jump to: navigation, search
m (Tools and Other NLP Resources)
m (Telephone)
 
(12 intermediate revisions not shown)
Line 59: Line 59:
*McAlpin, David W. 1981. ''[http://dsal.uchicago.edu/dictionaries/mcalpin/ A core vocabulary for Tamil].'' Rev. ed. Philadelphia, Pa.: Dept. of South Asia Regional Studies, University of Pennsylvania. (DSAL).
*McAlpin, David W. 1981. ''[http://dsal.uchicago.edu/dictionaries/mcalpin/ A core vocabulary for Tamil].'' Rev. ed. Philadelphia, Pa.: Dept. of South Asia Regional Studies, University of Pennsylvania. (DSAL).
*Raghavan, Raamesh Gowri. 2002-2005. ''[http://www.freelang.net/dictionary/tamil.php Freelang Tamil-English dictionary].'' T-E: 2,770 words; E-T: 2,763 words.
*Raghavan, Raamesh Gowri. 2002-2005. ''[http://www.freelang.net/dictionary/tamil.php Freelang Tamil-English dictionary].'' T-E: 2,770 words; E-T: 2,763 words.
 +
* Schiffman, Harold, and Renganathan, Vasu. 2009. [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2009L01 ''An English Dictionary of the Tamil Verb, Second Edition'']. LDC2009L01. {{hq|contains translations for 6597 English verbs and defines 9716 Tamil verbs. ... two formats: Adobe PDF and XML. ... The main goal of this dictionary is to get an English-knowing user to a Tamil verb, irrespective of whether he or she begins with an English verb or some other item, such as an adjective; this is because what may be a verb in Tamil may in fact not be a verb in English, and vice versa.}} {{si|[[User:Mamandel|Mamandel]] 20:40, 1 March 2011 (UTC)}}
*Subramanian, Pavoorchatram Rajagopal. 1992. ''Kriyāvin̲ tar̲kālat Tamil̲ akarāti : Tamil̲-Tamil̲-Āṅkilam.'' 1st ed. Madras: Kriyā. 979p. Reprinted with corrections 1992. ISBN 8185602573. ''([http://dsal.uchicago.edu/dictionaries/list.html#tamil DSAL]: searchable database under construction.)''
*Subramanian, Pavoorchatram Rajagopal. 1992. ''Kriyāvin̲ tar̲kālat Tamil̲ akarāti : Tamil̲-Tamil̲-Āṅkilam.'' 1st ed. Madras: Kriyā. 979p. Reprinted with corrections 1992. ISBN 8185602573. ''([http://dsal.uchicago.edu/dictionaries/list.html#tamil DSAL]: searchable database under construction.)''
*[http://www.dictionary.tamilcube.com/ Tamilcube] Tamil-English-Tamil. 200k+ entries. Online lookup. Unicode. Also digits to Tamil words, e.g., "516" → "ஐநூற்று பதினாறு"
*[http://www.dictionary.tamilcube.com/ Tamilcube] Tamil-English-Tamil. 200k+ entries. Online lookup. Unicode. Also digits to Tamil words, e.g., "516" → "ஐநூற்று பதினாறு"
Line 109: Line 110:
====Nonstandard and idiosyncratic encodings====
====Nonstandard and idiosyncratic encodings====
-
* These pages are on a [http://www.angelfire.com/indie/tamilezhutthuru3 weblog] which may be no longer maintained ''<small>[2010-03-24]</small>''. They are summarized in this wiki at [[Tamil/Nonstandard encodings]].  
+
These pages are on a [http://www.angelfire.com/indie/tamilezhutthuru3 weblog] which is no longer maintained as of 2011-05-03, although the pages are still up. They are summarized in this wiki at [[Tamil/Nonstandard encodings]].  
-
**[http://www.angelfire.com/indie/tamilezhutthuru3/1/aanGilam/piRa_pkm.html#thuvakkam Nonstandard fonts] (there called "Very Special Scheme Fonts")
+
*[http://www.angelfire.com/indie/tamilezhutthuru3/1/aanGilam/piRa_pkm.html#thuvakkam Nonstandard fonts] (there called "Very Special Scheme Fonts")
-
**[http://www.angelfire.com/indie/tamilezhutthuru3/1/aanGilam/innapiRa_pkm.html#thuvakkam Idiosyncratic fonts] (there called "Odd Scheme Fonts"): "I couldnt categorise following set of fonts into one category. Each of them follow totally new type of coding scheme."
+
*[http://www.angelfire.com/indie/tamilezhutthuru3/1/aanGilam/innapiRa_pkm.html#thuvakkam Idiosyncratic fonts] (there called "Odd Scheme Fonts"): {{hq|I couldnt categorise following set of fonts into one category. Each of them follow totally new type of coding scheme.}}
===Fonts===
===Fonts===
Line 166: Line 167:
* [http://www.kalkiweekly.com/ Kalki]. The site [http://www.kalkionline.com/downloads/downloadfonts.asp says] it requires [[#Other standard encodings|TSCII]], but it seems to use Unicode.
* [http://www.kalkiweekly.com/ Kalki]. The site [http://www.kalkionline.com/downloads/downloadfonts.asp says] it requires [[#Other standard encodings|TSCII]], but it seems to use Unicode.
* [http://thatstamil.oneindia.in/ Thats Tamil]. (Also listed under [[#Portals|Portals]]. Now part of OneIndia.in.)
* [http://thatstamil.oneindia.in/ Thats Tamil]. (Also listed under [[#Portals|Portals]]. Now part of OneIndia.in.)
 +
* [http://www.thenee.com/ Thenee]. {{si|[[User:Mamandel|Mamandel]] 2010-06-14}}
* [http://www.thinnai.com/ Thinnai].  
* [http://www.thinnai.com/ Thinnai].  
-
* [http://www.thinaboomi.com/ Thinaboomi]. Tamil Nadu. Partly in Unicode. '''Malware warning''' <small>''[2010-04-2]''</small>  [http://safebrowsing.clients.google.com/safebrowsing/diagnostic?client=Firefox&hl=en-US&site=http://www.thinaboomi.com/ Google Safe Browsing]
+
* [http://www.thinaboomi.com/ Thinaboomi]. Tamil Nadu. Partly in Unicode. <!-- '''Malware warning''' <small>''[2010-04-2]''</small>  [http://safebrowsing.clients.google.com/safebrowsing/diagnostic?client=Firefox&hl=en-US&site=http://www.thinaboomi.com/ Google Safe Browsing] ###  This site is not currently listed as suspicious. Part of this site was listed for suspicious activity 2 time(s) over the past 90 days. ... The last time Google visited this site was on 2010-05-25, and the last time suspicious content was found on this site was on 2010-03-25. ### mam 100526  -->
* [http://www.uthayan.com/ Uthayan].
* [http://www.uthayan.com/ Uthayan].
* [http://www.vikatan.com/ Vikatan].
* [http://www.vikatan.com/ Vikatan].
Line 193: Line 195:
===Speech===
===Speech===
 +
====Broadcast====
* [http://ta1.chinabroadcast.cn/ China Radio International].
* [http://ta1.chinabroadcast.cn/ China Radio International].
* [http://www.sooriyanradio.com Sooriyan Radio]. Site for a network in Sri Lanka. [[#Nonstandard and idiosyncratic encodings|Nonstandard encoding]], [[Tamil/Nonstandard_encodings#Other_Scheme_Fonts_3_.28Monolingual.29|TM-TTKapilan font]]. Has a "Live Radio" link which may require Internet Explorer and Windows Media Player, and/or may be active only during broadcast hours. May be music only.
* [http://www.sooriyanradio.com Sooriyan Radio]. Site for a network in Sri Lanka. [[#Nonstandard and idiosyncratic encodings|Nonstandard encoding]], [[Tamil/Nonstandard_encodings#Other_Scheme_Fonts_3_.28Monolingual.29|TM-TTKapilan font]]. Has a "Live Radio" link which may require Internet Explorer and Windows Media Player, and/or may be active only during broadcast hours. May be music only.
 +
 +
====Telephone====
 +
* [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96S59 CALLFRIEND Tamil]. Alexandra Canavan and George Zipperlen. 1996. LDC96S59. {{hq|The corpus consists of 60 unscripted telephone conversations, lasting between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers).}} {{si|[[User:Mamandel|Mamandel]] 20:45, 1 March 2011 (UTC)}}
 +
* [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006S35 CSLU: Multilanguage Telephone Speech Version 1.2]: Yeshwant Muthusamy, Ron Cole, and Beatrice Oshika. 2006. LDC2006S35. {{hq|The Multilanguage Telephone Speech corpus consists of telephone speech from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, Vietnamese.}} Tamil: 149 speakers; 2.82 hours of speech (total). {{si|[[User:Mamandel|Mamandel]] 15:38, 18 May 2011 (UTC)}}
===Video===
===Video===
Line 208: Line 215:
** [http://www.tamilmanam.net/resources.php resources], apparently including search engine links
** [http://www.tamilmanam.net/resources.php resources], apparently including search engine links
** and several other sections whose URLs are in romanized Tamil
** and several other sections whose URLs are in romanized Tamil
-
* [http://www.tamilvu.org/ Tamil Virtual University]. {{hq|An autonomous Institution, established by the Government of Talminadu. ... [[#Other standard encodings|TAB]] to Unicode Conversion is undergoing, core content and part of the lessons are available in Unicode}}. Downloadable fonts. At least some of the headings are in Unicode private use area.
+
* [http://www.tamilvu.org/ Tamil Virtual University]. {{hq|An autonomous Institution, established by the Government of Talminadu. ... [[#Other standard encodings|TAB]] to Unicode Conversion is undergoing, core content and part of the lessons are available in Unicode}}. Downloadable fonts. At least some of the headings are encoded in the Unicode private use area.
* [http://thatstamil.oneindia.in/ Thats Tamil]. Now part of OneIndia.in.
* [http://thatstamil.oneindia.in/ Thats Tamil]. Now part of OneIndia.in.
* [http://www.vikatan.com/ Vikatan]. Requires login
* [http://www.vikatan.com/ Vikatan]. Requires login
* [http://webtamilan.com/ WebTamilan]
* [http://webtamilan.com/ WebTamilan]
*[http://in.tamil.yahoo.com/ Yahoo India in Tamil].
*[http://in.tamil.yahoo.com/ Yahoo India in Tamil].
-
*[http://www.jaffnalibrary.com/tools/google.htm "Yarl" Tamil Search Machine]. Claims to use Unicode for Google search, but requires [[#Other standard encodings|TSCII]] input ("Please insert/type your phrase in plain English(ex:-ammaa)"). Also has a [http://www.jaffnalibrary.com/tools/google1.htm page] with searches for "Madurai Project" and "Forum Hub" (portals?) as well as [[#Other standard encodings|"TSCII Font", "Bamini Font", and "Tab Font"]].
+
*[http://www.jaffnalibrary.com/tools/google.htm "Yarl" Tamil Search Machine]. Claims to use Unicode for Google search, but requires [[#Other standard encodings|TSCII]] input ({{hq|Please insert/type your phrase in plain English(ex:-ammaa)}}). Also has a [http://www.jaffnalibrary.com/tools/google1.htm page] with searches for "Madurai Project" and "Forum Hub" (portals?) as well as [[#Other standard encodings|"TSCII Font", "Bamini Font", and "Tab Font"]].
==Tools and Other NLP Resources==
==Tools and Other NLP Resources==

Latest revision as of 15:43, 18 May 2011

Home > Tamil

தமிழ்


TAMIL


Contents

General

Language summary

(Information based on Ethnologue, 2010-02-25)

  • ISO 639-3 code: tam
  • Population: 61,500,000 in India (1997). Population total all countries: 65,675,200.
  • Also spoken in: Malaysia (Peninsular), Mauritius, Réunion, Singapore, Sri Lanka
  • Alternate names: Damulian, Tamal, Tamalsan, Tambul, Tamili
  • Dialects: Adi Dravida, Aiyar, Aiyangar, Arava, Burgandi, Kongar, Madrasi, Madurai, Pattapu Bhasha, Tamil, Sri Lanka Tamil, Malaya Tamil, Burma Tamil, South Africa Tamil, Tigalu, Harijan, Sanketi, Hebbar, Mandyam Brahmin, Secunderabad Brahmin. (See Ethnologue for notes.)
  • Classification: Dravidian, Southern, Tamil-Kannada, Tamil-Kodagu, Tamil-Malayalam, Tamil

In addition, there is diglossia between Literary Tamil (centamil /centamiẓ/) and the various spoken dialects (kotuntamil /koṭuntamiẓ/). Spoken Tamil varies widely with geography and caste. To a certain extent there has arisen a "Standard Spoken Tamil" used by educated people from different regions when they come together, but this is not really standardized.

Linguistic notes

There is only one noun declension, with eight cases and two numbers (singular and plural), all marked by endings. Postpositions are also used. There are two genders, "rational" and "irrational" (roughly, ±human); the rational gender is subdivided into honorific, masculine, and feminine.

Verbs are inflected for person, number, gender (third person only), mood, and tense.

Writing

Tamil is written in a Brahmi-derived script. See also Encoding and Fonts.

Linguistic resources

Overview


Grammar

  • Agesthialingam, S. (1967) A Generative grammar of Tamil. Annamalai University Publications.
  • Asher, R.E. Tamil. (Croom Helm Descriptive Grammar.) Routledge. Reprint edition (April 1989) ASIN: 0415036828 Out of print.
  • Schiffman, Harold F. 1979. A grammar of spoken Tamil. Madras: Christian Literature Society. 108p. Online; requires Tamilnet font, downloadable from site
  • Schiffman, Harold F. 1999. A reference grammar of spoken Tamil. Cambridge:Cambridge University Press. 254p. ISBN 0521640741.

Lexicon

Many of the following are in the Digital South Asia Library of the University of Chicago] (DSAL).


Topical word lists

  • Babynology: List of Tamil baby names in Roman transliteration
  • Tamilcube Tamil-Hindi-English word lists (number of entries) [Accessed 2010-03-11]:
    • Indian spices and pulses (71)
    • Indian herbs and plants (807)
    • Indian vegetables (55)
    • Fruits (30)
    • Flowers (43)
    • Birds (21)
    • Animals (42)
    • Fishes (27)


Linguistic portals and bibliographies

  • OLAC list of Resources in and about the Tamil language.
  • Penn Language Center Web Assisted Learning and Teaching of Tamil. (Much of this site requires the Tamilnet font, which apparently uses the TAB encoding; downloadable, with keyboard and other tools.)
  • SIL Bibliography

Encoding and Fonts

Before the development and general use of Unicode, computer use of Tamil and other South Asian languages required special fonts using only one byte. Many websites used such fonts, often with idiosyncratic encodings. Some still do, including some listed on this page, and many corpora still use such fonts, and so we list some resources for other encodings, as well as fonts and encoding conversion.

Encodings

The International Forum for Information Technology in Tamil (INFITT), a non-governmental organization that appears to have the support of the state of Tamil Nadu as well as the various other countries with large Tamil speaking populations, decided at the TAMIL INTERNET 2001 conference to recommend that software make use of the Unicode encoding, but that where an 8-bit encoding is necessary, either the TAB or TISCII encoding be used.

Unicode

The Unicode range for Tamil is 0B80-0BFF. The Unicode encoding is based on the ISCII Standard. (Unicode Standard, v5.2, p.290)

Other standard encodings

Nonstandard and idiosyncratic encodings

These pages are on a weblog which is no longer maintained as of 2011-05-03, although the pages are still up. They are summarized in this wiki at Tamil/Nonstandard encodings.

  • Nonstandard fonts (there called "Very Special Scheme Fonts")
  • Idiosyncratic fonts (there called "Odd Scheme Fonts"): I couldnt categorise following set of fonts into one category. Each of them follow totally new type of coding scheme.

Fonts

Resources for fonts

Multiple encodings

Unicode

Non-Unicode

Conversion

  • Murasu Anjal Software tool suite for creating, editing, converting and publishing Tamil content. Windows and Mac OS X. Conversion to Unicode of legacy documents composed with older encoding formats like TSCII, TAB and Murasu-6.
  • Padma. A Firefox add-in. "Padma transforms Indic text encoded in proprietary formats (ex: dynamic fonts) automatically to Unicode. Padma also has support for transforming from ISCII and transliteration schemes like ITRANS and RTS (Telugu only)." More details on Padma homepage.
  • Unicodify: From Lancaster University, producers of the Emille corpus. For Windows; ANSI C source code also available.
  • Text conversion from TSCII 1.7 to Unicode. Muthu Nedumaran. 2007. PDF.
  • Visai Tamil 2008. Seems to be an input tool. TAM, TAB, TSCII, Unicode, ASCII; keyboard layouts, spell check; dictionary (135k+ Tamil entries, 65k English)


Data Sources

All these resources use Unicode unless otherwise described.

Monolingual Text

News and magazines

Literature

  • GRETIL Tamil. Göttingen Register of Electronic Texts in Indian Languages. Romanized in various schemes; each document is headed with a table of transliteration.
  • Project Madurai. Etexts of ancient literary works, in TSCII and (since 2004) in Unicode as well. All-volunteer effort. Homepage "last updated on 31 Jan 2005". English description is on lower half of home page; "Homepage in English" link is "File not found". [2010-04-02]
  • Tamil Library Partly Unicode, partly TAB encoding, and a section of romanized texts; private use area characters for at least some headings. Part of Tamil Virtual University

Blogs

Parallel Text

  • EMILLE corpus. 200,000 words of text in English (information leaflets from the UK Government and various local authorities) with Tamil translation. Free license for non-profit research use.

Speech

Broadcast

Telephone

  • CALLFRIEND Tamil. Alexandra Canavan and George Zipperlen. 1996. LDC96S59. The corpus consists of 60 unscripted telephone conversations, lasting between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers). [Mamandel 20:45, 1 March 2011 (UTC)]
  • CSLU: Multilanguage Telephone Speech Version 1.2: Yeshwant Muthusamy, Ron Cole, and Beatrice Oshika. 2006. LDC2006S35. The Multilanguage Telephone Speech corpus consists of telephone speech from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, Vietnamese. Tamil: 149 speakers; 2.82 hours of speech (total). [Mamandel 15:38, 18 May 2011 (UTC)]

Video


Portals

Tools and Other NLP Resources

Morphological analyzer

  • Hunspell. the default spell checker of OpenOffice.org and Mozilla Firefox 3 & Thunderbird. ... Unicode character encoding, compounding and complex morphology ... Morphological analysis, stemming and generation.
  • TAGTAMIL part-of-speech tagger cum spell checker. (Listed under "MSDOS", not "Windows 3.x")

Input in older encodings

  • Adhawin. Input tool: type in romanization, output in Adhawin 8-bit font (included). Windows.
  • Anjal text editor for Windows.


Personal tools