Chair of the Workshop

Aravind K. Joshi

Introduction and Motivation

South Asia comprises of the countries- Afghanistan, Bangladesh, Bhutan, India, Maldives, Nepal, Pakistan and Sri Lanka. Southeast Asia, on the other hand, consists of Brunei, Burma, Cambodia, East Timor, Indonesia, Laos, Malaysia, Philippines, Singapore, Thailand and Vietnam. Table 1 gives an idea about the size of population and the number of living languages in the regions of South and Southeast Asia.

Sr. Country Population Living Languages
1 India 1,134,403,000 438
2 Indonesia 248,496,420 719
3 Pakistan 158,081,000 72
4 Bangladesh 153,281,000 42
5 Vietnam 85,029,000 106
6 Philippines 84,566,000 171
7 Thailand 63,003,000 74
8 Burma 47,967,000 111
9 Nepal 27,094,000 124
10 Malaysia 25,653,000 137
11 Sri Lanka 19,094,000 7
12 Cambodia 13,511,970 23
13 Afghanistan 12,164,970 52
14 Singapore 4,327,000 21
15 Laos 2,796,000 84
16 East Timor 1,067,000 19
17 Bhutan 637,000 25
18 Brunei 374,000 15
19 Maldives 359,000 1
Total 2,081,904,360 2241
Source (Lewis, 2009)
Table 1: Population and Number of Living Languages of South and Southeast Asia

The 2241 languages described in Table 1 belong to different language families like Indo-Aryan, Indo-Iranian, Dravidian, Sino-Tibetan, Austro-Asiatic, Kradai, Hmong-Mien, etc. In terms of population, South Asia and Southeast Asia represent 34.94% of the total population of the world. Some of the languages of these regions have a large number of native speakers: Hindi (5th largest according to number of its native speakers), Bengali (6th), Punjabi (12th), Tamil (18th), Urdu (20th), etc.

A characteristic of these languages is that they are under-resourced. But the words of these languages show rich variations in morphology. Moreover they are often heavily agglutinated and synthetic, making segmentation an important issue. The intellectual motivation for this workshop comes from the need to explore ways of harnessing the morphology of these languages for higher level processing. The task of morphology, however, is intimately linked with segmentation for these languages.

Morphology and Segmentation

South Asia is very rich in linguistic diversity. South Asian Languages belong to various language families and are rich in their morphology. All Dravidian languages are agglutinative, like Tamil, Kannada, Telugu, etc. Conjoining phenomenon exists in Hindi, i.e. two words are joined together to make another word, e.g., राजा [rɑʤɑ] (king) and ईश्वर [iʃvər] (God) can be combined together to build a new word राजेश्वर [rɑʤeʃvər] (Proper Noun – Name of a person). Here in this example, the last vowel of the first word ा [ɑ] and the first vowel of the second word ई [i] are replaced by another vowel े [e]- a phonological process called sandhi. Reduplication is another phenomenon existing in languages like Hindi and Urdu, e.g., (Hindi) छोटे छोटे (Urdu) چھوٹے چھوٹے [ʧʰoʈe ʧʰoʈe] (multiple small items) etc. The inflectional categories for verbs in most of the South Asian languages is quite complex. In addition to tense, aspect, mood, modality, voice, and honorificity, verbs in these languages are also marked for gender, number, person and honor to show agreement with subject or object. Tamil, a Dravidian language has very complex morphology. Consider, for example,Tamil: paaṭ ikkoNṭ iruntavanai (பாடிக்கொண்டிருந்தவனை)meaning the person (masc.) who was singing-obj

paaṭ u i koNṭ iru nt a avan ai
Sing Verbal participle Auxiliary verb Past tense marker Relative participle marker Pronoun – 3rd person, singular, masculine Accusative marker

Here, we start with a verbal root. The auxiliary verb 'koNṭ iru' denotes the continuous aspect. The 3rd person singular masculine pronoun 'avan' agglutinates and the accusative marker suffix attaches to it. The above examples show that South Asian languages are very rich in their morphology and development of their computational model is challengingly complex. Since it is not advisable to store all the possible forms in a lexicon (though one can attempt to do so), one must make use of the regularities and irregularities of word formation strategies and build processing techniques to automatically generate morphological rules from a given corpus.

Word segmentation is another challenging problem for South Asian languages. In Dravidian languages this task is very complex. So is the case for South Asian languages that are written in a derivation of Perso-Arabic scripts like Punjabi, Sindhi, Urdu, etc. Marathi, a member of the Indo-European family incorporates many morphological phenomena from Dravidian and Arabic languages, and is often considered as an agglutinative language[1]. It stacks a large number (up to nine affixes) of affixes together like Turkish, Hungarian, Finnish, etc. See the example below:

Marathi: घरासमोरचा बंगला Hindi: घर के सामने का बँगला House of-GEN front of bungalow
(Bungalow in front of the house)

Marking the segment boundaries becomes difficult when a large number of suffixes are stacked together. One might think of segmenting घरासमोरचा as घरासमोर-चा ((front of the house)-of) or as घर-ा-समोर-चा (in-front of-(the house)).

Southeast languages share a number of morphological rules with the South ones. Deeply influenced by Sanskrit and Pali, their vocabulary often includes words coming from those Indic languages (e.g. राजा [rɑʤɑ] becomes ราชา [rɑ:sɑ:] in Thai or ລາດຊະ [lɑ:tsɑ; formerly ຣາຊ / rɑ:tsɑ] in Lao), even if often modified by aphaeresis or apocope. They also often follow their derivation rules for the borrowed terms (Indic derivation). However, large parts of the Southeast Asian vocabulary are not indic-derived and follow other morphological rules (Khmer derivation, Thai-Lao derivation, square composition etc.) that should also be computerized.

Since the morphology and segmentation are fundamental to NLP. Hence the current workshop is proposed with the goals of:

  1. Providing a platform to linguistic and NLP communities for sharing and discussing ideas and work on South and Southeast Asian languages and combining efforts.
  2. Development of useful and high quality computational resources for under resourced South and Southeast Asian languages.
  3. Developing a well connected and motivated NLP community in the region.