South Asia comprises of the countries- Afghanistan, Bangladesh, Bhutan, India, Maldives, Nepal, Pakistan and Sri Lanka. Southeast Asia, on the other hand, consists of Brunei, Burma, Cambodia, East Timor, Indonesia, Laos, Malaysia, Philippines, Singapore, Thailand and Vietnam. Table 1 gives an idea about the size of population and the number of living languages in the regions of South and Southeast Asia.
|Source (Lewis, 2009)|
|Table 1: Population and Number of Living Languages of South and Southeast Asia|
The 2241 languages described in Table 1 belong to different language families like Indo-Aryan, Indo-Iranian, Dravidian, Sino-Tibetan, Austro-Asiatic, Kradai, Hmong-Mien, etc. In terms of population, South Asia and Southeast Asia represent 34.94% of the total population of the world. Some of the languages of these regions have a large number of native speakers: Hindi (5th largest according to number of its native speakers), Bengali (6th), Punjabi (12th), Tamil (18th), Urdu (20th), etc.
A characteristic of these languages is that they are under-resourced. But the words of these languages show rich variations in morphology. Moreover they are often heavily agglutinated and synthetic, making segmentation an important issue. The intellectual motivation for this workshop comes from the need to explore ways of harnessing the morphology of these languages for higher level processing. The task of morphology, however, is intimately linked with segmentation for these languages.
South Asia is very rich in linguistic diversity. South Asian Languages belong to various language families and are rich in their morphology. All Dravidian languages are agglutinative, like Tamil, Kannada, Telugu, etc. Conjoining phenomenon exists in Hindi, i.e. two words are joined together to make another word, e.g., राजा [rɑʤɑ] (king) and ईश्वर [iʃvər] (God) can be combined together to build a new word राजेश्वर [rɑʤeʃvər] (Proper Noun – Name of a person). Here in this example, the last vowel of the first word ा [ɑ] and the first vowel of the second word ई [i] are replaced by another vowel े [e]- a phonological process called sandhi. Reduplication is another phenomenon existing in languages like Hindi and Urdu, e.g., (Hindi) छोटे छोटे (Urdu) چھوٹے چھوٹے [ʧʰoʈe ʧʰoʈe] (multiple small items) etc. The inflectional categories for verbs in most of the South Asian languages is quite complex. In addition to tense, aspect, mood, modality, voice, and honorificity, verbs in these languages are also marked for gender, number, person and honor to show agreement with subject or object. Tamil, a Dravidian language has very complex morphology. Consider, for example,Tamil: paaṭ ikkoNṭ iruntavanai (பாடிக்கொண்டிருந்தவனை)meaning the person (masc.) who was singing-obj
|paaṭ u||i||koNṭ iru||nt||a||avan||ai|
|Sing||Verbal participle||Auxiliary verb||Past tense marker||Relative participle marker||Pronoun – 3rd person, singular, masculine||Accusative marker|
Here, we start with a verbal root. The auxiliary verb 'koNṭ iru' denotes the continuous aspect. The 3rd person singular masculine pronoun 'avan' agglutinates and the accusative marker suffix attaches to it. The above examples show that South Asian languages are very rich in their morphology and development of their computational model is challengingly complex. Since it is not advisable to store all the possible forms in a lexicon (though one can attempt to do so), one must make use of the regularities and irregularities of word formation strategies and build processing techniques to automatically generate morphological rules from a given corpus.
Word segmentation is another challenging problem for South Asian languages. In Dravidian languages this task is very complex. So is the case for South Asian languages that are written in a derivation of Perso-Arabic scripts like Punjabi, Sindhi, Urdu, etc. Marathi, a member of the Indo-European family incorporates many morphological phenomena from Dravidian and Arabic languages, and is often considered as an agglutinative language. It stacks a large number (up to nine affixes) of affixes together like Turkish, Hungarian, Finnish, etc. See the example below:
|Marathi: घरासमोरचा बंगला||Hindi: घर के सामने का बँगला||House of-GEN front of bungalow|
|(Bungalow in front of the house)|
Marking the segment boundaries becomes difficult when a large number of suffixes are stacked together. One might think of segmenting घरासमोरचा as घरासमोर-चा ((front of the house)-of) or as घर-ा-समोर-चा (in-front of-(the house)).
Southeast languages share a number of morphological rules with the South ones. Deeply influenced by Sanskrit and Pali, their vocabulary often includes words coming from those Indic languages (e.g. राजा [rɑʤɑ] becomes ราชา [rɑ:sɑ:] in Thai or ລາດຊະ [lɑ:tsɑ; formerly ຣາຊ / rɑ:tsɑ] in Lao), even if often modified by aphaeresis or apocope. They also often follow their derivation rules for the borrowed terms (Indic derivation). However, large parts of the Southeast Asian vocabulary are not indic-derived and follow other morphological rules (Khmer derivation, Thai-Lao derivation, square composition etc.) that should also be computerized.
Since the morphology and segmentation are fundamental to NLP. Hence the current workshop is proposed with the goals of: