Urdu enjoys the status of being the mother tongue of 80 millions people of the world, mostly residing in two countries Pakistan and India. It also enjoys the status of being the national language of Pakistan and one of the national languages of India. It is also the official language of different states of India like Utter Pradesh, Indian controlled Jammu and Kashmir, etc. It is spoken and understood in many countries of Asia, Africa and Europe and a large number of speakers are also present in USA and Australia. The total count of the people, who can speak and understand Urdu, is more than 200 millions. It is the 20th most populous natively spoken language.

Urdu belongs to the Indo-Aryan family and thus is an Indo-European language. It has developed in the present shape under the great influence of Persian, Arabic, Turkish, Punjabi, and other indigenous languages of Indian sub-continent during the Delhi Sultanate and Mughal Empire from 12th to 18th century.

Features of Urdu Script

Urdu is transcribed in the derivation of Persian alphabet that is by itself is a derivation of Arabic alphabet. It is read from right-to-left. The shape assumed by a character in a word is context sensitive. It contains 40 characters, 17 diacritical marks, 10 digits, etc. In Unicode, Arabic and its associative languages like Urdu, Punjabi, Sindhi etc. have been allocated 1,200 code points (0600h - 06FFh, FB50h - FEFFh).

The distinguishing characteristics of Urdu are discussed for the benefit of the unacquainted reader. Urdu is greatly influenced by Arabic and Persian languages. Figure 1 shows the alphabets of Urdu.

Urdu Character

Figure 1: Alphabet of Urdu

Further, the shape assumed by a character in a word is context sensitive i.e. the shape is different depending whether the position of the character is at the beginning, in the middle or at the end of the constituent word. This generates three shapes, the fourth being the independent shape of the character. Figure 2 gives these four shapes for a character, named Bey.

Figure 2: Context sensitive Shapes of Bey

To be precise, the above is true for all except eleven characters. Ten of these have only two shapes; the independent and the terminating shape, these characters are shown in Figure 3. These characters have independent and final shapes when they come at the beginning and in the middle or at the end of a word respectively.

Figure 3: Characters having only Independent and Final Shape

Hamza never comes at the beginning of a word, but it comes in the beginning of a ligature. Also it attains the independent shape instead of the final shape when it comes at the end of the word. Owing to this, it has initial, middle and independent shapes. It is illustrated in figure 4.

Figure 4: Shapes of Hamza, (Circled, right to left) Independent, Initial and Middle shape

Urdu is traditionally written in Nastaleeq, a script rich in calligraphic content. Owing to complexities of rendering, the basic shapes identified above are unable to render the language in an acceptable form in Nastaleeq. The characters of Urdu also need diacritics to help in the proper pronunciation of the constituent word. The diacritics appear above or below a character to define a vowel or emphasize a particular sound. These diacritical marks are basis of the vowel system in Urdu. There are a number of diacritics, the common ones being Zabar, Zer, and Pesh. Figure 5 shows the character Bey marked with these diacritics.

Figure 5: Bey with Diacritics

Figure 6 shows Urdu text in Nastaleeq script with diacritics placed on the respective characters.

Figure 6: Urdu Text in Nastaleeq

Diacritics, though part of the language, are sparingly used. They are essential for removing ambiguities, natural language processing and speech synthesis.


