Inconsistent Character Mapping In Nepali Language

It’s a well known fact that Keyboards were designed for English language. With the advent of computers in various other purposes, people started to use them for various other languages too. Here, I’m going to talk about the Nepali language and how it’s poor keyboard design has caused various problems.

(I might use Devanagiri Script and Nepali Language Interchangably)

Whenever a keyboard layout or a mapping to unicode is designed, it should follow two basic rules.

Easy to understand/ intuitive
Consistent

Well, the Nepali language has None.

I am not going to talk about what a keyboard layout should look like because optimizing that is a very far topic when we haven’t even optimized even the basics.

Easy to understand/intuitive.

First and foremost, when typing anyone who knows Nepali should feel at home. They should not be like, “Well that doesn’t make Sense”.

I am going to explain my points with the examples below.

Let’s take a character:

आ = अ + ा

Well, that’s and we pronounce it similarly to how we write it.

Let’s take another character:

ओ = अ + ा + े

I’m pretty sure that anyone who’s speaking in Nepali can say that we don’t pronounce the character ओ as the combination of those three characters. We instead pronounce it as a single vowel, an english ‘O’. The combination of the three characters (अ, ा, े) and the single character(ओ) are rendered in various places as the same character but are not inherently combined. They are fundamentally different characters but are rendered as the same character.

Let me demonstrate my point with a simple python Script.

As we can see, even if they both look the same, the characters are not the same.

Consistent:

My second point was for the characters to be consistent when typing. Sometimes, they might not be intutive but if they were consistent, we would remember things as time passed on and at the end the things would be mapped in our memory.

Well, Nepali Typing is not consistent either.

क + ा + े When typed together gives, काे (This is in one editor) In some other editor, it’s going to give something else entirely as I had shown in the example above (typed in VSCode)

So, in two markdown editors there is no consistency.

So, someone could have typed an entire document in one keymapping but when opened in another key mapping, everything would fail to open.

(Same thing happened to me)

Where would it affect:

For most people, a document being unreadable at a glance is a bad thing. You’d have to find the font that it was originally typed in and convert the entire document to that font and pray that the software that you’re using renders your texts properly.

In my case there was another problem.

I mostly work with neural networks, their training etc. I have to map the words into numbers before sending them to neural networks. Nepali typing of a large dataset is not something that I’m fond of doing so, there will be multiple input sources for my data thus increasing inconsistency.

My model would sometimes be backpropagated with one character and sometimes with other character for a same Image/Audio. And that Is a Problem. For example, If a text was Written:

	मेरो नाम संगम हो।
	Is it?
	[म, े, र, ो, न, ा, म, स, ं, ग, म, ह, ो]
	or is it?
	[म, े, र, ा, े, न, ा, म, स, ं, ग, म, ह, ा, े] And Which one is the correct mapping?

Based on nothing and only the fact that well, I am Nepali I decided to take the mapping that sounds correct (literally) i.e the First mapping.

First, I had to divide the characters into various types:

Consonants
Independent Vowels
Dependent Vowels

(Everything that is not an independent vowel or a dependent vowel is a consonant) So,

independent_vowels = ["अ", "आ", "इ", "ई", "उ", "ऊ", "ए", "ऐ", "ओ", "औ"]

dependent_vowels = ["ा", "ि", "ी", "ु", "ू", "े", "ै", "ो", "ौ"]

Further,

Independent Vowels and Dependent Vowels could be combined
Dependent Vowels could combine with themselves.

dependent_vowels_comb = {
    ('ा', 'े'): 'ो', 
    ('ा', 'ै'): 'ौ',
}

independent_vowels_comb = {
    ('अ', 'ा'): 'आ',
    ("अ", 'ो'):'ओ',
    ('अ', 'ौ'):'औ',
    ('ए', 'े'):'ऐ',
    ('आ', 'े'):'ओ',
    ('आ', 'ै'):'औ',
}

Now, all that was left was to create an algorithm for doing so.

So, I created a character type dict.

character_type_dict = {
    "independent_vowel":1,
    "dependent_vowel":2,
    "consonant":3,
}

First, I would check the type of the character. So I created a function to do that.

def check_type(character):
    """For any Given Character (Presumably a nepali character) Check the type of the character.

    Args:
        character (str): input character (string because python doesn't have a char datatype)

    Returns:
        character_type(int): returns the type of character that it was found.
    """
    if(character in independent_vowels):
        return character_type_dict["independent_vowel"]
    elif (character in dependent_vowels):
        return character_type_dict["dependent_vowel"]
    return character_type_dict["consonant"]

I created a function to join two characters if possible.

def combine(char1, char2):
    """
        For  given two strings of single nepali characters, tries to combine the two character based on the rules of nepali language.

    Args:
        char1 (str): single nepali character
        char2 (str): single nepali character
        
    Returns:
        combined_character (str): combined character if the two characters can be combined, else returns the first character
        combination_status: True if the two characters were combined, else False
    """
    type_char1 = check_type(char1)
    type_char2 = check_type(char2)
    if(type_char2 != 2):
        return char1, False
    if(type_char1 == 1): # Check for independent_vowel
        return independent_vowels_comb[(char1, char2)], True
    if(type_char1 == 2): # Check for dependent_vowel
        return dependent_vowels_comb[(char1, char2)], True
    else:
        return char1, False

Finally, a function to correct the word that was mistyped (From my perspective)

def correct_mistyped_word(word):
    """For a given word, tries to correct the mistakes in the word. That is tries to combine the characters based on nepali language.

    Args:
        word (string): Nepali Word

    Returns:
        final_word: Nepali word with the mistakes corrected
    """
    current_char = ""
    char_index = 0
    final_word = ""
    current_char = word[char_index]
    while(True):
        if(char_index == len(word) - 1):
            final_word += current_char
            break
        next_char = word[char_index + 1]
        current_char, combination_status = combine(current_char, next_char)
        if(not combination_status):
            final_word = final_word + current_char
            current_char = next_char
        char_index = char_index + 1
    return final_word

Call the function correct_mistyped_word() and you’re good to go.

This might not be the best approach but I had to get my work done and I was no grammarian.

Maybe there are proper standard rules for how the characters should mapped, Who knows!

And yeah, Feel free to copy the code and do anything you want with it.

Inconsistent Character Mapping In Nepali Language

Easy to understand/intuitive.

Consistent:

Where would it affect:

Sangam Khanal

Comments