We use auto correct every day on our cell phone and computer. It has become an integral part of our daily typing experience to such an extent that we no longer see it as a feature. But have you ever wondered what really goes behind your screen that magically converts “wrogn” to “wrong” ?
As you type the below sentence in google search bar :
“ how to learn to play footgall ”.
chances are very high that you intended to write the word “ football ” instead of “ footgall ” but ended up writing the wrong spelling. See that the search suggestions have corrected the wrong word and given us a few suggestions based on the correct word that is “ football ”.
The auto correct model that we are going to make will help us understand how does our phone understands that the word written by us is incorrect and suggests a more appropriate word for the same. Our model is not the mirror image of the one present on your phone, but works quite well!
Prerequisites : Python basic syntax, loops, data structures(strings, list, set, tuples), libraries (pandas, numpy, collections).
- Edit Distance :
Let us first understand what edit distance actually is.
When we say that 2 words are at ‘x’ edit distance from each other, it means that we need to edit or change one of the words ‘x’ times to form the other word.
For example consider the words ‘rat’ and ‘cat’. If we replace the letter ‘r’ with ‘c’ in the word ‘rat’ then it becomes ‘cat’. So it means that the words ‘rat’ and ‘cat’ are at 1 edit distance from each other. Similarly, the words ‘speed’ and ‘greed’ are 2 edit distance from each other ( replace ‘s’ and ‘p’ with ‘g’ and ‘r’ respectively in the word ‘speed’ to get ‘greed’ ).
Edit in any word / string could of the following forms:
The switch operation is done by swapping any 2 letters in a word.
For example :
‘boy’ -> ‘yob’, ‘ybo’, ‘byo’, ’oyb’, ‘oby’.
The replace operation is done by replacing 1 letter by another in a word.
For example :
‘cat’ -> ‘mat’, ‘hat’, ‘bat’, ‘rat’.
The delete operation is done by deleting 1 letter in a word without replacement.
For example :
‘cow’ -> ‘ow’, ‘cw’, ‘co’.
The insert operation is done by adding 1 letter to a word.
‘ate’ -> ‘gate’, ‘mate’, ‘hate’, ‘fate’.
Now as you’ve understood the 4 operations, we will use them to build our auto correct model, and to do that we have to compute the probability of a word being correct, based on a given input. For calculating probability we will make use of the baye’s theorem.
The mathematical equation of baye’s theorem is as follows:
P(A|B) = P(B|A) * P(A) / P(B)
The above equation says that the probability of a word ‘A’ being correct , given a word ‘B’ (LHS) is equal to the probability of a word ‘B’ being correct , given a word ‘A’ , multiplied with the probability of the word ‘A’ being correct divided by the probability of the word ‘B’ being correct (RHS).
To compute the above formula we would be required to read our dataset and compute all the probabilities that we will need to to build our model.
2. Preprocessing our data:
The next step is to pre process the dataset that we will be using to build our model. Python offers a range of libraries that makes data pre processing super easy. We will import such libraries using the keyword import as shown below.
After importing the libraries we will start with pre processing our data.
We will make a function process_data() that will take our data set as the input, convert all the sentences into lower case and split it into a list of words and return this list named ‘words’. Note that the dataset ‘shakespeare.txt’ is a file containing large number of sentences. When we read it, it will be saved as a long string.
This list of words will be our corpus. But it may contain some duplicate words as well. We will remove duplicate words from the corpus and save it in a variable called vocab.
Notice that we have used the set() function to remove duplicate words from our corpus, and hence we have 6116 unique words in our vocabulary.
3. Count of words
Now we need to determine how many times does each word appears in our corpus. We will store this data in a python dictionary. The keys of the dictionary will be the words present in our corpus and the values will be the number of times the given word appears in the corpus. For this we will make a get_count() function as shown below.
Notice that we have used the counter function from ‘collections’ module that we imported earlier. As you can see below the word ‘thee’ appears 240 times in our corpus. Similarly you can check for other words too.
4. Probability of words
Since we now have a dictionary ‘word_count_dict’ that contains the number of times each words appears in the corpus, we can calculate the probability that each word could appear, if a word is randomly chosen from the corpus. For any given word say ‘alex’, the probability of ‘alex’ appearing if we select a word randomly from the corpus is given by the formula :
P(‘alex’) = No of times alex is present in the corpus / length of corpus
Likewise we can calculate the probability for each word in the corpus.
We have a defined a function ‘get_probs’ to compute the probabilities.
As you can see below, the probability of selecting the word ‘thee’ from the corpus is equal to 0.0045. Similarly you can check for other words too.
5. String manipulations:
As discussed earlier, there are 4 operations (switch, replace, delete, insert) through which we will edit a given string. In this section we will make functions to perform these operations. To make these functions we will make use a very powerful feature in python that is list comprehension.
List comprehension is one the easiest ways to create a new list by performing operations on an existing list in a single line of code.
Let us make use of this feature and build our functions.
The function delete_letter() will take a word (of type string) as a input and delete a single letter from the word one by one and return us a list of words formed after deleting each letter.
For example : ‘cow’ -> ‘ow’, ‘cw’, ‘co’.
Now we will generate all the words from the list that result from deleting one character from the given word.
For example the word ‘cake’ is split as : [‘ake’, ‘cak’, ‘ake’]
As you can see below we’ve implemented our function on the word ‘cans’, which returns [‘ans’, ‘cns’, ‘cas’, ‘can’] after removing ‘c’, ‘a’, ’n’, ‘s’ from the word ‘cans’ respectively, as shown below.
We will make a function switch_letter that will take a word(of type string) as a input and switch adjacent letters from our string one by one and return us a list of words formed after swapping adjacent letters.
For example : ‘boy’ -> ‘oby’ , ‘byo’
As discussed we have used list comprehension to compute switch_letter, as shown below. Notice that in line 17 i have written a if condition to handle the case where our input word has duplicate letters.
For example, if the input is ‘zoo’, the function will return [‘ozo’, ‘zoo’], which contains the input word ‘zoo’ as well. To exclude this i have included the if condition.
We will make a function replace_letter() that will take a string as an input and replace each letter in it by every letter of the English alphabet and return us a list of words formed after replacing each letter.
As discussed we have used list comprehension to compute replace_letter, as shown below. For example, if the input is ‘can’, the function will return
['aan', 'ban', 'caa', 'cab', 'cac', 'cad', 'cae', 'caf', 'cag', 'cah', 'cai', 'caj', 'cak', 'cal', 'cam', 'cao', 'cap', 'caq', 'car', 'cas', 'cat', 'cau', 'cav', 'caw', 'cax', 'cay', 'caz', 'cbn', 'ccn', 'cdn', 'cen', 'cfn', 'cgn', 'chn', 'cin', 'cjn', 'ckn', 'cln', 'cmn', 'cnn', 'con', 'cpn', 'cqn', 'crn', 'csn', 'ctn', 'cun', 'cvn', 'cwn', 'cxn', 'cyn', 'czn', 'dan', 'ean', 'fan', 'gan', 'han', 'ian', 'jan', 'kan', 'lan', 'man', 'nan', 'oan', 'pan', 'qan', 'ran', 'san', 'tan', 'uan', 'van', 'wan', 'xan', 'yan', 'zan'].
Notice that first we will replace the first letter in ‘can’ that is ‘c’ by all the other 25 alphabets, similarly for the letter ‘a’ and ’n’ as well. Note that on line 15 i have used list comprehension to replace each letter in our input word by each alphabet.
As you can see below we run the function with the input word ‘can’ and it returns all the words after replacing each letter in our input word by each alphabet.
We will make a function insert_letter() that will take a word(of type string) as an input and returns a list of words formed by inserting an alphabet at every offset, which means inserting alphabets at the start, end and also between consecutive alphabets of the input word.
For example :
If the input word is ‘at’, the function will return the following :
['aat', 'bat', 'cat', 'dat', 'eat', 'fat', 'gat', 'hat', 'iat', 'jat', 'kat', 'lat', 'mat', 'nat', 'oat', 'pat', 'qat', 'rat', 'sat', 'tat', 'uat', 'vat', 'wat', 'xat', 'yat', 'zat', 'aat', 'abt', 'act', 'adt', 'aet', 'aft', 'agt', 'aht', 'ait', 'ajt', 'akt', 'alt', 'amt', 'ant', 'aot', 'apt', 'aqt', 'art', 'ast', 'att', 'aut', 'avt', 'awt', 'axt', 'ayt', 'azt', 'ata', 'atb', 'atc', 'atd', 'ate', 'atf', 'atg', 'ath', 'ati', 'atj', 'atk', 'atl', 'atm', 'atn', 'ato', 'atp', 'atq', 'atr', 'ats', 'att', 'atu', 'atv', 'atw', 'atx', 'aty', 'atz']
first it will insert all the alphabets before ‘at’,
then it will insert all the alphabets between ‘a’ and ‘t’,
then it will insert all the alphabets after ‘at’,
at last we will append all these words in a list.
As discussed i have used list comprehension to compute insert_letter(), as shown below.
As you can see below we run the function with the input word ‘at’ and it returns all the words after inserting alphabets at each offset.
6. Combining the string edits
After doing the string manipulations, we will make 2 functions namely edit_one_letter() and edit_two_letters() that will take a string as an input, and return all the possible single and double edits on that string.
Let us understand and implement both these functions one by one.
This function will take a string as an input and return a set of all possible edits that are at one edit distance away from the input string.
As you can see above, the function makes use of all the 4 functions that we implemented earlier. Also we have used set to ensure that there are no duplicate values in the final list.
Now we will run our function and check the output for the input string “at”.
The output is :
['a', 'aa', 'aat', 'ab', 'abt', 'ac', 'act', 'ad', 'adt', 'ae', 'aet', 'af', 'aft', 'ag', 'agt', 'ah', 'aht', 'ai', 'ait', 'aj', 'ajt', 'ak', 'akt', 'al', 'alt', 'am', 'amt', 'an', 'ant', 'ao', 'aot', 'ap', 'apt', 'aq', 'aqt', 'ar', 'art', 'as', 'ast', 'ata', 'atb', 'atc', 'atd', 'ate', 'atf', 'atg', 'ath', 'ati', 'atj', 'atk', 'atl', 'atm', 'atn', 'ato', 'atp', 'atq', 'atr', 'ats', 'att', 'atu', 'atv', 'atw', 'atx', 'aty', 'atz', 'au', 'aut', 'av', 'avt', 'aw', 'awt', 'ax', 'axt', 'ay', 'ayt', 'az', 'azt', 'bat', 'bt', 'cat', 'ct', 'dat', 'dt', 'eat', 'et', 'fat', 'ft', 'gat', 'gt', 'hat', 'ht', 'iat', 'it', 'jat', 'jt', 'kat', 'kt', 'lat', 'lt', 'mat', 'mt', 'nat', 'nt', 'oat', 'ot', 'pat', 'pt', 'qat', 'qt', 'rat', 'rt', 'sat', 'st', 't', 'ta', 'tat', 'tt', 'uat', 'ut', 'vat', 'vt', 'wat', 'wt', 'xat', 'xt', 'yat', 'yt', 'zat', 'zt']
This function will take a string as an input and return a set of all possible words that are at two edit distance away from the input string, which means it will first get all the possible edits on the input word and then for each such word it will again find all the possible edits. To implement this we simply need to run our string manipulation functions first on the input word and then again on the output of the first function. Then we combine the results of all the functions to get our final results as shown below.
As you can see above, the function makes use of all the 4 string manipulation functions twice. Also we have used set to ensure that there are no duplicate values in the final list.
As you can see there are 2654 words that are at two edit distance from the word ‘a’. I’ve also printed 20 such words as shown above.
7. Spelling Suggestions
As you saw, we created a list of all the words that are at one and two distance away from our input word. Now we will use these words to get the highest probable word, which is the actual word we intended to write but mistakenly wrote it wrong.
To do this we will make a function get_corrections() which will take a given word as the input and then returns a list of 0 to n number of suggestion tuples of the form (word, probability_of_word). Where ‘word’ is a possible suggestion and ‘probability_of_word’ is the probability that the ‘word’ is the correct spelling suggestion for the input word.
We will incorporate a 2 step process to generate the most probable spelling suggestion as follows :
In this step we will simply generate a set of suggestions for a given word using string manipulations that we discussed earlier. We will follow the 5 step logic as below :
a) If the word is already present in the vocabulary, suggest it.
b) Otherwise, if there are words returned by edit_one_letter() function which are present in the vocabulary, suggest them.
c) Otherwise, if there are words returned by edit_two_letters() function which are present in the vocabulary, suggest them.
d) Otherwise, suggest the input word.
The idea is that the words generated from fewer edits are more likely to be correct than words with more edits.
Step : 2
In this step we will make a dictionary whose key is a word and value is the probability of that particular word in our vocabulary. In case the word is not present in the vocabulary we will assign 0 probability to it.
Step : 3
In this step we select the ’n’ best suggestions.
notice that in line 25 i have sorted our best_words dictionary based on the descending order of probability so that most probable words will appear at the start of the list.
As you can see below i have used the get_correction function on few words and our model gives relevant words related to it.
8. Minimum edit distance
Now that we have implemented our auto correct model, the next step is to determine how to evaluate the similarity between two strings (say ‘yef’ and ‘yes’) , and to efficiently find the shortest path to go from one word to another. Minimum edit distance can be used to do this. Given 2 strings, minimum edit distance is the minimum cost of operations required to transform a given string into other. For calculating minimum edit distance we will use 3 string manipulation operations that are Insert, Delete and Replace. We will make use of a dynamic programming logic that will tell us the minimum number of edits required to convert a string into another.
For example consider 2 strings ‘hobby’ and ‘soggy’. Now to convert hobby to soggy we need to replace ‘h’ by ‘s’, ‘b’ by ‘g’ and the next ‘b’ by ‘g’ as well.
Each edit operation has a cost associated to it. As you saw above, it requires 3 replace operations to convert hobby to soggy. And if we consider cost of one replace operation equal to 2 then the minimum edit distance from ‘hobby’ to ‘soggy’ will be equal to 3*2 = 6. Similarly the cost of 1 insert operation is 1 and the cost of 1 delete operation is also 1. Notice that replace operation can be considered as one delete + one insert operation and hence the cost of one replace operation is equal to 2.
This method of manually converting one string to another using the replace, insert and delete operations is not suitable for long strings as it will be computationally expensive and time consuming as well.
Alternatively, we can follow a tabular approach based on dynamic programming which will be much efficient than this approach.
As you can see above, we have to convert the source word ‘play’ to target word ‘stay’. Note that we have appended an empty string ‘#’ at the start of both the words, which will make sense once you will perform the further steps. Our aim is to fill this matrix in a way that each value at row, column represents the cost of transforming the the source word till row to target word till the column.
For example the value at D[3,2] represents the cost required to convert [‘pla’] to [‘st’] ( ‘play’[ : 3] = ‘pla’ , ‘stay’[ : 2] = ‘st’ ). Similarly, the value at D[2,2] represents the cost required to convert [‘pl’] to [‘st’] ( ‘play’[ : 2] = ‘pl’ , ‘stay’[ : 2] = ‘st’ ). This cost will denote the minimum edit distance.
D[i,j] = source[ : i] -> target[: j]
D[i,j] will denote the minimum edit distance between source[:i] and target[:j].
Note that we will start with the empty strings at (0,0) (top left corner) and go all the way down to (4,4) (bottom right corner).
As discussed we are trying to solve sub problems and use them to build the solution for our main problem. First we start with finding minimum edit distance between smaller strings and keep on increasing the string size to reach the end solution. All this will get more clear once we start filling our matrix.
Lets start with the empty strings (‘#’) first.
Since the source empty string and target empty strings are both ‘#’ hence the minimum edit distance between them is 0.
Now lets move forward to the first letter ‘p’ in the source string ‘play’. To go from ‘p’ to an empty string ‘#’ we just need one delete operation. Hence the value at (1,0) is equal to the cost of one delete operation that is 1.
Now lets move forward to the first letter ‘s’ in the source string ‘stay’. To go from an empty string ‘#’ to ‘s’ we just need one insert operation. Hence the value at (0,1) is equal to the cost of one insert operation that is 1.
Now lets go from ‘p’ to ‘s’. This can be done by in 3 ways.
a) Replacing ‘p’ by ‘s’ directly.
We can replace ‘p’ by ‘s’ using one replace operation as discussed before.
Cost = 2.
b) Using insert + insert
To go from ‘p’ to ‘s’ we can insert ‘s’ to ‘p’ to get ‘ps’. Then we can delete ‘p’ to get a staring ‘s’. Notice that the cost of inserting one ‘s’ to a string is already calculated at (0,1) which is equal to one. And the cost of deleting one character is already known which is equal to 1.Hence the total cost will be equal to:
(cost of inserting one ‘s’ + cost of deleting one‘p’) = 1+1 = 2.
c) Using delete + insert
To go from ‘p’ to ‘s’ , we can delete ‘p’ to get an empty string. And then we can insert ‘s’ to the empty string to get ‘p’. Notice that the cost of deleting one ‘p’ to get an empty string is already calculated at (1,0) which is equal to one.And the cost of inserting one character is already known which is equal to 1. Hence the total cost will be equal to:
( cost of deleting ‘p’ + cost of inserting ‘s’) = 1+1 = 2.
Ideally, we would compare the cost of all the 3 options and assign the least cost to (1,1). But in this case all the 3 options have the same cost that is equal to 2, hence we assign 2 at (1,1). Therefore the minimum edit distance from ‘p’ to ‘s’ is equal to 2.
Hence to fill any cell in the matrix, it would be helpful to know the values at left, upper and adjacent upper left positions respectively. Using this we can make use of some calculations that have been already performed without having the need to perform them again.
To fill out the first column, we can use the below formula :
D[i , j] = D[i-1 ,j] + deletion cost
which is equal to :
D[i , j] = D[i-1 ,j] + 1.
So to fill any value in the first column, just add 1 to the previous value. After filling all the values in the first column, our matrix looks like this :
Similarly, to fill out the first row, we can use the below formula :
D[i , j] = D[i , j+1] + insertion cost
which is equal to :
D[i , j] = D[i +1 , j] + 1.
So to fill any value in the first row, just add 1 to the previous value. After filling all the values in the first row, our matrix looks like this :
Now we will generalise this method to fill out the rest of our matrix. Remember that we can reach a cell through the left, upper cell or adjacent upper left cell. We will calculate the cost of reaching the cell through these 3 ways and select the lowest of the 3 values for our cell. We will generalise the computations done before to form one final formula as shown below.
The first formula is through the left cell, the second formula is through the right cell and the third formula is through the adjacent upper cell. For the 3rd formula, if the source[i] and target[j] is same, then the formula would be
D[i-1, j-1] +0
If the source[i] and target[j] is not same, the formula would be
D[i-1, j-1] + replace cost
D[i-1, j-1] + 2 (since replace cost = 2).
Now you can use the generalised formula to get the final matrix.
The value at (4, 4) which is equal to 4, denotes the minimum edit distance between the source word ‘play’ and target word ‘stay’. One thing worth noticing here is that the value at (2,2) is equal to 4. And after that the diagonal values (3,3) and (4,4) does not change. It is because of the fact that source[2:4] and target[2:4] is the same that is ‘ay’, and hence there are no more edits needed, which is the reason why the value does not change.
Now we will write a function min_edit_distance which will compute the minimum edit distance between our source word and target word. The function takes 5 arguments : source, target, ins_cost = 1, del_cost = 1, rep_cost = 2. source is the source string, target is the target string, ins_cost, del_cost, rep_cost are the deletion, insertion and replace cost respectively.
Now we will run this function on few strings and see the results.
As you can see our function correctly returns the cost matrix for the pair of words given as the input. You can cross check the output by manually calculating the minimum edit distance for the pair of words.
That’s all about this simple yet effective auto correct model in python. If you have made it till here, i strongly believe you must have found it interesting.
If there are suggestions or modifications in the code that can optimise the model or any general feedback, feel free to leave a few sentences in the comment section and i will be more than happy to work upon it.
Good Luck :)