Wolf Garbe
2 min readOct 24, 2020

--

A great and very detailed overview of spelling correction in search engines.

I would add three points:

1. Performance

I think spelling correction performance (corrections per second) is important too, especially for search engines that need to serve many (thousands) users in parallel.

While Peter Norvig's article is very educational, the solution itself is not very well suited for search engine spelling correction in production for three reasons:

a. His way for candidate generation using insertions and replaces is not language independent and is infeasible for CJK languages, with Chinese 70,000 having Unicode Han characters.

b. Even for languages with the Latin alphabet it is quite slow, as it requires generating the candidates at query time. An average 5 letter word has about 3 million possible spelling errors (candidates) within a maximum edit distance of 3. If you are doing this for thousand users concurrently you need a lot of computing power an time.

c. Python despite being very popular, is not well suited for performance-critical/real-time applications at scale as it is an interpreted language.

I have benchmarked different spelling correction algorithms:

https://towardsdatascience.com/symspell-vs-bk-tree-100x-faster-fuzzy-string-search-spell-checking-c4f10d80a078

2. Precision, recall, and F-measure (F-score)

In "Presenting suggestions" you describe an automatic spelling correction and instant search approach.

The goal is always to correct more mistakes than introducing new ones by an imperfect spelling correction algorithm.

Whether this is the case, we can only assess if we know how noisy the input channel is and if we know what the precision, recall, and F-measure scores of the used spelling correction algorithms are.

3. Combining auto-correction, query completion, and instant search

Auto-correction, query completion, and instant search are related tasks, aiming at guessing the user intent to assist him, by either correcting his mistakes or saving him from typing work, and with instant search by automatically choosing the best suggestion and automatically rewriting the query.

It is the same user input that the three different algorithms have to work within a collaboration. The combination and the order of the above algorithms, and the decision whether some substring is the correct word, a mistyping of another word, or just the prefix of yet another word or even a misspelled prefix are sometimes tricky.

--

--

Responses (1)