The search API offers web-scale, real-time, full text, instant search for your data and documents.
A high-performance, focused crawler turns any website into JSON docs with structured data.
Turnkey, affordable, scaling, high performance search.
Search is omnipresent in today’s Information Age. The giant amount of data produced makes searching a core part of almost every solution stack.
Whether your customers searching for products or information on your website or research papers, patents, court or patient records need to be searched, the search solution decides whether your business or your research is successful or a frustrating experience.
Three options for search
The Pruning Radix Trie is a novel data structure, derived from a radix trie — but 3 orders of magnitude faster.
After I published SymSpell, a very fast spelling correction algorithm, I have been frequently asked whether it can be used for auto-completion as well. Unfortunately, despite its speed, SymSpell is a poor choice for auto-complete. The Radix Trie seemed to be a natural fit for auto-complete. But the lookup of a small prefix in a large dictionary — resulting in an huge number of candidates — lacked in speed. …
Exploring the implications of Artificial Intelligence, Consciousness and Free will
AI will replace most jobs. Human labor becomes worthless, the biggest devaluation in history. Democracy will crumble as people lost their negotiating power. Consciousness will spontaneously emerge in AI by Darwinian evolution, not by human engineering. Superintelligence will surpass the human and replace our species — in this millenium.
The NASA is working on project HAMMER to protect the earth from an asteroid that in 2175 has a 1 in 2,700 chance to hit us. …
Faster Word Segmentation by using a Triangular Matrix instead of Dynamic Programming. The integrated Spelling correction allows noisy input text. C# source code on GitHub.
For people in the West it seems obvious that words are separated by space, while in Chinese, Japanese, Korean (CJK languages), Thai and Javanese words are written without spaces between words.
Even the Classical Greek and late Classical Latin were written without those spaces. This was known as Scriptio continua.
And it seems we haven’t yet lost our capabilities: we can easily decipher
the quick brown fox jumps over the lazy dog
Our brain does this somehow intuitively and unconsciously. Our reading speed slows down just a bit, caused by all the background processing our brain has to do. How much that really is we will see if we attempt to do it programmatically. …
Conventional wisdom and textbooks say BK-trees are especially suited for spelling correction and fuzzy string search. But does this really hold true?
Also in the comments to my blog post on spelling correction the BK-tree has been mentioned as a superior data structure for fuzzy search.
So I decided to compare and benchmark the BK-tree to other options.
Approximate string search allows to lookup a string in a list of strings and return those strings which are close according to a specific string metric.
Sub-millisecond compound aware automatic spelling correction
Recently I was pointed to two interesting posts about spelling correction (and here). They applied a deep learning approach, the philosopher’s stone of modern times. It is really fascinating how universal Deep learning is from AlphaGo winning Go championships, Watson winning Jeopardy, fighting Fake news and threatening mankind with Singularity.
The question is whether the Deep Learning Multi-tool is going to excel and replace highly specialized algorithms and data structures in every domain, if they both deserve their place or if they shine if their complementary strengths are combined. …
This post explores the Elias-Fano encoding, which allows as a very efficient compression of sorted lists of integers, in the context of Information retrieval (IR).
Elias-Fano encoding is quasi succinct, which means it is almost as good as the best theoretical possible compression scheme for sorted integers. While it can be used to compress any sorted list of integers, we will use it for compressing posting lists of inverted indexes.
While gap compression has been around for over 30 years, and some of the foundations of Elias-Fano encoding even date back to a 1972 publication by Peter Elias, Elias Fano encoding itself has been published in 2012. Being a rather recent development beyond the papers there is not much actual implementation code available. …
The correction of product names, company names, street names & addresses is a frequent task of data cleaning and deduplication. Often those names are misspelled, either due to OCR errors or mistakes of the human data collectors.
The difference is that those names often consist of multiple words, white space and punctuation. For large data or even Big data applications also speed is very important.
After my blog post 1000x times faster spelling correction got more than 50.000 views I revisited both algorithm and implementation to see if it could be further improved.
While the basic idea of Symmetric Delete spelling correction algorithm remains unchanged the implementation has been significantly improved to unleash the full potential of the algorithm.
Compared to Peter Norvig’s algorithm it is now 1,000,000 times faster for edit distance=3 and 10,000 times faster for edit distance=2. …
Lex Google from a search engines perspective — a German law threatening the internet as we know it.
Leistungsschutzrecht für Presseverlage durch das Achte Gesetz zur Änderung des Urheberrechtsgesetzes
Hier die entscheidenden Passagen:
§ 87f (1) Der Hersteller eines Presseerzeugnisses (Presseverleger) hat das ausschließliche Recht, das Presseerzeugnis oder Teile hiervon zu gewerblichen Zwecken öffentlich zugänglich zu machen, es sei denn, es handelt sich um einzelne Wörter oder kleinste Textausschnitte. Ist das Presseerzeugnis in einem Unternehmen hergestellt worden, so gilt der Inhaber des Unternehmens als Hersteller.
§ 87g (2) Das Recht erlischt ein Jahr nach der Veröffentlichung des Presseerzeugnisses.
§ 87g (4) Zulässig ist die öffentliche Zugänglichmachung von Presseerzeugnissen oder Teilen hiervon, soweit sie nicht durch gewerbliche Anbieter von Suchmaschinen oder gewerbliche Anbieter von Diensten erfolgt, die Inhalte entsprechend aufbereiten. …