I have spent the past month working on an AI-based search engine.
When you go on darktable sub-Reddit, you will find the question “why do lighttable’s thumbnails look different from darkroom preview” asked every next week. The question is answered many times on this sub-Reddit, on various forums, and I even put the answer on the main Readme file, displayed on the Github main page of the software. To no avail. In the big picture, this is of course indicative of bad software design, but let’s discard that for a moment, and focus on the information retrieval part of the issue.
This situation causes some benevolent people, on forums, to answer the same things in loop. Problem is they get a bit less nice each time they repeat the same thing. Who can blame them ? But that makes open-source forums notoriously unwelcoming to beginners, who are prompted to RTFM (Read The Fucking Manual) at length, while they don’t even know what to look for. That makes people go away to platforms where people are nicer but typically less accurate on the second-hand information they relay.
In the meantime, I started using ChatGPT. Not to actually produce content, but as a kind of meta search engine. One of the great features of this chat bot (when used as a search engine) is to return thematic results instead of just keyword-driven ones. In other words it is able to suggest results related to your search but not limited to your exact keywords, that may broaden your perspective to things you didn’t think of, while still being in-topic. In comparison, Google and in-website search engines only fetch pages containing your exact keywords, which means you have to know what you are looking for. Over the past years, Google has also contracted this infuriating habit of discarding some keywords it thought to be irrelevant from your search query, which makes technical searches more difficult on top of frustrating (or am I the only one getting infuriated when stupid machines second-guess my instructions ?).
I feel like Google has an underlying database-like approach based on a key -> value data model (which is the ground of JSON data-exchange model, as well as the raison d’être for the Open Graph protocol ), like:
- word -> definition
- in-person service ->
- opening hours -> 9 to 5
- address -> Sesame street, 25. NYC.
- commercial product ->
- rating -> 4 stars
- description -> best stuff ever
- price -> that many $
So it feels like whatever search query you input is only meant to identify the best-matching key and then give you its value. So, again, great if you are looking for a store selling hair-dryers in your neighbourghood or for the deadline to pay your taxes. Not so much for “how to” and “why” searches.
For this use case, ChatGPT is half-way between a recommendation algorithm (as seen on all streaming platforms nowadays) and a typical keyword search engine. Before, to achieve that, you needed to do a sort of state-of-the-art search, hoping to find some introductory web page containing a listing of stuff to consider (just name dropping works, there), and use it as an entry point to investigate each item from the keywords found. Researchers are used to doing that, but my experience is the general audience already struggles to find opening hours of stores and services, so don’t hold your breath. In any case, it’s time consuming.
So I had the idea of implementing my own natural language AI-based search engine to index stuff I had already written somewhere in the web, hoping people would stop sending me emails asking exactly that if they had an easy way to find it. Turns out it’s really easy :
- recursive web crawler : 106 lines of Python code, including comments and docstrings, HTML parsing done with the
BeautifulSoup
package, - sentences splitting AI model : 50 lines of Python code to train one if you want it, using
NLTK
package, 3 lines to use your or aNLTK
factory-trained model, - automatic language detection : 10 lines of Python code, using a basic (and surprisingly accurate) stopwords counting scheme with
NLTK
stopwords corpus, - word tokenization AI model : 2 lines of Python code to use the pre-trained
NLTK
models (11 european languages supported out-of-the-box, you can train your own), - word vectorization model : 44 lines of Python code to train a multi-threaded, self-loading and self-saving reusable model, using the
Gensim
package, - cosine-similarity-based search engine ranker : 134 lines of Python code including comments and docstrings, using
Numpy
linear algebra package, and multi-threaded internal tokenization and vectorization, - web app : 75 lines of Python code, including comments, using the
Flask
microframework to process GET/POST requests feeding queries to the previous ranker, then a 100-lines HTML template displaying the results with Boostrap CSS.
All that put in a script that can update everything automatically while you sleep, you get a grand total of roughly 500 lines of computer code (understand : easy to maintain and debug), producing something that can run even on a shared hosting server and process an index of 15.000 pages in a matter of a few tenths of seconds (wall clock) without even needing a database. The whole thing (AI model and search engine ranker) is a memory dump of the numeric matrices and a dictionnary mapping URLs to page snippets content, into a 140 MB binary file stored on the server, that can simply be overwritten with a newer one to update the application data. Bottom line, you can definitely bake your own AI at home in 2023.
But… (you felt it coming…)
That’s assuming:
- websites use a sitemap indexing their relevant “content” pages (that is, not the login and archive pages) with the date of last modification. If they don’t, then you need:
- to write another web crawler following recursively all links from the homepage (72 more lines),
- a way to retrieve the date from within the page content (since there is no standard way of doing it: 23 more lines to try the most common),
- to regex your way through a filtering mechanism to prevent the crawler to exit the website and focus on relevant sections.
- a website-specific way to identify and exclude “non-content” pages, like profile, login and archives pages,
- a duplicates remover to keep only one canonical page/URL for duplicated content appearing under different URLs (10 more lines for something really basic),
- websites use UTF-8 encoding. If they don’t, then you need:
- to find out what encoding they declare, and if the encoding they use is actually the one they declared (15 more lines),
- to translate Unicode entities to their ASCII equivalent, since it’s the least common denominator and the AIs rely on generality to find patterns (151 more lines and a major performance penalty),
- websites don’t use HTML minification removing line breaks, because then the plain-text translation of the HTML markup would improperly contract the content of different tags into the same sentence and make AI sentence detection fail. If they do, then you need:
- to regex-find block-level HTML entities and insert line-breaks after them (3 more lines of code and a serious performance penalty)
- to regex-find inline-level HTML entities and insert spaces after them (same),
- to regex your way into contracting multiple white spaces.
- websites display machine language into the proper
<code>
and<pre>
HTML tags, so you can remove it before training your natural language model. Many people on forums paste terminal output and computer code without markup, so you are screwed. In this case, you need to enlarge your training dataset with valid content, hoping to improve the signal/noise ratio and fine-tune the hyper-parameters of your language model to discard the noise, - forum threads display quotes and requotes of previous replies in the proper
<blockquote>
HTML tags, so you can remove them before ranking the page. Many people on forums just paste the messages with>
, so the quoted content can’t be systematically removed, which means the weight of the contained keywords artificially increases on the page and lures the search engine into thinking that page is very relevant (more than authoritative documentation pages, for example, where the frequency of the same keywords is usually lower).
When you are down with all the tweaks and fixes required to cleanup internet junk, you have doubled your code base just to handle the most common patterns.
If you take the website of the International Color Consortium (ICC), https://color.org , an authoritative source on color management on computers (stuff you might be concerned with if you use screens to display images), they have no sitemap, no RSS feed, they don’t use any kind of timestamping on their pages (you can’t know the date of the info you are looking at), they declare an iso-8859-1
encoding in headers but if you actually use it, you get many decoding errors, and to end all that, most of their actual content is contained in… PDF files. That legitimately the crappiest website I have ever seen.
Now, if you take the website of the International Electrical Commission, https://www.electropedia.org/iev/ , it’s another flavor of shit. You see, in the photography world, people use terms like “luminance”, “brightness”, “lightness” lightly, in an interchangeable way. But they are not interchangeable, and Electropedia provides you with the exact technical definitions in English, French, and provides the exact translations for most European languages, Chinese, Japanese and Arabian. But the website is properly impossible to index, since it not only lacks a sitemap, but also pages don’t have actual URLs. You have to query the terms from the index, and be redirected to the definition through URL parameters, like https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=845-21-002
. And the word being defined is not even in the URL parameters, you only get its ID : 845-21-002
, which is the term “Optical radiation”, which you can only know from querying their database…
So if you ever wonder why levelling down is the norm on the internet, there you have it : authoritative sources are much harder to index, query and fetch than Jo Blos opening WordPress blogs to say shit, because technical institutions and organizations are ruled by people who use the internet to advertise the date and place of their next yearly symposium. I simply dare you to bake a Google query that has Electropedia on the first page without explicitly restricting the search to electropedia.org… I couldn’t.
On the other hand, WordPress websites are no-brainers to index : UTF-8 by default, standard use of sitemaps in SEO plugins, which also tend to support Open Graph meta tags natively, hierarchical taxinomies and post types letting you know what you are looking at. But forums are not. And that’s why investing time answering questions on most forums is a dry loss: they don’t contribute to information thesaurization. They are just chats with permalinks.
And don’t get me started on the mailing-list archiving websites that put email bodies into <pre>
tags only to get the text displayed in monospace font, so a machine doesn’t know if it’s looking at machine language or natural language.
I have heard we are in the era of information. I think we are in the era of noise and wasted CPU cycles.