It’s been a bit more than one year that I have put Chantal AI online, with now 3 major iterations of design. It’s time to compile what I learned from that.

Introduction : information retrieval

So you have documents and pages containing information and knowledge (HTML, plain text and PDF) useful to your business. That’s better than not having documents. Problem is, as the number of documents increases, information gets more and more buried… The best place to hide a tree is in a forest. There are a couple of things you can do to alleviate that :

  1. organize the content linearly, like chapters in a book, and have a top-level table of contents (index page),
  2. tag pages with keywords, categories, topics (or any other form of taxonomy) and have an index page listing all terms for each taxonomy (kind of a glossary),
  3. put a date on pages and have chronological archive,
  4. have a list of related pages on each page (only possible for HTML websites).

As you see here, the main question to answer is : how do expect your reader to enter the information system ? Very few people do a linear reading of documentations, for example, but want shorcuts, especially when they only need a double-check of something they already roughly know. Entering through topics index implies a kind of “state-of-the-art” research.

Which brings another question : is the reader in need of an operational information or of a full presentation of a topic ? Operational informations are usually short and needed urgently, to accomplish a specified task (phone number to call, address to ship, configuration key to set, form to fill, etc.). Topical overviews are needed at a different pace, “offline”, for individual training or knowledge updates.

On a website, it is customary to mix all those transverse ways of browsing information (website-level TOC, page-level TOC, tag clouds and tag archives, chronological archives). When documents are also PDF on a filesystem, that is not possible anymore. Either way, you need a search engine into the content to provide for a fast access through large volumes of information.

Your typical search engine still uses IF-TDF methods such as Okapi BM25 and children : that is computing keywords fequency statistics in all documents and ranking them using “black magic” weighting rational functions, fine-tuned empirically. To account for typos, discrepencies in suffixes (plural/singular, feminine/masculine, conjugated verbs), it is then customary to stem words, which in itself is a complex and risky procedure.1

Problem is, keyword-based approaches don’t allow to broaden search queries with synonyms. Users will have to define those synonyms themselves, provided they know them, which is a fair expectation only when the target audience of your search engine is experts.

This is where AI comes in handy : by its ability to infer meaning through statistical relationships between words commonly found in the same syntactical neighbouhood.

Introduction 2 : Word2Vec in a nutshell

Word2Vec is an unsupervised machine learning algorithm that aims at pinning words (or any kind of token occuring in a sequence)2 in a virtual multidimensional space. In this virtual space, the relative distance between each couple of words is optimized to mirror how often those words are found in the same immediate vicinity (where the engineer controls the distance of the vicinity : the search window). The coordinates of each word are represented by a vector.

Since each word is turned into a vector, it is possible to compute the centroid of each document (by averaging all its words’ vectors) and of each search query. By simple algebra, we can then compute the angular distance between documents and queries, and use it as a ranking algorithm. Therefore, searching for a word of for what was identified by the language model as its close synonym yields the same result.

As you have guessed, the quality of the ranking is closely tied to the quality of the language model inferred by Word2Vec training.

GIGO : more than you think

Garbage In, Garbage Out. It’s a well-known principle of computer-science, but AI, as it relies on very large volumes of data, makes it really difficult to (manually) assess the quality of the input data. Bias in the training set will bias the language model, therefore biasing the search engine.

In particular, you need to ensure that your input is free of machine and programming languages if you are training it for natural language. You also need to prevent biasing the language model with redundant content, as it will skew the resulting statistics of co-occurences.

A naive way of removing machine language was for me to discard <code> and <pre> markup from HTML pages. Problem is, a lot of websites and authors use them for their graphical appearance (monospace font), rather than for their semantics, an usecase for which the <tt> tag was designed. The solution I resorted to was to try and load web pages as JSON files, and discard them if they loaded without error, then remove all Github URLs containing /blob/. From there, it’s only hoping that the signal/noise ratio will favor valid natural content.

In the same spirit, I had to parse away all the <aside>, <nav>, <footer>, etc. tags, containing out-of-topic content that can be found on every page of a given website (navigation menues, “recent posts” columns, page footers, TOC, etc).

Then I had to write a duplicate finder, finding exact and near text duplicates (using Levenshtein distance) and keeping only the most recent or longest version. This is especially important on pages that keep an history of changes/commits linked from the up-to-date.

The last big problem is the recent trend of having AJAX-generated content, where content is loaded asynchronously (usually through a Rest API) and rendered client-side in Javascript. Those websites need to be parsed into an headless Chromium browser, which is enabled by libraries in most scripting languages, but at a terrible computational cost. The main culprits are Github and YouTube, but also all the selling platforms (Amazon and the likes). To keep the crawler relatively fast, I resorted to discard these pages entirely, based on the placeholder content they use when Javascript is disabled (which needs to be taylored manually for each website).

Short or long content, not both

The litterature on using Word2Vec to analyze Twitter posts reports that the best-performing window size (neighbourhood distance) is 5 words. For my application, I found that 9 is the best (11 may be better in some contexts, but it’s not as robust). Trying to index both real pages (long texts) and Github commit messages yielded very strange results, with unrelated Git commits showing up in almost every search result, and usually at the best ranks.

My hypotheses is that Git commit messages are usually very short, written in a bad English, and using a vocabulary very far from natural language (functions and file names). On top of skewing the Word2Vec language model, the centroid of those documents is often the result of 1 to 3 words vectors. For some reason, that makes them angularly close to virtually any search query.

This suggests that a general language model is not possible, and it would need to be fine-tuned for each corpus.

More accurate is not better

Word2Vec has 2 variants : Continuous Bag of Words (CBOW) and Skip-Gram (SG). The Word2Vec authors report a better accuracy of the SG approach to infer semantics and syntactics. However, it yields weird results in the context of information retrieval, akin to overfitting. CBOW infers topics better, that is generalizes content better. Indeed, CBOW infers a word given a context (a sequence of words), while SG infers a context given a word. The word inferred by CBOW might be seen as a latent topic of the neighbouring context.

This loss of semantic accuracy is actually desirable in the task of finding relationships between queries and documents. It’s a win for us too, as CBOW is much faster to compute than SG.

Stopwords removal vs. penalization

Word2Vec has a negative sampling parameter that can be used to penalize frequent words, under the assumption that they don’t carry meaning (that is true of all auxiliaries : “be”, “will”, “do”, etc.). In the context of a specialized AI where input data is limited to a few 100k documents, it performs worse than manually removing stopwords. In a multilingual system, it’s outright impossible since the stopwords may not be statistically unbiased between languages, so the negative sampling might penalize one language more than the others.

I resorted to a list of manually-curated stopwords that are automatically removed after tokenization. To design it, I output the complete frequency of words in the training sample and pick the ones that appear the most while not carrying meaning. This method has a limit, for example with one-letter words. Indeed, “A” might be a determinant (a cat) but also part of a commercial name (Sony A7). Same with “I” : “I am” but also “Canon 5D Mark I”. Depending on how the single letter was typed in the source document, but also how the tokenizer will split it (or not), the outcome is unpredictable and removing those is risky.

This shows that, in my opinion, every information retrieval system needs a way to enforce manual, grep-like, filtering based on full keywords or regex patterns matching, to narrow-down whatever results the clever methods (stats, AI, or combined) may yield.

Measuring quality

An objective metric of the information retrieval system accuracy is difficult to design. In my usecase, given that I know the dataset to be indexed, I have a couple of test queries that need to output some expected results. I use the results to those queries as a rough estimator of the quality of the language model. Ultimately, it’s all trial and errors, based on arbitrary designer’s expectations.

Return on investment

There are numerous open-source libraries that you can use to create your own AI at home, without having to code an AI. It’s all fine and dandy, but that doesn’t get your AI done. There is a lot of fine-tuning that goes on :

  • adjusting the hyper-parameters of the AI algo (like noise filtering, window size, SG vs. CBOW, dimensions of the vectors, number of epochs, etc.),
  • adjusting the pre-filtering and cleaning steps (text pre-processing, HTML cleanup, special characters and stop-words removal, etc.).

This alone is very time-consuming, but what is even more time-consuming is waiting for the computations to complete before you can see the results of your changes and keep on tuning.

I am suspicious regarding “all-purpose” and “one-size-fits-all” AIs. The latest iterations of Chat GPT and similar large language models show that they perform worse than their earlier variants on some tasks. The basic issue with any kind of AI is to gage the generality vs. accuracy ratio in the presence of noise, because too much accuracy (that is, overfitting) will be difficult to generalize out of the training dataset, but too little accuracy makes us loose meaning, until results are too inaccurate to be useful. So I’m a big believer into specialized AI, tuned for one job into one industry.

But then, this bespoke tayloring needs to be done times each client/application. And given the amount of manhours and computationnal power it needs to be production-ready, I’m not sure it’s cost-efficient to replace help desks and assistants with AI bots. It seems like a natural evolution in search engines, because the ability to understand synonyms or translations is a huge feat, and topic inference helps to generalize a search query beyond keywords. It is definitely a nice improvement if it can be done at a reasonable energy cost (CPU and GPU need a lot of electricity to run large AI models). But I’m not seeing that as a more efficient and less costly way of doing what we already did with people : replacing front-desk clercs with engineers fine-tuning AIs to follow the usage is probably not going to cost less, but will certainly hurt if not enough resources are put into making those AI sufficiently robust to use.


  1. Market and Marketing share the same stem, but their meaning is quite different, and reducing both to the same market- stem is an overly aggressive meaning generalization. ↩︎

  2. Word2Vec has also been used for songs and movies recommendation algorithms, and trained with listening user sessions. ↩︎