This is a follow-up on the previous Websites suck, which covered the preliminary information retrieval step.

#Introduction

On the open-source planet, in the 2020’s, information is scattered over many websites : scientific journals for theory, specification sheets for standards and protocols, software documentation for “how to use tool”, blogs and Youtube tutorials for “how to achieve goal”, forums and support for “how to solve problems”, Github for “what is known to break” and “why design (or lack thereof) was done this way”, sourcecode for implementation details, and books for everything considered worthy of paiement for access.

For developers, engineers, designers, this kind of scattering is already a productivity hinder, but at least one could argue they know (or have a chance to learn) how to navigate the network graph of information relationships. For end-users, forget it : all that remains is typical Google or Bing, which are way too broad and too noisy, since they infer content relevance by domain reputation and SEO, rather than by authority of the source.

Luckily, software editors like Algolia  have indexing and information retrieval solutions that can work across domains and websites, provided you control them, through centralized REST API and decentralized client-side Javascript. However, they don’t work on PDF content, which still is the (stupid) sweetheart format of anything technical (scientific papers, standards and protocols specifications), and still will be for the next decades.

It remains that the recurring problem of anything technical is technical jargon : in sciences and technics, there are no synonyms. Knowledge is built by drawing logical relationships between precise concepts, labelled with precise words. Each word has a precise meaning that cannot be replaced by any other word, because changing the word changes the referred concept, which changes the reasoning. Worse : some scientific words also exist in common language but mean something else.1 But end-users will usually not know the acurate jargon, and even developers are often sloppy on proper terminology, which is a big issue when you are retrieving information by keyword search…

“You don’t know what you don’t know” is worse than what people usually think, because you don’t even know what it’s called, which means you have no entry point in a typical information retrieval system (IRS), based on keywords matching. Geeks from the open-source planet often forget it when they send users to RTFM2 or, worse, to Google, as a spiteful answer to a “stupid” or redundant question on a forum or mailing list. On the other end, it is tiresome and uninteresting to answer the same questions over and over again, especially if it’s unpaid work. But nothing has been done so far to provide an actionnable solution to this problem, which keeps resentment growing on both ends of the information seeker-provider spectrum.

All this is calling for an IRS understanding “synonyms”, that is :

  • real synonyms (lexical),
  • misnomers that end-users use when they want to mean something roughly technical.

Again, IRS like Algolia implement synonyms-aware AIs , but they only suggest keywords to improve queries, they don’t seem to actually index documents using synonyms, that is internally represent documents in a synonym-agnostic way. Not to mention, Algolia is aggressively second-guessing queries and inferring typos in rare words, as Google is doing since around 2020, which is incompatible with technical jargon — necessarily less used than common language.

We can also generalize the notion of synonyms to translations. In a multilingual project, not all information may be available in all languages. On the other end, automatic machine translation nowadays produces fair-enough results for simple sentences. Directing users to auto-translated pages when nothing useful or up-to-date was found in their language is useful, and a language-agnostic IRS will help here.

#Synonym-agnostic representation of information

In 2013, Mikolov and al. have introduced Word2Vec,3 a vector representation of words, in an hyperspace, where the relative distance of words is subjected to their probability of co-occurence in natural language samples. Two architectures are proposed : Skip-Gram (SG), based on conditional probability and better suited for semantic and syntactic tasks (analogies derived from simple arithmetics on vectors, or even content recommendations4), and Continuous Bag of Words (CBOW), better suited for “aboutness” and topic modelling, as it can predict a missing word from a context5

Word2Vec CBOW has been used by Mitra and al.5 to design a dual embedding space model, used to rank documents by relevance compared to a query. For a document $D$ containing $k$ words, all words are converted to n-dimensional vectors $\vec{d_i}$ using a trained Word2Vec CBOW model, then the document is represented by the centroid of all individual word vectors, normalized :

$$\bar{D} = \frac{1}{k} \sum_{i = 0}^{k} \frac{\vec{d_i}}{||\vec{d_i}||}$$

This representation leads to a natural similarity metric between a document and a vectorized query $Q$ (that can be another document’s centroid itself), using cosine distance ($\cdot$ denotes the scalar “dot” product of two vectors, $|| \quad ||$ the euclidean L² norm):

$$similarity = \frac{\vec{Q} \cdot \bar{D}}{||\vec{Q}|| ||\bar{D}||}$$

This representation of a document as a vector is not only synonym-aware (through the Word2Vec CBOW internals), it is also extremely efficient as it turns the online ranking problem into simple vector algebra and allows to pre-compute $\bar{D}$ offline.

But Mitra and al. introduce a twist here. The Word2Vec neural architecture uses 3 layers : input ($I$), hidden ($H$) and output ($O$). The algorithm goes from layer to layer through the multiplication by an embedding matrix ($[W_{in}]$ and $[W_{out}]$), containing the list of embedding vectors for each word, which is optimized by the learning process :

$$\begin{cases} \vec{I} \cdot [W_{in}] &= \vec{H}\\ \vec{H} \cdot [W_{out}] &= \vec{O} \end{cases}$$

The objective of the CBOW optimizer is to push the $W_{in}$ vectors of words closer to the $W_{out}$ vectors of other words commonly co-occuring. The original Word2Vec algorithm makes only use of the $W_{in}$ vectors, once the language model is trained, which makes cosine similarities maximal between words typically similar, that is usually found embedded in the same structures. Mitra and al. note that the cosine similarity between $W_{in}$ and $W_{out}$ vectors is maximal for words topically similary, that is usually co-occuring in the same vicinity. Therefore, they propose a modified ranking algorithm, based on cosine similarities using vectors from both matrices, for a query $Q$ containing $l$ words:

$$DESM_{in-out}(Q, D) = \frac{1}{l} \sum_{i = 0}^{l} \frac{\vec{q}_{in,i} \cdot \bar{D}_{out}}{||\vec{q}_{in, i}|| ||\bar{D}_{out}||}$$

Mitra and al. observe that a typical $DESM_{in-in}$ vector-based ranking (using only $W_{in}$ for document and query vectors) is only marginally better than the old BM25 algorithm6 (which turns keywords statistics in documents into relevance scores through probabilities), but $DESM_{in-out}$ performs significantly better. In addition, $DESM_{in-out}$ is more robust to keywords stuffing, which is the traditional shortcoming of methods based on probabilities inferred from keywords statistics : since it is much more sensitive to context than to individual keywords, it is less deceived by adding unrelated keywords in an otherwise typical context, and will still identify the right context as a better result than the wrong context for the stuffed keywords.5

However, the best accuracy is achieved when word embedding are computed on queries instead of documents. This setup is unavailable when building a search engine, but suggests user queries will have to be stored for training update purposes.

They also note that the BM25 algorithm has a better separating effect than $DESM$ and propose a mixed model :

$$MM(Q, D) = (1 - \alpha) DESM_{in-out}(Q, D) + \alpha BM25(Q, D)$$

with BM25 parameters $k_1 = 1.7$, $b = 0.95$. In my implementation, I will use $\alpha = 0.03$. In place of the old BM25, which is known to favor short documents, I use the BM25+ algorithm6 with parameters $k_1 = 1.7$, $b = 0.3$, $\delta = 0.65$ which is found to give optimum parameters. I am unable to perform systematic tests of accuracy/relevance in my setup for lack of query/results/user feedback datasets in my field.

The DESM logic is on the line between an actual information system (keyword-based) and a recommendations algorithm (similarity-based).

I have implemented the DESM in Virtual Secretary core.nlp module .

#Accounting for words patterns

Both the above document representation and the subsequent relevance ranking, either based on BM25 TF-IDF  or on cosine similarity, treat documents as unordered bags of words. This is relevant for a topical approach but may discard exact sentences search.

Implementation-wise, one could think of regular expression  patterns to cover this usecase, however regex are too rigid : they match exact keywords (including typos) in the exact order (discarding permutations). They can be relevant for expert usecases, but not for the general audience.

Pang and al.7 have introduced an image-processing-inspired neural network based on turning text into a 2D interaction matrix between a document and a query. Without going all the way to the neural network, I reuse here this intuition to add a pattern-matching term to the above $MM(Q, D)$ ranking coefficient, using a simple convolution product.

We can build the 2D interaction matrix $IT$ of a document $D$ and a query $Q$ by setting:

$$IT_{i, j} = \begin{cases} 1 &\text{ if } Q_i = D_j \\ 0 &\text{ else } \end{cases}$$

Interaction matrix between the document “I think Word2Vec is rocking hard but it is rocking into my headache too Rocking Word2Vec is but it is hard to think” and the query “Word2Dec is rocking hard”

We then convolve the interaction matrix by a normalized identity matrix $I_N / N$ having the dimension of the query keywords, using a circular boundary condition, that is :

$$IT_{direct} = IT * \begin{bmatrix} 1/N & 0 & 0 & 0 & .\\ 0 & 1/N & 0 & 0 & .\\ 0 & 0 & 1/N & 0 & .\\ 0 & 0 & 0 & 1/N & .\\ . & . & . & . & .\end{bmatrix}$$

The resulting $IT_{direct}$ is expected to display a row-wise cumulative sum of $1$ where we have exact matches between the document content and the query, that is at the center (or its closest neighbour) of the word pattern found.

Interaction matrix convolved by $I_N / N$

The above operation will find exact patterns, in the exact order of declaration of the query keywords. It will, to some extent, find keywords permutations and partial patterns too but only because of the circular boundary condition and if the query is relatively short. To improve the permutations detection, we convolve again $IT$ by a rotated normalized indentity matrix, that is:

$$IT_{reverse} = IT * \begin{bmatrix} 0 & 0 & 1 / 3\\ 0 & 1 / 3 & 0\\ 1 / 3 & 0 & 0\end{bmatrix}$$

Interaction matrix convolved by rotated 3×3 identity matrix

That last result shows cumulative sums of $1$ for the exact match, but also for partial reversed matches (“rocking Word2Vec is”), which is oversensitive. As a final result, we will then average both methods : $IT_{correlation} = \frac{1}{2}(IT_{direct} + IT_{reverse})$

Average of convolutions by the direct and reverted kernels of the interaction matrix

Now the cumulative sum is $1$ only for the exact match, and we have peaks over $0.5$ for the partial matches, which relative magnitude represents well the number of keywords found. We can also found the position of the match in the document, which could be used in GUI to return the content snippet around the match. This was quickly tested but proved to be unreliable on webpages containing a lot of “aside” content : sidebars or footer content would often get the best match, and the adjusted snippets for the position of the best match were rarely more relevant than the typical page description/summary (often found in the <meta name="description"> HTML tag in page <head>).

The issue with this method is it is a lot slower than the DESM or the BM25+. As such, we will have to run it only on a subset of the best results found by the previous methods.

The final ranking algorithm will be:

$$rank(D, Q) = (1 - \alpha) DESM_{in-out}(Q, D) + \alpha BM25(Q, D) + ||\frac{1}{2}(IT_{direct} + IT_{reverse}||_\infty \label{ranking}$$

where $||\frac{1}{2}(IT_{direct} + IT_{reverse}||_\infty$ is the infinity norm, that is the maximum value of row-wise sums over the average of both convolutions results.

The ranking score is supposed to be limited to 1 (or 100 %). The current logic is equivalent to saturating the score if the exact keyword pattern is found, therefore bypassing any topical or similar terms search. This method makes obviously no sense for queries containing less than 3 keywords.

I have implemented the full ranking algorithm in Virtual Secretary core.nlp module .

#Prefiltering the training set

In the original Word2Vec paper,3 the authors achieve very good accuracy when training using the Google News dataset (6 billions words). It is equivalent to “cheating” the signal/noise ratio through big data, which they can. Correct accuracy is achieved at roughly 700 millions words. In my parallelized implementation, I use around 135 millions word, which almost saturates the 32 GB of RAM of my computer.

Performance aside, this is way above what small businesses, individuals and open-source projects can generate, in terms of text, during their lifetime. In any case, the computational power required is still unaffordable, though the Word2Vec algorithm, once implemented in C and parallelized, achieves very reasonable runtimes : 100 epochs over 135 millions words, with a vector size of 512 dimensions, run in 4 hours on my laptop (Intel Xeon mobile 8 × 4.1 Ghz).

A good AI model needs to generalize well, that is discard noise and pecuralities. Noise is diluted by big data. Without big data, we will have to clean-up inputs to generalize. This is usually the most overlooked aspect of AI : it was roughly 80 % of my effort on this project, and around 95 % of the sourcecode written.8

#Filtering meta-tokens

As far as Word2Vec is concerned, 1 is not the same thing as 5 or 2563. Each number will therefore be identified as a different token, and the neural network will try to find their hidden meaning separately. We could therefore remove all numbers, or better, replace them with a _NUMBER_ meta-token that will throw some generality to help the optimizer.

The problem is numbers happen in:

  • software version: v1.0, v2, or sometimes just 3.0,
  • dates and time,
  • file pathes and URLs,
  • commercial brands and products.

That means we need to extract URLs, pathes, emails, dates, times, etc. before filtering numbers. Which calls for a full range of meta-tokens.

I have implemented a such list of patterns/meta-tokens replacements using regular expressions in a dedicated Python module : Virtual Secretary core.patterns .

Some examples:

  • 2023-08-01 (year-month-day), 2023/08/10, March 25, 25 mars, 25 mars 2021, 2021 March 25 will all be replaced by _DATE_,
  • 12h15, 12:15, 12:15:00, 6:00, 12am, 12 am, 12 h, 6 h, 12:15:00Z, 12:15:00+01, 12:15:00 UTC+1 will all be replaced by _TIME_,
  • http://domain.ext , https://domain.ext , https://subdomain.domain.ext/page , //domain.ext, http://domain.ext/?search=query&sort=asc  will all be replaced by _URL_,
  • ~/folder, ~/.folder/, ./folder, C:\folder\file, /test/file will all be replaced by _PATH_,
  • $15, 15€, 15.5 €, 5 USD, EUR 5, 12k€, £12K will be replaced by _PRICE_,
  • 1.5in, 12 inches, 12’, 12’, 12 ft, 12.5 feet, 12,000’’, 5 km, 25 µm, 25µm, 25 micrometers, 2,5 cm, 12 m will be replaced by _DISTANCE_,
  • +2 °C, -5.2°C, 200 K, 250°F, 2.5 degC, 272 kelvin, 25 degree C will be replaced by _TEMPERATURE_,
  • @me, me@here, me@domain.ext, user1234, user6 will be replaced by _USER_,
  • 123456, 12.456, 12,456, 12_45, 12/45, 0-2, .1, 2. will be replaced by _NUMBER_,
  • etc.

This is an issue for some commercial products using numbers, especially cameras (Nikon D 300, Canon 5 D) if the numbers are spelled separately from the letter. The subsequent loss of specificity calls here for a bypass or an additional filter in the topical/similar information retrieval system, to allow users to input exact keywords to match.

Meta-tokenization prior to tokenization makes it also more reliable. Using NLTK word_tokenize  method, which relies on unsupervised machine learning pre-trained models, I noticed inconsistencies in tokens splitting on some punctuation signs, especially dots used in version numbers or as decimal separators, or parentheses. The “supervised” pre-filtering step based on regex makes it more reliable and consistent.

However, the performance penalty of applying a bank of regex filters for each document to index is quite noticeable. I had to parallelize the loop, and use the regex  Python package in concurrent mode (releasing the Python GIL ).

#Building Dumbrish

Dumbrish stands for DUMB FRench englISH, because my IRS needs to work in a bilingual setup for French and English, and artificial intelligence is dumb anyway.

Stemming  is a process that has been long used in natural language processing, consisting in removing the suffixes of words to find their roots (or stem). Stems were the first attempt at modelling synonyms in automated language processing.

Many French and English words share the same stem :

  • labor / labeur,
  • profile / profile,
  • decide / décider,
  • education / éducation.

Many other exist in both languages, as-is :

  • action,
  • film,
  • catalogue,
  • classification,
  • profit.

Some other words differ by their double consonants :

  • professional / professionnel,
  • literature / littérature,
  • environment / environnement.

With this intuition, we can forge a synthetic language in which enough words will be reduced to the same stem across French and English, allowing the Word2Vec learning process to infer translations (as synonyms, found in the same context) for the words that don’t share the same stems. The “stemmable” words will therefore act as anchors for the language model, which will then build inference on “non-stemmable” words from these anchoring points.

The rules of Dumbrish are simple:

  1. Consonants can’t be doubled.
  2. Accented characters are removed. Since people often misspell them in French (and some don’t use them at all), and they only change change the phonetic sound (except for “à”), there is no point in keeping them.
  3. Dumbrish uses the American spelling for -our words (“behaviour”, “colour”), British spelling is transformed to American spelling (“behavior”, “color”) to match Latin words (“color”) and be consistent with French stems (“coloriste”, “coloriser”).
  4. Plural and feminine forms in -s, -e, -ée and -ses are removed for words longer than 3 letters.
  5. The following word suffixes are removed in order:
    • -ity, -ité, -ite (quantity, quantité, activity, activité), for words of more than 4 letters,
    • -tor, -teur, -trice, replaced by -t (aviator, aviateur, aviatrice)
    • -ed (managed, relieved), for words of more than 4 letters,
    • -ment, -ement (management, immédiatement, endictment), for words of more than 4 letters,
    • -sion, -tion, replaced by -s and -t respectively (action, commission), for words of more than 4 letters,
    • -isme, -ism, -iste, -ist (socialist, socialism), for words of more than 3 letters,
    • -at (reliquat, predicat, postulat), for words of more than 4 letters,
    • -tif, -tiv, -tive, replaced by -t (active, actif), for words longer that 2 letters,
    • -y, replaced by -i (apply, really), for words longer than 3 letters,
    • -er (installer, chanter), for words longer than 4 letters,
    • -iz, -ize, -yze, replaced by -is (analyze, optimize, size), for words longer than 4 letters,
    • -eur, replaced by -or (couleur, chaleur, candeur), for words longer than 3 letters,
    • -ique, -ic replaced by -i (politique, arithmétique), for words longer than 3 letters.

Short words are not stemmed automatically, as too many of them would wrongly be reduced to the same stem. The following shows an example of the stemming process, step by step :

  graph TD;
      subgraph "Natural language"
        A_3[acte];
        A_1[actionné];
        A_2[activity];
        A_4[acteur];
        A_5[acting];
        A_6[activated];
        A_7[action];
      end

      A_3----->E_2;
      A_1-->B_1(actionne);
      B_1-->C_1(actione);
      C_1-->D_1(action);
      D_1-->E_4;
      A_2---->B_2(activ);
      B_2-->E_1;
      A_4----->E_5;
      A_5----->E_3;
      A_7----->E_4;
      A_6-->B_6(activat);
      B_6-->C_6(activ);
      C_6-->E_1

      subgraph "Dumbrish"
        E_1[act];
        E_2[acte];
        E_3[acting];
        E_4[action];
        E_5[actor];
      end

The stemming process, once again, introduces generality by discarding specificity, and therefore by losing meaning. But the DESM cares more about context than about individual syntactic and semantic meaning of word vectors, so this should be of little consequence.

#Stopwords removal

Word2Vec deals with frequent words (stopwords) by applying a negative penalty, to reduce their weight in the optimization process. Again, the Google guys (Mikolov and al.) can afford to do that, with their 6 billions words training sample. The penalty is set as an hyper-parameter . I have empirically found that no hyper-parameter gives good result for training samples between 300k and 75 millions words : too much meaning is lost, or too much noise is accepted (probably leading to over-fitting).

It is all the more problematic here that my corpus is bilingual, and I have no guaranty that languages are evenly sampled. Frequent words have less probability to be spotted in the minority language.

I therefore manually remove 869 stopwords, using a list generated from word statistics when preparing the training corpus. Stopwords are mostly grammatical operators, auxiliaries, quantifiers and adverbs that don’t help defining topics.

#N-grams

N-grams are associations of words that become atomic words in themselves (like New York, or Paris Saint Germain). It is worth mentionning that some N-grams use dashes to separate their members (New-York) while some other don’t (name/surname for persons), and web writers don’t always follow the proper standard (which are language-dependent anyway). Therefore, we can’t rely on proper observance of grammatical rules to infer meaning.

The original Word2Vec paper starts with a prefiltering step that collapses n-grams into single tokens. However, this is already the result of an artificial intelligence that should be trained on its own. Again, in the context of a specialized IRS trained on a limited dataset, this seems more dangerous than useful.

In addition, we don’t use Word2Vec here to achieve semantic tasks, where implicitely identifying nouns and separating them from verbs and adverbs might be crucial. On the contrary, here we handle context and try to guess the missing word (which could be identified as a topic) from it. It doesn’t seem useful nor relevant to collapse n-grams.

#Prefiltering queries

Spell-checking on user queries, using Peter Norvig’s Python algorithm  and Levenshtein distance, was attempted, but the runtimes are too bad to be useful online. This might be worth it if a C library with Python bindings appears in the future, but the pure Python implementation is beyond usable.

#Scraping text content

We can’t stress enough the importance of having massive amounts of data prior to even dreaming of training AIs. As I mentionned here and in my previous post, Websites suck, most of the work on this AI-driven search engine has been on scraping websites and cleaning their garbage HTML.

One of the most unexpected problems was extracting the date of publication from websites… Through all HTML standards, in a time where we dream of virtual reality, internet of things, web 3.0 and counting, there is still no standard, unified way to declare and retrieve the date of publication in a webpage. Here are the 8 different attempts my crawler has to make for each page to attempt date retrieval :

  • from sitemap.xml (if any). Not systematically used, but at least there is no guessing, though some CMS apparently fake dates here to force page re-indexation by bots (or perhaps the date is the one of the latest blog comment…).
  • <meta name="date" content="...">, stupid simple and self-explanatory, which is why nobody uses it.
  • <meta name="publishDate" content="...">, apparently only used by Adobe documentations. I didn’t find which standard/semantics it belongs to.
  • <meta property="article:published_time" content="..."> and <meta property="article:modified_time" content="...">, FaceBook own OpenGraph protocol.
  • <meta name="dc.date" content="...">, Dublin Core metadata (used only on some scientific journals that invested on the Internet through open edition).
  • <time datetime="...">, standard HTML 5.0 microdata, but doesn’t necessarily apply to the whole current page (it can be comments or replies, or anything).
  • <relative-time datetime="...">, Microsoft Github own way of saying “fuck you standards”.
  • <div class="dateline">...</div>, used on PhpBB forums. Yes, I was desperate.

It should go without saying that all information is contextual and therefore has an implied expiration date, which makes it a very basic feature to implement early in any IRS . Regardless, even with all those attempts, no date is found in many pages.

Another challenge here is to discard the “aside” content that doesn’t belong (or is not specific) to the current page : the “new posts/new comments” sidebars, menues, breadcrumbs, advertising, footer links, newsletter subscription forms, etc. as well as archive pages that partially duplicate individual posts. Those have the potential to bias the language model, and to mislead the content-based search engine.

The crawler and web scraper are implemented in Virtual Secretary module core.crawler . It prepares data in a form that is immediately reusable by the natural language processing and search engine modules. It can crawl websites from sitemaps or recursively. It can also read PDF and perform OCR on scanned PDF. Long PDF containing outlines (summary from section titles) are broken into individual sections which are indexed on their own.

#Performance

The implementation uses an index of 49.273 documents, a dictionnary of 99.481 words, and a Word2Vec vector size of 512. The training step is done on a Thinkpad P51 laptop from 2018, with 32 GB RAM and an Intel(R) Xeon(R) CPU E3-1505M v6 @ 3.00GHz × 8 cores. No GPU is used.

#Offline training

The Word2Vec training step uses a larger sample : 415k documents, for a total of 134 millions words. Tokenizing the whole sample and training Word2Vec through 80 epochs takes on average 4 h 30, and can use only a maximum 3 cores (each of them using around 7 GB of RAM, given that Python multiprocessing forks processes and fully duplicates objects from the main process). The uncompressed (pickled) trained model uses 194 MiB of storage.

The index pre-computation, including tokenization and vectorization of documents takes 37 minutes, and can use all 8 cores since it needs around 2.5 GB of RAM per core. The slightly compressed pre-computed index, which embeds its Word2Vec model, uses 558 MiB of storage.

#Online searching

Once loaded, the index uses 2 GB of RAM space, including the language model. I compared performance on search requests using requests composed of 3 to 5 keywords, with an average at exactly 4, on the same laptop and on a shared hosting webserver (designed mostly for Linux/Apache/MySQL web CMS , 80 €/year) where the present website is hosted too.

PlatformDESMBM25+ConvolutionsTotal
Laptop0.020.080.831.00
Server0.030.060.820.97
Runtimes (wallclock) in second, for each method and for the whole process

Each method is a term in the ranking equation $\eqref{ranking}$, so they all have to be computed (at least for requests having 3 keywords or more). The convolutions are run only on the 400 best results, sorted from the output of DESM and BM25+. The total includes an extra user-defined filtering step (by document origin URL), which is not presented here (simple text match).

Though DESM is based on unsupervised artificial intelligence, it is actually 3 times faster than the old BM25+, because once the vector representation of documents is pre-computed (offline), what remains to compute online is a simple 512×49.273 dot product (as the numerator of the cosine similarity), for which we use Numpy, which actually calls LAPACK routines internally.

The runtimes shown here are a lot worse than what commercial search engine achieve. However, it is interesting that their runtimes are still very reasonable, especially given the kind of hardware used to run the search, and equivalent to what typical web CMS search engines achieve for internal database lookups. Removing the convolutions part, which real benefit remains to be proven (early results show they tend to agree with the other methods, and barely reweigh the relevance scores), the server is able to serve a request in about 10-15 ms.

#Conclusion

The above methods have been implemented in an open-source Python framework : Virtual Secretary. A server-side implementation is provided with Chantal for image processing and color science topics.

The following graph shows a 2D projection of the language model for the query “camera ISO noise”. The black dots represent the keywords that have been found the closest from the whole query centroid (by cosine similarity). The green dots represent arbitrary keywords added to the graph to show where they would end : “color”, “camera”, “sensor”, “vibrant”, “iso”, “remov”, “cast”, “gamut”, “nois”, “highlight”, “shadow”, “social”, “photograph”, “politi”, “franc”. The blue dots represent the 5 closest keywords from all the previous keywords, taken individually.

2D projection by T-distributed Stochastic Neighbor Embedding of the 512D embedding vectors from Chantal v1.1 Word2Vec CBOW model

This shows that synonyms are well understood but antonyms are not separated properly. Again, since we work with context and topics, instead of semantics or syntax, this is of no consequence for the current application.

All this is obviously a side-project that went too far. Having worked on the Darktable  project since 2018 and being the only developer involved in user education as well (through YouTube tutorials, typical HTML documentation and user forums), I became the visible face of the project without planning for it, nor having the staff to shield me from sollicitations. In 2022, the volume of communications I had to handle reached the point of no return, on top of the usual yearly burn-out thanks to the (lack) of project management  that has become the signature of the Darktable project.

I started the Virtual Secretary project to automatically sort my emails, which was the first step toward digital sanity. That did help to sort out spam and newsletters, but most of the human-sent emails were basically turning me into a human information retrieval system for content I already wrote, sometimes in French and in English. This made me realize the next problem.

Many projects, even businesses, have problems caused by the lack of documentation. The natural solution is to build some documentation. What I did overlook in the past is having a documentation is merely a preliminary step :

  • a documentation describes the tool more than how to use it. How to use it to achieve a defined goal is actually the scope of a course.
  • writing courses is incredibly difficult, as changes in the software tools can quickly make them outdated.
  • nobody reads documentations linearly (not even me). People pick the parts they are interested in, meaning they need easy “non-linear” entry points : search engines. As explained in introduction, keywords-based search engines can work only for experts knowing the proper terminology.
  • documentations need to be somehow tied to tutorials and more “hands on” approaches, which are complementary. And the other way around, ideally.

Knowledge is more than the sum of individual pieces of information : it’s relationships between them. Beyond the current hype of generative artificial intelligences, whose ethics and actual results are disputable, drawing relationships between objects is where AI shines. As documentations are not enough, IRS are the next logical step to make documentation actually discoverable. General purpose search engines (Google, Bing and the rest) do a pretty poor job here, on specialized and technical topics. The need for specialized – jargon-aware – language models and AI-based IRS is real. Apart from Apache Lucene , which is written in Java (and as such, badly suited for server-side implementation and public API), there is still no real solution to this problem. Virtual Secretary provides an efficient Python framework that offers very good performance, since most of it only calls C libraries through Python bindings.

The next step of this work would be to use Doc2Vec to find related pages into the documents index.


  1. “Weight” is carelessly used for “mass” in common language, whereas physics tell us “mass” is a quantity of matter, measured in kilogram, and “weight” is the force resulting from gravity applied on some mass, measured in newton. ↩︎

  2. Read The Fucking Manual ↩︎

  3. MIKOLOV, Tomas, CHEN, Kai, CORRADO, Greg, et al. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. URL ↩︎ ↩︎

  4. CASELLES-DUPRÉ, Hugo, LESAINT, Florian, et ROYO-LETELIER, Jimena. Word2vec applied to recommendation: Hyperparameters matter. In : Proceedings of the 12th ACM Conference on Recommender Systems. 2018. p. 352-356. URL  ↩︎

  5. MITRA, Bhaskar, NALISNICK, Eric, CRASWELL, Nick, et al. A dual embedding space model for document ranking. arXiv preprint arXiv:1602.01137, 2016. URL ↩︎ ↩︎ ↩︎

  6. ROBERTSON, Stephen, ZARAGOZA, Hugo, et al. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval, 2009, vol. 3, no 4, p. 333-389. URL  ↩︎ ↩︎

  7. PANG, Liang, LAN, Yanyan, GUO, Jiafeng, et al. Text matching as image recognition. In : Proceedings of the AAAI Conference on Artificial Intelligence. 2016. URL  ↩︎

  8. Granted, the Python ecosystem of machine learning packages (Gensim, Scikit-learn, NLTK, etc.) makes it really fast to wip-up your own AI with minimal code to write and very reasonable performance, since they all mostly provide Pythonic bindings to lower-level C libraries. ↩︎