Introduction to Sentence-BERT (SBERT)

In October 2019 Google announced a new AI-based technology called BERT to further improve their search results.

BERT stands for 'Bidirectional Encoder Representations from Transformers' and is a language representation model which was trained with 3.3M English words.

The huge difference between BERT and earlier versions of language models is that BERT 'understands' the context in which a word is used. To make this a bit clearer, let’s look at the following two example sentences.

"Bob is running a marathon."
"Bob is running a company."

It is easy for a human reader to understand that running has a completely different meaning in these examples, but for machines this task is far from trivial.

Early models which were widely used until 2018 would have put both cases of 'running' into the same semantic box of 'walking fast', because it would be the most common usage.

BERT on the other hand looks at the context of words before encoding their meaning into ones and zeros. This makes it a much more accurate and powerful algorithm for encoding natural languages.

For BERT, it is clear that 'running' in example 2) implies that Bob leads a company and has nothing to do with the act of walking.

After understanding what makes BERT so special, it is easy to see the value behind this technology for Google’s search algorithms.

'How BERT helps Google Search understand language' - Google

Not only does it improve the understanding of words in user inputs, but it also helps Google understand more natural user queries. In 2022, you don’t have to explain your grandparents keyword-based search anymore.

With the help of BERT, Google learns to understand whole sequences and their semantic connections.

Nowadays, you can ask a question like 'What is the name of the movie where the little boy sees ghosts?' and get 'The Sixth Sense' as the first result, which is a perfectly fitting, and in this case correct, answer.

Sentence-BERT (SBERT)

As good as it sounds, BERT is still not a solution for every kind of semantic search problem.

Although it is a superb way to encode the meanings of words in a query, it doesn’t perform well when it comes to comparing similarities of whole sentences.

Reimers and his colleague Gurevych, the authors of the Sentence-BERT paper, realized this early on and explain the problem with the following statement:

"Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity […]."

A few modifications and some more fine-tuning for semantic textual similarity help Sentence-BERT cut those 65 hours to about 5 seconds, which is an incredible performance boost.

The difference to regular BERT is the idea to go one abstraction level further and encode the semantic meaning of the whole sentence instead of only encoding the individual words.

The system's core still builds on the standard pre-trained BERT algorithm and derives its semantic power from it.

So, in a scenario where you want to find the most similar headlines to a given one, you are better off utilizing the SBERT algorithm.

At first glance this information might not seem valuable in the scope of SEO, but with new algorithms on the rise like SMITH, which works similar to BERT but looks at longer contexts on document scale, getting a feel for SBERT might be a good thing to do.

Fortunately, I conducted an experiment with SBERT to give you some insights and shed some light on the capacities of BERT-powered architectures.

Using SBERT to find similar asked questions

Given that we now know SBERT is much faster when it comes to comparing sentences, we will use this knowledge to identify questions from a database which are similar to a new question asked by the user.

To do this, I downloaded a huge data dump from reddit’s well-known subreddit 'Explain Like I’m Five' (ELI5) using a script by Facebook. On ELI5 users can ask questions about all kinds of things and get layman-friendly answers. I filtered out the questions and encoded them with SBERT.

In the next step I built an algorithm which utilizes the commonly used vector similarity measure 'cosine-similarity' to compute 4 questions from my data dump that are most similar to my input question. To have a Baseline for comparison, I did the same experiment with an encoding strategy called TF-IDF.

TF-IDF encodings reflect how important a word is to a document in a collection or corpus. It is a quite naïve statistical method which operates on word-level and doesn’t take semantics into account.

In the following section I will discuss the results for two questions I asked and explain the underlying reasoning process.

Example 1 – 'Is there racism in animals?'

Resulting similar questions:

TF-IDF	SBERT
Why is there racism?	Do animals express discrimination / racism based on the colour of fur / skin?
Why is there racism?	Do animals get embarrassed or feel shame?
Why is racism still a thing?	Why is incest breeding bad for animals?
Why is racism so common?	Are there murderers, psychopaths or other behavioral deviations commonly associated with human beings among animals?

The top similar question for SBERT gives the user exactly what he asked for, although the wording of the question is different.

Here we can observe the power of SBERT: The sentences are compared using a deeper level of semantics, which helps the system understand the query in a much more advanced way, which is not possible for a simple TF-IDF model.

The naïve system gets very focused on the word ’racism’ and provides two top answers containing 'is there racism'. It is obvious that the TF-IDF search mechanism drops the important phrase ’in animals’ in order to maximize the similarity between the sequences.

This makes it clear that the baseline representation focuses on a word-level encoding structure and doesn’t take the context into account.

Further, we can see that SBERT even understands that racism has a negative connotation. This is the reason why it also lists other question results like 'Do animals get embarrassed or feel ashamed?'.

This behavior is out of scope for a simple statistical model which was fitted on short questions.

Example 2 – 'Why didn’t we already fly to Mars?'

Resulting similar questions:

TF-IDF	SBERT
Could we see "someone" on Mars?	Why haven’t we been able to land on Mars?
Why not mars?	Why haven’t we put a man on Mars yet?
How high can a fly fly?	Why are we making our expedition to mars a one way trip?
Why does it sometimes cost more to fly from A → B than it does to fly from A → B → C?	Why do we need to go to Mars?

Once again the baseline model concentrates on the simple word-level concepts of 'flying' and 'mars', while SBERT gets the intention of the question and provides the user with fitting results regarding the topic ’travel of mankind to mars’.

It’s interesting to observe the interchangeable usage of 'already' and 'yet' between the second SBERT-provided question and the user query. This again shows the semantic power of the BERT architecture.

As we can see, SBERT has the capacity to encode even granular semantic information on the sentence level and provides meaningful results when used for sentence similarity tasks.

Given the fact that algorithms like SMITH are currently being developed, it could be possible that a sentence-level language model like SBERT could be implemented in Google’s algorithm as well. But what does this power of BERT and BERT-like technologies mean for SEO?

Cutting Edge Language Technologies and SEO

Natural Language Processing made a huge forward leap with the introduction of BERT and there is no sign of slowing down.

New papers are being released on a regular basis and the field is getting pushed more than ever by companies like Google, Facebook and Amazon.

With this in mind, the challenge of SEO will change from caring about backlinks, keyword optimization, meta descriptions, etc., to actually just generating quality content for users.

People want to find precise and compact information when they search for something, and you need to be able to deliver exactly that.

The times where we optimized for machines slowly comes to an end. In the future, machines will optimize for us.