What is SMITH from Google?
SMITH(Siamese Multi-depth Transformer-based Hierarchical) is a new deep-learning model that like BERT is designed to understand the meaning of a text.
First published in October 2020 in the following research paper (Beyond 512 Tokens: Siamese multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching) and is notable for its ability to predict how a block of text will continue within the context of the entire document.
This makes SMITH very powerful in understanding long documents due to this predictive capability, according to the authors of the paper being even better than BERT itself in understanding long content.
How does SMITH affect positioning?
According to internal Google sources SMITH does not yet affect ranking as it is not yet implemented. Rumors arose after the Google Core Update of December 2020 in forums such as Blackhatseoword suggesting that this algorithm was launched during this update, mainly based on the fact that one of the strong points of this update was the indexing of text passages in which SMITH could play a very important role.
But there is no further evidence that SMITH has gone into production and on Twitter Danny Sullivan, a Google employee, confirms to us that SMITH is not yet live and justifies this by saying that not every article they publish ends up producing tools that are used in the search engine.
Why is it important?
Due to the capabilities of understanding the content of long texts SMITH could improve functions such as the suggestion of related news, related articles and in short everything that has to do with the grouping of content or documents.
It could also affect ranking if implemented in Google's search algorithm and respond to users' search intent with longer content. Although for this it seems that there would still be quite a way to go and we cannot know if it will ever become effective.
In any case it is a topic worth paying attention to as few publications claim to improve the performance of the most cutting edge tools, in this case it would be BERT, and these few articles are the ones that are most likely to end up materializing in how the Google search engine works.
We will have to keep an eye on how SMITH evolves in order to always be one step ahead and improve our content and positioning before changes occur and catch us off guard.
How is SMITH different from BERT?
BERT(Bidirectional Encoder Representations from Transformers) is one of the most recent and well-known adoptions of Google's search algorithm.
Unlike SMITH, BERT is limited to the analysis/prediction of a short text such as a paragraph or a few sentences. According to the authors of the article where SMITH is introduced:
"In recent years, self-attention based models like Transformers... and BERT ...have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length."
Beyond the technical details that differentiate the structure of SMITH and BERT, the former, as we have already mentioned, is capable of analyzing longer texts, increasing the text input to be analyzed from 512 characters to 2048. According to the authors of the article:
"Our experimental results on several benchmark datasets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention..., multi-depth attention-based hierarchical recurrent neural network..., and BERT.
Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048."
Despite its ability to analyze longer sections of text, SMITH is presented by its authors as a complement to BERT and not as a replacement. The reason for this is to be guessed and is probably due to a lower performance and would limit its use to specific moments when Google decides that SMITH could significantly improve the results with respect to BERT.
We will keep an eye on the experiments that are being carried out and the changes in the next Core Update to see if we end up getting a surprise.
In this article we quote the following article on several occasions:
Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching (2020) Liu Yang, Mingyang Zhang, Cheng Li, Mike Bendersky & Marc Najork.