Spam is a problem almost as old as the Internet itself, and no matter how much progress is made in this regard, there will always be one. Attackers and cybersecurity companies have a ‘race’ in which they constantly outpace each other, developing new types of spam and new detection methods in response. Despite that, it’s very likely that your Gmail inbox still receives a lot of spam.
And the techniques used by spammers are very advanced, so much so that it is almost impossible for the filters to be up to par. It is very easy to filter emails that have certain words or come from certain servers; but it is more difficult when the malicious email seems more ‘real’ than the real one .
Now, Google may have made a vital breakthrough in the fight against spam; According to its studies, it is capable of improving spam detection by 38% compared to usual, while reducing the number of false positives by 19.4% and processing by no less than 83%. Is magic?
Gmail against spam
No, it’s not magic, it’s something more intelligent. Google has focused on a technique used by spammers to bypass filters and make the user believe that it is a legitimate email: modifying the text to make it look like something it is not. Many spam emails now use homoglyphs, characters that look similar to others; for example, using the number ‘0’ to replace the letter ‘O’ , which depending on the font used look very similar to the naked eye. There are also many special characters, such as those used in mathematics, that look like letters if you don’t look closely, but in reality they are not. Other methods to bypass filters include invisible characters and using keywords that are detected by algorithms. Some emails even include spelling mistakes on purpose, which the user corrects in their mind when they are reading.
To combat these attacks, Google uses a new type of text vectorizer called RETVec, which has been trained to detect these types of techniques, including the insertion and deletion of characters, spelling errors, homoglyphs, the replacement of letters by others, and more. The model has been trained on a new text encoder that is capable of encoding all characters and works with more than 100 different languages, including Spanish .
The key is that this model is not based on a list of millions of words to check, a process that is too demanding and which was what was used until now; Instead, RETVec only uses 200,000 parameters because it works similar to how humans read. Using machine learning, it is based on the “similarity” of the words and not the letters that are actually written in the email. Thanks to this, you don’t need a large server to run it, and in fact, Google has released the RETVec source code so that anyone can use it on their own servers; for example, to combat spam in comments on web pages.
The effect of RETVec has already been noticed in some Gmail accounts, since Google has been testing the model for a year, but it is now that it will begin to reach all accounts .