Gmail’s Spam Filtering Just Got a Whole Lot Better

By Jesse Hollington

3 Min Read Published: Dec 5th, 2023

Text Size

- +

Toggle Dark Mode

While not everyone is a fan of Google services such as Gmail, it’s hard to argue that one area the email provider excels in is preventing spam from reaching your inbox.

It’s a stark contrast to Apple’s iCloud, which has one of the weakest anti-spam filters among major providers. While Apple Mail on your Mac can make up for some of iCloud’s server-side deficiencies, those who deal with larger volumes of unsolicited bulk email often need to turn to third-party tools such as SpamSieve (which has thankfully been updated to work with the new Mail plug-in restrictions in macOS Sonoma).

Now, Gmail is taking its already powerful spam filters to the next level with a new technology that should close many loopholes spammers use to get around classic text-based and Bayesian spam filtering.

In a recent post on the Google Security Blog, Elie Bursztein, Cybersecurity & AI Research Director, and Software Engineer Marina Zhang explain how Google has implemented a new technology known as RETVec that will protect Gmail inboxes from the emoji-laden emails that often make it past many traditional spam filters.

The Google team refers to these as “adversarial text manipulations,” which are deliberate attempts by spammers to stuff special characters, emojis, and other junk into emails that are readable by humans but difficult for machine algorithms to identify as spam.

In covering the news over at ArsTechnica, Ron Amadeo shares an example of what a spam message such as this looks like. The trick lies in using homoglyphs, which are “obscure characters that look like they’re part of the normal Latin alphabet but actually aren’t.” This ranges from simple things like swapping zeros for the letter “O” to inserting periods, underscores, and strange underlined characters to confuse the machines.

Gmail homoglyph spam example from ArsTechnica — Ron Amadeo / ArsTechnica

The result is that a spam filter looks at this hot mess of an email and basically gives up.
Ron Amadeo, ArsTechnica

The biggest challenge in developing anti-spam algorithms to deal with these character manipulations is finding a way to do so efficiently. Gmail processes hundreds of billions of emails per day, and nobody wants their messages needlessly delayed while complex algorithms chew through them to make sure they’re okay to land in your inbox.

After all, consider how many possible combinations there would be for common words and phrases once you factor in characters that can be swapped out for numbers, math symbols, emojis, and foreign-language character sets like Cyrillic and Hebrew. Building lookup tables to analyze all of those permutations is complex and resource-intensive.

Google’s answer to this is RETVec, which is short for “Resilient & Efficient Text Vectorizer,” an analytical engine that’s designed to work across languages and character sets as quickly as possible by using machine learning to visually analyze text in a message the way a set of human eyes would perceive it, rather than merely looking at the characters that make it up.

It’s essentially the same technology that both Apple and Google use to identify objects within photos, scaled to work on the millions of email messages that pass through its filters every second.

Google RETVec Gmail Spam Filter Performance — Google

Over the past year, we battle-tested RETVec extensively inside Google to evaluate its usefulness and found it to be highly effective for security and anti-abuse applications. In particular, replacing the Gmail spam classifier’s previous text vectorizer with RETVec allowed us to improve the spam detection rate over the baseline by 38% and reduce the false positive rate by 19.4%.
Google

According to Google’s RETVec page on Github, this allows it to work with only 200,000 parameters instead of the millions that would be required by traditional text classification models. This also makes it lightweight enough to be deployed on devices rather than requiring farms of high-powered cloud servers.

Google’s security team says that RETVec is “one of the largest defensive upgrades” made in the past few years. It’s been testing RETVec with Gmail over the past year, and more recently, it has begun rolling it out to end users to provide better protection against these craftier spam emails that previously slipped into your inbox.