From Stanford to Speech AI: 8kHz Audio Watermarking for the Real World

It’s not every day that two students get the chance to turn a class project into technology that could secure billions of phone calls. Our final project in Stanford’s CS224S: Spoken Language Processing began as a collaboration with Sanas, a pioneer in real-time speech AI, and ended with a breakthrough in audio watermarking for low-bandwidth telephony, which powers billions of conversations worldwide every day.

We came into the course from different backgrounds. Ryota brought years of experience in jazz guitar and a deep driven interest in audio signal processing. Rushank came from a computer science track, focused on AI and with previous research experience in machine learning for drug development and disease diagnosis.

Sanas offered us a challenge that aligned perfectly with our combined strengths: watermarking 8kHz audio, which is the standard for most of the world’s calls yet rarely addressed in published research. In just a few weeks, we combined Ryota’s audio expertise with Rushank’s machine learning skills to create a watermark that’s imperceptible to listeners yet resilient against compression, transmission, and even removal, ultimately surpassing leading industry benchmarks.

A Stanford Project Meets Sanas Innovation

Our paths to this project were as different as our backgrounds.

For Rushank, the spark came while deciding whether to take the class. The concept of speech-to-speech models stood out, and sparked the idea that such technology could be used to translate accents in real time. Curious whether this had already been invented, a quick Google search revealed Sanas as the top search result. That discovery, along with the realization that our professor, Dr. Andrew Maas, a well-recognized figure in the field, was connected to the company, stayed in his mind during the first weeks of class. When it was time to choose a final project, Rushank asked for an introduction and Dr. Maas kindly connected him to former Stanford student and Sanas CTO, Shawn Zhang.

Ryota’s route was a little less direct. With years of experience in music and a background in signal processing from the electrical engineering department, the world of audio already felt familiar. When Rushank pitched the Sanas opportunity, the challenge of watermarking low-bandwidth audio — a space where audio DSP expertise could combine with new machine learning techniques — was too compelling to pass up.

We kicked off the project by meeting with Shawn, who shared a problem in an area Sanas was actively working on and pointed us toward existing industry benchmarks and resources to help us get started. From there, Sanas paired us with engineer Alvaro Escudero, whose weekly calls helped us workshop ideas, test assumptions, and keep the project moving forward.

It quickly became clear that this wasn’t a solved problem we could simply adapt; it was an open challenge that would require pulling together insights from multiple disciplines and pushing past the limits of existing research.

The Challenge: Why 8kHz Audio Watermarking Is So Difficult

The challenge Sanas gave us had been largely overlooked in published research: how to create a robust, imperceptible 8kHz audio watermarking solution for the format still used in most modern telephony.

While watermarking for high-fidelity speech has been studied extensively, lower-bandwidth audio presents a tougher problem. At 8kHz, almost no part of the spectrum is unused: every frequency is already doing critical work carrying the voice signal. That meant our watermark needed to vanish to the human ear yet remain intact after heavy compression, transmission, and even deliberate attempts at removal.

Solving this required fluency across a surprising range of disciplines: audio signal processing, machine learning, psychoacoustics, and the infrastructure behind modern telephony. Speech AI lives in a rare space where mathematics, engineering, linguistics, neuroscience, and human perception all converge — and success depends on making them work in harmony.

Over just a few weeks, we combined our expertise for a focused sprint of experimentation and evaluation. The result was a model that surpassed the performance of Meta’s AudioSeal, currently recognized as the SOTA (state of the art) architecture for audio watermarking, despite operating in a much more constrained audio domain.

In the figure below, the circled concentration in Meta's example leads to audible artifacts, while the 8 kHz prototype watermark's faint, structured patterns are embedded at low amplitudes and in frequency bands that don’t interfere with perceptual quality.

For our work, we also earned recognition as the highest-scoring paper in our class, a graduate-level CS course with 200+ students. And of course, most meaningfully, we contributed directly to Sanas’ ongoing R&D in a field with real-world, global significance.

Why It Matters: Disciplines Converge for Real-World Impact

This project showed us firsthand how breakthroughs often happen at the intersection of different disciplines.

For Rushank, it was a reminder that even well-documented research problems like audio watermarking, which sees lots of new publications and innovation, can leave major gaps when it comes to practical, real-world needs. Protecting voice communication in the age of deepfakes and fraud is one such opportunity. For Ryota, it showed how his passions for music and audio can translate directly into speech AI, right down to the psychoacoustic details of how humans perceive sound.

Through this process, we both found ourselves drawn to speech AI as a field we might want to pursue further in our studies and careers, one that blends our existing skills with new technical challenges and the potential for global impact.

Partnering with Sanas gave us a front-row seat to the value of collaboration between academia and industry: the speed of a startup, the rigor of a research university, and the shared goal of building something that matters.

In just a few weeks, we saw an idea go from a conversation in class to a prototype that can contribute to the trustworthiness of billions of phone calls, addressing a critical gap in today’s voice tech, audio AI, and cybersecurity landscapes. And perhaps most importantly, it was a reminder that innovation really does thrive when curiosity meets collaboration.

This is just the beginning for audio watermarking. Keep an eye out for future announcements, and sign up for more information — you'll be the first to know when we share more details.

Related blog

Sound good?

Book a demo

From Stanford to Speech AI: 8kHz Audio Watermarking for the Real World

A Stanford Project Meets Sanas Innovation

The Challenge: Why 8kHz Audio Watermarking Is So Difficult

Why It Matters: Disciplines Converge for Real-World Impact

Related blog

The High Cost of Poor Speech Clarity in the Enterprise

A Letter from Our CEO

Top Takeaways From The Clarity Advantage

Sound good?

Get in touch