SFrame: End-to-End Encryption for Group Video Calls (RFC 9605)

A one-to-one video call can be genuinely end-to-end encrypted with the standard browser plumbing. Add a third person and a group server, and most of that protection quietly evaporates: the server in the middle gets to decrypt every frame so it can route them. SFrame, published as RFC 9605 in 2024, closes that gap by encrypting the media frame itself, so the server forwards bytes it can never read.

When you join a two-person WebRTC call, the media stream is protected by SRTP, the Secure Real-time Transport Protocol. SRTP encrypts the audio and video packets between the two endpoints, and for a direct call that is genuinely end-to-end. The problem starts the moment a call grows past two people, because peer-to-peer meshes do not scale. A four-person call needs each phone to send its video to three others, and a fifty-person webinar would melt a laptop.

The Server in the Middle

To make group calls work, conferencing systems route everyone's media through a central server called a Selective Forwarding Unit, or SFU. Each participant sends one upload to the SFU, and the SFU forwards the appropriate streams to everyone else. It decides who sees whose video, drops layers when a connection is weak, and switches the active speaker. This is the architecture behind essentially every scalable video product.

Here is the catch. SRTP is hop-by-hop. Your phone encrypts to the SFU, the SFU decrypts, then re-encrypts to each recipient. For the duration of that hop, the server holds your camera feed and your microphone in the clear. The link is encrypted, but the server is not blind. If that server is compromised, subpoenaed, or simply curious, the content is right there. Marketing copy that calls this "encrypted video" is technically true and practically misleading, because the threat model most people care about is exactly the operator in the middle.

The distinction that matters

"Encrypted in transit" means each network hop is protected, but intermediaries decrypt along the way. "End-to-end encrypted" means only the participants hold the keys, and every server is a blind relay. For group calls, the gap between these two is the conferencing server, and closing it is the entire point of SFrame. The same distinction we draw for end-to-end encryption in general applies with extra force to real-time media.

Encrypt the Frame, Not the Packet

The earlier attempt to fix this, a framework called PERC, tried to layer a second encryption on top of SRTP while keeping the headers the SFU needed. It worked on paper but was complicated enough that almost nobody shipped it. SFrame took a different angle: encrypt at the level of the media frame, which is the unit the video codec produces, sitting one layer above the network packets.

When your encoder produces a video frame, SFrame encrypts that frame with an authenticated cipher before it is split into transport packets. The SFU still sees the packet headers it needs to do its job, sizes, timing, sequence numbers, but the actual payload, the pixels and the audio samples, is ciphertext it has no key for. It forwards opaque bytes. The receiving client reassembles the frame and decrypts it. The codec never knows anything happened.

Encrypting whole frames rather than individual packets is also more efficient. A single video frame can span dozens of packets, and adding authentication to every one of them wastes bandwidth on overhead. SFrame attaches its encryption metadata once per frame, which keeps the cost low enough to be practical on a phone.

What SFrame Uses Under the Hood

SFrame relies on authenticated encryption, the AEAD model, so that every frame is both kept secret and protected from tampering. A receiver that gets a flipped bit or a forged frame rejects it rather than rendering garbage or, worse, attacker-controlled content. The specification supports the standard modern choices, including AES in GCM mode and constructions built on AES-CTR with HMAC, the same families used elsewhere in well-built protocols.

Each frame carries a small SFrame header with a key identifier and a counter. The counter feeds the cipher's nonce, which is critical: reusing a nonce with the same key is catastrophic for AES-GCM, so the monotonic counter is what keeps every frame's nonce unique. This is the same discipline we cover in our piece on nonce-reuse vulnerabilities, applied to a stream of thousands of frames per minute.

SFrame Does Not Solve Key Distribution, and That Is Deliberate

The honest boundary of SFrame is this: it tells you how to encrypt a frame once everyone in the call shares the right keys. It says almost nothing about how those keys get there. That is intentional separation of concerns, and it is where the design becomes genuinely interesting, because the obvious partner is already standardized.

SFrame handles the media. A group key-management protocol handles the membership. Pair them and you get a video call where the server routes ciphertext, and the keys rotate correctly every time someone joins or leaves.

That partner is MLS, the Messaging Layer Security protocol (RFC 9420). MLS was built to give a group of participants a shared, continuously updated secret, with the cryptographic property that adding or removing a member rotates the key so that newcomers cannot read old messages and departed members cannot read new ones. That is precisely the key schedule a group call needs. The combination, MLS for the group secret and SFrame for the media, gives forward secrecy and post-compromise security to live audio and video, not just to text. We cover why that membership math is hard in our discussion of post-compromise security.

What SFrame Still Leaks

No honest description of a protocol is complete without its limits. SFrame protects the content of the media. It does not hide everything.

Metadata remains visible to the SFU. The server still sees who is connected, who is speaking, frame sizes, and timing. Frame sizes alone can leak information through traffic analysis, for example distinguishing speech from silence or a static slide from motion.
Endpoint compromise still wins. If an attacker controls a participant's device, end-to-end encryption ends at that device by definition. SFrame protects the wire and the server, not a malicious or backdoored client.
Key management is only as strong as its implementation. SFrame's guarantees rest entirely on the keys being distributed correctly. A weak or shortcut key-exchange undermines the whole thing, no matter how sound the frame encryption is.

Layer	Who can read the media
Plain SRTP, group call via SFU	Participants and the SFU
SFrame over SRTP	Participants only; SFU sees ciphertext
SFrame plus MLS group keys	Participants only, with keys that rotate on join and leave

Why This Matters Beyond Video

SFrame is a small, focused standard, and that focus is its strength. It does one job, encrypt a media frame with an authenticated cipher and a unique nonce, and it leaves key management to a protocol built for that. This is the same engineering philosophy that produces durable cryptographic systems generally: narrow, composable pieces with clearly stated boundaries, rather than a single monolith that claims to do everything and audits to nothing.

At Haven we use MLS for group state in our chat, the same protocol that pairs naturally with SFrame for media. We are deliberate about what each layer protects and what it does not, and we would rather tell you exactly where the server can and cannot see than wave the phrase "encrypted calls" and let you assume the strongest reading. The next time a product promises encrypted group video, the question worth asking is simple: when there are three people on the call, can the server in the middle still decode the frames? With SFrame done right, the answer is no.