Private Set Intersection: Finding What Two Parties Share Without Revealing the Rest

You install a messaging app and it offers to find which of your contacts already use it. To answer that, somebody has to compare your address book against the app's user list. The lazy version uploads your entire address book to a server. Private Set Intersection is the cryptographic version: both sides learn exactly the overlap and nothing else.

Private Set Intersection, usually shortened to PSI, is a protocol for two parties who each hold a set of items. They want to compute the items the two sets have in common. The requirement is strict: neither party should learn anything about the other party's items that are not in the intersection. If your contact list has 400 numbers and three of them are app users, you learn those three, and the server learns those three, and nothing about the other 397 leaves your device in usable form.

That sounds like it should be impossible. How do you compare two lists without showing them to each other? The trick is that you never compare the plaintext items. You compare cryptographic transformations of them, chosen so that equal inputs collide and unequal inputs reveal nothing.

Why the Naive Approaches Fail

The first idea most people have is hashing. Both sides hash their items and compare the hashes. Equal items produce equal hashes, so the intersection falls out. The problem is that hashes of low-entropy data are reversible by brute force. Phone numbers live in a space of maybe ten billion possibilities. A server that receives the SHA-256 of a phone number can simply hash every possible number and build a lookup table. Hashing a phone number hides nothing from a motivated adversary.

The second idea is to add a secret salt. But for two parties to get matching hashes, they would need to share the same salt, and once they share it the brute-force attack works again for whoever holds it. A salt that both parties know is not a secret. This is the wall that makes PSI a genuine cryptographic problem rather than a hashing exercise.

The defining property

PSI is not "hash and compare." Hashing identifiers like phone numbers or emails leaks them to brute force because the input space is small. PSI uses keyed operations where neither party ever holds a value that lets them reverse the other party's non-matching items.

The Diffie-Hellman Style Construction

One of the cleanest PSI designs builds on the same hardness assumption as Diffie-Hellman key exchange. The idea is double encryption with two secret keys, where the order of application does not matter.

Suppose the client has a secret exponent a and the server has a secret exponent b. Each item is first mapped onto a point in a group where the discrete logarithm problem is hard. The client raises each of its items to the power a and sends them over. The server raises those to the power b, producing items locked under both keys, and sends them back. Separately, the server raises its own items to the power b, the client raises those to the power a, and now both sets are locked under both a and b.

Because exponentiation commutes, an item that appears in both sets ends up at exactly the same doubly-locked value regardless of which side applied which key first. The client compares the two doubly-locked sets and reads off the matches. For any item the client did not have, it only ever saw a value locked under the server's secret b, which it cannot strip off, so it learns nothing about those.

The Faster Modern Family: OPRFs and OT

The Diffie-Hellman construction is elegant but uses public-key operations on every item, which is slow at scale. Most high-performance PSI today is built from two other primitives:

Oblivious Pseudorandom Functions (OPRF). The server holds a secret key for a pseudorandom function. The client gets the function applied to its inputs without the server learning those inputs, and without the client learning the key. The client ends up with keyed, non-reversible tags it can match against a list of the server's tags.
Oblivious Transfer (OT). A primitive where a receiver picks one of several values a sender holds, the receiver learns only the value it chose, and the sender never learns which one was chosen. Modern OT extension makes millions of these cheap, which is what lets OT-based PSI process large sets quickly.

These constructions can intersect sets of millions of items in seconds, which is why PSI moved from a theoretical curiosity to something deployable in consumer apps over the last decade.

Variants That Matter in Practice

Variant	What each side learns
Plain PSI	The intersection itself, nothing else.
PSI cardinality	Only the size of the intersection, not which items. Useful for analytics like "how many of my contacts use this."
PSI sum / analytics	An aggregate over matched items, such as a total spend, without revealing the matched identities. This underpinned conversion-measurement systems.
Unbalanced PSI	Optimized for the common case where one set is huge (server user base) and the other is small (your contacts), shifting work off the small client.

Where the Threat Model Gets Subtle

PSI protects the inputs during the computation. It does not protect you from a party that simply lies about its set. A malicious server can run the protocol with a set crafted to probe for specific targets. If the server wants to know whether you have one particular number, it can run a PSI where its "set" is just that number, and the result tells it yes or no. PSI guarantees nobody learns your non-matching items; it does not guarantee the other side is asking an honest question.

This is why private contact discovery in real systems pairs PSI-style cryptography with rate limiting, abuse detection, and sometimes trusted hardware enclaves. Some designs deliberately avoid server-side contact discovery altogether because even a perfect PSI still tells the server the size and rate of your queries. The cryptography is one layer; the deployment around it carries the rest of the weight.

Why This Belongs in Your Mental Model

PSI is a good example of a broader truth in privacy engineering: the question is rarely "is it encrypted." It is "who learns what, and under what assumptions." A feature can be cryptographically private in the narrow technical sense while still leaking metadata about how often you use it and how large your contact graph is. When you evaluate a tool that promises private matching, contact discovery, or breach checking, the useful questions are what exactly is computed, what each party walks away knowing, and whether a dishonest party can turn the protocol into a probe.

For builders, PSI and its cousins, like private information retrieval and secure multiparty computation, are the toolkit for computing on data you are not allowed to see. They are slower and more constrained than plaintext, but the gap keeps closing, and the privacy they buy is real.