8 MIN READ · Pedro Thomaz

How to build a recommender system without user tracking

A recommender system without user tracking is built on content features and session-local signals, not cross-user profiling. Here is how we ship one — and the accuracy we trade away.

How to build a recommender system without user tracking

You can build a genuinely useful recommender system without tracking a single user. The trick is to stop trying to learn who someone is across visits, and instead rank a finite catalogue using what the items themselves contain plus the handful of signals the current session hands you for free. You lose the last few points of accuracy that come from cross-user profiling. For most products, that loss is not worth a surveillance liability.

What "a recommender system without user tracking" actually means

A recommender without user tracking is a ranking system that produces personalised-feeling suggestions without building a persistent, cross-session profile of an identifiable person. No cookie that follows you, no device fingerprint, no user ID stitched across visits, no behavioural data shipped to a third party. The model never needs to answer "who is this?" — only "given these items and what just happened in this session, what should I show next?".

It's worth being precise, because the industry blurs this on purpose. There are three broad families of recommenders:

Privacy-first recommendation lives almost entirely in the second and third families, glued together with whatever the live session legitimately reveals. That's the whole design space, and it's bigger than people assume.

The short version

1. Recommendation is mostly ranking, and ranking is mostly content

We made this argument at length in our case for ML over AI: most "AI features" are ranked suggestion lists wearing a chatbot costume. The same reframing dissolves the privacy problem. If the job is "order this known set of items", then the richest signal available is almost always the items themselves — not a dossier on the user.

Take Delicious Diamonds, the luxury chocolate house we build for. "You might also like" on a praline page does not need to know your purchase history across the open web. It needs to know that the praline you're looking at is single-origin Madagascar, 72% cocoa, in the gifting price band, with a citrus note — and that three other items in the catalogue share most of those attributes. That's a content-based neighbour lookup over a few hundred SKUs. It runs in a millisecond and it's correct on a first visit, with an empty cookie jar, for a logged-out browser behind a VPN.

Cross-user collaborative filtering is the expensive, invasive way to approximate something content features often give you directly. When your catalogue is small and well-described, you don't need the crowd.

Building content features without surveillance

The work moves from data collection to data description. For each item you store a feature vector built from things you already own:

Similarity is then cosine distance over those vectors, optionally re-weighted by business rules. None of it references a person. You compute it when content changes, cache it, and serve the precomputed neighbours as a static lookup.

2. Session-local signals: personalisation that forgets on purpose

Pure content similarity is impersonal. The honest middle ground is the session — the one unit of context you can use without becoming a tracker, because it begins and ends with the visit.

Within a single session you legitimately know: the pages visited this visit, items added to a cart or favourited, the search terms typed, the declared interface language, and coarse context like viewport class or time of day. From that you can maintain a session interest vector — a running average of the feature vectors of items the visitor has engaged with this visit — and rank candidates by closeness to it. Browse three smoky high-cocoa bars and the page quietly tilts toward intensity over sweetness, with no idea who you are and no memory of you tomorrow.

The discipline that makes this privacy-first rather than tracking-by-another-name:

This mirrors how we run the site you're reading. As we wrote in we track no users, our analytics is one anonymous beacon per session with no IP retention beyond seven days. The recommender follows the same rule: the session is the unit, and nothing survives it.

3. On-device and aggregate approaches for the harder cases

Sometimes content plus session genuinely isn't enough — you want the model to learn. There are two ways to get learning without building per-user dossiers on a server.

On-device personalisation

Ship the ranker to the client and let the personal state never leave it. The catalogue's feature vectors are public; the user's preference weights are computed and stored locally (IndexedDB, on-device storage) and applied at render time. The server sends the same neutral candidate set to everyone; the device reorders it. This is exactly how Jofit's ranker can adapt to one person's adherence history — that history is the user's own, on their own device, and our servers never see the trail.

The cost is real and you should name it: you can't compute things that require the population, like true trending or cold-start defaults, from a single device. Those you derive from aggregates.

Aggregate-only signals

You can use the crowd without tracking individuals if you only ever read the crowd in aggregate, and only above a threshold. "This item was viewed by enough distinct sessions this week to count as popular" is a privacy-safe global signal, provided you store counters, not events — increment a tally, never an identity-linked row. Anything below a minimum count (say, fewer than 50 distinct sessions) is suppressed, so no aggregate can ever single out a person. This is the same k-anonymity instinct that makes our beacon analytics safe.

For teams who want to go further, federated learning and differential privacy formalise this: train on-device, send only noised model updates, never raw behaviour. We'd reach for that on a large B2C product. For a studio site or a few-hundred-SKU shop, it's over-engineering — content plus session plus thresholded aggregates already gets you 90% of the felt quality.

4. The accuracy tradeoff, stated honestly

Here is the part most vendors won't print. Dropping cross-user tracking costs you accuracy, and the size of the cost depends entirely on your catalogue.

What you give up:

What you keep, or gain:

The decision rule we use: if the catalogue is small and well-described, content-based wins outright and the accuracy gap is negligible. The bigger and blander the catalogue, and the more your margin depends on squeezing the last percent of conversion, the more collaborative filtering would have bought you — and the more honestly you have to weigh that against the data liability it creates.

5. A reference architecture you can ship this sprint

Concretely, on our stack — PHP 8.3, Cockpit CMS, no build step, server-rendered — a privacy-first recommender is three modest pieces:

  1. Offline feature build. On content publish, compute each item's feature vector (attributes + text embedding) and the top-N content neighbours. Store them as JSON next to the content. No runtime cost, no database join on the hot path.
  2. Session vector. A small client-side accumulator keeps a running mean of the feature vectors of items touched this visit, in sessionStorage. It is sent to the ranking endpoint as a short opaque array — { session_vec: [...], context: { lang, viewport } } — never an event log.
  3. Ranker. A server endpoint blends content similarity, session closeness, and thresholded aggregate popularity into one score, applies hard business rules, and returns the ordered list. Stateless. Logs nothing identity-linked.

That's it. It fits in a sprint, runs on shared hosting, and there is no part of it you'd be uncomfortable explaining in a privacy notice.

The takeaway

Most products don't need to track users to recommend well — they need to describe their catalogue well and respect the boundary of the session. Reach for collaborative filtering only when your catalogue is too large and too generic for content features to carry the load, and only after you've priced in the surveillance it requires. For everyone else, a content-based, session-local ranker is faster to build, cheaper to run, legal without a consent dance, and honest enough to explain to the person it's serving. The few accuracy points you leave on the table are the cheapest insurance you'll ever buy.