7 MIN READ · Pedro Thomaz

Machine learning vs LLM recommender systems: when classic ML wins

Machine learning vs an LLM recommender system: for ranked suggestions, classic ML usually wins on cost, latency, explainability and cold-start. Here's how to decide.

Machine learning vs LLM recommender systems: when classic ML wins

For most recommendation problems, a machine learning recommender system — collaborative filtering, content-based scoring, or matrix factorization — beats throwing an LLM at the task on every axis that ships products: cost per request, p99 latency, explainability, and cold-start behaviour. Reach for an LLM only when the output is genuinely novel text the user must read. Everything else is ranking known items, and classic ML ranks better, cheaper, and in a way you can defend.

We build recommenders for wellness products — the kind that surface a meal swap, a sleep cue, an exercise, or a supplement on a screen a user opens every morning. We've shipped both architectures. This is the decision we actually make, term by term, with the numbers that drive it.

Machine learning vs LLM recommender system: the definitions first

Half the confusion in these debates comes from sloppy vocabulary. So, precisely:

The short version: a recommendation is almost never a text-generation problem. It is a ranking problem over a finite candidate set you already own. Once you see it that way, the LLM stops looking like the obvious tool.

Axis 1 — Cost: the per-request line item

A trained gradient-boosted ranker or a matrix factorization model costs effectively nothing per request. It is a dot product, or a few hundred tree splits, running in the same process that served the page. No network hop, no token meter.

An LLM recommender is a recurring metered cost on every screen load. Take a wellness app with 30,000 daily-active users who each see suggestions five times a day: that's 150,000 inferences a day, 4.5 million a month. Even at a fraction of a cent per call once you count input context, output tokens, and the retries you'll need for malformed JSON, you are now paying monthly for something a one-file model does for free. The classic model's marginal cost is rounding error; the LLM's marginal cost scales linearly with your success. Growth should not be a cost incident.

Axis 2 — Latency: the budget you can't claw back

A wellness dashboard wants to render in well under 100 ms. Our in-process rankers return scored, sorted candidates in under a millisecond — the recommendation is not on the critical path, it's noise in the render budget.

An LLM call is hundreds of milliseconds to several seconds, and the tail is brutal: p50 might be fine while p99 blows your budget entirely. You can stream tokens to hide it, but you can't stream a ranked list the user needs before they decide what to tap. So you cache. Now you're maintaining a cache invalidation strategy for recommendations that change with every logged meal — solving a hard problem you created by picking the wrong tool. The classic model never had the problem.

Axis 3 — Explainability: the question every user eventually asks

"Why am I being shown this?" In a regulated or health-adjacent context this isn't a nicety — it's the difference between a defensible product and a liability.

A scoring recommender answers with numbers. Our wellness ranker scores each candidate on a few explicit axes — evidence weight (annotated once by a clinician, RCT > cohort > case report), personal delta (how far the user is from the clinical target for the metric this addresses), and adherence cost (the friction the suggestion adds). Multiply, sort, take the top. Every suggestion decomposes into three numbers we can show, log, and replay.

Ask an LLM the same question and you get a fluent paragraph that sounds like a reason but isn't the actual cause of the output — it's a post-hoc rationalisation generated by the same stochastic process. You cannot unit-test it. You cannot diff it across model versions. You cannot put it in front of a clinician and have them sign off on the logic, because there is no logic, only weights you don't own. For the accessible and Class I medical work we do under EU MDR 2017/745, "the model said so" is not an answer.

Axis 4 — Cold-start: the problem everyone underestimates

Cold-start is the recommender's hardest moment: a brand-new user with no history, or a brand-new item nobody has touched. It's where naive collaborative filtering falls apart — with no interactions, there's nothing to factorize.

Here's the nuance that decides the architecture:

The honest takeaway: the part of cold-start that tempts you toward an LLM is solved better by embeddings plus content features, which is classic ML with a modern feature extractor — not by prompting a chat model on every request.

The toolbox, ordered by reach-for-first

  1. Weighted scoring with hand-tuned coefficients. Boring, effective, two weekends to ship. Maximally explainable. Start here.
  2. Gradient-boosted trees (XGBoost / LightGBM) on engagement labels. The best ROI in applied ML over the past decade. Trains on a laptop, handles non-linear feature interactions your hand-tuned weights miss.
  3. Matrix factorization once you have real user-item history. The classic collaborative-filtering engine; still state-of-practice for "users like you" recall.
  4. Embeddings / sentence-transformers for semantic similarity and new-item cold-start. This is your content-based recall layer.
  5. Contextual bandits when you need to explore as well as exploit and learn online from clicks.
  6. An LLM — only when the output is genuinely novel prose the user must read.

A mature recommender is usually candidate generation (collaborative filtering + content-based recall pulls a few hundred candidates) followed by a ranking stage (a gradient-boosted model scores them). Two-stage, both classic ML, both explainable, both fast.

So when does the LLM actually win?

Three legitimate cases, and they share a shape — the LLM produces text, not a decision:

Notice the pattern: in every winning case the LLM is downstream of, or orthogonal to, the ranking decision. It never is the recommender.

The shipping order

If you're holding a spec that says "AI suggestions", do this:

  1. List the candidates. If you can't enumerate them, you don't have a recommendation problem — you have a content problem. Fix that first.
  2. Pick three scoring axes. Hand-tune the weights for a week. Ship it.
  3. Log every suggestion shown and every one acted on. This is your training set; you only get it by shipping.
  4. After four to six weeks of data, train a gradient-boosted model on the logs and retire the hand-tuned weights. Add matrix factorization and embeddings as your history and catalogue grow.
  5. Only then, if you still need novel prose, add an LLM — to explain, not to decide.

You'll ship faster, spend less, sleep better, and — the part that matters — you'll be able to answer the only question that counts when a user asks why they got that suggestion: here are the three numbers, and here is what they mean. That is the whole case for machine learning over an LLM recommender, in one sentence.

This is the same discipline behind the wellness recommenders we build, the privacy-first analytics we run, and the chocolate-house storefront we shipped for Delicious Diamonds: pick the simplest model that answers the question you can be asked to defend. If you're staring at a recommendation feature and the spec says "let AI handle it", we're happy to argue you out of it.