7 MIN READ · Pedro Thomaz

Machine learning vs LLM recommender systems: when classic ML wins

Q: Which is more explainable, ML or an LLM, for recommendations?

Classic ML is more auditable: feature importances, SHAP values, and ranking scores give you a defensible reason for every recommendation. An LLM can produce fluent natural-language explanations, but those are post-hoc rationalisations, not the true cause of the ranking — for regulated or health-adjacent contexts we prefer a model whose decisions can be traced and reproduced.

Machine learning vs an LLM recommender system: for ranked suggestions, classic ML usually wins on cost, latency, explainability and cold-start. Here's how to decide.

Machine learning vs LLM recommender systems: when classic ML wins

For most recommendation problems, a machine learning recommender system — collaborative filtering, content-based scoring, or matrix factorization — beats throwing an LLM at the task on every axis that ships products: cost per request, p99 latency, explainability, and cold-start behaviour. Reach for an LLM only when the output is genuinely novel text the user must read. Everything else is ranking known items, and classic ML ranks better, cheaper, and in a way you can defend.

We build recommenders for wellness products — the kind that surface a meal swap, a sleep cue, an exercise, or a supplement on a screen a user opens every morning. We've shipped both architectures. This is the decision we actually make, term by term, with the numbers that drive it.

Machine learning vs LLM recommender system: the definitions first

Half the confusion in these debates comes from sloppy vocabulary. So, precisely:

Collaborative filtering recommends items by finding users who behaved like you and surfacing what they liked. "People who logged this workout also logged that one." It needs an interaction history and nothing about the items themselves.
Content-based filtering recommends items similar to ones you already engaged with, using item features — a recipe's macros, an article's tags, a supplement's mechanism. It needs item metadata and nothing about other users.
Matrix factorization is the workhorse behind modern collaborative filtering. You have a giant, mostly-empty user-by-item matrix of ratings or engagements; factorization decomposes it into two small dense matrices (user vectors and item vectors) whose product predicts the missing cells. Netflix popularised it around 2009; it is still excellent.
An LLM recommender uses a large language model — by prompting it with a user profile and asking for suggestions, or by embedding items and queries into a shared vector space for semantic retrieval. The first is generative; the second is really just content-based filtering wearing a transformer.

The short version: a recommendation is almost never a text-generation problem. It is a ranking problem over a finite candidate set you already own. Once you see it that way, the LLM stops looking like the obvious tool.

Axis 1 — Cost: the per-request line item

A trained gradient-boosted ranker or a matrix factorization model costs effectively nothing per request. It is a dot product, or a few hundred tree splits, running in the same process that served the page. No network hop, no token meter.

An LLM recommender is a recurring metered cost on every screen load. Take a wellness app with 30,000 daily-active users who each see suggestions five times a day: that's 150,000 inferences a day, 4.5 million a month. Even at a fraction of a cent per call once you count input context, output tokens, and the retries you'll need for malformed JSON, you are now paying monthly for something a one-file model does for free. The classic model's marginal cost is rounding error; the LLM's marginal cost scales linearly with your success. Growth should not be a cost incident.

Axis 2 — Latency: the budget you can't claw back

A wellness dashboard wants to render in well under 100 ms. Our in-process rankers return scored, sorted candidates in under a millisecond — the recommendation is not on the critical path, it's noise in the render budget.

An LLM call is hundreds of milliseconds to several seconds, and the tail is brutal: p50 might be fine while p99 blows your budget entirely. You can stream tokens to hide it, but you can't stream a ranked list the user needs before they decide what to tap. So you cache. Now you're maintaining a cache invalidation strategy for recommendations that change with every logged meal — solving a hard problem you created by picking the wrong tool. The classic model never had the problem.

Axis 3 — Explainability: the question every user eventually asks

"Why am I being shown this?" In a regulated or health-adjacent context this isn't a nicety — it's the difference between a defensible product and a liability.

A scoring recommender answers with numbers. Our wellness ranker scores each candidate on a few explicit axes — evidence weight (annotated once by a clinician, RCT > cohort > case report), personal delta (how far the user is from the clinical target for the metric this addresses), and adherence cost (the friction the suggestion adds). Multiply, sort, take the top. Every suggestion decomposes into three numbers we can show, log, and replay.

Ask an LLM the same question and you get a fluent paragraph that sounds like a reason but isn't the actual cause of the output — it's a post-hoc rationalisation generated by the same stochastic process. You cannot unit-test it. You cannot diff it across model versions. You cannot put it in front of a clinician and have them sign off on the logic, because there is no logic, only weights you don't own. For the accessible and Class I medical work we do under EU MDR 2017/745, "the model said so" is not an answer.

Axis 4 — Cold-start: the problem everyone underestimates

Cold-start is the recommender's hardest moment: a brand-new user with no history, or a brand-new item nobody has touched. It's where naive collaborative filtering falls apart — with no interactions, there's nothing to factorize.

Here's the nuance that decides the architecture:

New-user cold-start. Content-based scoring handles it gracefully — you score the candidate set against what little you know (onboarding answers, a single logged metric) and improve as data arrives. An LLM can also reason from a sparse profile, which is the one place it has a genuine edge. But a hand-tuned content-based ranker gets you 90% of that benefit at 0% of the cost, and you control exactly what it weighs.
New-item cold-start. This is where LLMs and embeddings shine and pure collaborative filtering can't: a new recipe with zero engagement still has features — ingredients, macros, tags. Embed it, and content-based similarity places it immediately. You don't need a generative model for this; a 90 MB sentence-transformer produces embeddings that beat a frontier LLM on most retrieval benchmarks, at a thousandth of the cost.

The honest takeaway: the part of cold-start that tempts you toward an LLM is solved better by embeddings plus content features, which is classic ML with a modern feature extractor — not by prompting a chat model on every request.

The toolbox, ordered by reach-for-first

Weighted scoring with hand-tuned coefficients. Boring, effective, two weekends to ship. Maximally explainable. Start here.
Gradient-boosted trees (XGBoost / LightGBM) on engagement labels. The best ROI in applied ML over the past decade. Trains on a laptop, handles non-linear feature interactions your hand-tuned weights miss.
Matrix factorization once you have real user-item history. The classic collaborative-filtering engine; still state-of-practice for "users like you" recall.
Embeddings / sentence-transformers for semantic similarity and new-item cold-start. This is your content-based recall layer.
Contextual bandits when you need to explore as well as exploit and learn online from clicks.
An LLM — only when the output is genuinely novel prose the user must read.

A mature recommender is usually candidate generation (collaborative filtering + content-based recall pulls a few hundred candidates) followed by a ranking stage (a gradient-boosted model scores them). Two-stage, both classic ML, both explainable, both fast.

So when does the LLM actually win?

Three legitimate cases, and they share a shape — the LLM produces text, not a decision:

Open-ended generation. "Write me a week of workouts given these constraints." There's no finite candidate set to pre-author, so ranking can't apply.
Natural-language interface. The user asks in prose; the LLM translates it into a query against a deterministic system; the answer comes from that system. The LLM is the translator, not the oracle.
Grounded explanation. The classic ranker picks the suggestion; the LLM writes the sentence explaining why, grounded in the ranker's three numbers. A hallucination now costs you a clumsy sentence, not a clinical recommendation.

Notice the pattern: in every winning case the LLM is downstream of, or orthogonal to, the ranking decision. It never is the recommender.

FAQ

When should you use classic ML instead of an LLM for a recommender?

Use classic ML (matrix factorisation, gradient-boosted trees, two-tower retrieval) when you have structured interaction data, need sub-50ms latency, and serve recommendations at scale. These models are cheaper to run per request, deterministic, and easy to evaluate offline — an LLM only earns its place when the input is unstructured text or you need natural-language reasoning over sparse context.

Is an LLM cheaper or more expensive than a classic ML recommender?

An LLM is almost always more expensive per recommendation, often by two or three orders of magnitude. A trained ranking model serves a request for a fraction of a cent on commodity CPU, while an LLM call carries token costs, GPU inference, and higher latency — so reserve the LLM for low-volume, high-value paths rather than the hot serving loop.

How do you handle the cold-start problem with each approach?

LLMs handle cold-start better because they can reason from item descriptions and a few words of user intent without any interaction history. Classic ML needs content-based features or popularity fallbacks until enough behavioural data accumulates — a common production pattern is to use an LLM or content model for new users, then hand off to a collaborative-filtering model once signal exists.

Which is more explainable, ML or an LLM, for recommendations?

Classic ML is more auditable: feature importances, SHAP values, and ranking scores give you a defensible reason for every recommendation. An LLM can produce fluent natural-language explanations, but those are post-hoc rationalisations, not the true cause of the ranking — for regulated or health-adjacent contexts we prefer a model whose decisions can be traced and reproduced.

When should you NOT use AI for recommendations at all?

Skip both ML and LLMs when simple heuristics already satisfy the need — popularity sorting, recency, editorial curation, or rule-based filters often beat a model on a small catalogue or low-traffic site. A recommender is only worth the data pipeline, monitoring, and retraining cost once you have enough users and items that hand-tuned rules visibly fail.

The shipping order

If you're holding a spec that says "AI suggestions", do this:

List the candidates. If you can't enumerate them, you don't have a recommendation problem — you have a content problem. Fix that first.
Pick three scoring axes. Hand-tune the weights for a week. Ship it.
Log every suggestion shown and every one acted on. This is your training set; you only get it by shipping.
After four to six weeks of data, train a gradient-boosted model on the logs and retire the hand-tuned weights. Add matrix factorization and embeddings as your history and catalogue grow.
Only then, if you still need novel prose, add an LLM — to explain, not to decide.

You'll ship faster, spend less, sleep better, and — the part that matters — you'll be able to answer the only question that counts when a user asks why they got that suggestion: here are the three numbers, and here is what they mean. That is the whole case for machine learning over an LLM recommender, in one sentence.

This is the same discipline behind the wellness recommenders we build, the privacy-first analytics we run, and the chocolate-house storefront we shipped for Delicious Diamonds: pick the simplest model that answers the question you can be asked to defend. If you're staring at a recommendation feature and the spec says "let AI handle it", we're happy to argue you out of it.