A case for not using AI — and using ML instead
Every product meeting in 2026 ends with the same line: "and then GPT could suggest…". We've been on the other side of that sentence enough times to write down what we actually do instead — and why, for the kind of products we build, it almost always wins.
The LLM tax nobody itemises
An LLM call is not a feature. It's a recurring cost, a latency budget, a regulatory exposure, and an evaluation problem — all rolled into one API key. On a daily-active wellness screen with 30 000 users, even a small model becomes the line item your CFO learns to spell. And that's before you count the streaming infrastructure, the prompt caching, the jailbreak monitoring, and the "as an AI language model" embarrassments your support inbox forwards back to you.
None of which is an argument against LLMs in general. It's an argument against using them for ranked suggestion lists — which is what most "AI features" actually are once you peel the wrapper off.
What a "suggestion" usually needs
Take a step back. A suggestion in a product is almost always answering the same three questions:
- Which candidates exist? The candidate set is finite and yours — recipes, exercises, articles, supplements, sounds, products. You wrote them.
- How relevant is each one for this user, right now?
- What's the cost of being wrong?
Notice what isn't on the list: generating new text. Most of the time you don't need novel prose — you need to rank known things well.
The boring model that beats the chatbot
For Jofit — our fitness and wellness app — we ship a small ranker that scores every candidate intervention (a supplement, a meal swap, a sleep cue, a training tweak) on three axes:
- Evidence weight. Each intervention is annotated with its source paper and a study-quality score (RCT > cohort > case report). The annotation is one-time work done by a clinician, not by a model.
- Personal delta. How far the user is from the clinical target for the metric this intervention addresses. A vitamin D at 10 ng/mL scores higher delta than one at 28 ng/mL.
- Adherence cost. A small habit-history model estimates how much friction the suggestion adds: time, money, social cost, whether the user has rejected similar suggestions before.
Multiply, sort, take the top. Anything in the URGENT clinical tier bypasses the ranker entirely. The whole thing runs in under a millisecond and fits in a single file.
Why a chatbot would do this worse
An LLM could generate the same recommendation. But:
- You can't reliably make it cite its sources. The links are hallucinated 4–10% of the time even on the best models.
- You can't unit-test it. The output shifts between model versions.
- You can't explain a score. "Why did you suggest this?" returns a paragraph, not a number.
- You can't run an A/B test cleanly. Variance from prompt drift swamps the variance from your change.
The ranker has none of those problems. Every score is a function of inputs we control. We can replay last week's data against this week's model and quantify the lift. We can ask "what would happen if we doubled the weight on adherence cost?" and answer in an afternoon.
The toolbox, ordered by reach-for first
- Weighted scoring with hand-tuned coefficients. Boring. Effective. Two weekends to ship.
- Gradient-boosted trees (XGBoost / LightGBM) on engagement labels. The single best ROI in applied ML over the past decade. Trains on a laptop.
- Matrix factorisation for collaborative recommendations once you have a user-item history. Same family as Netflix circa 2009. Still excellent.
- Contextual bandits when you want to explore as well as exploit. Vowpal Wabbit will run them on the edge.
- A small transformer or embedding model for semantic similarity. Sentence-transformers ships a 90 MB model that beats GPT-4 on most retrieval benchmarks.
- LLM — only when the output is genuinely novel text the user needs to read, and you have a budget line for it.
When to actually reach for an LLM
Three legitimate cases:
- Open-ended generation. "Write me a workout plan for next week given these constraints." There's no candidate set you could pre-author.
- Natural-language interface to a structured system. The user asks in prose; you translate to a query; the answer comes from a deterministic system. The LLM is the translator, not the oracle.
- Long-form explanation. The ranker picked the suggestion. The LLM writes the paragraph that explains why, grounded in the ranker's inputs. Now hallucinations cost you a sentence, not a clinical recommendation.
FAQ
Why use a small ML ranker instead of an LLM for a suggestion feature?
Because a suggestion list answers a finite, rankable question — which of your own candidates is most relevant for this user right now — rather than needing new text generated. An LLM does this worse: it can't reliably cite sources (hallucinating 4-10%), can't be unit-tested, can't explain a score, and can't be A/B tested cleanly. An LLM call is also a recurring cost, latency budget, and regulatory exposure.
How does the Jofit ranker score each candidate intervention?
It scores each candidate on three axes — evidence weight (an annotated source paper plus a study-quality score where RCT beats cohort beats case report, rated once by a clinician), personal delta (how far the user is from their clinical target), and adherence cost (a habit-history model estimating friction). The three are multiplied, sorted, and the top results are taken.
What happens with urgent clinical cases?
An urgent clinical tier bypasses the ranker entirely, so it is not subject to the normal scoring and sorting.
Is this an argument against LLMs in general?
No — the case is specifically against using LLMs for ranked suggestion lists, which is what most "AI features" actually are. The point is that you don't need to generate new text; you need to rank known things well.
Which ML techniques should we reach for first?
Start with weighted scoring using hand-tuned coefficients, then gradient-boosted trees such as XGBoost or LightGBM on engagement labels, and matrix factorisation once you have user-item history. The Jofit ranker runs in under a millisecond and fits in one file.
The shipping order
If you're staring at a feature spec that says "AI suggestions", do this:
- List the candidates. If you can't, you don't have a recommendation problem — you have a content problem. Fix that first.
- Pick three scoring axes. Hand-tune the weights for a week.
- Ship it. Log every suggestion shown and every one acted on.
- After 4–6 weeks of data, train a gradient-boosted model on the logs. Replace the hand-tuned weights.
- Only then, if you still need novel prose, add an LLM — to explain, not to decide.
You'll ship faster, spend less, sleep better, and — most importantly — be able to answer the one question that matters when a user asks why they got that suggestion: here are the three numbers, and here is what they mean.