He asked AI to count carbs 27000 times. It couldn't give the same answer twice

· ai · Source ↗

TLDR

  • Preprint study sent 13 food photos to GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Pro, and Gemini 3.1 Pro 500+ times each; all four models produced dangerously inconsistent carb estimates at minimum temperature.

Key Takeaways

  • 26,904 total queries using a real iAPS production prompt; Gemini 2.5 Pro showed an 11x higher median coefficient of variation (11%) than Claude (2.4%).
  • Gemini 2.5 Pro’s paella estimates ranged 55g to 484g across 500 queries – a 429g spread equivalent to 42.9 units of insulin at 1:10 ICR.
  • Claude was the only model with zero queries landing in the clinically dangerous zone (>2U insulin error); GPT-5.4 hit that threshold on 37% of individual queries.
  • High consistency does not equal accuracy: all four models converged on 28g for a 40g cheese sandwich – a persistent 12g underdose across 510 Claude queries at 0.3% CV.
  • Model-reported confidence scores are uncalibrated; Claude’s confidence correlates at r=-0.01 with actual accuracy, and its high-confidence estimates are measurably worse than low-confidence ones.

Hacker News Comment Review

  • Consensus is that photo-based carb estimation is fundamentally limited by information physics – olive oil content, hidden ingredients, and depth are not recoverable from RGB pixels alone, regardless of model.
  • Several commenters noted the confidence-score failure is unsurprising and that repeated queries with majority-vote aggregation is a practical workaround already known to some practitioners.
  • The iAPS production integration drew sharp reactions: commenters found it alarming that a real open-source AID system is shipping this prompt path for insulin-dosing decisions.

Notable Comments

  • @jaccola: “Inside that sandwich could be drenched with olive oil or it could be hollow cheese” – photons physically cannot resolve caloric density.
  • @a-dub: Multiple same-prompt queries with short-answer requests yield a working confidence signal the model itself cannot provide.
  • @voidUpdate: Flags that the iAPS production prompt is already live in an app people use for real health decisions.

Original | Discuss on HN