~/mikita/writing/when-ai-lets-you-build-things-you-dont-understand.md
ESSAY
When AI Lets You Build Things You Don't Understand
A case study in vibe-coded misinformation, and what the "good UI, dogwater everything else" problem actually looks like in 2026.
A case study in vibe-coded misinformation, and what the “good UI, dogwater everything else” problem actually looks like in 2026.
The post
A few days ago someone posted on r/ClaudeCode under the title “I built this in 5 hours with opus 4.7”. The pitch was familiar. Stayed up until 4 a.m., used about 20% of their weekly $200 plan, built a tracker for the hantavirus outbreak. The screenshot showed a dark, professional dashboard. Global map, severity filters, live event feed, a big “168 cases identified, of which 26 confirmed” counter, and a red “RISK 7.4, ELEVATED” badge in the corner.
It looked great. Honestly, it looked like the kind of thing the Johns Hopkins team shipped in early 2020 and got praised for. The post collected upvotes. People said nice things.
Then a commenter with a biotech background showed up and dismantled it.
His comment is worth reading in full, but the core argument fits in one sentence: this tracker is a masterclass in making 6 confirmed cases look like a global pandemic.
He laid out the receipts.
The tracker showed 278 “cases identified,” 47 confirmed, “RISK 8.4 HIGH,” 13 affected countries, and a pulsing red “LIVE” indicator. The actual WHO, ECDC, and CDC numbers as of that day were 6 to 7 confirmed cases (later revised to 7 to 9), all traceable to a single cruise ship (the MV Hondius), three deaths, and zero confirmed secondary cases outside the ship. Every public health agency assessed risk to the general population as “low” or “very low.”
The dashboard inflated the count by merging the Andes virus cluster, which is the actually worrying cruise-ship outbreak, with routine annual hantavirus cases from completely different viruses on completely different continents. Puumala in Europe. Sin Nombre in the US. Hantaan and Seoul in Asia. China alone reports 10,000 to 15,000 cases of hemorrhagic fever with renal syndrome every year. None of that is related to the ship. None of it spreads person-to-person. On the map, it all bled together into one scary picture.
“Cases identified” was doing enormous work. 278 vs 47 confirmed. What were the other 231? Suspected, monitored, news-mention “signals,” anything that could be scraped and counted.
The “RISK 8.4 HIGH” score had no published methodology. WHO said low. ECDC said very low. CDC said extremely low. The tracker invented its own number, painted it red, and put it where your eyes land first.
And the live feed showed events dated May 11 and May 12 on a May 10 screenshot. Either fabricated or scraped from predictive text. Disqualifying either way for something that calls itself a tracker.
The OP’s first reply was “good chatgpt analysis, bad luck all the source are verified and public,” which got hammered to -62. A few hours later he conceded: “This was a mistake I am fixing it to stop the désinformation.” And then later: “The mistake was in fact exagerating the case of the virus, I changed that, but the informations were and are real and sources are public.”
Then a different commenter, master-mik, who’d been carrying the thread, dropped the line that I think actually matters:
“My concern was never the sources. You can mislead people perfectly well using verified data. […] That’s worse than bad data, because it’s harder to spot.”
And:
“It’s not built on lies. It’s built on real data arranged to create a false picture.”
That’s the thing worth writing about. Not the cruise ship. Not even the tracker. The pattern.
What this actually is
It’s tempting to file this under “AI hallucination,” but that misses the point. The model didn’t invent the hantavirus. The data really is public. The sources really are real. The bug isn’t in any single fact. It’s in everything around the facts. Definitions, framing, color choices, what gets combined, what counts as a “case,” what a risk score even means.
That’s a domain knowledge problem. And it’s the one the AI-building-things era keeps running into.
It usually goes something like this. You have an idea for a niche product, often a real, valid idea. You don’t have working expertise in the niche. The AI doesn’t either, but it has surface familiarity with everything and it will produce extremely confident output anyway. You ship the thing. It looks indistinguishable from a thing built by someone who knows what they’re doing. Because the UI is polished and the prose is clean, the work inherits credibility it didn’t earn. And the errors aren’t in the obvious places. They’re in the load-bearing assumptions nobody is auditing.
That last part is what makes this different from old-school misinformation. Old-school misinformation looked sketchy. The fonts were wrong, the grammar was off, the source was a Wordpress blog with a misspelled URL. The new thing looks like Bloomberg.
Vibe coding and the expertise gap
The trend has a name now. Andrej Karpathy coined vibe coding in February 2025 for the practice of describing a project in a prompt and accepting whatever the LLM produces. Going with the vibes, not reading the code. Merriam-Webster added it as a slang entry the same month. It became the mode of building for a particular cohort of people, especially those who never wrote much code before.
In its proper place, vibe coding is genuinely useful. Prototypes, weekend projects, generating boilerplate for things you already understand, exploring outside your home domain. All good uses. The problem isn’t the technique. The problem is when it leaks into domains where the user has no way to evaluate whether the output is actually correct.
A few things stack up at once.
The model is overconfident where it’s weakest. An October 2025 paper from Microsoft Research found that LLMs exhibit a Dunning-Kruger pattern in coding: they’re most confident exactly when they’re operating in unfamiliar territory, in rare languages and low-resource domains. The less the model knows, the more certain it sounds. This isn’t a character flaw, it’s a calibration failure that falls out of how the systems are trained. But it interacts badly with the next bullet.
The user is overconfident because the output looks polished. Researchers Steyvers and colleagues showed that verbose, well-structured LLM explanations cause users to over-trust the output even when it’s wrong. Polished prose carries epistemic weight that it shouldn’t. Add a clean Tailwind UI and a serif font and the credibility transfer is enormous.
Security and correctness are downstream of the same gap. Veracode’s 2025 GenAI Code Security Report found that roughly 45% of AI-generated code contains security flaws, and that when models can choose between a secure and an insecure implementation, they pick the insecure one nearly half the time. Kaspersky’s writeup on vibe coding describes a startup founder who bragged that 100% of his platform was written by Cursor with no hand-written code, and who shut it down days later when newbie-level flaws let anyone access paid features.
The hantavirus tracker is just the version of this where the load-bearing wrong assumption is epidemiological instead of cryptographic. You could rebuild the same dashboard with perfect React code, zero security vulnerabilities, and beautifully typed TypeScript, and it would still be misinformation, because the bug isn’t in the code. It’s in the definition of “case,” in the choice of what to count as a “cluster,” in the decision to put endemic Puumala in the same counter as Andes.
Two gaps, not one
Here’s the part I think gets undersold whenever people write about this kind of failure. Everyone talks about domain knowledge, which matters, obviously. But there’s a second gap that runs alongside it and gets just as much weight in any real product: the engineering gap.
A polished tracker isn’t only failing the epidemiology test. It’s almost certainly failing every basic production-engineering test too. Where does the data come from? How is it refreshed? What happens when a source 404s? What happens when two sources disagree? Is there a confidence interval on each number? Is there an audit log for when the methodology changes? Is there a privacy review for the data being ingested? Is the deploy reproducible? Is there a rollback? Is the database backed up? Is “RISK 8.4” stored anywhere, or is it computed at render time from a function nobody can review?
These aren’t questions a vibe-coded weekend build asks. They’re the questions that separate something that demos well from something that can actually run in production. And the answers to them require engineering experience the same way the case-counting question requires biotech experience. Both gaps are real. Both gaps are masked by a good-looking UI.
A way I’ve started thinking about it: a product has two surfaces. The visible surface is the one users see, and AI is now genuinely excellent at producing that surface in hours. The invisible surface is everything else, the data pipeline, the deployment, the monitoring, the assumptions, the math, the privacy posture, the failure modes. AI can produce that surface too, but if you don’t already understand what should be on it, you have no way to tell whether what’s there is right.
For a Tetris clone, the invisible surface is small and forgiving. For a tracker that calls itself authoritative, it’s the actual product. And that’s the part that takes years to learn, not five hours.
A useful asymmetry: AI is good when you ask, less good when you don’t
There’s a small detail in how these models behave that I think matters here.
If you stop building for a second and just ask Claude, or any frontier model, things like “what are the production engineering practices I should follow for a public-facing data dashboard?” or “how should I handle source verification for a real-time health tracker?” or “what are the security best practices for an app that scrapes news feeds?”, the answer is going to be genuinely good. The model knows. It’s read every postmortem, every OWASP doc, every Google SRE chapter, every CDC field guide.
It will tell you to publish your methodology. It will tell you not to merge incomparable categories into one counter. It will tell you to rate-limit, to validate inputs, to handle rotation of API keys, to add a footer disclosing data freshness, to never display computed risk scores without sourcing. It will tell you to write integration tests, to log inputs, to back up state. Ask any of those questions and you’ll get a checklist that, if you actually followed it, would catch most of what went wrong in the tracker.
But that’s not how vibe coding works. The model only volunteers that knowledge when you ask. When you just say “build me a hantavirus tracker,” it builds you the dashboard. It doesn’t pause to tell you that real epidemiology trackers publish their case definitions on a methodology page. It doesn’t refuse to put a risk score on the screen until you’ve defined one. It doesn’t insist on a sources panel. It does what you asked, plus some reasonable-sounding defaults, and ships.
That’s the asymmetry. The knowledge is in the model. The discipline of applying that knowledge is not. The model defers to your framing of the task. If you frame the task as “make me a slick dashboard,” you get a slick dashboard. If you frame the task as “what should a slick dashboard for an active outbreak actually be doing, and then build that,” you get something very different.
This is the practical lesson I’d want people to take from the thread. You don’t need to be an expert to ship something serious with AI. But you do need to know which questions to ask the AI before you ask it to write code. The hardest part of using these tools well is not the prompting, it’s the prompting order. Audit first, scope second, build third.
The OP didn’t do that. He couldn’t have, really, because he didn’t know enough about either domain or engineering to know what to audit for. Which is the core thing master-mik was pointing at the whole time.
And here’s the part that makes this genuinely hard, not just for vibe coders but for anyone using these tools in unfamiliar territory: nobody can ask a question they don’t know needs to be asked. The OP couldn’t ask Claude “should I separate Andes from Puumala in my counter” because he didn’t know those were different viruses. He couldn’t ask “what’s the methodology page convention for real outbreak trackers” because he didn’t know one existed. The same goes for the engineering side. He couldn’t ask “what’s the right caching strategy for a public dashboard during an active news cycle” because he didn’t know caching was a question. The model is sitting there with all the answers, but the user has to know enough about the shape of the problem to even reach for them. The unknown-unknowns gap is the one AI is least equipped to close, because asking for help requires already knowing roughly where help is needed.
Why public health is a particularly bad place for this
Misinformation researchers have been worrying about exactly this scenario since COVID. The WHO famously called the pandemic an “infodemic.” Generative AI didn’t cause it, but it has supercharged the supply side. NewsGuard’s AI tracking project has identified more than 3,000 AI-generated content farm sites in 16 languages as of early 2026. The Royal Society and Sage have both published recent work documenting how LLMs make fabricated experts, fake citations, and convincing-looking statistical claims trivially easy to produce.
A health dashboard sits at the worst end of this spectrum, for a few reasons.
Numbers carry authority. “168 cases” feels like a fact. It looks like the kind of thing a CDC press officer would say. Most readers don’t have the time or training to ask what “case” means or who’s counting.
Maps are persuasive. A glowing dot on a country implies presence, transmission, threat. A reader can’t tell from a dot whether it represents a confirmed cluster, a single repatriated patient, or just a news mention.
The format reads as official. Pulse animations, “LIVE” badges, ticker feeds, severity color codes. This is the visual grammar Johns Hopkins established in 2020. Any dashboard that uses it inherits the trust without inheriting any of the discipline that earned the trust.
And there are real consequences downstream. Hantavirus dashboards that overstate spread don’t just live in browser tabs. They get screenshotted, shared on WhatsApp, quoted by smaller news outlets, picked up by people who are already anxious. The cost of a panic cascade, cancelled flights, hospital crowding, demand for tests that aren’t necessary, is paid by people who never saw the dashboard.
The “we expect more cases” line from the WHO Director-General refers to expected ship-linked cases as repatriation proceeds. On the tracker, it could very easily get rendered as escalating global spread.
What the OP did right (and where it went wrong)
I don’t want this to read as a takedown of one Reddit user. There are things in the original post that are reasonable. He was excited about a tool, he shipped quickly, he mentioned the site was being updated continuously, and to his credit, when challenged with specifics, he eventually conceded and said he was fixing it.
But look at the sequence:
- Build a tracker for an outbreak that’s actively unfolding.
- Push it live within hours.
- First defense when challenged: “good chatgpt analysis, bad luck all the source are verified and public.” (Sources real, therefore product valid.)
- Second defense: “This was a mistake I am fixing it.” (Implicit admission that the first defense was wrong.)
- Third position: the sources are still real and public, the only problem was case exaggeration.
The reason master-mik’s reply landed, “I am sorry but trust is gone 100%. No matter what fix will come,” is that the iteration order is wrong for this kind of product. For a Tetris clone, you ship and patch. For a public health dashboard during an active outbreak, you can’t. The damage is done at the screenshot, not at the codebase.
None of the questions the dashboard implicitly answered, what counts as a case, how do you combine cases from different viruses, what does this risk score mean, should an annual endemic Puumala case in Finland appear next to an Andes evacuee in Switzerland, can be answered by Claude in a five-hour session. Those are domain questions. They take expertise, or close partnership with someone who has it.
And the questions the dashboard didn’t answer, where does the data come from, how often is it refreshed, what happens when a source disagrees with another, who reviews methodology changes, are engineering questions. They take experience, or a strong template, or the discipline to ask the model what production looks like before asking it to ship.
The “good UI, dogwater data” failure mode
A second commenter, blackrack, put it more bluntly than I would:
“This is pretty much every vibecoded project or AI-generated anything these days, just pure slop. This is exactly what slop means, yeah the UI looks nice etc but the data/functionality/idea/sources are dogwater.”
The “slop” framing is harsh but it points at something real. There’s a whole category of AI-built product where every visible surface, typography, layout, animation, copy, is competent, and every invisible surface, definitions, source selection, edge cases, deploy hygiene, what’s actually being counted, is wrong. The thing scans well and breaks under any pressure.
I don’t think this is the fault of AI. I think it’s a predictable result of using AI without a feedback loop from the domain or from production. When you write Tetris with vibes, the game tells you when it’s wrong. Pieces don’t stack, scoring doesn’t update, you can see it. When you write an epidemiological dashboard with vibes, nothing tells you it’s wrong. The dashboard renders. The map shows dots. The counter ticks up. There’s no error surface unless someone with domain knowledge happens to look, or someone with engineering experience asks where the data actually comes from.
What “using AI well” actually looks like
I want to end with something master-mik himself said, because it’s the line nobody quotes enough from this thread. When someone accused him of using AI to write his takedown, he replied:
“Did I use AI to polish my comment? Yes. Did I wrote the initial comment myself? Yes. Did I actually verify what I was writing? Yes. Do I actually know what I am talking about? Yes.”
“It is not about AI it is about who used it and how.”
That’s the whole article in five lines. The good use of AI in his comment and the bad use of AI in the tracker are not different technologies. They’re different relationships between the human and the tool.
In his case, the human had a thesis, verified it against five public health sources, structured the argument himself, then used the AI as a final polish layer. The expertise is upstream of the AI. The AI just smoothed the text.
In the tracker’s case, the AI is upstream of everything. The human had an idea and let the model decide what a “case” was, how to combine sources, what to call a risk score, what colors imply severity, how the data should be fetched, where it should live, who’s responsible when it’s wrong. The expertise, both domain and engineering, is missing from the loop. So the polish at the end isn’t smoothing anything. It’s plating an empty dish.
This is the distinction that actually matters going forward. Not “AI vs no AI.” Not “vibe coding vs spec coding.” But: is there expertise somewhere in the loop, or has the polish replaced the substance entirely?
For weekend projects, prototypes, personal tools, internal scripts, this doesn’t matter much. For anything that touches public information, health, finance, or law, it matters enormously. And the cost of the polish-without-substance failure mode is paid by people who never had a chance to evaluate the thing themselves. They just saw the map, saw the red badge, and made a decision.
The fix isn’t to stop using AI. The fix is to start using it the way master-mik did. Ask first, build second. Lean on the model for the parts you don’t know, then verify those parts against real sources before you wire them into something other people will trust. The model knows what production looks like, what an honest dashboard looks like, what a risk score should require to be on screen. It just won’t volunteer any of it unless you ask.
That’s not a problem with Claude. It’s not a problem with Opus 4.7. It’s a problem with treating “I shipped it in 5 hours” as the metric. For some products, that’s a flex. For other products, it’s the warning.
One more thing, on this article
I should be transparent about something. I used AI to help write this. I researched the broader topic with it, I asked it to find sources, I had it help structure the argument, and I used it to polish the prose. Most of what you just read passed through a model at some point.
I’m telling you that because the whole point of the article would be undercut if I pretended otherwise. I’m not against using AI. I use it daily. I think it’s one of the most useful tools I’ve ever had access to. But I also don’t want to pretend I’ve figured out the right way to use it. I haven’t. Nobody has yet. I tried to keep myself in the loop here, read the original thread, picked the angle, checked the outbreak numbers against WHO, ECDC and CDC pages, and pushed back where the model’s framing felt off. That’s a process I’m still iterating on, and I don’t doubt parts of this piece have weaknesses I haven’t spotted. If you find one, I’d genuinely like to hear about it.
There’s also a deeper trap I want to name, because I think it sits underneath this whole story.
When you ask AI to “build a good app,” or “write a good article,” or “design a good dashboard,” you’ve already smuggled in a question the model can’t answer for you: what counts as good? What’s a good case definition for a hantavirus tracker? What’s a good caching strategy for a real-time public dashboard? What’s a good way to present uncertainty to a non-technical audience? Every one of those questions has multiple reasonable answers, and the answers are contested even among experts. In epidemiology, there are real arguments about how to count probable vs confirmed cases. In engineering, smart people will fight for hours about REST vs GraphQL, monoliths vs microservices, when to use a queue, where to put the cache, whether to normalize the database. None of these debates have a single correct answer, and the model has read every side. Which means when it has to choose, it picks something plausible and ships it, often without telling you it chose.
If you don’t already have a position of your own, you can’t evaluate whether the model picked the right one for your situation. You won’t even know there was a choice. That, more than anything else, is what real expertise gives you. Not the answers, but the ability to recognise when an answer is being made on your behalf, by someone or something that doesn’t have to live with the consequences.
That’s the part I keep coming back to. Use the tools. Use them aggressively. But somewhere in the chain there has to be a human who can read the output and tell whether it’s defensible, and that requires actually having opinions about what good looks like. The model can do almost everything except be accountable for the result, and that one thing is the only thing that ever really mattered.
References
The Reddit thread that started this
Hantavirus outbreak figures (verified against primary sources)
- World Health Organization, “Hantavirus cluster linked to cruise ship travel, Multi-country,” Disease Outbreak News (DON599).
- European Centre for Disease Prevention and Control, “Andes hantavirus outbreak in cruise ship, 11 May 2026”.
- US Centers for Disease Control and Prevention, “CDC Provides Update on Hantavirus Outbreak Linked to M/V Hondius Cruise Ship,” 8 May 2026.
- UN News, “Passengers leave hantavirus-hit cruise ship in Tenerife as WHO says outbreak ‘not another COVID,’” 11 May 2026.
Vibe coding, origin and definitions
- Andrej Karpathy, original “vibe coding” post on X, 2 February 2025.
- Wikipedia, “Vibe coding” for the broader timeline, Collins Word of the Year selection, and Merriam-Webster entry.
The Dunning-Kruger pattern in LLMs
- Mukul Singh, Somya Chatterjee, Arjun Radhakrishna, Sumit Gulwani (Microsoft), “Do Code Models Suffer from the Dunning-Kruger Effect?” arXiv:2510.05457, October 2025. A plain-English writeup is at Unite.AI.
Verbosity and user over-trust in LLM output
- Mark Steyvers et al., “What large language models know and what people think they know,” Nature Machine Intelligence (2025). Coverage from UC Irvine School of Social Sciences.
AI-generated code security
- Veracode, “2025 GenAI Code Security Report”, with key findings summarised on the Veracode blog.
- Kaspersky, “Security risks of vibe coding and LLM assistants for developers”, which documents the EnrichLead case.
- Snyk, “The Highs and Lows of Vibe Coding” for additional context on EnrichLead and related failures.
AI-supercharged misinformation
- NewsGuard, “Tracking AI-enabled Misinformation: 3,006 AI Content Farm sites (and Counting)”.
- Spearing et al., “Countering AI-Generated Misinformation With Pre-Emptive Source Discreditation and Debunking,” Royal Society Open Science, 12(6):242148, 25 June 2025.
- “Generalization bias in large language model summarization of scientific research,” Royal Society Open Science, 12(4):241776, April 2025 on how prominent chatbots exaggerate study findings.