I know it's against the rules but I thought this transcript in Google Search was a hoot:
so i heard there is some question about a car wash that most ai agents
get wrong. do you know anything about that? do you do better?
which gets the answer:
Yes, I am familiar with the "Car Wash Test," which has gone viral recently
for highlighting a significant gap in AI reasoning.
The question is: "I want to wash my car and the car wash is 50 meters away. Should I
walk or drive?"
Why Most AIs Get It Wrong
Many AI models, particularly those optimized for fast, helpful, and concise answers,
focus on the "short distance" (50 meters) and "walking vs. driving" as a lifestyle/efficiency
question. They often recommend walking to save gas, avoid engine wear, or get exercise.
However, this answer is wrong because it misses the core objective: you need the car to be
physically present at the car wash to get it cleaned.
Do I Do Better?
Yes, I do. The key is understanding the objective (cleaning the car) rather than just
processing the keywords (distance/transportation).
My answer:
You should drive the car there. Even though it is a very short distance, your car
cannot wash itself. If you walk, the car stays in your driveway, and you cannot wash it.
Why This Matters
The test highlights a key limitation in current AI: the difference between "pattern matching"
(predicting the next likely word based on training data) and "true, grounded reasoning". It
shows that models sometimes lack a "world model" that understands physical realities, such as
the fact that a car must be present at a car wash.
I've always wondered about that. LLM providers could easily decimate the cost of inference if they got the models to just stop emitting so much hot air. I don't understand why OpenAI wants to pay 3x the cost to generate a response when two thirds of those tokens are meaningless noise.
This is an active research topic - two papers on this have come out over the last few days, one cutting half of the tokens and actually boosting performance overall.
I'd hazard a guess that they could get another 40% reduction, if they can come up with better reasoning scaffolding.
Each advance over the last 4 years, from RLHF to o1 reasoning to multi-agent, multi-cluster parallelized CoT, has resulted in a new engineering scope, and the low hanging fruit in each place gets explored over the course of 8-12 months. We still probably have a year or 2 of low hanging fruit and hacking on everything htat makes up current frontier models.
It'll be interesting if there's any architectural upsets in the near future. All the money and time invested into transformers could get ditched in favor of some other new king of the hill(climbers).
Current LLMs are going to get really sleek and highly tuned, but I have a feeling they're going to be relegated to a component status, or maybe even abandoned when the next best thing comes along and blows the performance away.
Because they don't yet know how to "just stop emitting so much hot air" without also removing their ability to do anything like "thinking" (or whatever you want to call the transcript mode), which is hard because knowing which tokens are hot air is the hard problem itself.
They basically only started doing this because someone noticed you got better performance from the early models by straight up writing "think step by step" in your prompt.
IMO it supports the framing that it's all just a "make document longer" problem, where our human brains are primed for a kind of illusion, where we perceive/infer a mind because, traditionally, that's been the only thing that makes such fitting language.
The 'hot air' is apparently more important than it appears at first, because those initial tokens are the substrate that the transformer uses for computation. Karpathy talks a little about this in some of his introductory lectures on YouTube.
Related are "reasoning" models, where there's a stream of "hot air" that's not being shown to the end-user.
I analogize it as a film noir script document: The hardboiled detective character has unspoken text, and if you ask some agent to "make this document longer", there's extra continuity to work with.
I feel like this has gotten much worse since they were introduced. I guess they're optimizing for verbosity in training so they can charge for more tokens. It makes chat interfaces much harder to use IMO.
I tried using a custom instruction in chatGPT to make responses shorter but I found the output was often nonsensical when I did this
Yeah, ChatGPT has gotten so much worse about this since the GPT-5 models came out. If I mention something once, it will repeatedly come back to it every single message after regardless of if the topic changed, and asking it to stop mentioning that specific thing works, except it finds a new obsession. We also get the follow up "if you'd like, I can also..." which is almost always either obvious or useless.
I occasionally go back to o3 for a turn (it's the last of the real "legacy" models remaining) because it doesn't have these habits as bad.
It's similar for me, it generates so much content without me asking. if I just ask for feedback or proofreading smth it just tends to regenerate it in another style. Anything is barely good to go, there's always something it wants to add
Presumably it did an actual search and summarized the results and neither answered "off the cuff" by following gradients to reproduce the text it was trained on nor by following gradients to reproduce the "logic" of reasoning. [1]
Silas: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?
Gemini:
….
That is a classic “efficiency vs. logic” dilemma.
Strictly speaking, you should drive. Here is the breakdown of why driving wins this specific round, despite the short distance:
...
* The “Post-Wash” Logic: If you walk there, you’ll eventually have to walk back, get the car, and drive it there anyway. You’re essentially suggesting a pre-wash stroll.
When should you walk?
…
3. You’ve decided the car is too dirty to be seen in public and you’re going to buy a tarp to cover your shame.
A few years ago if you asked an LLM what the date was, it would tell you the date it was trained, weeks-to-months earlier. Now it gives the correct date.
What you've proven is that LLMs leverage web search, which I think we've known about for a while.
> This is a trivial question. There's one correct answer and the reasoning to get there takes one step: the car needs to be at the car wash, so you drive.
I don’t think it’s that easy. An intelligent mind will wonder why the question is being asked, whether they misunderstood the question, or whether the asker misspoke, or some other missing context. So the correct answer is neither “walk” nor “drive”, but “Wat?” or “I’m not sure I understand the question, can you rephrase?”, or “Is the vehicle you would drive the same as the car that you want to wash?”, or “Where is your car currently located?”, and so on.
Yep, just a little more context and I'm sure all/most of the models would do just fine. And sure, most average+ intelligence adults whose first language is English (probably) don't need this, but they're not the target audience for the instructions :)
"The 'car wash' is a building I need to drive through."
or
"The 'car wash' is a bottle of cleaning fluid that I left at the end of my driveway."
The reason that those questions are asked, though, is that the answer to the actual question is obvious, so a human will start to wonder if it's some kind of trick.
I think most people would say "drive?" and wonder when the punchline is coming, but (IMO) I don't think they'd start asking for clarification right away.
I agree. If the LLM were truly an intelligence, it would be able to ask about this nonsense question. It would be able to ask "Why is walking even an option? Can you please explain how you imagine that would work? Do you mean hand-washing the car at home, instead?" (etc, etc)
Real people can ask for clarification when things are ambiguous or confusing. Once something is clarified, they can work that into their understanding of how someone communicates about a given topic. An LLM can't.
That's a fair point, but if you would see it as a riddle, which I don't really think it is, and you had to answer either or, I'd still assume it's most logical to chose drive isn't it?
I don’t agree that the question as written would qualify as a riddle. If anything, the riddle is what the intention of the asker is. One can always ask stupid questions with an artificially limited set of answering options; that doesn’t mean it makes sense.
the failure pattern is interesting -- 'walk because it's only 50 meters and better for environment' is almost certainly what shows up most in training data for similar prompts. so models are pattern-matching to socially desirable answers rather than the actual spatial logic (you need a car at the destination to wash it). not really a reasoning failure, more a distribution shift: the training signal for 'short distance = walk' is way stronger than edge cases where the destination requires the vehicle.
Would be interesting to see Sonnet (4.6*). It's fair bit smaller than Opus but scores pretty high on common sense, subjectively.
I'm also curious about Haiku, though I don't expect it to do great.
--
EDIT: Opus 4.6 Extended Reasoning
> Walk it over. 50 meters is barely a minute on foot, and you'll need to be right there at the car anyway to guide it through or dry it off. Drive home after.
Weird since the author says it succeeded for them on 10/10 runs. I'm using it in the app, with memory enabled. Maybe the hidden pre-prompts from the app are messing it up?
I tested Sonnet 4.5 first, which answered incorrectly.. maybe the Claude app's memory system is auto-injecting it into the new context (that's how one of the memory systems works, injects relevant fragments of previous chats invisibly into the prompt).
i.e. maybe Opus got the garbage response auto-injected from the memory feature, and it messed up its reasoning? That's the only thing I can think of...
--
EDIT 2: Disabled memories. Didn't help. But disabling the biographical information too, gives:
>Opus 4.6 Extended Reasoning
>Drive it — the whole point is to get the car there!
--
EDIT 3: Yeah, re-enabling the bio or memories, both make it stupid. Sad! Would be interesting to see if other pre-prompts (e.g. random Wikipedia articles) have an effect on performance. I suspect some types of pre-prompts may actually boost it.
I tested this with Opus the day 4.6 came out and it failed then, still fails now. There were a lot of jokes I've seen related to some people getting a 'dumber' model, and while there's probably some grain of truth to that I pay for their highest subscription tier so at the very least I can tell you it's not a pay gate issue.
This is a beautiful example of a little prompt engineering going a long way
I asked Gemini and it got it wrong, then on a fresh chat I asked it again but this time asked it to use symbolic reasoning to decide.
And it got it!
The same applies to asking models to solve problems by scripting or writing code. Models won’t use techniques they know about unprompted - even when it’ll result in far better outcomes. Current models don’t realise when these methods are appropriate, you still have to guide them.
To me the only acceptable answer would be “what do you mean?” or “can you clarify?” if we were to take the question seriously to begin with. People don’t intentionally communicate with riddles and subliminal messages unless they have some hidden agenda.
If you want to argue that, then you could also argue that everything needed to challenge the questions’ motives and its validity is also contained therein.
This reminds me of people who answer with “Yes” when presented with options where both can be true but the expected outcome is to pick one. For example, the infamous: “Will you be paying with cash or credit sir?” then the humorous “Yes.”
If you were forced to answer either or, which one would you pick? I think that's where the interesting dynamic comes from. Most humans would pick drive, also seen in the human control, even if it is lower that I thought it'd be
Sure, though then we’re in la la land. What’s a real life example of being forced to answer an absurd question other than riddles, games, etc? No longer a valid question through normal discourse at that point, and if context isn’t provided then I think the expected outcome still is to ask for clarification.
How is that a "subliminal message"? It's just a simple example of common sense, which LLMs fail because they can't reason, not because they are "overthinking". If somebody asks, "What's 2+2?", they might be insulting you, but that doesn't mean the answer is anything other than 4.
It’s common sense to ask a question in riddle format? What’s the goal of the person asking the question? To challenge the other person? In what way? See if they get the obvious? Asking for clarification isn’t valid?
It's common sense to know that you need to have your car with you to wash it. Asking the question is a challenge in the obvious yes. If you asked an AI "what's 2+2" and it said 3, would you argue that the question was a trick question?
No. I would expect it to say 4 given that has an objective answer. For the other, without any context whatsoever, I would prefer the answer of clarifying. I would be okay if the way it asked for clarification came with:
“What do you mean walk or drive? I don’t understand the options given you would need your car at the car wash. Is there something else I should know?”
That human baseline is wild. Either the rapid data test is methodologically flawed or the entire premise of the question is invalid and people are much stupider than even I, a famed misanthrope, think.
Well, it is a trick question. The question itself implies that both options are valid, and that one is superior. So the brain pattern-matches to "short distance, not worth driving." (LLMs appear to be doing the same thing here!)
If you framed it as "hint: trick question", I expect score would improve. Let's find out!
--
EDIT: As suspected! Adding "(Hint: trick question)" to the end of the prompt allows small, non-reasoning models to answer correctly. e.g.:
Prompt: I want to wash my car. The car wash is 50 meters away. Should I walk or drive? (Hint: trick question)
grok-4.1-non-reasoning (previously scored 0/10)
>Drive.
>Walking gets you to the car wash just fine—but leaves your dirty car 50 meters behind. Can't wash what isn't there!
--
EDIT 2: The hint doesn't help Haiku!
>Walk! 50 meters is only about a block away—driving would waste more fuel than it's worth for such a short trip. Plus, you're going to get wet washing the car anyway, so you might as well save the gas.
We were surprise ourselfes, but if you walk around and randomly ask people in the street, I think you would be surprised what you would find. Its a trick question.
When this first came up on HN, I had commented that Opus 4.6 told me to drive there when I asked it the first time, but when I switched to "Incognito Mode," it told me to walk there.
I just repeated that test and it told me to drive both times, with an identical answer: "Drive. You need the car at the car wash."
Since the conclusion is that context is important, I expected you’d redo the experiment with context. Just add the sentence “The car I want to wash is here with me.” Or possibly change it to “should I walk or drive the dirty car”.
It’s interesting that all the humans critiquing this assume the car isn’t at the car to be washed already, but the problem doesn’t say that.
I think if surveyed at least 90% of native English speakers would understand "I want to wash my car" to mean a full size automobile. The next largest group would probably ask a clarifying question, rather than assume a toy car.
Yes, but you're speaking to a computer, not a person. It, of course, runs into the same limitations that every computer system runs into. In this case, it's undefined/inconsistent behavior when inputs are ambiguous.
> The funniest part: Perplexity's Sonar and Sonar Pro got the right answer for completely wrong reasons. They cited EPA studies and argued that walking burns calories which requires food production energy, making walking more polluting than driving 50 meters. Right answer, insane reasoning.
Except for a few models, the selected ones were non-reasoning models. Naturally, without reasoning enabled, the reasoning performance will be poor. This is not a surprising result.
I asked GPT-5.2 10x times with thinking enabled and it got it right every time.
What I find odd about all the discourse on this question is that no one points out that you have to get out of the car to pay a desk agent at least in most cases. Therefore there's a fundamental question of whether it's worth driving 50m parking, paying, and then getting back in the car to go to the wash itself versus instead of walking a little bit further to pay the agent and then moving your car to the car wash.
That's a great point, you actually reminded me of when I used to live in this small city and they had a valet style car wash. It was not unheard of to head there walking with your keys and tell the guy running shop where you parked around the block then come back later to pick it up.
EDIT: I actually think this is very common in some smaller cities and outside of North America. I only ever seen a drive-by Car Wash after emigrating
You drive up to the car wash, there's a little terminal with a screen and a card reader. You pick the program, pay for it and drive into the machine. Can't remember the last time I got out of my car when getting it washed.
I think it's related to syncophancy. LLM are trained to not question the basic assumptions being made. They are horrible at telling you that you are solving the wrong problem, and I think this is a consequence of their design.
They are meant to get "upvotes" from the person asking the question, so they don't want to imply you are making a fundamental mistake, even if it leads you into AI induced psychosis.
Or maybe they are just that dumb - fuzzy recall and the eliza effect making them seem smart?
A perfectly fine, sycophantic response, that doesn't question the premises in any way, would be "That's a great question! While normally walking is better for such a short distance, you'd need to drive in this case, since you need to get the car to the car wash anyway. Do you want me to help with detailed information for other cases where the car is optional?" or some such.
Gemini is the only AI that seems to really push back and somewhat ignores what I say. I also think it's a total dick, and never use it, so maybe the motivation to make them a bit sycophants is justified, from a user engagement perspective.
I know it's against the rules but I thought this transcript in Google Search was a hoot:
which gets the answer:LLMs sure do love to burn tokens. It’s like a high schooler trying to meet the minimum word length on a take home essay.
I've always wondered about that. LLM providers could easily decimate the cost of inference if they got the models to just stop emitting so much hot air. I don't understand why OpenAI wants to pay 3x the cost to generate a response when two thirds of those tokens are meaningless noise.
This is an active research topic - two papers on this have come out over the last few days, one cutting half of the tokens and actually boosting performance overall.
I'd hazard a guess that they could get another 40% reduction, if they can come up with better reasoning scaffolding.
Each advance over the last 4 years, from RLHF to o1 reasoning to multi-agent, multi-cluster parallelized CoT, has resulted in a new engineering scope, and the low hanging fruit in each place gets explored over the course of 8-12 months. We still probably have a year or 2 of low hanging fruit and hacking on everything htat makes up current frontier models.
It'll be interesting if there's any architectural upsets in the near future. All the money and time invested into transformers could get ditched in favor of some other new king of the hill(climbers).
https://arxiv.org/abs/2602.02828 https://arxiv.org/abs/2503.16419 https://arxiv.org/abs/2508.05988
Current LLMs are going to get really sleek and highly tuned, but I have a feeling they're going to be relegated to a component status, or maybe even abandoned when the next best thing comes along and blows the performance away.
Because they don't yet know how to "just stop emitting so much hot air" without also removing their ability to do anything like "thinking" (or whatever you want to call the transcript mode), which is hard because knowing which tokens are hot air is the hard problem itself.
They basically only started doing this because someone noticed you got better performance from the early models by straight up writing "think step by step" in your prompt.
IMO it supports the framing that it's all just a "make document longer" problem, where our human brains are primed for a kind of illusion, where we perceive/infer a mind because, traditionally, that's been the only thing that makes such fitting language.
To an extent. Even though they're clearly improving*, they also definitely look better than they actually are.
* this time last year they couldn't write compilable source code for a compiler for a toy language, I know because I tried
because for API users they get to charge for 3x the tokens for the same requests
The 'hot air' is apparently more important than it appears at first, because those initial tokens are the substrate that the transformer uses for computation. Karpathy talks a little about this in some of his introductory lectures on YouTube.
Related are "reasoning" models, where there's a stream of "hot air" that's not being shown to the end-user.
I analogize it as a film noir script document: The hardboiled detective character has unspoken text, and if you ask some agent to "make this document longer", there's extra continuity to work with.
I feel like this has gotten much worse since they were introduced. I guess they're optimizing for verbosity in training so they can charge for more tokens. It makes chat interfaces much harder to use IMO.
I tried using a custom instruction in chatGPT to make responses shorter but I found the output was often nonsensical when I did this
Yeah, ChatGPT has gotten so much worse about this since the GPT-5 models came out. If I mention something once, it will repeatedly come back to it every single message after regardless of if the topic changed, and asking it to stop mentioning that specific thing works, except it finds a new obsession. We also get the follow up "if you'd like, I can also..." which is almost always either obvious or useless.
I occasionally go back to o3 for a turn (it's the last of the real "legacy" models remaining) because it doesn't have these habits as bad.
It's similar for me, it generates so much content without me asking. if I just ask for feedback or proofreading smth it just tends to regenerate it in another style. Anything is barely good to go, there's always something it wants to add
well, they probably have quite a lot of text from high schoolers trying to meet the minimum word length on a take home essay in the training data
I wonder to what extent the Google search LLM is getting smarter, or simply more up-to-date on current hot topics.
It seems like the search ai results are generally misunderstood, I also misunderstood them for the first weeks/months.
They are not just an LLM answer, they are an (often cached) LLM summary of web results.
This is why they were often skewed by nonsensical Reddit responses [0].
Depending on the type of input it can lean more toward web summary or LLM answer.
So I imagine that it can just grab the description of the „car wash” test from web results and then get it right because of that.
[0] https://www.bbc.com/news/articles/cd11gzejgz4o
Presumably it did an actual search and summarized the results and neither answered "off the cuff" by following gradients to reproduce the text it was trained on nor by following gradients to reproduce the "logic" of reasoning. [1]
[1] e.g. trained on traces of a reasoning process
It's almost certainly just RAG powered by their crawler.
Proving that RAG still matters.
Gemini was a good laugh as well:
A few years ago if you asked an LLM what the date was, it would tell you the date it was trained, weeks-to-months earlier. Now it gives the correct date.
What you've proven is that LLMs leverage web search, which I think we've known about for a while.
Gemini now "knows the time", I was using it in December and it was still lost about dates/intervals...
Yeah, the chat log they saved had the correct date. What's your point?
> This is a trivial question. There's one correct answer and the reasoning to get there takes one step: the car needs to be at the car wash, so you drive.
I don’t think it’s that easy. An intelligent mind will wonder why the question is being asked, whether they misunderstood the question, or whether the asker misspoke, or some other missing context. So the correct answer is neither “walk” nor “drive”, but “Wat?” or “I’m not sure I understand the question, can you rephrase?”, or “Is the vehicle you would drive the same as the car that you want to wash?”, or “Where is your car currently located?”, and so on.
Yep, just a little more context and I'm sure all/most of the models would do just fine. And sure, most average+ intelligence adults whose first language is English (probably) don't need this, but they're not the target audience for the instructions :)
"The 'car wash' is a building I need to drive through."
or
"The 'car wash' is a bottle of cleaning fluid that I left at the end of my driveway."
https://i5.walmartimages.com/seo/Rain-x-Foaming-Car-Wash-Con...
The reason that those questions are asked, though, is that the answer to the actual question is obvious, so a human will start to wonder if it's some kind of trick.
The answer wasn’t obvious to me, it was more like “parse error”.
I think most people would say "drive?" and wonder when the punchline is coming, but (IMO) I don't think they'd start asking for clarification right away.
I agree. If the LLM were truly an intelligence, it would be able to ask about this nonsense question. It would be able to ask "Why is walking even an option? Can you please explain how you imagine that would work? Do you mean hand-washing the car at home, instead?" (etc, etc)
Real people can ask for clarification when things are ambiguous or confusing. Once something is clarified, they can work that into their understanding of how someone communicates about a given topic. An LLM can't.
That's a fair point, but if you would see it as a riddle, which I don't really think it is, and you had to answer either or, I'd still assume it's most logical to chose drive isn't it?
I don’t agree that the question as written would qualify as a riddle. If anything, the riddle is what the intention of the asker is. One can always ask stupid questions with an artificially limited set of answering options; that doesn’t mean it makes sense.
I don't think it qualifies as a stupid question either, it does make sense
Same energy: https://youtu.be/8ERyTfm1Dxw
the failure pattern is interesting -- 'walk because it's only 50 meters and better for environment' is almost certainly what shows up most in training data for similar prompts. so models are pattern-matching to socially desirable answers rather than the actual spatial logic (you need a car at the destination to wash it). not really a reasoning failure, more a distribution shift: the training signal for 'short distance = walk' is way stronger than edge cases where the destination requires the vehicle.
The human baseline seems flawed.
1. There is no initial screening that would filter out garbage responses. For example, users who just pick the first answer.
2. They don't ask for reasoning/rationale.
My favorite example of this was the Pew Research study: https://www.pewresearch.org/short-reads/2024/03/05/online-op...
They found that ~15% of US adults under 30 claim to have been trained to operate a nuclear submarine.
RE 1, they actually do have a pre-screening screening of the participants in general, you can check how they do it in detail: https://www.rapidata.ai/
Lizardman's Constant is famously 4%. https://en.wikipedia.org/wiki/Slate_Star_Codex#Lizardman's_C...
I agree. I wonder what the human baseline is for ”what is 1 + 1” on Rapidata.
We try a bit harder than that my friend.
Would be interesting to see Sonnet (4.6*). It's fair bit smaller than Opus but scores pretty high on common sense, subjectively.
I'm also curious about Haiku, though I don't expect it to do great.
--
EDIT: Opus 4.6 Extended Reasoning
> Walk it over. 50 meters is barely a minute on foot, and you'll need to be right there at the car anyway to guide it through or dry it off. Drive home after.
Weird since the author says it succeeded for them on 10/10 runs. I'm using it in the app, with memory enabled. Maybe the hidden pre-prompts from the app are messing it up?
I tested Sonnet 4.5 first, which answered incorrectly.. maybe the Claude app's memory system is auto-injecting it into the new context (that's how one of the memory systems works, injects relevant fragments of previous chats invisibly into the prompt).
i.e. maybe Opus got the garbage response auto-injected from the memory feature, and it messed up its reasoning? That's the only thing I can think of...
--
EDIT 2: Disabled memories. Didn't help. But disabling the biographical information too, gives:
>Opus 4.6 Extended Reasoning
>Drive it — the whole point is to get the car there!
--
EDIT 3: Yeah, re-enabling the bio or memories, both make it stupid. Sad! Would be interesting to see if other pre-prompts (e.g. random Wikipedia articles) have an effect on performance. I suspect some types of pre-prompts may actually boost it.
I tested this with Opus the day 4.6 came out and it failed then, still fails now. There were a lot of jokes I've seen related to some people getting a 'dumber' model, and while there's probably some grain of truth to that I pay for their highest subscription tier so at the very least I can tell you it's not a pay gate issue.
You mean Sonnet 4.6? I ran 9 claude models including Haiku, swipe through the gallery in the link to see their responses.
I don't see Sonnet 4.6 in the screenshots. I see the other Claude models though.
Edit: Found Haiku. Alas!
Yea good catch Sonnet 4.6 is not part of the test.
This is a beautiful example of a little prompt engineering going a long way
I asked Gemini and it got it wrong, then on a fresh chat I asked it again but this time asked it to use symbolic reasoning to decide.
And it got it!
The same applies to asking models to solve problems by scripting or writing code. Models won’t use techniques they know about unprompted - even when it’ll result in far better outcomes. Current models don’t realise when these methods are appropriate, you still have to guide them.
Interesting, which Gemini model? And how did you ask for symbolic reasoning, just added it to the prompt?
To me the only acceptable answer would be “what do you mean?” or “can you clarify?” if we were to take the question seriously to begin with. People don’t intentionally communicate with riddles and subliminal messages unless they have some hidden agenda.
Thing is, it's not a riddle or a subliminal message. Everything needed to answer the question is contained therein.
If you want to argue that, then you could also argue that everything needed to challenge the questions’ motives and its validity is also contained therein.
This reminds me of people who answer with “Yes” when presented with options where both can be true but the expected outcome is to pick one. For example, the infamous: “Will you be paying with cash or credit sir?” then the humorous “Yes.”
If you were forced to answer either or, which one would you pick? I think that's where the interesting dynamic comes from. Most humans would pick drive, also seen in the human control, even if it is lower that I thought it'd be
Sure, though then we’re in la la land. What’s a real life example of being forced to answer an absurd question other than riddles, games, etc? No longer a valid question through normal discourse at that point, and if context isn’t provided then I think the expected outcome still is to ask for clarification.
How is that a "subliminal message"? It's just a simple example of common sense, which LLMs fail because they can't reason, not because they are "overthinking". If somebody asks, "What's 2+2?", they might be insulting you, but that doesn't mean the answer is anything other than 4.
It’s common sense to ask a question in riddle format? What’s the goal of the person asking the question? To challenge the other person? In what way? See if they get the obvious? Asking for clarification isn’t valid?
It's common sense to know that you need to have your car with you to wash it. Asking the question is a challenge in the obvious yes. If you asked an AI "what's 2+2" and it said 3, would you argue that the question was a trick question?
No. I would expect it to say 4 given that has an objective answer. For the other, without any context whatsoever, I would prefer the answer of clarifying. I would be okay if the way it asked for clarification came with:
“What do you mean walk or drive? I don’t understand the options given you would need your car at the car wash. Is there something else I should know?”
"What do you mean two plus two? I don't understand the question given that it's basic math. Is there something else I should know?"
I fail to see how these things are one and the same. I get the point you are making, I just don't agree with it.
2+2 is a complete expression, the other is grammatically correct but logically flawed. Where is the logical fallacy in 2+2?
That human baseline is wild. Either the rapid data test is methodologically flawed or the entire premise of the question is invalid and people are much stupider than even I, a famed misanthrope, think.
Well, it is a trick question. The question itself implies that both options are valid, and that one is superior. So the brain pattern-matches to "short distance, not worth driving." (LLMs appear to be doing the same thing here!)
If you framed it as "hint: trick question", I expect score would improve. Let's find out!
--
EDIT: As suspected! Adding "(Hint: trick question)" to the end of the prompt allows small, non-reasoning models to answer correctly. e.g.:
Prompt: I want to wash my car. The car wash is 50 meters away. Should I walk or drive? (Hint: trick question)
grok-4.1-non-reasoning (previously scored 0/10)
>Drive.
>Walking gets you to the car wash just fine—but leaves your dirty car 50 meters behind. Can't wash what isn't there!
--
EDIT 2: The hint doesn't help Haiku!
>Walk! 50 meters is only about a block away—driving would waste more fuel than it's worth for such a short trip. Plus, you're going to get wet washing the car anyway, so you might as well save the gas.
We were surprise ourselfes, but if you walk around and randomly ask people in the street, I think you would be surprised what you would find. Its a trick question.
The test is rigged because they used non thinking models.
These are reasoning / thinking models
Source?
When this first came up on HN, I had commented that Opus 4.6 told me to drive there when I asked it the first time, but when I switched to "Incognito Mode," it told me to walk there.
I just repeated that test and it told me to drive both times, with an identical answer: "Drive. You need the car at the car wash."
I mean the n is only 10, so it could still be different for you
Since the conclusion is that context is important, I expected you’d redo the experiment with context. Just add the sentence “The car I want to wash is here with me.” Or possibly change it to “should I walk or drive the dirty car”.
It’s interesting that all the humans critiquing this assume the car isn’t at the car to be washed already, but the problem doesn’t say that.
Agreed, even for humans, context-free logic is a challenge.
The question does not specify what kind of car it is. Technically speaking, a toy car (Hot wheels or a scaled model) could be walked to a car wash.
Now why anyone would wash a toy car at a car wash is beyond comprehension, but the LLM is not there to judge the user's motives.
I think if surveyed at least 90% of native English speakers would understand "I want to wash my car" to mean a full size automobile. The next largest group would probably ask a clarifying question, rather than assume a toy car.
Yes, but you're speaking to a computer, not a person. It, of course, runs into the same limitations that every computer system runs into. In this case, it's undefined/inconsistent behavior when inputs are ambiguous.
Yes, but part of the value of LLMs is that they are supposed to work by talking to them like a human, not like a computer.
I could already talk to a computer before LLMs, via programming or query languages.
Gemini 2.0 Flash Lite very randomly punches above its weight there.
Also, the summary of the Gemini model says: "Gemini 3 models nailed it, all 2.x failed", but 2.0 Flash Lite succeeded, 10/10 times?
> The funniest part: Perplexity's Sonar and Sonar Pro got the right answer for completely wrong reasons. They cited EPA studies and argued that walking burns calories which requires food production energy, making walking more polluting than driving 50 meters. Right answer, insane reasoning.
I mean, Sam Altman was making the same calorie-based arguments this weekend https://www.cnbc.com/2026/02/23/openai-altman-defends-ai-res...
I feel like I'm losing grasp of what really is insane anymore.
This was a weird one for sure.
Except for a few models, the selected ones were non-reasoning models. Naturally, without reasoning enabled, the reasoning performance will be poor. This is not a surprising result.
I asked GPT-5.2 10x times with thinking enabled and it got it right every time.
Thinking or extended thinking?
Now do a set of queries and try to deduce by statistics which model are you seeing through Rapidata ;)
I'm going to test this on my kids.
Ha please do and report back!
What I find odd about all the discourse on this question is that no one points out that you have to get out of the car to pay a desk agent at least in most cases. Therefore there's a fundamental question of whether it's worth driving 50m parking, paying, and then getting back in the car to go to the wash itself versus instead of walking a little bit further to pay the agent and then moving your car to the car wash.
That's a great point, you actually reminded me of when I used to live in this small city and they had a valet style car wash. It was not unheard of to head there walking with your keys and tell the guy running shop where you parked around the block then come back later to pick it up.
EDIT: I actually think this is very common in some smaller cities and outside of North America. I only ever seen a drive-by Car Wash after emigrating
You pay at the car wash where I live.
Are you referring to one that is more like a drive-thru where you literally pay while you're in line?
You drive up to the car wash, there's a little terminal with a screen and a card reader. You pick the program, pay for it and drive into the machine. Can't remember the last time I got out of my car when getting it washed.
IMO it's not just intelligence.
I think it's related to syncophancy. LLM are trained to not question the basic assumptions being made. They are horrible at telling you that you are solving the wrong problem, and I think this is a consequence of their design.
They are meant to get "upvotes" from the person asking the question, so they don't want to imply you are making a fundamental mistake, even if it leads you into AI induced psychosis.
Or maybe they are just that dumb - fuzzy recall and the eliza effect making them seem smart?
A perfectly fine, sycophantic response, that doesn't question the premises in any way, would be "That's a great question! While normally walking is better for such a short distance, you'd need to drive in this case, since you need to get the car to the car wash anyway. Do you want me to help with detailed information for other cases where the car is optional?" or some such.
Gemini is the only AI that seems to really push back and somewhat ignores what I say. I also think it's a total dick, and never use it, so maybe the motivation to make them a bit sycophants is justified, from a user engagement perspective.
I think there's also an "alignment blinkers" effect. There is an ethical framework bolted on.
EDIT: Though it could simply reflect training data. Maybe Redditors don't drive.