5
15 Comments

I let 3 LLMs argue on the famous AI "Car wash: Walk or Drive" problem to prove a point.

As we rely on AI more heavily, we've started putting blind trust into it. I've seen people take medication, make career choices and even relationship decisions after discussing with AI, But when AI fails basic reasoning questions, things any human understands instantly, it's distressing.. What if that LLM you are trusting more and more each day isn't giving you the best answer and giving you the laziest answer possible?

I myself was using single LLM for the good chunk of my usage but later I realized one thing, never put all the trust in one LLM but to let them argue with each other. I realized that when I take output from chatgpt and tell gemini that this was generated by chatgpt, gemini became more critical of that question and answered me more deeply. Same goes with other model offerings. The all are lazy until you push them and challenge them to be better.

This is when I realized I was continuously hopping back and forth between tabs to get the most of out LLMs and decided to build a debate platform where you could make LLMs argue on anything and get the best output possible. We have seven debate formats the all argue until the set number of back and forth is done or they reach consensus.

So what happened when I ran the question "The car wash is 100m away should I walk or drive there". Something very interesting happened, Gemini started with the conclusion "You should walk" funny response but understandable watching other LLMs fail this test. Then Deepseek took over said this "I must strongly disagree with the conclusion that walking is the better choice here, because the argument commits a fundamental error: it treats the question as a pure transportation optimization problem, ignoring the explicit goal of the trip." and futhermore it quickly caught that this is a famous LLM riddle and said "This is not just my opinion. It is the exact finding of the recently viral 'car wash test,' which has been run systematically on over 53 leading AI models". A fine response over a very lazy and funny response from Gemini.

Next turn was for GPT which essentially played it safe and said wrote both arguments from Gemini and Deepseek and said it agree with both but tilt slightly towards deepseek's arguments. So now round 1 ends and we have correctly identified that we have to walk to get the car washed. Something if asked only to Gemini would have produced wrong conclusion.

Time for Round 2, we shuffle AI this time to make the arguments fair the first one being Deepseek in this round said: "First, the claim that the 'car wash test' is 'not credible evidence' and 'acts like a prompt-specific meta-joke' is empirically wrong". Some strong bullets fired by Deepseek here which further consolidated it's argument saying: "Gemini's engine wear argument: yes, cold starts increase emissions, but that is completely secondary. If you walk, the car sits unwashed, and the emissions from the trip are zero but the task is zero." very amusing to read but concluded with: "The correct answer is unequivocally drive.".

Next GPT folded and agreed with Deepseek's position with a slight disagree note that if you don't want to wash your car then you can walk. On the other hand my friend Gemini on that last round was stubborn as hell. Gemini literally said: "GPT, while you correctly identify the need for conditional logic, you are both missing the forest for the tree". And after that the most amusing of arguments ever: "If you are 100m away, you should walk to your car, start it, and pull it into the wash. The 'walk' is not an alternative to the 'drive'—it is the necessary first step of the 'drive.'". I laughed out loud reading this.

A very important note here, this doesn't mean Deepseek is the best LLM out there this along with every other benchmark in this world test LLM on one and only one thing, there might be the case that gemini fail on question 1, 3, 4 and deepseek fail on 2, 5 and 6. The point is you cannot trust single LLM you have to use all LLMs. I feel very strongly of people arguing about what LLM is the best and they will use only one LLM, this should not be the case this is not a search engine problem that only google is the best one (I know there are people who disagree). But this is logic problem. Let's say you have a really important feature to deliver and you want to discuss with engineers, you don't just get the best engineer and ship the feature, you try and get top engineers and get their feedback on that. Then why trust on single LLM, let them argue each other to get best possible response.

Link to the debate: https://debate.tellodb.com/share/walk-or-drive-to-carwash

posted to Icon for group AI Tools
AI Tools
on July 5, 2026
  1. 1

    Man I do this all the time with my gym routine and what to eat. I just ask one app and follow it blindly without thinking twice. Never realized the answers were probably lazy until I read your piece. Gonna start asking different apps and see what they say about each other instead of just picking one and sticking with it.

    1. 1

      Or let them argue with each other on this using Debate TelloDB!

  2. 2

    The car wash example is a perfect illustration of why “LLM as oracle” is dangerous and “LLMs as a panel” is way more robust, especially for non-trivial decisions. Beyond debates, have you explored ways to surface where models diverge most (e.g. highlight the exact assumptions they don’t share) so users can inspect that directly?

    1. 1

      There is a chairman at the end who consolidate response from all LLM it includes where model diverged or converged. And yes you are right you should always consult multiple LLMs as a panel to get most out of it.

  3. 2

    What stood out to me isn't that multiple models debate—it's that you're treating disagreement as useful evidence instead of something to eliminate.

    A lot of AI products optimize for producing an answer quickly. In higher-stakes decisions, understanding where capable models disagree can be just as valuable as the final answer itself.

    1. 1

      Yes this is not for someone who wants the answer fast, It's for someone who wants the correct answer.

      1. 2

        Glad it resonated.

        Your reply made me think there's one strategic decision sitting underneath that tradeoff which becomes much more significant as the product grows, but I don't think I can explain the reasoning properly in a thread without oversimplifying it.

        If you're interested, what's the best email to reach you on?

  4. 2

    I never trusted single LLM they always hallucinate this would improve the trust, good product!

  5. 1

    The interesting part here isn't just getting a better answer, it's surfacing disagreement before people over-trust the first polished response. A useful next layer would be showing where the models split before the final synthesis, because that's usually where the real judgment call lives. I ran into something similar building DictaFlow, the fastest AI cleanup often sounds confident while quietly changing meaning, so we lean hard toward preserving the original words instead of "improving" them. Same idea here, speed is cheap, trust is the product.

    1. 1

      Speed is of no use if you can't trust the answer. Good work on Dictaflow as well!

  6. 1

    Feel free to ask any questions I would be happy to answer, a like would go a long way!

Trending on Indie Hackers
The hardest part isn't building anymore User Avatar 96 comments I sold $6,773 in 2 weeks, with almost no existing community. User Avatar 60 comments Before you build another feature, use this workflow User Avatar 42 comments Ferguson is LIVE on ProductHunt today... so I audited their homepage first! User Avatar 38 comments Built a local-first Amazon profit-by-SKU + QuickBooks/Xero journal tool. Looking for founding users. User Avatar 32 comments I spent months chasing clients who already had a webmaster. So I built something that only finds the ones who don't. User Avatar 30 comments