I let 3 LLMs argue on the famous AI "Car wash: Walk or Drive" problem to prove a point.

As we rely on AI more heavily, we've started putting blind trust into it. I've seen people take medication, make career choices and even relationship decisions after discussing with AI, But when AI fails basic reasoning questions, things any human understands instantly, it's distressing.. What if that LLM you are trusting more and more each day isn't giving you the best answer and giving you the laziest answer possible?

I myself was using single LLM for the good chunk of my usage but later I realized one thing, never put all the trust in one LLM but to let them argue with each other. I realized that when I take output from chatgpt and tell gemini that this was generated by chatgpt, gemini became more critical of that question and answered me more deeply. Same goes with other model offerings. The all are lazy until you push them and challenge them to be better.

This is when I realized I was continuously hopping back and forth between tabs to get the most of out LLMs and decided to build a debate platform where you could make LLMs argue on anything and get the best output possible. We have seven debate formats the all argue until the set number of back and forth is done or they reach consensus.

So what happened when I ran the question "The car wash is 100m away should I walk or drive there". Something very interesting happened, Gemini started with the conclusion "You should walk" funny response but understandable watching other LLMs fail this test. Then Deepseek took over said this "I must strongly disagree with the conclusion that walking is the better choice here, because the argument commits a fundamental error: it treats the question as a pure transportation optimization problem, ignoring the explicit goal of the trip." and futhermore it quickly caught that this is a famous LLM riddle and said "This is not just my opinion. It is the exact finding of the recently viral 'car wash test,' which has been run systematically on over 53 leading AI models". A fine response over a very lazy and funny response from Gemini.

Next turn was for GPT which essentially played it safe and said wrote both arguments from Gemini and Deepseek and said it agree with both but tilt slightly towards deepseek's arguments. So now round 1 ends and we have correctly identified that we have to walk to get the car washed. Something if asked only to Gemini would have produced wrong conclusion.

Time for Round 2, we shuffle AI this time to make the arguments fair the first one being Deepseek in this round said: "First, the claim that the 'car wash test' is 'not credible evidence' and 'acts like a prompt-specific meta-joke' is empirically wrong". Some strong bullets fired by Deepseek here which further consolidated it's argument saying: "Gemini's engine wear argument: yes, cold starts increase emissions, but that is completely secondary. If you walk, the car sits unwashed, and the emissions from the trip are zero but the task is zero." very amusing to read but concluded with: "The correct answer is unequivocally drive.".

Next GPT folded and agreed with Deepseek's position with a slight disagree note that if you don't want to wash your car then you can walk. On the other hand my friend Gemini on that last round was stubborn as hell. Gemini literally said: "GPT, while you correctly identify the need for conditional logic, you are both missing the forest for the tree". And after that the most amusing of arguments ever: "If you are 100m away, you should walk to your car, start it, and pull it into the wash. The 'walk' is not an alternative to the 'drive'—it is the necessary first step of the 'drive.'". I laughed out loud reading this.

A very important note here, this doesn't mean Deepseek is the best LLM out there this along with every other benchmark in this world test LLM on one and only one thing, there might be the case that gemini fail on question 1, 3, 4 and deepseek fail on 2, 5 and 6. The point is you cannot trust single LLM you have to use all LLMs. I feel very strongly of people arguing about what LLM is the best and they will use only one LLM, this should not be the case this is not a search engine problem that only google is the best one (I know there are people who disagree). But this is logic problem. Let's say you have a really important feature to deliver and you want to discuss with engineers, you don't just get the best engineer and ship the feature, you try and get top engineers and get their feedback on that. Then why trust on single LLM, let them argue each other to get best possible response.

Link to the debate: https://debate.tellodb.com/share/walk-or-drive-to-carwash

Sharjeel Abbas

posted to

AI Tools

on July 5, 2026

Say something nice to sharjeelabbas…

Post Comment

1

Man I do this all the time with my gym routine and what to eat. I just ask one app and follow it blindly without thinking twice. Never realized the answers were probably lazy until I read your piece. Gonna start asking different apps and see what they say about each other instead of just picking one and sticking with it.

lenchpes

·
an hour ago
·
Reply
1. 1
  
  Or let them argue with each other on this using Debate TelloDB!
  
  sharjeelabbas
  
  ·
  an hour ago
  ·
  Reply
2

The car wash example is a perfect illustration of why “LLM as oracle” is dangerous and “LLMs as a panel” is way more robust, especially for non-trivial decisions. Beyond debates, have you explored ways to surface where models diverge most (e.g. highlight the exact assumptions they don’t share) so users can inspect that directly?

toanconquers

·
4 hours ago
·
Reply
1. 1
  
  There is a chairman at the end who consolidate response from all LLM it includes where model diverged or converged. And yes you are right you should always consult multiple LLMs as a panel to get most out of it.
  
  sharjeelabbas
  
  ·
  4 hours ago
  ·
  Reply
2

What stood out to me isn't that multiple models debate—it's that you're treating disagreement as useful evidence instead of something to eliminate.

A lot of AI products optimize for producing an answer quickly. In higher-stakes decisions, understanding where capable models disagree can be just as valuable as the final answer itself.

aryan_sinh

·
7 hours ago
·
Reply
1. 1
  
  Yes this is not for someone who wants the answer fast, It's for someone who wants the correct answer.
  
  sharjeelabbas
  
  ·
  6 hours ago
  ·
  Reply
  1. 2
    
    Glad it resonated.
    
    Your reply made me think there's one strategic decision sitting underneath that tradeoff which becomes much more significant as the product grows, but I don't think I can explain the reasoning properly in a thread without oversimplifying it.
    
    If you're interested, what's the best email to reach you on?
    
    aryan_sinh
    
    ·
    5 hours ago
    ·
    Reply
    1. 1
      
      Feel free to reach me at [email protected]
      
      sharjeelabbas
      
      ·
      4 hours ago
      ·
      Reply
2

I never trusted single LLM they always hallucinate this would improve the trust, good product!

TehArooj

·
9 hours ago
·
Reply
1. 1
  
  Thanks! 🦾
  
  sharjeelabbas
  
  ·
  5 hours ago
  ·
  Reply
1

The interesting part here isn't just getting a better answer, it's surfacing disagreement before people over-trust the first polished response. A useful next layer would be showing where the models split before the final synthesis, because that's usually where the real judgment call lives. I ran into something similar building DictaFlow, the fastest AI cleanup often sounds confident while quietly changing meaning, so we lean hard toward preserving the original words instead of "improving" them. Same idea here, speed is cheap, trust is the product.

ryanshrott

·
2 hours ago
·
Reply
1. 1
  
  Speed is of no use if you can't trust the answer. Good work on Dictaflow as well!
  
  sharjeelabbas
  
  ·
  2 hours ago
  ·
  Reply
1

Nice article

234deals

·
2 hours ago
·
Reply
1. 1
  
  Thanks 💪
  
  sharjeelabbas
  
  ·
  2 hours ago
  ·
  Reply
1

Feel free to ask any questions I would be happy to answer, a like would go a long way!

sharjeelabbas

·
10 hours ago
·
Reply