Conversational commerce 2.0: Lowering the barrier to digital commerce through voice-native interfaces, specifically for low-literacy or high

The history of digital commerce has been written in characters. The browser, the search bar, the keyboard, every step of the journey from intent to purchase. That assumption is now under pressure. The global voice commerce market reached $49.6 billion in 2024 and is projected to grow to $147.9 billion by 2030 at a compound annual rate of 20%, and roughly half of all digital searches are now expected to be voice-based by the end of the year. Underneath those numbers sits a harder fact. Two billion workers, more than 60% of the world's adult labor force, earn their living entirely inside the informal economy where typing on a screen is often a barrier rather than a tool. The question facing product leaders building for the next billion users has shifted. It is no longer whether voice will become a primary interface for commerce, but who gets included when it does.

Varanjot Kaur, Senior Member of IEEE has spent more than fifteen years building products at the intersection of monetization, business messaging, and generative AI. She currently leads product management for a global messaging platform that serves billions of users, with her focus on the conversational systems that let small and medium businesses reach customers at scale. Earlier in her tenure she led product strategy for the platform's monetization arm, and before that spent nearly a decade at McKinsey advising Fortune 500 technology firms on growth and go-to-market.

We spoke with Kaur about the architecture of voice-native commerce, the populations it finally lets into the digital economy, and what the industry is still getting wrong about audio as a transactional medium.

Most of the conversation about commerce assumes a literate, sedentary user typing on a screen. Who does that assumption leave out?

A lot of people, and most of them are the ones building actual economies. About two billion workers, more than 60% of the global adult labor force, work informally. In low and middle income countries, the informal sector accounts for roughly 35% of GDP. These are not edge cases. Street vendors, market sellers, delivery riders, beauticians who run their business out of messaging chats, mechanics taking orders during repairs. They are the dominant economic actors in entire regions, and the standard digital commerce stack quietly excludes them.

The exclusion is a literacy problem, a mobility problem, and a UI problem at once. If your day involves your hands being busy or your eyes being busy, typing into a form is friction you can't afford. If you read at a level below the seventh grade because your education stopped early, every additional menu and confirmation screen is a fail point. There is a name for what gets lost in that gap. People sometimes call it dark GDP, the economic activity that exists, generates real income, and never shows up in formal numbers because the people creating it cannot get into the digital channels that would count it. Voice changes that math.

Conversational commerce isn't new. We had chatbots a decade ago. Why did that wave fall short?

Conversational Commerce 1.0 was text. That was the entire problem. The first wave of chatbots assumed that if you replaced a form with a text interface, you had solved accessibility. You hadn't. You had moved the same literacy and dexterity requirements into a different visual frame. A user who couldn't navigate a checkout flow on a website couldn't navigate a scripted text bot either. The bot was just a smaller box with the same prerequisites.

The second failure of that wave was that it treated conversation as a search problem. Ask a question, get an answer, end. Real commerce conversations don't work that way. They are multi-turn, ambient, sometimes interrupted for hours, and they carry context the platform has to remember. Conversational Commerce 2.0 isn't an incremental upgrade. It is a different paradigm where audio becomes the core medium, where the system holds state across long gaps, and where the burden of structure shifts from the user to the platform. Most of the value the industry will create over the next five years sits in that shift.

Voice-native is more than speech-to-text. What does the underlying stack actually have to do?

Three things, and each of them is harder than it sounds. First, language coverage that matches how people actually talk, not how textbooks describe a language. In most of the world, people code-switch. They mix English with Hindi, Spanish with Quechua, Mandarin with regional dialects in the same sentence. A voice system that works in lab conditions and falls apart on real speech is useless to the populations that need it most. Startups like Navana Tech in India are now building SDKs that let users navigate banking apps in their local tongue, and that work matters disproportionately to commerce because the same speech patterns show up in any business conversation.

Second, persistent context across turns. A business conversation is not a single command. Someone records a voice note about a price, sends another about delivery, sends a third confirming the order. The system has to hold the thread.

Third, ambient noise robustness. The market vendor isn't sitting in a quiet office. There are 8.4 billion active voice assistants in the world right now, and most of them perform well only under conditions that don't match where the next billion users actually live. The technical bar in 2026 has moved fast. We are seeing latency targets around 250 milliseconds for end-to-end voice interactions, which is roughly the threshold below which conversation feels like conversation rather than a delayed exchange. Hitting that target inside a noisy market in a tier-three city is the real engineering work.

Where is voice-native already showing up at scale, and what do those use cases actually look like?

It splits cleanly into the consumer side and business side, and both are moving fast. On the consumer side, the dominant pattern is frictionless reordering, household staples bought by voice while someone is cooking, real-time package tracking and rerouting while driving. Amazon's Rufus, the generative shopping assistant, is a clear signal of where the category is going. People can ask layered, contextual product questions and get a curated answer instead of scrolling through hundreds of reviews. That is not a chatbot upgrade. It is a fundamentally different interaction model.

On the business side, the use cases get more interesting because the economics are sharper. Logistics platforms like Aramex are using voice APIs to let delivery drivers update statuses and reroute hands-free, which is a direct safety and productivity win. AI voice agents in customer support are now deflecting up to 90% of routine WISMO queries, the "where is my order" volume that has historically eaten frontline support capacity. And in markets like Brazil and India, voice notes are functioning as de facto contracts between buyers and sellers. The prosody, the pauses, the emphasis on a price or a delivery date, all of that carries trust that a printed receipt would. The text-first stack cannot capture any of it.

What's the hardest problem inside this category? Where does it nearly break?
Distinguishing organic communication from commercial activity. When platforms have to behave differently for businesses than for individual users, they need a way to identify which is which without forcing everyone to self-identify. People don't. Most small businesses use the same channels their customers use, and any classifier built on top has to operate at scale, with extremely low false positive rates, because the cost of misidentifying an actual person as a business is high. They lose access to the thing they use to talk to their family.

The other failure mode that breaks systems is what happens when bad actors get sophisticated. Spam doesn't stay where you classified it. It mutates. The moment thresholds become legible, even implicitly through observed behavior, the spam evolves. Gartner is forecasting that more than 40% of agentic AI projects will be canceled or abandoned by the end of 2027, and a lot of the failures are not going to be technical. They are going to be governance and adversarial behavior failures. The product has to keep working when someone is actively trying to break it.

You're a Joseph Wharton Fellow, and your training spans business strategy as much as it does product. How does that show up in how you think about voice commerce?

Honestly, the most useful thing the Joseph Wharton Fellowship gave me was the upper hand to think about market structure as part of product design. Most product training is about features and user experience, which is fine but incomplete. If a category like voice commerce is going to lower the barrier to digital trade for hundreds of millions of micro-businesses, you are not designing a feature. You are designing a market. You are deciding who gets to participate, what counts as fair pricing across very different buying powers, and what the long-term incentives look like for the ecosystem you are building. That kind of question doesn't fit cleanly into a product backlog.

The other thing the fellowship signaled, mostly to me, was that thinking and building don't have to be different jobs. There is a tendency in tech to assume strategists don't ship and shippers don't strategize. I have never found that to be true at the work that matters. The most interesting voice work happening right now is coming out of strategic questions, not UI questions. People are asking which segments of users the existing text product is structurally failing, and the answer reorganizes the engineering work behind it.

Before this, you co-authored a McKinsey Quarterly piece on resilience in downturns. Does that analytical work translate to how you read the voice commerce category now?

Directly. The resilience study was a long exercise in separating signal from noise across decades of company performance, more than fifty financial metrics, twenty years of data. The work was about not getting fooled by short-term patterns that don't actually generalize. That muscle is exactly the muscle you need reading a category like voice commerce, because the temptation in early-stage markets is to over-index on the loudest demo or the cleanest pilot. Most patterns don't survive contact with a billion users.

What the McKinsey work taught me, and I think this is what I bring to industry calls now, is to ask what the failure case looks like before celebrating the success case. Most resilient companies in our research didn't outperform during the downturn because they did something dramatic. They outperformed because they had already done the unsexy work of being capital-efficient and operationally clean before the downturn arrived. Voice products are similar. The platforms that will lead the next decade are the ones doing the unglamorous infrastructure work right now, not the ones with the most impressive demo.

What's still unsolved? What's the question you most want the industry to get right?

Two things. First voice as a primary commerce surface, not a secondary one. Most platforms still treat voice as an accessibility feature you add to a text product. I think that gets it backwards. Build voice-first, and text becomes the alternative. That is a different product organization, a different research agenda, and a different definition of who counts as the primary user. I don’t have a clear answer yet for what that looks like at the scale of platforms serving billions of people across hundreds of languages, but it’s the question I’m most interested in—and one I believe the industry will spend the rest of this decade trying to solve.

Second, the problem of identity in commerce conversations. Right now there is no standard way for a small business in Lagos or Mumbai or Recife to prove they are a real business rather than a sophisticated spam operation. Without that, every platform builds its own ad-hoc classifier, which is expensive and inconsistent. There is a real research and product opportunity in cross-platform business identity for the informal economy. Gartner is forecasting that 40% of enterprise applications will include task-specific AI agents by the end of 2026, but most of that conversation is happening in enterprise contexts. The same infrastructure question matters even more for the long tail of micro-businesses who can't afford a custom integration.