ACE-Step: Building A Fast, Open Foundation Model For Scalable Music Generation

In music, every new tool competes with decades of craft. Composers and producers want AI that can keep up with their sessions, respect licensing, and respond in the same fluid way a trusted collaborator would. Generative models can now sketch melodies, shape timbres, and suggest arrangements, but most still behave like opaque engines that produce sound without showing how or why a result came to be. For many creators, the real promise of generative audio is not replacing their ear, but lowering the barrier to professional-grade results so a laptop session can yield mixes that once required expensive studio time.

That gap between raw capability and trustworthy instruments is the problem Wenxiao Zhao works on. As CTO and co-founder of TimeDomain, Inc., and an IEEE Senior Member and a leading voice in generative audio technology, he leads the team behind ACE Studio, an AI-native music workstation supported by multiple models, with ACE-Step powering its music-generation tools. His operating principle is direct: design models so that artists stay in control, then scale performance and openness around that standard.

From Research Prototypes To Composer-Ready Models

That focus on controllable instruments matters because foundation models are quickly moving from lab demos to the center of real music workflows. The global generative AI in the music market was valued at about $569.7 million in 2024 and is projected to reach roughly $2.795 billion dollars by 2030, as more artists, labels, and platforms adopt AI for composition, arrangement, and sound design. Under that growth curve, the systems that win will be the ones that turn research models into predictable, session ready instruments that can adapt to a wide range of tasks, from vocal lines and backing tracks to remix stems, without forcing creators to change how they work.

Zhao led the integration of ACE-Step into ACE Studio’s multi-model architecture, positioning it as a fast, open foundation model that supports music generation workflows alongside dedicated singing and instrument models. Starting in 2020, he directed foundational research on neural singing synthesis, then expanded the architecture to combine diffusion models, autoencoders, and lightweight transformers that can generate minutes of coherent audio while staying responsive enough for interactive use. Over successive releases from 2020 through 2025, that work took ACE Studio from an early prototype to a globally deployed workstation.

Within that system, ACE-Step plays a focused role inside a broader model stack, contributing to how ACE Studio reasons about musical structure and progression. That influence is most directly felt in features such as singing-to-accompaniment, while other generation tools rely on models purpose-built for their own tasks. Together, those capabilities help creators move from sketch to arrangement faster without replacing their taste or authorship. The result is a system that behaves less like a black box and more like a familiar instrument, helping producers at different experience levels reach professional-grade outcomes inside a single session.

“When we build a foundation model for music, I expect it to behave like a trusted instrument, not a research experiment. If it can keep sessions stable while adapting to new tasks, composers stop thinking about the model and focus on their ideas. That is the bar ACE Studio is designed to clear every day,” states Zhao.

Latency Lessons From Games: Making Generative Audio Feel Instant

Once the backbone is stable, the next challenge is making it feel instant in interactive workflows. The global mobile gaming market was valued at about $146.33 billion in 2024 and is projected to reach roughly $336.57 billion by 2029, reflecting how billions of players have grown used to high fidelity, always available experiences on handheld devices. Industry analysis also notes that around 75% of gamers report frustration with slow or unreliable games, effectively setting a strict latency budget for any interactive system they touch. Those expectations now spill into creative software too, so a generative audio workstation that makes producers wait feels outdated next to games and real time tools that respond in milliseconds.

Zhao learned many of those latency lessons earlier in his career as a core developer on Tencent Billiards, a 3D mobile billiards game built by a small ten person team inside Tencent's Photon Studio Group. He designed a custom C++ physics engine integrated with Unity, implemented deterministic simulation and state reconciliation for real time multiplayer, and introduced a weak network reconnection mechanism that reduced online match dropouts by more than 28% while keeping play at a steady 30 frames per second. When the WeChat Mini Game version launched, the title reached a peak of 5.2 million daily active users and generated more than 10 million yuan in monthly revenue, ultimately attracting over 50 million cumulative players across its lifecycle. Its long-term retention and monetization made it a reference example of Tencent's small-team, high-impact development model, reinforcing how tightly engineered responsiveness translates into user trust. That mindset now carries forward into how ACE Studio is built, so composers experience vocal renders and music previews as immediate rather than delayed batch outputs.

“Fast models are only useful if the rest of the system keeps up with human timing. Years of tuning billiards physics for millions of players taught me that perceived smoothness comes from the whole pipeline, from algorithms to UI. That mindset carries into how ACE Studio is built, so a composer can drag a note or tweak a phrase and feel the result instantly, not wait for a progress bar,” says Zhao.

Open Foundations For A Creator Ecosystem

As ACE-Step matures, openness has become a second pillar alongside raw speed. The open source ACE-Step repository has attracted around 3,345 stars and 391 forks on GitHub in 2025, reflecting meaningful traction among researchers, plugin developers, and creative coders who want to inspect and extend the model rather than accept a sealed service. When a music foundation model is visible at this level, it behaves less like a proprietary effect and more like shared infrastructure that different tools, companies, and independent creators can build on together.

Zhao, who holds a granted patent in music generation systems, pushed for ACE-Step to be released as a fast, open foundation model rather than kept entirely behind ACE Studio's interface. He led the architecture for training and inference pipelines, packaging the model so developers can run it on their own infrastructure, integrate it into custom tools, or contribute improvements back through GitHub. Inside ACE Studio, the same design shows up as artist supervised, feedback looped workflows, where every phrase can be regenerated, sliced, or rebalanced with clear controls instead of hidden parameters, so creators see and feel how the model responds to their decisions. That combination of open code for developers and explainable behavior for musicians turns ACE-Step into a a modular foundation, other creative tech products can adopt without asking artists to give up control or authorship.

“An open foundation model is a promise that we are willing to let other people inspect and extend our work. When developers can see how ACE-Step is built, adapt it, and feed improvements back, everyone benefits. I want composers, researchers, and plugin makers to feel they are building on a shared backbone, not fighting a black box,” explains Zhao.

Guardrails For AI Voices And Music Rights

As adoption accelerates, rights and attribution sit at the center of any credible AI audio platform. The market for generative AI music and audiovisual content is expected to grow from about €3billion (~$3.49 billion) today to roughly €64 billion (~$74.2 billion) by 2028, while economic analysis suggests that around 24 percent of music creators' revenues could be at risk over the same period as unlicensed AI content competes with human works. Those numbers make it clear that training data choices, consent, and licensing structures are now core product decisions in generative audio, not just legal fine print.

Inside ACE Studio, Zhao's team treats rights and transparency as part of the product surface, not an afterthought. The workstation ships with more than eighty royalty free AI vocalists and clear usage terms, so composers can build demos and final tracks without guessing whether a take is safe to publish. Custom voice training requires explicit configuration and separation of private models, reducing the risk that a client voicebank leaks into general use. The Asset Community gives creators a governed way to share and monetize their own models, while preserving attribution and usage controls that fit professional work. In practice, those guardrails help ACE Studio expand access to professional-grade vocals without turning human singers, writers, or rights holders into expendable inputs.

“Generative audio has to respect the same ethical and legal lines that protect human performers. If we help creators understand what a model is trained on, what they are allowed to do with it, and how their own voices are protected, AI becomes a reliable part of the studio instead of a liability. This is the standard the system is designed to meet,” notes Zhao.

Looking Ahead, Where Creator-Centric AI Scales

As generative audio matures, the trajectories in both music and AI markets point to larger stakes rather than a passing trend. The global AI in music market is expected to reach about $3.58 billion by 2030, while the broader global music market is projected to grow to roughly 41.99 billion dollars by 2030, and some forecasts suggest AI driven music revenues could approach 60.44 billion dollars by 2034. Those projections favor creator centric, explainable systems that keep artists in the loop, document how models are trained and used, and lower the barrier to professional workflows instead of turning catalogs into anonymous training data.

Zhao's path reflects that direction, beyond ACE-Step and ACE Studio, he serves as a judge for the Globee Awards for Impact, reinforcing his role in shaping how responsible AI products are evaluated across industries. His focus remains consistent, using fast, open models and supervised production loops to show that generative audio can widen opportunity for composers and producers rather than trying to replace them.

“Scalable AI in music is not only about bigger models or more presets. It is about earning the trust of composers, labels, and rights holders one session at a time. If we keep creators at the center of every design decision, the next decade of generative audio can widen opportunity instead of narrowing it,” says Zhao.

Say something nice to DonaldGreene…

1

The "trusted instrument vs research experiment" framing really nails the adoption barrier for generative audio. Musicians have decades of muscle memory with their tools - they need AI that fits into existing workflows, not one that demands they learn a new creative paradigm.

The latency lesson from game dev is underrated. Interactive creative tools live and die by perceived responsiveness. If there's a noticeable delay between "I want to try this" and hearing the result, the creative flow breaks. Sounds like the billiards physics background gave you intuition for that pipeline-level thinking.

Curious about the open-source strategy: with 3.3k stars, are you seeing more academic/research forks or production-ready plugins? And how do you think about the moat when the model is open - is it the integrated ACE Studio experience that retains users?

demogod_ai

·
5 months ago
·