Here's the version of events I'm a little embarrassed to tell. I was finishing up a co-op survival game, maybe six months into development. Players needed to coordinate. I'd been putting off voice chat because it felt like a stretch goal, not a core feature. Then a playtest group started complaining that the ping wheel wasn't cutting it for complex situations. Fine. Push-to-talk. How hard can it be?
I budgeted a weekend. I spent two weeks, and that was before moderation became a problem. Day three after a small public beta, a player messaged me with a screenshot of a chat log they'd somehow captured. Someone had been harassing a kid in voice. I had no tools to handle it. No reporting flow. No recording. No way to ban someone from voice specifically. I had a Unity project and a Photon integration that let people talk to each other, and nothing else.
That was the moment I understood that voice chat is not a feature. It's a product decision with ongoing costs.
The Deceptive Simplicity Problem
Voice chat looks simple because the networking problem is solved. WebRTC has been around since 2011. Unity has a built-in Vivox integration. You can get two people talking in a game in an afternoon. The gotchas aren't technical, they're operational.
Text chat is hard to moderate. Voice chat is harder by an order of magnitude. With text, you have a record. You can run it through a filter, log it, search it. With voice, you're dealing with real-time audio streams that are expensive to store, difficult to analyze automatically, and legally complicated to record without user consent in some jurisdictions. The moderation tooling that exists for text, built over twenty years of online gaming, mostly doesn't apply to voice.
Every platform that added voice chat at scale, Xbox Live, PlayStation Network, Discord, had to build entire trust-and-safety teams around it. You're not doing that. So the question becomes: how much of this problem can you outsource?
Third-Party SDKs vs Rolling Your Own
You have two real paths: integrate an existing voice SDK, or build on raw WebRTC. Let me be direct about when each makes sense.
Use a third-party SDK if: you're shipping to consoles, you need voice in a native mobile build, you have more than a few hundred concurrent users, or you want to avoid running your own signaling servers. The SDKs handle NAT traversal, TURN/STUN infrastructure, platform certification requirements, and give you controls (mute, ban, channel management) out of the box.
Build on WebRTC yourself if: you're building a browser-based game, your player count is small and predictable, you have backend engineering resources, and you want complete control over the architecture. The DIY path is real work, but it's not impossible, and for some games it's the right answer.
Vivox: The Enterprise Option
Vivox is what Fortnite uses. It's what Call of Duty uses. It's what most AAA games on Unity use because Unity acquired it in 2019 and baked it into the engine. If you've played a big multiplayer game in the last five years, you've probably talked through Vivox without knowing it.
The Unity integration is genuinely good. You initialize it with your credentials, create a channel, and users can join. Positional audio is built in. The SDK handles all the NAT traversal and relay infrastructure. For a studio with a real budget shipping on multiple platforms including consoles, Vivox makes sense.
The catch is pricing and access. Vivox pricing is not public. You negotiate contracts, and the costs scale with concurrent users. For an indie game expecting a few thousand players at launch, the pricing can be workable. For a small studio with uncertain player counts, the uncertainty itself is a problem. I've talked to developers who got quotes ranging from "effectively free at small scale" to numbers that made them reconsider the feature entirely.
Vivox also has an approval process. You need to apply, get approved, and there's overhead there that a weekend project can't absorb. If you're shipping a commercial game with real funding, it's worth the conversation. If you're a solo dev making your first multiplayer game, look elsewhere first.
Agora: The Practical Middle Ground
Agora is where I ended up, and it's what I'd recommend to most indie developers today. The pricing model is transparent: you pay per minute of audio usage. As of 2025, it's roughly $0.99 per 1,000 audio minutes for the basic tier, with a free tier of 10,000 minutes per month. For a game with 100 concurrent players talking for an hour each day, you're looking at a few dollars a month until you scale.
The docs are actually good. That sounds like a low bar, but voice SDK documentation is famously bad. Agora's Unity SDK comes with working sample projects, the initialization flow is clear, and their dashboard shows you real usage. I've debugged Agora integrations at 2am and been able to figure out what was wrong. That matters.
Agora also has a moderation layer called Agora Content Moderation, which does real-time audio stream analysis for certain types of content. It's not magic, it won't catch everything, but it's something you can turn on without building it yourself. Combined with a user reporting flow, it gives you the minimum viable trust-and-safety setup.
The weaknesses are real. Agora is a Chinese company (Beijing-headquartered), which matters for games targeting certain markets. Data residency questions come up. The SDK size is not trivial. And for advanced spatial audio with dynamic positioning, you're doing more manual work than Vivox.
Daily.co and LiveKit: The Newer Options
LiveKit is the one I'd be looking at seriously if I were starting a new project today. It's open-source (Apache 2.0 licensed), you can self-host it, and their cloud tier is competitively priced. The project has been growing fast since 2021. Their Unity SDK is relatively new but functional, and the WebRTC implementation underneath is solid.
What LiveKit does well is flexibility. You can run your own LiveKit server, which means no vendor dependency, no usage-based pricing surprises, and full control over data. For a game studio that already has backend infrastructure, adding a LiveKit server to your fleet isn't a huge ask. The tradeoff is that you're now responsible for operating it, including the TURN relay infrastructure for players behind restrictive NATs.
Daily.co is primarily designed for video calling, not games. You can use it for voice, and some browser-based games do, but you're fighting the SDK's assumptions about use cases. The pricing is clear and reasonable, but the fit isn't natural. I wouldn't recommend it unless you're building something closer to a social space than a game.
For browser games specifically, LiveKit is genuinely competitive. If you're building a WebSocket-based multiplayer game that runs in the browser, dropping in LiveKit for voice is less friction than most alternatives.
Building on WebRTC Yourself
I'll be honest: I've done this once, for a browser game, and I wouldn't do it again for a native game. WebRTC itself is well-specified and browsers implement it consistently. The problem is signaling and NAT traversal.
WebRTC needs a signaling channel to exchange connection information before peers can talk directly. You're building that. It also needs STUN servers (for discovering public IPs) and TURN servers (for relaying when direct connection fails). Running TURN at any real scale isn't free. Coturn is the standard self-hosted option, but it burns bandwidth proportional to your concurrent users who can't establish direct connections, which in practice is around 15-20% of connections depending on your player base's network setup.
Where DIY WebRTC makes sense: small browser game, you already have WebSocket infrastructure, you want to understand the technology deeply, and your player count is bounded. Where it doesn't: native games on mobile or console, you need scalable relay infrastructure, you're a solo dev without backend time.
The real risk with DIY isn't the initial implementation. It's the maintenance. WebRTC browsers update. New NAT traversal edge cases appear. You're now on the hook for debugging "voice works for 90% of players but not these specific network configurations" forever.
Proximity Audio: Three Different Problems
Proximity audio (where your voice volume changes based on in-game distance) sounds like one feature. It's actually three different product problems, and which one you're solving changes the implementation entirely.
The Among Us model: During a round, crewmates can't talk. After a death, the ghost hears both living and dead players but living players can't hear ghosts. In meetings, everyone hears everyone. This isn't spatial audio at all, it's discrete channel membership that changes based on game state. The technical implementation is just channel switching. When a player dies, move them to the ghost channel. When a meeting starts, merge channels. Among Us built this without fancy audio spatialization because the game didn't need it.
The FPS squad model: In games like Squad or Hell Let Loose, there's local voice (150m range, anyone nearby hears you), squad voice (your squad always, regardless of distance), and command voice (squad leaders only). Volume attenuates with distance on local chat. This needs actual audio spatialization: calculate distance between speaker and listener server-side or client-side, send that distance to the SDK, let the SDK adjust volume. Vivox handles this natively. With Agora, you use their spatial audio extension. With LiveKit, you're computing it yourself and adjusting track volumes.
The open world model: This is the hard one. In MMOs with large zones, or open world games where hundreds of players might be in the same area, naive proximity audio creates chaos. World of Warcraft's default voice chat (yes, it exists, almost nobody uses it) handles this by limiting proximity channels to small groups. Final Fantasy XIV routes voice through party/linkshell structures rather than pure proximity. The lesson from every MMO that tried pure spatial voice chat: it doesn't work at population density. You need to constrain who can hear whom, and the constraints should follow social structures, not just geometry.
For most games you'll actually build, the FPS squad model is what you're implementing. The math isn't hard: get the distance between two players every tick, clamp it to your max range, convert it to a volume scalar, update the audio track. The gotcha is that doing this server-side (where you know authoritative positions) and communicating it efficiently to the audio SDK adds latency. Most games do it client-side with client-known positions and accept that proximity audio is slightly laggy when players are at the edge of range.
Moderation: The Real Cost Nobody Budgets For
Let me tell you what happened after that harassment incident I mentioned. I added a reporting button. Players could tap it and file a report against a user. I got 40 reports in the first week. I had to listen to audio clips. I had no audio clips, because I wasn't recording. I had timestamps and player IDs, and that was it. Every report was unverifiable. I banned two players based on corroborating reports from multiple sources. The person who got harassed quit the game.
This is the reality of voice moderation without tooling. You're flying blind.
There are two companies worth knowing about: ToxMod and Modulate.ai. Both do real-time audio analysis specifically for gaming contexts. They listen to voice streams and flag toxic behavior: harassment, slurs, doxxing attempts, and similar. They don't require you to record and store all audio for review, which is important for privacy reasons.
ToxMod (acquired by Activision in 2023, now also available to third parties) integrates at the audio platform level. Modulate.ai offers a standalone SDK. Both cost money on top of your voice infrastructure costs. ToxMod pricing is enterprise. Modulate.ai has been more accessible to indie developers, with pricing that scales by usage.
Are they worth it? If you have a game where minors play, or where voice is a primary communication channel, yes. If you're building a small game with a tight-knit community and manual moderation is feasible, maybe not at launch. The problem is that harassment incidents happen before you expect them. The first incident that gets posted to social media is the one that defines your game's reputation for a year.
At minimum, build a reporting flow before you launch voice. It doesn't have to be sophisticated. A button that captures a timestamp and the reported user's ID, emails it to you, and lets you act on it. That's the floor. From there you add tooling as the community grows.
Recording and Legal Considerations
Don't skip this. In some jurisdictions, recording voice communications without consent from all parties is illegal. California, for example. If your game is played internationally, you need to know what you're recording and where you're storing it.
The standard approach is an explicit consent notice during onboarding: "Voice chat may be recorded for moderation purposes." If you're not recording, say that too. Players should know what their expectations are. Your ToS should cover this. I am not a lawyer; get actual legal advice before you record voice at any scale.
My Honest Take for Indie Devs
If I were starting a new multiplayer game today and it needed voice, here's what I'd actually do:
- Small browser game or early prototype: LiveKit self-hosted or cloud. Free tier gets you started. If it works, you grow from there without vendor lock-in.
- Unity or Unreal game, mid-size indie: Agora. Predictable pricing, good docs, available spatial audio extension. Budget $50-200/month for a moderately active player base and don't be surprised when moderation takes more time than the integration did.
- Serious commercial game, console ports, large player base: Talk to Vivox. Yes, the pricing conversation is annoying. Yes, the integration overhead is real. But at scale, having purpose-built gaming voice infrastructure with platform certification support is worth it.
- DIY WebRTC: Only if you're building browser-based and want control, or if you have backend engineering hours to spend and a specific reason not to use a managed solution.
The one thing I'd tell every developer adding voice for the first time: start with fewer channels than you think you need. One lobby channel, one in-game channel. Don't build zone-based spatial audio from the start. Get voice working and stable, watch how players use it, then add complexity. Voice chat architecture that looks elegant on a whiteboard often creates weird player behaviors in practice. Proximity audio especially, because players will immediately try to find the edge of the range and use it tactically in ways you didn't expect.
Voice changes the social texture of a multiplayer game more than almost any other feature. A game without voice feels like a game. A game with voice becomes a place. That's worth building. Just don't assume it's a weekend project.
