Nobody talks about lobby systems. Every tutorial walks you through tick rate and lag compensation, real-time sync and netcode, rollback and prediction. The lobby system, that thing players sit in for 90 seconds before every match, gets three paragraphs in a wiki somewhere and a footnote in the docs. And then you ship, and the lobby breaks, and you realize it was never a waiting room at all. It's the most complex state machine in your entire game.
I learned this in 2023 while building an indie battle royale. We had solid netcode. We had a working matchmaking service. We had host migration logic that I was genuinely proud of. Then we ran our first real playtest with 40 people, and within an hour the lobbies were freezing, parties were splitting mid-queue, and a player with 600ms latency got elected host because our migration algorithm picked whoever responded first to the ping. Everyone in that lobby lagged out. It was a bad day.
This post is what I wish existed before I started that project.
A Lobby Is a State Machine, Not a Waiting Room
The first mistake most developers make is treating a lobby as a passive container. You put players in, they click "Ready," the match starts. That model falls apart the moment you have to handle real users.
A lobby is a state machine with at least four distinct states, and the transitions between them are where bugs live. Here's the minimal version:
- Assembling: Players are joining. The party is incomplete. You're accepting new connections and filling slots.
- Staging: The party is full (or at threshold). You're waiting for ready confirmations.
- Queued: The lobby is in matchmaking, waiting for server assignment. No new players can join.
- Launching: A server has been assigned. You're handing off state and transitioning players to the game session.
Each transition has failure modes. What happens if a player disconnects during Launching? Do you abort and return to Assembling? Do you launch without them? What if the matchmaking server returns an error during Queued? You need explicit error states and rollback logic for every transition. Most developers don't build these until a bug forces them to.
I'd recommend modeling this with a proper state machine library rather than a pile of boolean flags. XState (JavaScript) and stateless (C#/.NET) both give you explicit state definitions, transition guards, and entry/exit actions. When your lobby has eight boolean properties controlling behavior, you've got 256 possible states, most of them invalid. A real state machine makes illegal states unrepresentable.
Party Formation: Keeping Players Together
Party formation sounds simple. Player A invites Player B, Player B accepts, they're in a party. Done. In practice, you need to handle:
- Invite flows via friend list, in-game ID, or shareable code
- Join-by-code (a short alphanumeric code that maps to a lobby ID)
- Keeping the party together across queue transitions (when you enter matchmaking and get bounced back, does the party stay intact?)
- Party size limits relative to team size
- What happens when one party member is in a different game mode
The "keep together through queue" problem is nastier than it sounds. Your lobby service and your matchmaking service are separate systems. When you hand a party to matchmaking, you need a way to track them as a unit and return them as a unit if the queue fails. The common pattern is a persistent party ID that survives lobby transitions. The party exists independently of any specific lobby. Lobbies are created and destroyed; the party persists.
For join-by-code: don't generate codes on the client. Generate them server-side, store the mapping in Redis with a short TTL (I used 30 minutes), and validate on join. A six-character alphanumeric code gives you about 2.17 billion combinations, more than enough to avoid collisions at any realistic indie scale.
Host Migration: The Hard Part
Host migration is the problem that turns "this should be straightforward" into a two-week debugging session.
In a peer-to-peer lobby, one client acts as the host. They maintain the lobby state, relay it to other players, and initiate the handoff to the game server. When that player disconnects mid-lobby, you need a new host elected immediately. Every second without a host is a second the lobby state is inconsistent.
The naive approach: elect the player with the lowest ping. This is what I did in 2023. It doesn't work. Ping measurements in a lobby are noisy. They're round-trip to your relay server, not peer-to-peer latency, and they don't account for packet loss or jitter. We elected a player with 600ms average latency as host because they happened to have a low ping at the exact moment of the election. The lobby transferred. Everyone's experience degraded immediately. We killed the session and started over.
Better approach: elect based on connection stability over time, not a single ping snapshot. Track a rolling average of the last 10 pings per player. Weight by packet loss rate. Break ties by connection duration (players who've been in the lobby longer tend to be more stable).
Best approach: don't use P2P lobby hosting at all. Use a server-authoritative lobby service.
Server-authoritative lobbies eliminate host migration entirely. The lobby state lives on your servers, not on a player's client. Players connect to the lobby service, not to each other. When a player disconnects, the server removes them from the state and continues. No election, no state transfer, no migration.
The tradeoff is cost and operational overhead. But this is 2024: Photon Lobby, PlayFab Lobby, and Nakama all offer hosted lobby services with per-connection or per-MAU pricing. For most indie games, the cost is negligible compared to the engineering time you'd spend debugging P2P host migration in production.
The Ready-Check Problem
Not everyone clicks "Ready." Some players go AFK. Some are still on Discord. Some are apparently testing whether they can hold the lobby hostage indefinitely. You need timeout logic, and it needs to be designed with intent.
Here's what I recommend:
- Display a visible countdown once all required slots are filled. 60 seconds is standard. Show it prominently so players know the clock is running.
- Auto-start threshold: If 75% of players are ready and 45 seconds have elapsed, start the match. Don't wait for 100%. In a 10-player lobby, holding everyone hostage for one AFK player is not acceptable UX.
- AFK detection: Track last interaction time per player. If a player hasn't sent any input in 60 seconds, mark them as potentially AFK and deprioritize their ready state in your threshold calculation. Don't kick them silently.
- Kick on timeout: If a player isn't ready after 90 seconds in the staging state, remove them and open the slot. Send them a clear notification. Silent drops kill retention.
The auto-start threshold is genuinely opinionated, and you should tune it for your game. A competitive 5v5 might require 100% ready since an uneven match is broken by definition. A battle royale with 40 players can comfortably start at 90% fill. Know your game mode's tolerance for partial fills before you hardcode anything.
Queue Management and Lobby Serialization
When your lobby transitions to Queued state, you need to serialize lobby state and hand it to your matchmaking system. This serialization is easy to get wrong.
At minimum, your serialized lobby should include: the party ID, a list of player IDs with their ready states, any lobby settings (region preference, game mode, skill bracket), and a schema version number. The schema version matters when you're rolling updates. If your matchmaking service expects schema v2 and your lobby service is still sending v1, you'll get silent mismatches that are genuinely unpleasant to debug.
The three most referenced implementations in this space:
- Photon Lobby: Part of Photon Realtime, it provides lobby listing, filtering, and property syncing out of the box. Strong Unity integration. Lobby properties are key-value pairs you define, synced to all members automatically. Good for games already on the Photon stack.
- PlayFab Lobby: Microsoft's gaming platform now has first-class Lobby support that's generally available. It's server-authoritative by default, handles member management, and integrates with PlayFab Matchmaking for clean queue transitions. The SDK handles a significant portion of the state machine complexity for you.
- Nakama: Open source, self-hostable. Nakama provides match listing and lobby-like functionality through its matchmaker and real-time multiplayer APIs. If you need full control and want to avoid vendor lock-in, Nakama is worth the operational overhead.
For the handoff to the game server: send the full serialized lobby state as a JSON payload to your game server allocator (Agones, Unity Multiplay, or whatever you're using), then send each player a connection token with the server IP and port. The game server validates the token, confirms the player is on the expected list, and admits them. This makes the handoff atomic. If game server allocation fails, return players to Assembling state with a clear error message.
Implementation Tips That Actually Help
A few things I'd do differently if I started over:
Use webhook callbacks for lobby events, not polling. Have your lobby service POST to a webhook endpoint when state changes: player joined, player left, lobby ready, queue started. This decouples your lobby service from whatever consumes lobby events, and it's far easier to debug than a polling loop with a five-second interval.
Handle disconnects at every state. Don't assume a player who disconnected during Staging is gone forever. They might reconnect within 10 seconds on a mobile connection. Store their player ID and lobby token in Redis with a 30-second TTL. If they reconnect before expiry, restore their slot. This one change measurably improves early-session retention.
Log state transitions with full context. When your lobby transitions from Assembling to Staging, log the lobby ID, player list, timestamp, and trigger (full party vs. threshold met). When things go wrong in production, these logs are the only way to reconstruct what happened. Without them, you're guessing.
Test with artificial latency and disconnects before you think you're done. Use tc netem on Linux to simulate 300ms latency and 5% packet loss during lobby transitions. Disconnect the host mid-staging. Disconnect half the party during queue. Every lobby bug I've found in production, I've been able to reproduce in staging once I started simulating real network conditions.
The lobby system won't get you on the app store feature page. Nobody reviews a game and says "the lobby was great." But it's the first real multiplayer experience your players have. Get it wrong and they'll quit before they see your actual game. Get it right and it disappears, which is exactly what you want.
