Building a Sub-100ms Ad Auction for Conversational AI

Users expect conversational AI to respond instantly. A 100ms delay feels snappy. A 500ms delay feels slow. Ad serving cannot add latency to the conversation experience. This constraint fundamentally changes how you architect an auction system. Traditional real-time bidding works with a 300ms latency budget. A conversational ad auction needs to complete end-to-end in under 100ms. Here is how we think about that problem.

Why Latency Matters for Conversational Ads

In display advertising, latency is a cost problem and a user experience problem, but it is not a functional constraint. A web page can take 2 seconds to load and still be usable. The ad loads asynchronously.

In conversational AI, ads are served inline with the response text. When a user asks a question, the AI generates an answer. If we need to run an ad auction, it happens synchronously. The user waits. The perception of "instant" response breaks at roughly 100-150ms of added latency. Beyond that, the conversation feels laggy.

This is not just about user experience. It is about the fundamental value proposition of conversational AI. Users choose these tools because they are fast. If we slow them down significantly to inject ads, we break the product.

Traditional RTB systems solve this with asynchronous impressions and delayed auction reporting. The alternative is to make the auction itself fast enough to run synchronously, then handle telemetry asynchronously.

The Latency Budget Breakdown

To hit a sub-100ms target, the auction orchestration needs to be surgical. There are several key components that each consume time: network communication to and from the exchange, validation and enrichment of the bid request, concurrent queries to demand partners, scoring and selection logic, and serialization of responses. The critical path is typically the time spent waiting for bidder responses. Everything else must happen as quickly as possible in parallel. Network latency varies by geography—requests from distant regions experience inherent delays that you cannot eliminate, only mitigate through regional deployment.

The constraints are real and non-negotiable. The conversation cannot wait for a slow auction. This constraint pushes you toward aggressive optimizations: pre-computation of data that would normally be fetched at request time, timeouts on bidder queries so slow responders don't block faster ones, and architectural choices that eliminate unnecessary round-trips.

Architecture: Edge, Pre-Computation, and Parallelization

Three core techniques make this latency target achievable:

1. Edge Deployment

The auction coordinator runs on edge nodes geographically distributed near the clients. Deploy to major cloud regions and use latency-based routing. A request from San Francisco hits a West Coast node. A request from London hits an EU node. This eliminates one round-trip to a centralized exchange.

Each edge node keeps a replicated copy of critical data: bid floor lookups, partner health status, and context classification results. This data is synced from the main exchange every 30 seconds. Staleness is acceptable because the data changes slowly and consistency is not required across all nodes.

2. Pre-Computed Bid Floors

One major latency sink in a traditional auction is fetching bid floors from a database at request time. The approach we've explored is to pre-compute bid floors for common context patterns and load them into memory on each edge node. Bid floors can be computed and published periodically, allowing the request-time lookup to be a simple in-memory operation instead of a database query.

// Illustrative: in-memory floor lookup with fallback logic
bid_floor = floor_cache.lookup(
  context_pattern
) || floor_cache.lookup(
  broader_fallback_pattern
)

This technique trades freshness for speed. Your bid floors are slightly stale—reflecting recent trends rather than real-time signals—but they're available instantly without network I/O. For many use cases, this is acceptable. The question to ask is: how fresh do your floors really need to be?

3. Async Bidding with Timeout

The approach we've found effective is to query multiple bidders concurrently but not wait indefinitely for all of them to respond. You set a time window—long enough for most bidders to respond but short enough that you don't break the user experience. If a bidder doesn't respond in time, you move forward with the bids you have.

// Illustrative: concurrent requests with timeout
responses = []
for bidder in selected_bidders:
  send_request_async(bidder)

// Collect responses that arrive within the window
responses = wait_with_timeout(window_duration)
// Proceed with whatever we've collected

This is a hard trade-off. Fast bidders get consistently selected in auctions because they always make the cutoff. Slower bidders miss impressions. But you've chosen to optimize for user experience over participation equity. The alternative—waiting for all bidders—breaks the product.

The timeout window itself becomes a negotiation point with demand partners. If you're too aggressive, bidders complain about missing impressions. If you're too generous, your latency target becomes unachievable. Finding the right balance is an empirical question for your market.

4. Connection Pooling and Pipelining

Another latency killer in request-response systems is connection setup. For each new request, establishing a fresh connection to each bidder adds meaningful overhead: TLS negotiation, TCP handshakes, etc. The technique we've explored is to maintain persistent connections and pipeline requests through them. This reuses the connection overhead across many requests, making per-request connection cost negligible over time.

In practice, this means managing connection state—detecting when connections drop, reconnecting with appropriate backoff, and monitoring pool health. The operational complexity is worth it for the latency savings.

Graceful Degradation Under Load

In a system optimized for speed, degradation is inevitable. Components fail. Partners get slow. Your challenge is to degrade gracefully without breaking the user experience.

Some patterns worth exploring: circuit breakers that temporarily pause requests to chronically slow partners, preventing them from dragging down the entire auction. Fallback paths that serve pre-computed or cached results when real-time computation would be too slow. Smart defaults that do not require full auction execution in edge cases. The goal is to have multiple quality tiers available: try to run the full auction, but be ready to serve something simpler if the full path times out.

The key insight is that perfect is the enemy of good. A fast auction with 80% of potential bidders beats a slow auction with 100% of them. A served ad, even if it's not the optimal winner, beats no ad at all. Build in these escape hatches early.

OpenRTB 2.6 Compliance

Despite aggressive optimization, maintaining full OpenRTB 2.6 compliance is non-negotiable. Bid requests and responses should follow the spec exactly. This allows demand partners to drop in their existing RTB stack without custom integration.

The speed comes from how we orchestrate the protocol, not from deviating from it. This is important for ecosystem interoperability.

The Latency Spectrum and Trade-Offs

In practice, hitting your target consistently is harder than hitting it on average. Latency is not uniform. Requests from some regions are inherently slower due to geographic distance. Some bidders are consistently faster than others. Some queries are easier than others. Building a sub-100ms system means optimizing the common case while accepting that outliers will miss the target.

The question becomes: what happens when you exceed your target? If the auction takes 150ms instead of 100ms, the conversation still works. It feels slightly slower, but not broken. You need to understand the perceptual threshold for your use case and be realistic about what percentage of requests will exceed it. Then decide if that's acceptable or if you need to make different trade-offs.

Trade-Offs: Depth vs. Speed

Building for latency forces explicit trade-offs. The traditional display ad exchange queries many bidders in parallel, betting that the marginal bidder might have a higher CPM. But when you have a latency constraint, querying many bidders becomes impossible. You have to pre-select a smaller set: the bidders most likely to win based on historical patterns and relevance signals.

This is a conscious choice with real consequences. Fewer bidders may mean lower average CPMs. But it also means more predictable latency and a smaller blast radius when things go wrong. The question is whether you can make up the revenue difference through other mechanisms: higher conversion rates on better-targeted ads, premium positioning, or other value props that don't rely on exhaustive competition.

Looking Forward

The interesting frontier for conversational ad auctions is predictive bidding — pre-scoring likely winners based on historical patterns before the request even arrives. In theory, this could compress latency significantly. The challenge is handling cold requests and novel contexts where historical data is sparse. It is an open problem worth watching.

More broadly, conversational advertising demands a different engineering mindset than display. The constraints are synchronous, the latency budgets are tight, and the traditional RTB playbook does not map cleanly. The teams that build well here will be the ones that treat latency as a first-class design constraint, not an afterthought.