Observability and the Three Dashboards We Actually Live With
Metrics, traces, and the training / infra / serving dashboard layout that keeps an eight-specialist C++ ensemble debuggable at 3am.

Every servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off system eventually gets an observability story. Most are bad — panels accrete one at a time, each answering a question that mattered once, until forty charts fight for a single page and nobody knows which to trust. We have been through that cycle twice, once on an earlier research cluster and once on the current servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off of the eight-specialist C++ ensemble. This post is what we landed on after both rounds of pruning.
Why this matters
Observability is the part of the system that decides how long an incident takes and whether the post-mortem is grounded. Get it wrong and on-call stares at forty charts that all look slightly off. Get it right and the right owners see the right signal first, with a drill-down path that does not require asking the original author for a screenshot. The whole observability layer here reduces to three dashboards, four categories of metric, two kinds of trace, and a small set of rules about what is allowed on each surface. The rules matter more than the panels.
1. The three dashboards
We have training, infra, and servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off. Each has one owner, one SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundinginference serving stack story, and a hard rule against bleeding into the others.
Training dashboard
Audience: the pretraining and post-training engineers running the weekly specialist refresh. Time horizon: the duration of one training run, hours to days. Primary signals:
- Step loss and evaluation loss per specialist, per data mix, per phase of the curriculum (the simple short-context mix, the context-graph 16K mix, the repo-graph 64K mix, the structure-aware enriched mix).
- Tokens per second per device, effective MFU, and the FLOP breakdown per component (attention, SSM, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack experts, norms, embedding).
- Gradient norms, parameter update norms, optimizer state statistics, and the loss-weight schedule (
mtp_lambda,moe_aux_loss_weight,top_lambda,stp_lambda,gateskip_lambda). - FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper amax history where FP8Quick term guideFP8Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 History: FP8 rollout notes Reference: Megatron FLCE on Hopper is in play, NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: NVFP4 inference scale statistics where NVFP4Quick term guideNVFP4NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.GroundingAbout: precision recipe: FP16, BF16, FP8, NVFP4 Reference: NVFP4 inference is in play.
- MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack load-balance telemetry: per-expert token counts, aux loss, router z-loss, capacity utilization.
- Data pipeline health: shard-read lag, packingQuick term guidePacked rowsWhy packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…GroundingPacked rows as the real training contract Tokenized enriched packed rows on TPU: feeding structure to XLA without recompiles ratio, document-mask statistics, FIM-rate effective values after runtime mutation.
- Checkpoint cadence, latest successful checkpoint, staleness.
Not on the training dashboard: servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off latency, infra health of the servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off cluster, and cost. The training dashboard has one SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundinginference serving stack that leaves it: the weekly specialist refresh must produce a checkpoint that clears its declared correctness threshold.
Infra dashboard
Audience: the cluster engineers running the actual silicon. Time horizon: ongoing, with a rolling 7-day default and the ability to zoom to one minute. Primary signals:
- GPU utilization, HBM utilization, HBM ECC error rates, SM clock behavior, thermal throttling.
- NVLink/NVSwitch/PCIe bandwidth counters and link error counts.
- Node health: memory pressure, page cache behavior, network throughput per interface, NIC error counts, kernel message anomalies.
- Storage: filesystem throughput on the dataset mounts, latency on the checkpoint store, space headroom on each tier.
- Power and thermal envelope at the rack-class level (we track "rack class" and "position within a rack class", never identifiers).
- Schedule churn: which accelerator pool is holding which class of workload, eviction and preemption rates at the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: KV cache and paged attention level, how often a replica was moved.
- Cost: GPU-hours by category of workload (training, eval, servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off, research), dollars per specialist-training-run at the category level, servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off cost per million tokens delivered.
- Software inventory: CUDAQuick term guideCUDANVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.GroundingAbout: XLA vs CUDA stack decisions History: GB10 tensor-path proof summary Reference: training on 8x H200 driver version, PyTorch version, TE version, kernel library versions per deployment class. Never individual machines, only the classes.
Not on the infra dashboard: model quality or loss, per-request servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off traces, or anything identifying a specific host, rack row, or region. Categories only.
Infra SLOsQuick term guideSLOsThe small set of per-specialist service-level objectives that the router and operator dashboards use to decide admission, shadowing, and rollback behavior.Groundinginference serving stack are simple and boring on purpose: capacity headroom on each accelerator pool above a threshold, ECC error rate below a threshold, checkpoint-write p99 under a threshold, cost variance inside a monthly band.
Serving dashboard
Audience: the product and servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off engineers; the on-call rotation. Time horizon: now, with a rolling 24-hour trend and the ability to zoom to a single second. Primary signals, per specialist:
- Time-to-first-token (TTFT) p50/p95/p99.
- Inter-token latency (ITL) p50/p95/p99.
- Queue depth p95, admission-to-first-token p95.
- Preemption rate, preemption-depth distribution, fraction of responses carrying the
preempted_onceflag. - Paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack block utilization, block-pool pressure, prefix-cacheQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.GroundingAbout: KV cache and paged attention Reference: inference serving stack Reference: vLLM automatic prefix caching hit rate per adapter.
- Per-adapter swap rate inside the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: KV cache and paged attention.
- Speculative-decode acceptance where enabled; rejection-driven rollback overhead.
- Router signals: primary-specialist confidence distribution, shadow dispatch rate, circuit-breaker state per specialist.
- Correctness SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundinginference serving stack: rolling 24-hour correctness score per specialist, marked degraded when it drifts below its declared pass rate.
- Error surfaces: 4xx/5xx rate to callers, tool-call failure rate, token-stream disconnect rate.
Not on the servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off dashboard: training loss, raw GPU counters from a specific machine (drill-down starts on infra), or aggregate ensemble latency as a single number — we do not publish one and the dashboard does not either.
Dashboard split, at a glance
| Dashboard | Owner | Time horizon | Primary contract |
|---|---|---|---|
| Training | Training leads | One run (hours to days) | Weekly checkpoint passes its eval floor |
| Infra | Infrastructure owners | Rolling 7d, zoom to 1m | Pools healthy, costs in band, no identifying labels |
| Serving | Serving leads + on-call | Now + 24h, zoom to 1s | Per-specialist TTFT/ITL/correctness SLOs |
2. The four metric categories
Every emitted metric falls in one of four buckets. The bucket dictates which dashboard is allowed to consume it.
- SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundinginference serving stack metrics. The named contract numbers. Few, stable, alerting attached.
- Component metrics. The component signals that drive an SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundinginference serving stack — KV pool pressure, MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack load balance, schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: KV cache and paged attention queue depth.
- Health metrics. Liveness and saturation signals — GPU utilization, NIC errors, page-cache stalls.
- Outcome metrics. Eval scores, compilation-pass rate, end-to-end correctness on canary traffic.
SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundinginference serving stack metrics are public. Component metrics are dashboard panels behind a drill-down. Health metrics live almost entirely on infra. Outcome metrics span training and servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off and are the only category that is allowed on more than one dashboard, in a fixed strip we call the "outcome bar".
That bucket rule also protects the metrics backend from becoming the next
incident. MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack routing, adapter-heavy servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off, and specialist fan-out all create
more possible labels than a live dashboard should carry. We keep the raw
(layer, expert, specialist, adapter) detail in receipts and post-incident
drill-downs, then publish low-cardinality live surfaces such as per-specialist
router z-loss distributions, capped capacity-skew summaries, and queue-depth
bands. MoE routing we actually shipped
and Profiler and performance reports are the
adjacent posts because one defines the routing signals and the other defines
where the heavy evidence belongs.
3. Two kinds of trace
Request traces. OpenTelemetry-style spans from router admission through schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: KV cache and paged attention dispatch, prefill, decode, optional tool calls, and out to the caller (with draft/verify markers when speculative decoding is on). Sampling is adaptive: every error, every preempted request, every SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundinginference serving stack breach, plus a low base rate of healthy traffic.
In practice the keep-or-drop decision belongs beside the collector, not inside the hot decode loop. The worker should emit the cheap span envelope, then let the downstream collector keep every error, preemption, and SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundinginference serving stack-breach trace after the request finishes. That preserves the interesting tail without asking the servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off path to re-evaluate a heavyweight sampling policy on every healthy token stream.
Kernel traces. Per-step GPU traces captured through the in-tree telemetry hooks and, for deeper questions, Nsight Systems on demand. Heavyweight; off by default. The servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off and training dashboards have a button that says "capture a kernel trace for the next N steps on this specialist's replica class". Captured traces land in a time-bounded store and can be attached to incident tickets. In practice this should be a bounded dormant capture that operators arm for the next window, not a mid-incident process restart.
The division matters. A request trace tells you that a specific request spent 340 ms in admission, 120 ms in prefill, and produced a first token 460 ms after arrival; it cannot tell you that the prefill was slow because an SSM kernel took an unexpected compile path. A kernel trace tells you exactly that and can be unintelligible without the request context. Keeping them separate, with cross-links, keeps each trace type legible.
4. The rules
A small set of rules does more work than any individual panel.
- One owner per dashboard. The owner is responsible for pruning, for adding panels, and for rejecting cross-domain panels. When we did not have this rule, every dashboard accreted an "other" section that became a third of the surface area.
- Every panel has an SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundinginference serving stack or it has a reason. Either the panel is an SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundinginference serving stack, a direct drill-down from one, or a component metric with a specific incident in its history. "Nice to have" is not a reason. Reviewed quarterly with teeth.
- Category granularity in the dashboard layer. Metrics are emitted with full labels; dashboards consume category-level aggregations. A GPU-utilization panel on infra shows utilization by accelerator class, not a per-host heatmap. Host-level views live in the data store, not the dashboard.
- No screenshots in tickets, only links. The link includes the time range and the specialist filter. Screenshots rot.
- No Christmas-tree correlation panels. We do not put loss curves next to servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off latency next to GPU utilization to look for correlations. Cross-domain analysis goes through the outcome bar and explicit post-incident work, not a unified dashboard.
- Alert on SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundinginference serving stack breach, not metric threshold, with a minimum duration. Dashboards display thresholds where useful; alerting fires only on the SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundinginference serving stack. Every duplicated-threshold system eventually drifts.
5. What breaks and how the dashboards help
A few representative incidents, sketched at the category level, show the three-dashboard split doing its job.
A specialist's p95 ITL jumps
The servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off dashboard spikes on ITL, paged-KVQuick term guidePaged attentionThe decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.GroundingAbout: KV cache and paged attention Example: Dense FA4 KV-cache decode sample Reference: inference serving stack block-pool pressure rises, prefix-cacheQuick term guidePrefix cacheThe reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.GroundingAbout: KV cache and paged attention Reference: inference serving stack Reference: vLLM automatic prefix caching hit rate falls, the rest of the ensemble does not move. Component drill-down shows adapter-swap rate climbing. The incident is a caller sending a high-variance adapter stream; the fix is a caller-side circuit breaker.
The weekly training run plateaus
Training dashboard shows flat loss on one specialist; the MoEQuick term guideMoEToken Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.GroundingThe MoE Routing We Actually Shipped Sequence, Context, and Expert Splits in the Hybrid Stack load-balance panel drifts toward two overloaded experts; z-loss climbs. Data-pipeline drill-down shows the FIM rate was mutated mid-run and the effective value is not what the run config claims. Fix: a data-pipeline validation guard. Infra and servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off stay untouched.
HBM ECC error rate rises on a class of accelerator
Infra fires first. ServingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off stays green because the schedulerQuick term guideschedulerThe per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.GroundingAbout: inference serving stack Reference: KV cache and paged attention is already moving workload off the degraded pool; the preemption-rate panel ticks up briefly and settles. Training shows nothing because the affected pool is not carrying a training workload this week. Infra handles it; the other teams see it only through the outcome bar.
In each case, the right dashboard fires, the right team drills down, and the other two stay out of the way. That is what the split is for.
6. The outcome bar
The one place the three dashboards share is the outcome bar at the top: a single thin strip with the rolling correctness pass rate per specialist, the rolling end-to-end correctness on canary traffic, and the latest published evaluation scores from the most recent checkpoints. It is the only cross-domain surface we kept, and it exists so that all three teams open their dashboard and see the same product-truth number first. If the outcome bar is green and the rest is loud, on-call calms down; if the outcome bar is red and the rest is green, on-call escalates regardless. It costs little to render and resolves a surprising number of arguments.
A minimal Prometheus surface to back the bar:
# Correctness pass rate per specialist, rolled to 1h
- record: project:correctness_pass_rate:1h
expr: sum by (specialist) (rate(project_eval_pass_total[1h]))
/ sum by (specialist) (rate(project_eval_total[1h]))
# Canary correctness, rolled to 15m
- record: project:canary_correctness:15m
expr: avg by (specialist) (project_canary_pass_ratio)
# Outcome-bar SLO: alert only on the bar, never on the panels
- alert: OutcomeBarRed
expr: project:correctness_pass_rate:1h < 0.85
for: 30m
What we kept and what we threw away
Kept: three dashboards with one owner each, four metric categories, two trace systems with cross-links, adaptive trace sampling biased toward interesting traffic, an outcome bar as the only cross-dashboard surface, category-granularity labels in the dashboard layer, SLOQuick term guideSLOA single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.Groundinginference serving stack-gated alerting.
Threw away: a unified "everything" dashboard, correlation panels across domains, host-level heatmaps in the dashboard layer, duplicated threshold logic between dashboards and alerts, per-engineer dashboards that did not survive a 30-day review, screenshots in incident tickets.
The observability story, like the servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off stack it watches, gets better the more it respects boundaries. Training, infra, and servingQuick term guideServingHow we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…GroundingServing the eight: router, per-specialist scheduler, and the KV layout that keeps them honest vLLM on GB10: the overlay, the registration fixes, and the paths we kept off are three different teams asking three different questions on three different time horizons. Forcing them to share a single pane of glass was, in retrospect, the most common mistake we made and the one the current layout specifically refuses to make again.
Frequently asked questions
Why alert on burn rate instead of a raw TTFT or ITL threshold?+
Terms used in this article
Start here for quick definitions, then follow the linked posts for deeper context.
A single service-level objective such as deadline miss rate or first-token latency floor that the serving stack publishes per specialist rather than as one blended ensemble number.
The decode-side cache contract where attention reads keys and values through fixed-size block indirection instead of one contiguous per-sequence buffer.
The reuse policy above paged KV that lets later requests point at already-filled cache blocks when they begin with the same validated token prefix.
The per-specialist serving control loop that admits, batches, preempts, and commits work after routing but before the decode kernel touches KV state.
The small set of per-specialist service-level objectives that the router and operator dashboards use to decide admission, shadowing, and rollback behavior.
How we actually serve an eight-specialist C++ ensemble: a top-level router, per-specialist continuous-batch schedulers, paged KV per model,…
NVIDIA's four-bit floating-point inference/training format family used when the lane can tolerate more aggressive quantization than FP8.
Eight-bit floating-point training and inference formats used to trade precision for throughput and memory on recent accelerator lanes.
Token Choice vs Expert Choice, null-expert debugging, gating stability, and the production routing decisions behind the MegaCpp SLM Ensemble.
Why packed rows are the real boundary between the data pipeline and the model, and why MegaCpp treats row packing as a schema contract rather than a…
NVIDIA's GPU programming stack: compiler, runtime, driver, libraries, and kernel toolchain used by CUDA training and inference lanes.