# MegaCpp Model Wiring Examples

This directory is the checked-in example index for MegaCpp-specific model,
launcher, and runtime wiring. Use it when an article needs a public-safe anchor
for NAM56R hybrid patterns, GB10 bring-up, TileLang TMA legality, MLA adapter
seams, or the compact runtime receipts that sit above the lower-level kernel
and XLA example trees.

Quick paths:
- [Hybrid architecture patterns (NAM56R)](#nam56r-hybrid-architecture-patterns)
- [GB10 / Blackwell bring-up](#gb10-blackwell-bring-up)
- [TileLang TMA and shared-memory legality](#tilelang-tma-and-shared-memory-legality)
- [MLA integration and Sparse MLA](#mla-integration-and-sparse-mla)
- [Runtime and loss-contract receipts](#runtime-and-loss-contract-receipts)
- [Related example catalogs](#related-example-catalogs)

Reading order:
- start with the recipe and launch surfaces to understand the public NAM56R contract
- then use the compact runtime receipts to see each issue in a smaller teaching-sized sample
- then use the near-copy receipts when you need a shape or layout contract that stays closer to the MegaCpp proof surface
- for a full GB10 step-by-step reproduction bundle with CUDA/C++ source, cubin patchers, and a separated deeper driver lane, use `gb10_repro_bundle/README.md`

<a id="nam56r-hybrid-architecture-patterns"></a>
## Hybrid architecture patterns (NAM56R)

These files are the best entry points when a post needs a checked-in decoder for
MegaCpp-specific hybrid notation, launch policy, and recipe shaping.

- `nam56r_nemo_recipe_sample.py`: authoritative NAM56R recipe values and CLI emission
- `nam56r_nemo_recipe_contract_sample.py`: recipe-object contract for NAM56R-style authoring
- `nam56r_block_taxonomy_sample.py`: decoded block-letter taxonomy for `A / E / M / R`
- `nam56r_pattern_composition_sample.py`: expanded pattern counts and layer-rank map
- `nam56r_feature_placement_sample.py`: where the main feature families attach in NAM56R
- `nam56r_megatron_plan_sample.py`: explicit hybrid-plan translation into Megatron-native roles
- `nam56r_megatron_recipe_nearcopy.py`: fail-closed near-copy translation for the same plan
- `nam56r_launch_contract_sample.py`: split between generated native args and fixed launch policy
- `nam56r_launch_recipe_nearcopy.py`: near-copy launcher split when the exact launch contract matters
- `nam56r_launcher_profile_sample.py`: grouped launcher env/profile controls
- `nam56r_runtime_patch_surface_sample.py`: runtime patch surfaces layered on top of the recipe
- `nam56r_cuda_graph_launcher_sample.sh`: shell-side CUDA-graph launcher example
- `fail_closed_pattern_translation_sample.py`: fail-closed translator for hybrid block strings
- `nemotron_recipe_to_megatron_sample.py`: compact Nemotron-style recipe lowered into Megatron-native args
- `megatron_args_sample.py`: argument shaping for Megatron-style launch flows

Use this group when an article needs a public-safe anchor for NAM56R, pattern
strings such as `AEMEAEMEAEMR`, or the boundary between recipe declaration and
launch/runtime policy.

<a id="gb10-blackwell-bring-up"></a>
## GB10 / Blackwell bring-up

Use these files when an article talks about GB10, Blackwell bring-up, arch
rewrites, tcgen05 gates, or the difference between driver-visible hints and
execution-grade proof.

- `gb10_arch_patch_probe_sample.py`: narrow probe showing what an `sm_100a -> sm_121a` arch-field rewrite does and does not prove
- `gb10_driver_signal_vs_runtime_proof_sample.py`: compact rule-of-thumb sample separating driver-visible hints from execution proof
- `gb10_tcgen05_gate_matrix_nearcopy.py`: staged tcgen05 gate matrix for alloc/load/mma-style probe paths
- `gb10_repro_bundle/README.md`: full reproduction pack overview
- `gb10_repro_bundle/README_walkthrough.md`: step-by-step reproduction path
- `gb10_repro_bundle/README_gates.md`: focused gate walk through patched metadata and later integrity checks
- `gb10_repro_bundle/kernel_baseline.cu`: minimal baseline CUDA kernel for the bundle
- `gb10_repro_bundle/kernel_alloc_only.cu`: alloc-only tensor-path probe
- `gb10_repro_bundle/kernel_sm100a.cu`: source cubin surface for the arch-patch lane
- `gb10_repro_bundle/loader.cpp`: module-load harness
- `gb10_repro_bundle/query_attrs.cpp`: device-attribute query harness
- `gb10_repro_bundle/patch_elf.py`, `patch_symbols.py`, `patch_nvinfo.py`: public-safe cubin metadata patchers

Use this group for GB10 / Blackwell bring-up articles first; move into the
separated `driver_patch_lane/` only when the article is explicitly about the
deeper `libcuda` helper lane.

<a id="tilelang-tma-and-shared-memory-legality"></a>
## TileLang TMA and shared-memory legality

These examples are the checked-in anchors for TileLang TMA bulk copy,
shared-memory layout legality, and the Mamba3-style layout rewrites that sit
adjacent to that lane.

- `tilelang_tma_bulk_copy_smem_sample.py`: compact TileLang TMA and shared-memory lowering sample
- `tilelang_tma_bulk_copy_smem_nearcopy.py`: near-copy lowering contract when the exact layout/legality surface matters
- `mamba3_mimo_3d_to_2d_smem_sample.py`: compact shared-memory legality sample for the Mamba3 layout rewrite
- `mamba3_mimo_3d_to_2d_smem_nearcopy.py`: near-copy refactor for the same layout issue

Use this group when a post explains TMA bulk copy, shared-memory legality,
layout rewrites, or why a lowering failure is a compiler-contract issue rather
than a math-correctness issue.

<a id="mla-integration-and-sparse-mla"></a>
## MLA integration and Sparse MLA

Use these files when an article needs a checked-in anchor for MLA adapters,
shared MLA compatibility, or Sparse MLA dispatch and dimension assumptions.

- `mla_integration_pattern_sample.py`: narrow adapter seam for MLA integration
- `mla_shared_adapter_sample.py`: shared MLA compatibility adapter contract
- `sparse_mla_fp8_dispatch_nearcopy.py`: FP8-aware Sparse MLA dispatch surface
- `sparse_mla_dimension_generalization_nearcopy.py`: hardcoded-vs-generalized Sparse MLA dimension comparison

This group is the right target for articles about MLA integration boundaries,
Sparse MLA dispatch, and the difference between compact adapter seams and the
heavier near-copy receipts.

<a id="runtime-and-loss-contract-receipts"></a>
## Runtime and loss-contract receipts

These files are the compact and near-copy receipts for runtime correctness,
memory-shape fixes, output-layer parity, recurrent-mixer seams, and
structure-aware contracts.

- `dsa_cuda_graph_safety_sample.py`: compact CUDA-graph-safe DSA mask-update sample
- `dsa_cuda_graph_safety_nearcopy.py`: detailed reproducer for the same contract
- `dsa_indexer_memory_sample.py`: compact memory-shape sample for the DSA score-materialization issue
- `dsa_indexer_memory_nearcopy.py`: detailed reproducer for the fp32 DSA score-intermediate blow-up
- `mamba_linear_ce_parity_sample.py`: compact output-layer and CE-loss parity surface
- `mamba_linear_ce_parity_nearcopy.py`: detailed class-contract reproducer for Mamba linear-CE parity
- `liger_flce_reduction_none_nearcopy.py`: loss-contract sample for the broken `reduction=\"none\"` FLCE path
- `megatron_flce_hopper_nearcopy.py`: Hopper-ready fused linear cross entropy contract sample
- `author_mamba3_spec_nearcopy.py`: explicit RMSNorm seam in Mamba3 author integration
- `m2rnn_mixer_spec_sample.py`: recurrent-style mixer spec surface for Megatron integration
- `mamba3_psiv_cache_scaffold.py`: scaffold-style checked-in example for the fail-closed PsiV cache gate
- `index_cache_patch_nearcopy.py`: cache-lifecycle sample for full/shared DSA index reuse
- `structure_embedding_contract_sample.py`: validated structure-input normalization before embedding fusion
- `parquet_to_megatron_indexed_dataset_sample.py`: parquet-token-shard to indexed-dataset bridge
- `prepare_format_megacpp_sample.py`: thin public wrapper for naming and split policy in Megatron-ready data prep

Use this group when a post is about runtime contracts, loss-path correctness,
output-layer parity, cache lifecycles, or structure-aware data/model seams.

<a id="related-example-catalogs"></a>
## Related example catalogs

This directory is the MegaCpp-specific wiring catalog. The adjacent example
trees cover the lower-level kernel and TPU/XLA stacks that many articles also
need to reference.

- `../kernels/README.md`: FlashAttention-4 (FA4), fused residual helpers, fused RoPE QK, mHC, dense/sparse attention receipts, and other kernel-level examples
- `../xla/README.md`: PJRT, Pallas, XLA runtime dispatch, Splash-adjacent TPU examples, startup calibration, and clustered sparse TPU receipts

Use `examples/megacpp/README.md` for MegaCpp-specific recipe and bring-up
contracts first, then jump to the sibling catalogs when the article needs the
lower-level kernel or TPU/XLA proof surfaces.