adding variable length attention to llama3 8b by liangel-02 · Pull Request #2000 · pytorch/torchtitan

liangel-02 · 2025-11-07T01:12:31Z

Summary
This PR adds variable length attention (varlen) support to the Llama 3 8b model in torchtitan. We replace use_flex_attn with attn_type (either "sdpa", "varlen", "flex"). If attn_type = "varlen", the attention module calls a compiled varlen_attn defined here.

Testing
Ran loss and performance tests against flex attention. Loss is on par.

Varlen is slightly slower than Flex due to the cuda kernel speeds (varlen calls into flash_attention_forward/flash_attention_backward today).

	Varlen	Flex
Forward	774us 357ns	722us 317ns
Backward	1ms 955us 916ns	1ms 558us 747ns

torchtitan/hf_datasets/text_datasets.py

fegin

This implementation won't work with PP and too model intrusive. The pack logic should be hide inside the inner attention.

torchtitan/hf_datasets/text_datasets.py

torchtitan/models/llama3/train_configs/llama3_8b_varlen.toml

torchtitan/models/attention.py

torchtitan/models/llama3/__init__.py

fegin

LGTM, thanks for the update. Leave some other comments, after the comments are addressed, this PR should be ready.

torchtitan/models/attention.py

torchtitan/models/llama3/model/model.py

torchtitan/models/llama3/train_configs/llama3_8b.toml

tianyu-l

Thanks! Left some comments, please see if they make sense to you.

torchtitan/models/llama3/model/model.py

torchtitan/models/llama3/model/args.py

torchtitan/models/llama3/model/model.py

torchtitan/models/attention.py

tianyu-l

Left some more comments. If you'd like to focus on Llama 3 in this PR, that's fine with me too.

torchtitan/distributed/activation_checkpoint.py

torchtitan/experiments/forge/example_train.py

torchtitan/experiments/gpt_oss/infra/parallelize.py

torchtitan/experiments/simple_fsdp/deepseek_v3/parallelize.py

torchtitan/experiments/vlm/infra/parallelize.py

torchtitan/models/llama4/model/args.py

torchtitan/models/qwen3/infra/parallelize.py

torchtitan/models/llama4/model/model.py

torchtitan/models/qwen3/model/model.py

torchtitan/models/attention.py

fegin

LGTM, we can leave other models to other PR(s).

torchtitan/models/deepseek_v3/infra/parallelize.py

torchtitan/experiments/simple_fsdp/deepseek_v3/parallelize.py

tianyu-l · 2025-11-21T00:26:47Z

torchtitan/models/llama3/model/model.py

+                    xv,
+                    self.head_dim,
+                    attention_masks,
+                    is_causal=True,


This would fail? I think is_causal is no longer accepted.

Btw, it seems varlen is not tested in CI, can we add one test similar to https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests/features.py#L336

tianyu-l

LGTM.

We need to modify save_list of SAC to save the result of varlen attn, to be consistent with other attn implementations. Can do this in next PR.

tianyu-l · 2025-11-21T22:44:09Z

tests/integration_tests/features.py

+            [
+                [
+                    "--parallelism.data_parallel_shard_degree=4",
+                    "--activation_checkpoint.mode='full'",


let's use per_op_sac like the test above.

This reverts commit f8fa21e.

wwwjn · 2025-11-24T22:21:04Z

The test on AMD hardward failed, but works on NVIDIA setup. Do you know why? https://github.com/pytorch/torchtitan/actions/runs/19644468781/job/56256208915 cc @liangel-02 @drisspg

**Summary** This PR adds variable length attention (varlen) support to the Llama 3 8b model in torchtitan. We replace `use_flex_attn` with `attn_type` (either "sdpa", "varlen", "flex"). If `attn_type = "varlen"`, the attention module calls a compiled `varlen_attn` defined [here](https://github.com/pytorch/pytorch/blob/main/torch/nn/attention/varlen.py). **Testing** Ran loss and performance tests against flex attention. Loss is on par. <img width="947" height="505" alt="Screenshot 2025-11-19 at 3 24 26 PM" src="https://github.com/user-attachments/assets/d85dfc09-4f5e-4f82-abc9-49b870b34990" /> Varlen is slightly slower than Flex due to the cuda kernel speeds (varlen calls into `flash_attention_forward`/`flash_attention_backward` today). | | Varlen | Flex | | :---: | :------ | :---: | | Forward | 774us 357ns | 722us 317ns | | Backward | 1ms 955us 916ns | 1ms 558us 747ns |

* [TorchComms] add testing badge at experiments readme (#2010) * [compiler toolkit] specify passes through config (#2006) We should be able to control what passes to run in the compiler. This PR uses the config compile.passes to indicate in a list of graph passes to apply on the captured gm. By default, no pass is applied. Users can specify what passes to apply. Currently there are `autobucketing_reordering_pass` and `regional_inductor_pass`. ``` NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes autobucketing_reordering,regional_inductor ``` Also updated CI to include this new config * [simplefsdp] fix region ac in zero2-style FSDP (#1970) After some offline discussion, we've concluded that life would be easier if we can put simplefsdp's checkpoint logic for `reshard_after_forward` to compiler. The ac annotation part is borrowed form AP: [LINK](https://github.com/meta-pytorch/autoparallel/blob/main/autoparallel/activation_checkpointing.py#L69). **Trace and Loss Check** (all with torch.compile enable) reshard_after_fwd = False 1. SAC + llama3 ([trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-30-17-05-06_rank0_trace.json)) <img width="768" height="115" alt="Screenshot 2025-10-30 at 4 28 59 PM" src="https://github.com/user-attachments/assets/e4e22335-2e3f-46c8-8def-a60d592fee0a" /> <img width="689" height="512" alt="Screenshot 2025-11-05 at 9 02 30 PM" src="https://github.com/user-attachments/assets/40a71316-a457-4e72-9002-cc8beea8f32c" /> 2. Full AC + llama3 [(trace)]() <img width="729" height="105" alt="Screenshot 2025-10-30 at 4 30 53 PM" src="https://github.com/user-attachments/assets/e8d63460-579b-4f0a-8504-851480e5b548" /> <img width="789" height="763" alt="Screenshot 2025-11-05 at 9 11 34 PM" src="https://github.com/user-attachments/assets/1a13d09e-04c4-4db9-99fe-cf10d24bf7f5" /> 3. No AC + llama3 [[trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-30-17-03-50_rank0_trace.json)] <img width="748" height="115" alt="Screenshot 2025-10-30 at 4 32 05 PM" src="https://github.com/user-attachments/assets/20104d24-9d45-4eba-b694-815e133b88d0" /> <img width="800" height="764" alt="Screenshot 2025-11-05 at 9 07 46 PM" src="https://github.com/user-attachments/assets/55b104ce-8ec1-4ed6-95e7-300e96ad55af" /> reshard_after_fwd = True 1. SAC + llama3 ([Trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-31-11-34-24_rank0_trace.json)) <img width="795" height="108" alt="Screenshot 2025-10-31 at 11 34 47 AM" src="https://github.com/user-attachments/assets/a3988f72-7e87-4e52-90f9-8bee840cd6f4" /> 2. Full AC + llama3 ([Trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-31-11-36-27_rank0_trace.json)) <img width="593" height="110" alt="Screenshot 2025-10-31 at 11 38 02 AM" src="https://github.com/user-attachments/assets/5ee61b2b-9600-4af8-9a24-61b3564f93ca" /> 3. No AC + llama3 ([Trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-30-17-02-44_rank0_trace.json)) <img width="701" height="109" alt="Screenshot 2025-10-31 at 11 43 04 AM" src="https://github.com/user-attachments/assets/576b28f6-dae4-4ff7-b005-57b0cf9ad7cc" /> * [SimpleFSDP] Add typing to simple_fsdp.py (#2001) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #2002 * __->__ #2001 Add typing, credit to Claude. * [Full DTensor][Reland] Add full_dtensor flag (#2013) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2013 When full_dtensor is True, the compute_placement will be preserved. This means that `to_local()` won't be called for fsdp only case. nD parallelism case (fsdp + tp) will error out as we have not implemented this case. This argument doesn't affect the current simple_fsdp. We have verified `full_dtensor=True` case with the full dtensor skleton PR, which will be published once it is ready. **This is a reland PR of https://github.com/pytorch/torchtitan/pull/2002. The previous one was broken during rebase.** * set pg names (#1986) Summary: - we need to pass the global rank information to pytorch so that the pg name can include the pg information - this is necessary to differentiate the default pg's on different replicas - these need to different because flight recorder matches collectives based on pg name as well - add ft training to experiments folder, we'll move remaining pieces of ft to this gradually but make new features only available through this folder --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1986). * #1988 * #1987 * __->__ #1986 Co-authored-by: Tushar Jain <tushar00jain@users.noreply.github.com> * Fix the error message of maybe_enable_async_tp() (#2011) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2012 * __->__ #2011 It is not correct as JobConfig has changed. * Add dry run mode (#2012) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2012 * #2011 Summary: The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly. * [easy] [compiler toolkit] Clean up unused function (#2014) As titled. `_clear_traced_params_buffers` is no longer being used as we have switched the dynamo graph capture API. * Run Torchtitan ROCm workflow on cron schedule & push to Main branch only (#2016) Addressing following issues in this PR- - Running Torchtitan ROCm workflow on cron schedule & only when push to Main branch. CUDA workflow will run as is. - Refactor Torchtitan test run to address older PR comment https://github.com/pytorch/torchtitan/pull/1786#discussion_r2476279289 * Revert PR-2016 & Redo "Run Torchtitan ROCm workflow on cron schedule & push to Main branch only" (#2017) Reverts PR: https://github.com/pytorch/torchtitan/pull/2016 Addressing following issues in this PR- - Running Torchtitan ROCm workflow on cron schedule & only when push to Main branch. CUDA workflow will run as is. - Refactor Torchtitan test run to address older PR comment https://github.com/pytorch/torchtitan/pull/1786#discussion_r2476279289 Co-authored-by: tianyu-l <150487191+tianyu-l@users.noreply.github.com> * [compiler toolkit] Add tests and scripts for numerics check (#2015) This PR adds the utils to automatically check the training numerics (losses, grad norms) of two runs to verify if they have bitwise equivalence. The added script triggers two runs with user defined configs. Then it loads metrics saved during training and compare the numerics to verify bitwise equivalence. Currently we check for losses and grad norms during training steps For example, we want to compare the numerics between compiler toolkit with aot_eager backend and eager on llama3-8B. ``` python torchtitan/experiments/compiler_toolkit/scripts/check_numerics.py --ngpu 4 --config-file torchtitan/models/llama3/train_configs/llama3_8b.toml --dp-shard-degree 2 --tp-degree 2 ``` It'll run `simple_fsdp` experiment without `torch.compile` as the eager baseline, and `compile_toolkit` experiment as the compiled run. Then it compares the training numerics of these two runs to verify bitwise equivalence. When it is bitwise equivalent, we'll see the following output ``` Starting training: simple_fsdp.llama3 ✓ Training completed: simple_fsdp.llama3 Starting training: compiler_toolkit.llama3 ✓ Training completed: compiler_toolkit.llama3 ✓ PASS: All 11 steps match exactly (bitwise equivalent) ✓ PASS: All 11 steps match exactly (bitwise equivalent) ✓ SUCCESS: All metrics are bitwise equivalent ``` Also added unit-tests in `compiler_toolkit/tests/test_numerics.py` so that we can guard working parallelism combinations that already have bitwise equivalence in CI. * Add .claude to .gitignore (#2026) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2029 * #2030 * #2028 * #2027 * __->__ #2026 As title * Fix dry run mode (#2027) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2029 * #2030 * #2028 * __->__ #2027 * #2026 Dry run mode works but it doesn't exit gracefully for all cases. This PR fixes it ``` DRY_RUN=1 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --training.steps=10 --activation_checkpoint.mode="none" --debug.deterministic --debug.seed=42 ``` * [Compiler Toolkit] Make compiler toolkit work with checkpoint (#2030) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2029 * __->__ #2030 The current CompileModule will result in an "inner" prefix for everything. This PR fixes it by overloading the methods. Also merge https://github.com/pytorch/torchtitan/pull/2028 to this PR. Something wrong with ghstack. * [Flux] Update integration test badge in README.md (#2019) Fixes the badge in the `README.md` file * Print device and stride when print module (#2045) Before: <img width="978" height="93" alt="image" src="https://github.com/user-attachments/assets/48dc39d9-e897-4396-ac62-025574303403" /> After: <img width="1318" height="82" alt="image" src="https://github.com/user-attachments/assets/47b4771a-aaf9-4f61-80bc-757f3a08c1d2" /> * [SimpleFSDP] add manual bucketing pass (#1881) This PR adds support for aten-level manual bucketing in SimpleFSDP+`aot_eager` backend. Dependent on PyTorch [PR](https://github.com/pytorch/pytorch/pull/165487) TODO List: - [ ] We should have better way of handling region info other than a list of str FQNs in current `manual_bucketed_modules`. It would be very easy to miss some of model modules. (cc. @xmfan @SherlockNoMad ) - [ ] Currently, the reordering happens under the hood and overlap with last/next compute. We should allow users to specify which module they want to reorder. - [ ] Loss difference on multi-node training - [ ] DSV3 manual bucketing I'll address the TODO items in follow up PRs. Let's start with this simple FSDP+TP+llama3 PR. 1. Performance (FSDP2 under eager mode, SimpleFSDP uses `aot_eager` backend) **Llama 3-8B** * Performance (All Batch_size = 1). (The slower TPS on Single Node is sort of as expected, since FSDP2 handles copy-in/out in two different streams, whereas SimpleFSDP handles copy-in/out in the same stream) |Node| Method | Parallelism | Memory | TPS | Trace| |---------|---------|-----------|----------|------|------| |1-Node (8H100)|SimpleFSDP | FSDP=8| 40.96GiB(43.12%) | 7,227| [LINK](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-16-10-48-48_rank0_trace.json)| |1-Node (8H100)|FSDP2-eager| FSDP=8| 47.82GiB(50.35%) | 7,380 | [LINK](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-16-10-54-14_rank0_trace.json)| |8-Node (64H100)|SimpleFSDP| FSDP=64 | 29.37GiB | 4,984| | |8-Node (64H100)|FSDP2| FSDP=64 | 31.41GiB |5,097 | | |1-Node (8H100)|SimpleFSDP| FSDP=4 TP=2 | 28.28GiB(29.77%) | 5,881 | [LINK](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-26-18-00-18_rank0_trace.json) | |1-Node (8H100)|FSDP2| FSDP=4 TP=2 | 35.33GiB(37.20%) | 5,898 | [LINK](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-26-15-35-47_rank0_trace.json) | |8-Node (64H100)|SimpleFSDP| FSDP=8 TP=8 | ||| |8-Node (64H100)|FSDP2| FSDP=8 TP=8 | ||| Example SimpleFSDP 1D overlapping trace: <img width="1127" height="127" alt="Screenshot 2025-10-16 at 10 49 55 AM" src="https://github.com/user-attachments/assets/2d9e3ff8-8e9b-40a7-a666-3c0a0975186e" /> Example SimpleFSDP 2D overlapping trace: <img width="1162" height="166" alt="Screenshot 2025-10-26 at 6 00 51 PM" src="https://github.com/user-attachments/assets/bc5cc031-5b6c-4e4d-a9da-70c43114f49a" /> - Bitwise Loss: FSDP-only: <img width="1266" height="837" alt="Screenshot 2025-10-17 at 10 41 56 AM" src="https://github.com/user-attachments/assets/30f83d95-1eca-4f10-9e7e-47c45278cd8d" /> FSDP+TP: <img width="1259" height="808" alt="Screenshot 2025-10-26 at 9 03 58 PM" src="https://github.com/user-attachments/assets/b75b452b-adb9-4078-9412-ee9e584ffe15" /> * Add export_dtype parameter to `convert_to_hf` function (#2041) The current `convert_to_hf.py` does not support `export_dtype`, which makes it `float32` by default. This PR adds support for export dtypes of `["float16", "bfloat16", "float32"]`. * [compiler toolkit] Port joint_ac_pass from simplefsdp (#2051) This PR integrates the changes in #1970 to compiler toolkit (applying `joint_ac_pass` on the joint graph graph to tag nodes based on `reshard_after_forward` flag) Also did some refactor for applying graph passes in compiler toolkit experiments. We will have two kinds of passes 1. joint_custom_passes: these are passes to be applied on the captured joint graph before partitioner. By default we `validate_flex_attn_annotation_pass` and `fsdp_reshard_after_fwd_pass` 2. compiler_passes: there are passes to be applied on partitioned fwd and bwd graphs as backend optimizations. By default there is none. We can indicate `autobucketing_reordering_pass` and `regional_inductor_pass` using configs. * [compiler toolkit] Port manual bucketing from SimpleFSDP experiment (#2056) This PR integrates the manual bucketing pass (transformer block bucketing) added in SimpleFSDP experiment (#1881) to compiler toolkit So now compiler toolkit can also run manual bucketing pass by specifying the config ``` NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes transformer_block_bucketing ``` Also updated README and integration test to include the newly ported pass * Re:Run Torchtitan ROCm workflow on cron schedule & push to Main branch only (#2018) Addressing following issues in this PR- Running Torchtitan ROCm workflow on cron schedule & only when push to Main branch. CUDA workflow will run as is. Refactor Torchtitan test run to address older PR comment https://github.com/pytorch/torchtitan/pull/1786#discussion_r2476279289 * Add a loss comparison script (#2029) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2049 * __->__ #2029 ## Summary This PR adds `scripts/loss_compare.py` for comparing training losses between different git commits and/or training configurations. ## Key Features - Commit Comparison: Compare losses between two different git commits with deterministic training - Configuration Comparison: Compare different training configurations on the same commit - Reproducibility: Automatically enables deterministic mode and seed checkpointing for reproducible comparisons - Real-time Output: Streams training output to both console and log files during execution - Statistical Analysis: Generates step-by-step loss comparisons and summary statistics - CI Testing: Includes --assert-equal flag for automated testing to verify identical losses ## Usage Examples #### Compare two commits ``` python3 ./scripts/loss_compare.py main my_branch ``` #### Compare two commits with custom configuration ``` python3 ./scripts/loss_compare.py main my_branch \ --baseline-config="./custom.toml" --baseline-options="--parallelism.tensor_parallel_degree=2" \ ``` #### Compare different parallelization strategies on same commit ``` python3 ./scripts/loss_compare.py . . \ --baseline-config="./llama3_8b.toml" --baseline-options="--parallelism.tensor_parallel_degree=2" \ --test-options="--parallelism.tensor_parallel_degree=1" \ ``` #### Assert equality for CI testing ``` python3 ./scripts/loss_compare.py main my_branch --assert-equal ``` ## Real Use Cases Compare full dtensor simple fsdp with fsdp2: ``` python3 scripts/loss_compare.py . . \ --baseline-options='--activation_checkpoint.mode="none"' \ --test-train-file='torchtitan.experiments.full_dtensor.train' \ --test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none"' \ --assert-equal --no-seed-checkpoint [LOSS_COMPARE] [LOSS_COMPARE] Asserting losses are equal... [LOSS_COMPARE] Baseline log: /tmp/baseline_training.log [LOSS_COMPARE] Test log: /tmp/test_training.log [LOSS_COMPARE] Extracted 100 steps from baseline log [LOSS_COMPARE] Extracted 100 steps from test log test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok ``` * Fix integration test gpu_arch_type field (#2060) All tests in experiments are broken due to the `gpu_arch_type` field added in #2018. * [compiler toolkit] Add Trainer subclass for compiler toolkit (#2064) Adding CudaGraph pass (https://github.com/pytorch/torchtitan/pull/2050) would require some custom logic in Trainer's close() method. So we create a Trainer subclass in compiler toolkit * Let loss_compare.py check the repo cleaness (#2062) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2063 * __->__ #2062 This will prevent errors when later doing git checkout * CUDAGraph support for SimpleFSDP and TP (#2050) ## Features - [x] Support SimpleFSDP and TP - [x] Support static input indices to reduce copy - [x] Support memory reuse to reduce memory consumption - [x] Cleanup cudagraph when training finishes to avoid nccl hang from destroy_process_group Command: ``` NCCL_GRAPH_REGISTER=0 NGPU=8 TRAIN_FILE=torchtitan.experiments.compiler_toolkit.train CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes cudagraph ``` Note: we use `NCCL_GRAPH_REGISTER=0` due to a known issue that nccl + cudagraphs + expandable segments result in IMA. https://github.com/pytorch/pytorch/issues/158029 [trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces%2Ftree%2Fshared_trace%2Fboyuan_e1ef464b-ee61-4c61-82e5-f7a485e561bf_rank0_trace.json) ## Result **Numerics:** Achieved bitwise equivalence w/ and w/o cudagraph pass on llama3.1-8B AND llama3.1-70B. **Performance:** <img width="560" height="90" alt="image" src="https://github.com/user-attachments/assets/9d54c461-0eb1-4f7e-9652-3d52043ad74f" /> Raw log: [llama3-8b](https://www.internalfb.com/phabricator/paste/view/P2045444190), [llama3-70b](https://www.internalfb.com/phabricator/paste/view/P2045567416) **Memory:** On llama3.1-70b, cudagraph takes 6% more memory consumption (143 GiB vs 153 GiB). A few tricks to reduce memory consumption (use llama3.1-70b w/ cudagraph as an example): - Start: 161 GiB - \+ use the same stream for warmup and graph capture of both fwd and bwd: 160 GiB - \+ warmup in cudagraph memory pool instead of eager memory pool: 153 GiB **static input copy:** On llama3.1-70B, for forward, we copy 1 tensor of 128 bytes; for backward, we copy 1 tensor of 0.98 GB. This shows static input indices is handled correctly. ## Followup PR In the followup PR, I will enable fx graph partition for deepseek v3 https://github.com/pytorch/pytorch/pull/165945. * compiler_toolkit: fix args access (#2067) This PR fixes access to args; it's an attribute, not a variable in the scope. The method itself though would not be used because `should_check_address` seems to be always `False` and there doesn't seem to be a command line argument for it. Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * 3outeille/transformers backend (Dense model only) (#2048) # Context Reference PR: https://github.com/huggingface/torchtitan/pull/1 This PR enables: - Llama-like HF models to work with 4D parallelism: FSDP, CP, TP, PP (and the combinations between them). The following models were tested: - `meta-llama/Llama-3.2-1B` - `microsoft/phi-2` - `Qwen/Qwen2.5-7B` - `mistralai/Mistral-7B-v0.1` - `ByteDance-Seed/Seed-Coder-8B-Instruct` - `Qwen/Qwen3-4B-Instruct-2507` - `arcee-ai/AFM-4.5B` - `ibm-granite/granite-3b-code-base-2k` - `baidu/ERNIE-4.5-0.3B-Base-PT` - `kyutai/helium-1-preview-2b` - `allenai/OLMo-7B-hf` - `mistralai/Ministral-8B-Instruct-2410` - Patching HF models weights initialisation. Without this, the the `loss` and `grad_norm` starts very high # Usage - Requirements `transformers==4.57.1` - Config: `torchtitan/torchtitan/experiments/transformers_backend/configs/qwen3.toml` ```diff ... [model] - name = "llama3" + name = "transformers_backend" flavor = "debugmodel" hf_assets_path = "./tests/assets/tokenizer" +[hf_transformers] +model = "Qwen/Qwen3-4B-Instruct-2507" ... ``` - Train: `LOG_RANK=7 CONFIG_FILE=<YOUR_PATH>/torchtitan/experiments/transformers_backend/configs/qwen3.toml ./run_train.sh --job.custom_config_module=torchtitan.experiments.transformers_backend.job_config --compile.enable` <img width="1334" height="453" alt="image" src="https://github.com/user-attachments/assets/da459448-027b-4af9-8176-6a3e433a272c" /> # Testing methodology <img width="2672" height="2018" alt="image" src="https://github.com/user-attachments/assets/66d8689d-7ede-47e3-b389-d4fc1bdd70f7" /> - Following the [converging.md](https://github.com/pytorch/torchtitan/blob/main/docs/converging.md) guidelines, I am comparing the baseline `FSDP=2` vs `FSDP=2 & <other //-ism>` - More precisely, the `test_hf_integration.py`is going to do: ```bash results/ |_ meta-llama |_ Llama-3.2-1B |_ debugmodel/ |_ seed_checkpoint/ |_ config.toml |_ seed.slurm |_ step-0/ |_ .... |_ fsdp2_tp1_cp1_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ fsdp2_tp2_cp1_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp1_pp2/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp2_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp2_pp2/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log` |_ full/ ... ``` - Here is the grid search to test the HF modelling ```shell #!/usr/bin/bash model_names=( "meta-llama/Llama-3.2-1B" "microsoft/phi-2" "Qwen/Qwen2.5-7B" "mistralai/Mistral-7B-v0.1" "ByteDance-Seed/Seed-Coder-8B-Instruct" "Qwen/Qwen3-4B-Instruct-2507" "arcee-ai/AFM-4.5B" "ibm-granite/granite-3b-code-base-2k" "baidu/ERNIE-4.5-0.3B-Base-PT" "kyutai/helium-1-preview-2b" "allenai/OLMo-7B-hf" "mistralai/Ministral-8B-Instruct-2410" ) for model_name in "${model_names[@]}"; do rm -rf slurm_results/${model_name} python test_hf_integration.py create_configs --model_name "$model_name" --out_dir slurm_results --flavor debugmodel python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel/seed_checkpoint --qos high while [ ! -f slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt ] || [ "$(cat slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt)" != "completed" ]; do echo "Waiting for seed checkpoint from ${model_name} to complete ..." sleep 1 done python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel --qos high echo "================" done ``` # Further tasks - Moe (handle in PR https://github.com/huggingface/torchtitan/pull/3) - Missing `build_optimizers_with_moe_load_balancing` support for MoE - Missing TP/PP/EP supports for MoE - When using HF modeling, the test `FSDP=2 vs FSDP=2 + PP=2`, the `loss` and `grad_norm` not bitwise matching (but converging) while it is the case with Torchtitan modeling. (issue is tracked in https://github.com/huggingface/torchtitan/pull/4) - Add convergence tests to CI by doing tiny model + gloo backend (once PP is bitwise matching) - the HF modeling has lower MFU than Torchtitan MFU - NOTE: `import torch._dynamo.config; torch._dynamo.config.cache_size_limit = 128` to avoid recomputation for graph when using `torch.compile` and `activation checkpointing` * adding variable length attention to llama3 8b (#2000) **Summary** This PR adds variable length attention (varlen) support to the Llama 3 8b model in torchtitan. We replace `use_flex_attn` with `attn_type` (either "sdpa", "varlen", "flex"). If `attn_type = "varlen"`, the attention module calls a compiled `varlen_attn` defined [here](https://github.com/pytorch/pytorch/blob/main/torch/nn/attention/varlen.py). **Testing** Ran loss and performance tests against flex attention. Loss is on par. <img width="947" height="505" alt="Screenshot 2025-11-19 at 3 24 26 PM" src="https://github.com/user-attachments/assets/d85dfc09-4f5e-4f82-abc9-49b870b34990" /> Varlen is slightly slower than Flex due to the cuda kernel speeds (varlen calls into `flash_attention_forward`/`flash_attention_backward` today). | | Varlen | Flex | | :---: | :------ | :---: | | Forward | 774us 357ns | 722us 317ns | | Backward | 1ms 955us 916ns | 1ms 558us 747ns | * remove scatter_add in MoE implementation (#1974) PR for removing `scatter_add` in the MoE implementation. `scatter_add` is somewhat problematic as it is non-deterministic due to the necessity of [atomic adds](https://discuss.pytorch.org/t/why-does-index-add-and-scatter-add-induce-non-deterministic-behavior-on-the-cuda-backend/45544/2) for correctness. Determinism, correctness, and performance tests using scripts under `torchtitan/moe_bench_and_test`: ``` # Determinism: run same forward 100x and compute standard deviations pytest -rsfP torchtitan/moe_bench_and_test/test_moe.py -k test_determinism out_old_std=tensor(0.0297, device='cuda:0', dtype=torch.bfloat16) out_std=tensor(0., device='cuda:0', dtype=torch.bfloat16) out_old_std/out_moe_old.abs().mean()=tensor(0.0006, device='cuda:0', dtype=torch.bfloat16) out_std/out_moe.abs().mean()=tensor(0., device='cuda:0', dtype=torch.bfloat16) ``` ``` # Accuracy: compare MoE outputs to FFN outputs, with weights set such that outputs should be the same # Relative error decreased by 3x pytest -rsfP torchtitan/moe_bench_and_test/test_moe.py -k test_moe_ffn_equivalence moe_old_rel_err=0.009754068047048696 moe_rel_err=0.002507858727736454 moe_old_rel_err/moe_rel_err=3.8894009216589858 ``` ``` # Timing: triton do_bench for DSv3 16B layer fwd + bwd. ~3% faster runtime python torchtitan/moe_bench_and_test/moe_timing.py moe_old && python torchtitan/moe_bench_and_test/moe_timing.py moe args=Namespace(cls='moe_old', perf_reps=1000, perf_warmups=100, seqlen=4096, bsz=4) moe_time_ms=19.712812881469727 args=Namespace(cls='moe', perf_reps=1000, perf_warmups=100, seqlen=4096, bsz=4) moe_time_ms=19.03301840562087 ``` ``` # Memory: for DSv3 16B layer fwd + bwd. ~15% reduction in active mem, ~18% in reserved mem. python torchtitan/moe_bench_and_test/moe_memory.py moe_old && python torchtitan/moe_bench_and_test/moe_memory.py moe args=Namespace(cls='moe_old', iters=1, seqlen=4096, bsz=4) peak_stats.max_active_gib=5.926029682159424 peak_stats.max_reserved_gib=7.224609375 args=Namespace(cls='moe', iters=1, seqlen=4096, bsz=4) peak_stats.max_active_gib=5.051033020019531 peak_stats.max_reserved_gib=5.91015625 ``` Testing fwd + bwd correctness for `tp_degree=ep_degree=world_size=8` and `etp=1` ``` # Similar relative errors torchrun --nproc-per-node 8 torchtitan/moe_bench_and_test/test_tp.py args=Namespace(seqlen=256, bsz=4, tol=0.01), world_size=8, tp=8, ep=8, etp=1 err_ratio_fsdp_ep_old=0.0028211805268959435 err_ratio_fsdp_ep=0.002805679534989922 err_ratio_ep_ep_old=0.0022941468020912068 kl_fsdp_ep_old=tensor(2.4915e-05, device='cuda:0', dtype=torch.bfloat16) kl_fsdp_ep=tensor(2.0981e-05, device='cuda:0', dtype=torch.bfloat16) kl_ep_ep_old=tensor(2.1458e-05, device='cuda:0', dtype=torch.bfloat16) ``` Everything under `torchtitan/moe_bench_and_test` is temporary testing utilities and is to be deleted prior to merging. * Update transformers backend name (#2075) following Huggingface efforts in VLLM (cf https://github.com/vllm-project/vllm/pull/28725), we would like to uniformize the naming and make sure that people think we use the HF models only * Enhance loss_compare.py: Add Import/Export Options and Enable CI Comparison with Existing Losses (#2063) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2063 This PR allows us to check if the loss is consistent across commits/PRs. 1. This PR contains a pre-tested losses result file. 2. This PR improve the loss_compare.py to add --import and --export options. 3. In CI, uses --import to get the previous losses and compare them with the current PR. If anything mismatch (10 steps), the CI will fail. * Print out the version number (#2083) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2083 This PR and https://github.com/pytorch/torchtitan/pull/2070 can resolve https://github.com/pytorch/torchtitan/issues/2043. This should not affect `.github/scripts/update_version.sh` as `.github/scripts/update_version.sh` will append the version at the end of the file, which will overwrite the value. * Autoparallel as an experiment in main (#2054) Experiments like SimpleFSDP/Compiler Toolkit/Autoparallel are all being developed at the same time, and SimpleFSDP/Compiler Toolkit both run into issues with PP that requires the PP utilities from Autoparallel. We want to land the Autoparallel experiment into main to facilitate that sharing. --------- Signed-off-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Will Constable <whc@meta.com> Co-authored-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Francisco Massa <fvsmassa@gmail.com> Co-authored-by: ruisizhang123 <ruisizhang123@gmail.com> Co-authored-by: Brian Hirsh <briandhirsh@gmail.com> Co-authored-by: Will Constable <willconstable@gmail.com> * skip varlen integration test on rocm (#2085) as title since varlen attention is not supported on rocm * [Local Tensor] Replace dry_run.py with fake mode implementation (#2057) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2057 Replaces `dry_run.py` implementation with fake PG mode for DRY_RUN configuration validation. This PR also adds support of Local tensor mode to provide deeper validation coverage. **Note:** Currently returns early before `init_weights()` if using local tensor mode due to some limitation of local tensor, which will be fixed by https://github.com/pytorch/pytorch/pull/166540 . * add varlen attention for qwen 3 (#2084) As title **Testing** <img width="469" height="431" alt="Screenshot 2025-11-24 at 4 30 53 PM" src="https://github.com/user-attachments/assets/6b9a362d-de36-48b7-b465-d91ae24f4cbf" /> performance and loss on par * [FLUX] Add FLUX inference test in CI (#1969) * Improve logging by formatting the dict as JSON. (#2094) We use Slurm to run jobs, and i just noticed that job configs and model args were being logged on a single line by default, which makes the logs hard to read. This PR improves readability by formatting these dictionaries with `json.dumps` before logging, so the configs are formatted nicely and easier for humans to read. before: <img width="2594" height="640" alt="image" src="https://github.com/user-attachments/assets/c3c07b09-d12c-484d-aa90-a626cd25c6d2" /> after: <img width="2252" height="1032" alt="image" src="https://github.com/user-attachments/assets/4cbde979-c34c-4fc5-aa55-f280f39cf9ef" /> * add all SDPA backends to op_sac_save_list (#2095) As we discussed in https://github.com/pytorch/torchtitan/issues/2091, we should add all `scaled_dot_product_attention` backends to `op_sac_save_list`to avoid recomputing attention during backward. * modify save list for varlen attn (#2082) adding varlen attention ops to ac save list **testing** used DebugMode() to print out op list. verified that forward is not being recomputed in the backward step. ``` [rank0]:forward ops [rank0]:varlen_attn in forward: True ... [rank0]:varlen_attn recomputed in backward: False [rank0]:saved correctly ``` * Make sure log after distributed initialized. (#2102) There is a condition check in config logging for distributed initialization, so the config logging has to be happen after distributed has been initialized. Co-authored-by: Zhiqiang Zang <zzq@fb.com> * [mxfp8] [docs] [BE] add MXFP8 usage documentation and benchmarks (#2096) Fixes #1998 * Mark input tokens to routed experts as dynamic to avoid a recompile (#2007) Stacked PRs: * __->__#2007 --- --- --- Mark input tokens to routed experts as dynamic to avoid a recompile This saves 1 recompile, and you can see the input tokens are dynamic from the first graph compiled: ```python class GraphModule(torch.nn.Module): def forward(...s77: "Sym(s77)", L_x_: "bf16[s77, 5120][5120, 1]cuda:0"... ``` I verified that this also fixes the AC recompile issue of: https://github.com/pytorch/torchtitan/issues/1971. But I'm keeping `torch._C._dynamo.eval_frame._set_lru_cache(False)`, as there could be other recompile reasons popping up. * fix mxfp8 loss image (#2104) In the original PR i moved the image location without updating the markdown pointing to it by accident. This fixes that. * Update hf_assets_path for llama4 (#2110) Fix typo in train_config, hf asset should be for maverick, see: https://huggingface.co/meta-llama/models?search=128e * Enables parsing of --compile.components through CLI (#2115) Without this PR, I'm not able to pass `--compile.components=model,loss`. Tested using `python -m torchtitan.config.manager --compile.components=model,loss`. * fix `ForgeEngine` compatibility issue with (#2121) Summary: Fix backward incompatible changes introduced in https://github.com/pytorch/torchtitan/commit/ff078526d1b9a51a3507cd234715ac3c61291e85 Differential Revision: D88572518 * Remove the hack for SAC + FlexAttention (#2118) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2118 PyTorch can now support torch.compile inside the SAC region even if torch.compile is not used to wrap SAC. This PR removes the workaround to ensure torch.compile works with Flex * Add warning to run_tests (#2123) Small addition since right now running a test that doesn't exist just outputs nothing, e.g. `python -m tests.integration_tests.run_tests ./test-out --test_name does_not_exist` Now the output is: ` WARNING:root:No tests were run for --test_name 'does_not_exist' in test suite 'features'. Available test names in 'features' suite: ['default', '1d_compile', '1d_compile_sac_op', '2d_eager', '2d_compile', 'full_checkpoint', 'model_only_hf_checkpoint', 'last_save_model_only_fp32', 'last_save_model_only_bf16', 'pp_looped_zero_bubble', 'pp_zbv', 'pp_1f1b', 'pp_gpipe', 'pp_dp_1f1b', 'pp_dp_gpipe', 'pp_tp', 'pp_dp_tp', '3d_compile', 'pp_looped_1f1b', 'pp_custom_csv', 'optimizer_foreach', 'ddp', 'hsdp', 'fsdp+flex_attn', 'fsdp+flex_attn+per_op_sac', 'fsdp+varlen_attn+per_op_sac', 'cp_allgather', 'cp_alltoall', 'hsdp+tp', 'fsdp+cp', 'hsdp+cp_without_dp_shard', 'hsdp+cp_with_dp_shard', 'fsdp+tp+cp', 'cpu_offload+opt_in_bwd+TP+DP+CP', 'test_generate', 'fsdp_reshard_always', 'optional_checkpoint', 'float8_emulation', 'gradient_accumulation', 'validation_tp_cp_pp'] ` * [compiler toolkit] Disable CUDAGraph integration test (#2127) As titled. We'll enable when it is fixed. * Add CI for Autoparallel experiment llama3 on 4 GPUs (#2105) * Support rope cache indexing using positions (#2112) Add support to indexing rope cache using `position_ids`, this might be needed during 1. inference, where we passed in `position_ids` into transformer forward 2. CP load balancing where we need to index rope cache given positions ids Test: running dpskv3 16b base <img width="489" height="286" alt="image" src="https://github.com/user-attachments/assets/6f463d65-a0de-413d-ab19-770db9983dbb" /> also tested in https://github.com/wwwjn/torchtitan/pull/1/files when passing position_ids <img width="665" height="269" alt="image" src="https://github.com/user-attachments/assets/70e4bddc-0334-4dbf-b00d-6e4b49a94655" /> --------- Co-authored-by: JessicaZhong <zhengjesszhong@gmail.com> * [forge] allow torchforges to set checkpoint base folder (#2131) this PR 1) allowing Torchforge to decide where to put the checkpoint and wandb, etc, instead of the "current" folder ~~allowing Torchforge to decide to print / log the configs~~ * Rename auto_parallel experiment to autoparallel (#2128) * PyTorch depends on psutil (#2132) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2132 TorchTitan should also depends on psutil. * Remove caching for attention masks (#2117) We remove the lru_cache for attention masks, because in get_attention_mask() function, `and_masks(*mask_mods)` will return different object id. `create_attention_mask` will use all parameters as cache key, and new object id will always cause cache miss. Before the change: (llama3 debugmodel_flex_attn) <img width="1182" height="275" alt="Screenshot 2025-12-09 at 1 27 45 PM" src="https://github.com/user-attachments/assets/e9af2597-9d94-4478-8136-8b9b8c35d9e6" /> After the change: <img width="1182" height="275" alt="Screenshot 2025-12-09 at 1 29 56 PM" src="https://github.com/user-attachments/assets/756a7d09-b47f-434f-8ff6-40098b265a03" /> * Clarify contribution guidelines. (#2134) * Enable PP and EP overlap for MoE (#1721) Option 2 of https://github.com/pytorch/torchtitan/issues/1682 These changes add a custom `overlap_callback` function to replace the OVERLAP_F_B action that is run during the schedule execution. In the custom function, we write `run_forward()` and `run_backward()`. `run_backward()` is run as a separate thread so that we can have both forward and backward running together side by side. Looks like this: <img width="1321" height="443" alt="image" src="https://github.com/user-attachments/assets/911f3637-1afa-4537-989a-a325ba558957" /> In order for these changes to work with Expert Parallel, we also need to add custom autograd functions to act as the boundary points at which we do communication. We added hooks before and after expert parallel dispatch and combine to signal boundary points, so our figure from before now turns into: <img width="1382" height="388" alt="image" src="https://github.com/user-attachments/assets/3991749d-7d67-4098-81a4-4efcfd1c75ca" /> Now in each of these red blocks, we use a global coordinator. We need `threading.Barrier(2).wait()` so that the comm and compute from our forward and backward steps are scheduled in lock-step before continuing. DSv3 16B run command: ``` TORCH_NCCL_TRACE_BUFFER_SIZE=2000 TORCH_NCCL_DUMP_ON_TIMEOUT=true TORCH_FR_DUMP_TEMP_FILE=./nccl_trace_rank_ NGPU=8 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh ``` Trace examples: <img width="2409" height="1889" alt="image" src="https://github.com/user-attachments/assets/923efc8f-9241-4646-aba0-ccc846d3932b" /> Test command: `python -m tests.integration_tests.run_tests ./test-out --test_name pp_dualpipev --test_suite models` --------- Co-authored-by: tianyu-l <150487191+tianyu-l@users.noreply.github.com> * Fix apply_compile called multiple times in PP initialization (#2135) Stacked PRs: * __->__#2135 --- --- --- PP initialization calls apply_compile multiple times, once per pp stage. But apply_compile does some global patching. So I add `already_patched` to avoid patching the same method multiple times. If we patch multiple times, the second time will wrap `_run_experts_grouped_mm_dynamic` in a torch.compile(fullgraph=True) leading to the error in the issue below. FIXES https://github.com/pytorch/torchtitan/issues/2124 * Enable static type checking with Pyrefly (#2136) Enables static type checking of torchtitan with [pyrefly](https://github.com/facebook/pyrefly). Type checking the code helps catch bugs earlier in the development cycle. * Adds pyrefly to CI, as part of the linting workflow. * Addresses ~100 type errors that can be fixed via local code changes and updates to type annotations, and silences the rest with `# pyrefly: ignore` suppression comments. Note that https://github.com/pytorch/torchtitan/commit/325efd946f1cbea85e503f9e684b8c879891fc1a contains all of the non-comment changes. * [Autoparallel] Add local_map variant of DSv3 and 2D mesh AP (#2129) Stacked PRs: * __->__#2129 --- --- --- [Autoparallel] Add local_map variant of DSv3 and 2D mesh AP Currently, the AP experiment monkey patches Titan's main DSv3 implementation. But this is prone to breakage from both model definition changes in titan and from HOP/partitioner related changes in core. When these breaks happen, people are usually blocked until I find the root cause. I'm going on PTO for the rest of the year, so I'm adding an integration to AP's DSv3 model in an attempt to make the development more stable for the upcoming PP integration. Test: https://gist.github.com/xmfan/db15fda1e1bc1df7cd523005fe0baf33 * Implement ciflow/rocm on Torchtitan (#2114) In this PR, I implemented ciflow/rocm on Torchtitan. The changes are part of integration_test_8gpu_features.yaml. The workflow still supports running on pull_request (without any PR label) for CUDA. However, along with push to main and cron schedule, with the ciflow/8gpu label added to PR, the workflow runs for both CUDA & ROCm. --------- Co-authored-by: Huy Do <huydhn@gmail.com> * [MoE] Add node limited routing support (#2111) As titled, added node-limited routing support via two-layer routing. First, group experts into `num_groups` groups, and experts in the same group should reside on the same node to utilize fast intra-node communication. Second, pick the `top_k_group` by the top 2 expert scores' sum in each group. Third, pick `top_k` experts within the selected `top_k_groups`. Reference: https://github.com/huggingface/transformers/blob/4c9fde2a2a3aece0bcf1be93f696e88297da9397/src/transformers/models/deepseek_v3/modeling_deepseek_v3.py#L212 Test on one node using DeepSeek V3 debug model with MoE arguments `num_experts=8, num_shared_experts=2, num_groups=4, top_k_group=2, top_k=3`. <img width="1196" height="465" alt="Pasted Graphic" src="https://github.com/user-attachments/assets/63fd8414-1761-4efe-acff-154b1f46a16d" /> * Upgrade GitHub Actions to latest versions (#2152) ## Summary Upgrade GitHub Actions to their latest versions for improved features, bug fixes, and security updates. ## Changes | Action | Old Version(s) | New Version | Release | Files | |--------|---------------|-------------|---------|-------| | `pypa/gh-action-pypi-publish` | [`release/v1`](https://github.com/pypa/gh-action-pypi-publish/releases/tag/release/v1) | [`v1`](https://github.com/pypa/gh-action-pypi-publish/releases/tag/v1) | [Release](https://github.com/pypa/gh-action-pypi-publish/releases/tag/v1) | release.yml | ## Why upgrade? Keeping GitHub Actions up to date ensures: - **Security**: Latest security patches and fixes - **Features**: Access to new functionality and improvements - **Compatibility**: Better support for current GitHub features - **Performance**: Optimizations and efficiency improvements ### Security Note Actions that were previously pinned to commit SHAs remain pinned to SHAs (updated to the latest release SHA) to maintain the security benefits of immutable references. ### Testing These changes only affect CI/CD workflow configurations and should not impact application functionality. The workflows should be tested by running them on a branch before merging. * Upgrade GitHub Actions for Node 24 compatibility (#2151) ## Summary Upgrade GitHub Actions to their latest versions to ensure compatibility with Node 24, as Node 20 will reach end-of-life in April 2026. ## Changes | Action | Old Version(s) | New Version | Release | Files | |--------|---------------|-------------|---------|-------| | `actions/checkout` | [`v3`](https://github.com/actions/checkout/releases/tag/v3), [`v4`](https://github.com/actions/checkout/releases/tag/v4) | [`v6`](https://github.com/actions/checkout/releases/tag/v6) | [Release](https://github.com/actions/checkout/releases/tag/v6) | docker-builds.yml, release.yml | | `actions/download-artifact` | [`v4`](https://github.com/actions/download-artifact/releases/tag/v4) | [`v7`](https://github.com/actions/download-artifact/releases/tag/v7) | [Release](https://github.com/actions/download-artifact/releases/tag/v7) | release.yml | | `actions/setup-python` | [`v5`](https://github.com/actions/setup-python/releases/tag/v5) | [`v6`](https://github.com/actions/setup-python/releases/tag/v6) | [Release](https://github.com/actions/setup-python/releases/tag/v6) | release.yml | | `actions/upload-artifact` | [`v4`](https://github.com/actions/upload-artifact/releases/tag/v4) | [`v6`](https://github.com/actions/upload-artifact/releases/tag/v6) | [Release](https://github.com/actions/upload-artifact/releases/tag/v6) | release.yml | ## Context Per [GitHub's announcement](https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/), Node 20 is being deprecated and runners will begin using Node 24 by default starting March 4th, 2026. ### Why this matters - **Node 20 EOL**: April 2026 - **Node 24 default**: March 4th, 2026 - **Action**: Update to latest action versions that support Node 24 ### Security Note Actions that were previously pinned to commit SHAs remain pinned to SHAs (updated to the latest release SHA) to maintain the security benefits of immutable references. ### Testing These changes only affect CI/CD workflow configurations and should not impact application functionality. The workflows should be tested by running them on a branch before merging. * Improve the loss_compare.sh logic (#2143) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2145 * #2144 * __->__ #2143 1. Accept one "." (meaning the current commit) case to simplify the command line. 2. Ignore the untracked files. * [GPT-OSS] Add HF state dict adapter to support loading from HF checkpoints (#2021) As titled, this PR adds HF state dict adapter to support loading from GPT-OSS HF checkpoint. GPT-OSS checkpoint is quantized in MXPF4 format. The de-quantization steps are offloaded to the `QuantizedHuggingFaceStorageReader` in `dcp`, so this feature depends on this PR to update `QuantizedHuggingFaceStorageReader` (https://github.com/pytorch/pytorch/pull/167672). 1. Test 1. We use `dcp.load(hf_state_dict, storage_reader=QuantizedHuggingFaceStorageReader(path=input_dir))` to load from GPT-OSS HF checkpoint, and map the `hf_state_dict` back to TorchTitan state dict. We build one test input, and compare two outputs: 1. Using `transformer` library to load GPT-OSS HF checkpoint and run inference on the test input; 2. We use the converted TorchTitan model to run inference on the test input. We compare the outputs by comparing the KL divergence of two output probability distributions. The result shows two models are very similar. <img width="1191" height="191" alt="Pasted Graphic" src="https://github.com/user-attachments/assets/bb6a75e9-3dd7-43fa-847e-3f5f4fb5fd93" /> 2. Test 2. We load the model directly from quantized GPT-OSS HF checkpoint, and do a test training. <img width="1198" height="408" alt="Pasted Graphic 1" src="https://github.com/user-attachments/assets/49ab42ff-0115-4e79-b069-c556e0dd23f6" /> * Add local built pytorch path for pyrefly (#2155) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2156 * __->__ #2155 This assumes that the local built version has the same parent folder as torchtitan. Also fixes some pyrefly errors for moe.py * Run vLLM inference using torchtitan model definition (single GPU) (#2119) As titled, put it in deterministic RL folder * [RELAND] Let CUDA and ROCm read different loss result (#2157) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2157 CUDA and ROCm have different loss results. So we need to read from different loss result files. The loss results of FSDP and HSDP start to diverge after 5th step when running with ROCm, we also need to adjust this. But this is more an unknown issue that AMD people may want to figure out the root cause or confirm that this is an expected behavior. **This PR is a reland PR of https://github.com/pytorch/torchtitan/pull/2156** due to some landing issue of the previous PR. * Use new DeviceMesh unflatten to rewrite parallel_dims (#1660) **Summary** This PR utilizes the latest APIs provided by DeviceMesh to simplify the creation of all different meshes. The design philosophy is as follow: 1. Create one world mesh with the shape as [world_size,] 2. Create all 1-D submeshes by using 1) unflattening from the world mesh, or 2) slicing and flatten from other derived meshes. 3. ParallelDims now provides an API, get_mesh() and get_optional_mesh(). which accepts str or list[str]. When the argument is str, the API directly return the corresponding 1-D submesh. If the argument is list[str], the dim names will be used to concatenate to form a n-D device mesh. The main difference between the two APIs is that the former one will raise an ValueError if the result mesh is None the later one will just return None. * Integrate DeepEP to torchtitan (#2107) ## Summary This initial version integrates DeepEP into TorchTitan, focusing on correctness and compatibility rather than maximal performance tuning. - Functional DeepEP-backed MoE + Expert Parallelism - User-controlled configuration - Compatible with torch.compile and SAC - Intended as a first unblocker for benchmarking and iteration ## Perf: DeepSeek-V3 671B on 64 nodes × H100 (512 GPUs total) <details> <summary><strong>Training config (click to expand)</strong></summary> ``` config_path="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml", command_args=[ "--training.dataset_path=/lustre/fsw/portfolios/sw/users/elfieg/hf_datasets/c4", "--training.seq_len=4096", "--training.steps=120", "--metrics.log_freq=10", "--profiling.no-enable-profiling", "--comm.init_timeout_seconds=2000", "--comm.train_timeout_seconds=300", "--metrics.disable_color_printing", # Parallelism "--parallelism.data_parallel_replicate_degree=1", "--parallelism.data_parallel_shard_degree=64", "--parallelism.fsdp_reshard_after_forward=default", "--parallelism.tensor_parallel_degree=1", "--parallelism.expert_parallel_degree=32", "--parallelism.expert_tensor_parallel_degree=1", "--parallelism.pipeline_parallel_degree=8", "--parallelism.pipeline_parallel_schedule=Interleaved1F1B", # Training "--training.local_batch_size=16", "--activation_checkpoint.mode=full", # Compilation "--compile.enable", "--compile.components=model", "--compile.components=loss", # MoE / DeepEP "--debug.moe_force_load_balance", "--parallelism.expert_parallel_comm_backend=deepep", ], ``` </details> After: ``` memory: 56.75GiB(71.74%) tps: 579 tflops: 162.82 mfu: 16.46% ``` Before: ``` memory: 60.18GiB(76.07%) tps: 346 tflops: 97.24 mfu: 9.83% ``` ## Loss Curve: <img width="877" height="380" alt="Screenshot 2025-12-16 at 11 30 02 PM" src="https://github.com/user-attachments/assets/b2f15297-2f05-4f4b-b4d5-b2747a30b2fa" /> Shout out to my colleagues @gekurian @syed-ahmed @aazzolini for internal supports! * Fix pypa/gh-action-pypi-publish version to use SHA pinning (#2161) ## Summary Fix incorrect version reference for `pypa/gh-action-pypi-publish`. ## Problem A previous PR incorrectly changed the action reference from `release/v1` (valid branch) to `v1` (non-existent tag). The `v1` tag doesn't exist in the pypa/gh-action-pypi-publish repository. ## Solution Updated to use SHA pinning for release/v1.13: ```yaml uses: pypa/gh-action-pypi-publish@ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e # release/v1.13 ``` This follows [GitHub's security best practices](https://docs.github.com/en/actions/reference/security/secure-use#using-third-party-actions) for third-party actions by pinning to an immutable SHA. ## Files Changed - `.github/workflows/release.yml` --------- Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com> * Upgrade GitHub Actions for Node 24 compatibility (#2164) ## Summary Upgrade GitHub Actions to their latest versions to ensure compatibility with Node 24, as Node 20 will reach end-of-life in April 2026. ## Changes | Action | Old Version(s) | New Version | Release | Files | |--------|---------------|-------------|---------|-------| | `actions/checkout` | [`v3`](https://github.com/actions/checkout/releases/tag/v3) | [`v6`](https://github.com/actions/checkout/releases/tag/v6) | [Release](https://github.com/actions/checkout/releases/tag/v6) | lint.yaml | | `actions/setup-python` | [`v4`](https://github.com/actions/setup-python/releases/tag/v4) | [`v6`](https://github.com/actions/setup-python/releases/tag/v6) | [Release](https://github.com/actions/setup-python/releases/tag/v6) | lint.yaml | ## Context Per [GitHub's announcement](https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/), Node 20 is being deprecated and runners will begin using Node 24 by default starting March 4th, 2026. ### Why this matters - **Node 20 EOL**: April 2026 - **Node 24 default**: March 4th, 2026 - **Action**: Update to latest action versions that support Node 24 ### Security Note Actions that were previously pinned to commit SHAs remain pinned to SHAs (updated to the latest release SHA) to maintain the security benefits of immutable references. ### Testing These changes only affect CI/CD workflow configurations and should not impact application functionality. The workflows should be tested by running them on a branch before merging. Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com> * Expose common dataloader args (#2097) This diff introduces common dataloader args which are supported by statefuldataloader (and torch.utils.data dataloader). Users should be able to use them in their config files. I was thinking about introducing a catch all kwargs to make it easier to specify args but that can easily complicate things (validation checks, duplication, existing defined named args in function definitions etc). * Replace `logger.warn()` to `logger.warning()` , allow `log_validation` to log `extra_metrics` and expose common wandb args (#2166) 1. Replace `logger.warn()` to `logger.warning()` 2. allow `log_validation` to log `extra_metrics` 3. expose common wandb init args, it is userful when resume training. * Add Dependabot for GitHub Actions updates (#2163) ## Summary Add Dependabot configuration to automatically keep GitHub Actions up to date. Here's some more information about Dependabot: https://docs.github.com/en/code-security/dependabot/working-with-dependabot/keeping-your-actions-up-to-date-with-dependabot ## Changes - Added `.github/dependabot.yml` with weekly checks for GitHub Actions updates ## Context As discussed in #2161 ([comment](https://github.com/pytorch/torchtitan/pull/2161#issuecomment-3667526716)), adding Dependabot to automatically manage GitHub Actions updates going forward. ## Why Dependabot will automatically create PRs when new versions of GitHub Actions are available, helping to: - Keep CI/CD workflows secure with the latest patches - Get new features and improvements - Maintain compatibility with GitHub's infrastructure Each action update will be proposed as a separate PR for individual review and testing. --------- Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com> * Bump tj-actions/changed-files from d6e91a2266cdb9d62096cebf1e8546899c6aa18f to e0021407031f5be11a464abee9a0776171c79891 in the github-actions group (#2167) Bumps the github-actions group with 1 update: [tj-actions/changed-files](https://github.com/tj-actions/changed-files). Updates `tj-actions/changed-files` from d6e91a2266cdb9d62096cebf1e8546899c6aa18f to e0021407031f5be11a464abee9a0776171c79891 <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/tj-actions/changed-files/blob/main/HISTORY.md">tj-actions/changed-files's changelog</a>.</em></p> <blockquote> <h1>Changelog</h1> <h1><a href="https://github.com/tj-actions/changed-files/compare/v46.0.5...v47.0.0">47.0.0</a> - (2025-09-13)</h1> <h2>🚀 Features</h2> <ul> <li>Add any_added to outputs (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2567">#2567</a>) (<a href="https://github.com/tj-actions/changed-files/commit/c260d49a827b5eb266673bed7871c5d3ee9b5aef">c260d49</a>) - (Jellyfrog)</li> </ul> <h2>➖ Remove</h2> <ul> <li>Commit and push step from build job (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2538">#2538</a>) (<a href="https://github.com/tj-actions/changed-files/commit/be393a90381e27c9fec2c8c2e02b00f005710145">be393a9</a>) - (Tonye Jack)</li> </ul> <h2>🔄 Update</h2> <ul> <li>Updated README.md (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2592">#2592</a>)</li> </ul> <p>Co-authored-by: github-actions[bot] <41898282+github-actions[bot]<a href="https://github.com/users"><code>@users</code></a>.noreply.github.com> (<a href="https://github.com/tj-actions/changed-files/commit/3dbc1e181273d808ccff822a6e00cf18b6628ef0">3dbc1e1</a>) - (github-actions[bot])</p> <ul> <li>Updated README.md (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2591">#2591</a>)</li> </ul> <p>Co-authored-by: github-actions[bot] <41898282+github-actions[bot]<a href="https://github.com/users"><code>@users</code></a>.noreply.github.com> (<a href="https://github.com/tj-actions/changed-files/commit/b1ccff8c0892ad141d7d2de6f31e526a9dad931f">b1ccff8</a>) - (github-actions[bot])</p> <ul> <li>Updated README.md (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2574">#2574</a>)</li> </ul> <p>Co-authored-by: github-actions[bot] <41898282+github-actions[bot]<a href="https://github.com/users"><code>@users</code></a>.noreply.github.com> (<a href="https://github.com/tj-actions/changed-files/commit/050a3d3360d29711ee9d8210fc639d902d23ad07">050a3d3</a>) - (github-actions[bot])</p> <h2>📚 Documentation</h2> <ul> <li>Update link to glob patterns (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2590">#2590</a>) (<a href="https://github.com/tj-actions/changed-files/commit/a892f50f7a7187bc288633c09230b09ce7ad8fd0">a892f50</a>) - (Tonye Jack)</li> <li>Add Jellyfrog as a contributor for code, and doc (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2573">#2573</a>) (<a href="https://github.com/tj-actions/changed-files/commit/f000a9b97f254f9590ff26f651cccde827ad36da">f000a9b</a>) - (allcontributors[bot])</li> </ul> <h2>🧪 Testing</h2> <ul> <li>Manual triggered workflows (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2637">#2637</a>) (<a href="https://github.com/tj-actions/changed-files/commit/c2ca2493190021783138cb8aac49bcee14b4bb89">c2ca249</a>) - (Tonye Jack)</li> </ul> <h2>⚙️ Miscellaneous Tasks</h2> <ul> <li><strong>deps-dev:</strong> Bump jest from 30.0.5 to 30.1.3 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2655">#2655</a>) (<a href="https://github.com/tj-actions/changed-files/commit/9a6755550a331fdcc8ec45443738933f8fa22eea">9a67555</a>) - (dependabot[bot])</li> <li><strong>deps:</strong> Bump tj-actions/git-cliff from 2.1.0 to 2.2.0 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2660">#2660</a>) (<a href="https://github.com/tj-actions/changed-files/commit/b67e30df88f43e244f4e83775e5ad8335114fb95">b67e30d</a>) - (dependabot[bot])</li> <li><strong>deps:</strong> Bump github/codeql-action from 3.30.2 to 3.30.3 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2661">#2661</a>) (<a href="https://github.com/tj-actions/changed-files/commit/62aef422ffa195474d80d73387535cf4622b2824">62aef42</a>) - (dependabot[bot])</li> <li><strong>deps:</strong> Bump github/codeql-action from 3.29.11 to 3.30.2 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2659">#2659</a>) (<a href="https://github.com/tj-actions/changed-files/commit/e874f3cddd0f54ae776e6995ae6dae4cf40fd3d3">e874f3c</a>) - (dependabot[bot])</li> <li><strong>deps:</strong> Bump actions/setup-node from 4.4.0 to 5.0.0 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2656">#2656</a>) (<a href="https://github.com/tj-actions/changed-files/commit/8c14441336bb3d84fd6b7fa83b6d7201c740baf5">8c14441</a>) - (dependabot[bot])</li> <li><strong>deps-dev:</strong> Bump <code>@types/node</code> from 24.3.0 to 24.3.1 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2657">#2657</a>) (<a href="https://github.com/tj-actions/changed-files/commit/e995ac4be5be2bcb6e29556edc51fb63aca6b49b">e995ac4</a>) - (dependabot[bot])</li> <li><strong>deps-dev:</strong> Bump <code>@types/node</code> from 24.2.1 to 24.3.0 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2649">#2649</a>) (<a href="https://github.com/tj-actions/changed-files/commit/3b04099b21072562f07469c10deb182b24236ca9">3b04099</a>) - (dependabot[bot])</li> <li><strong>deps:</strong> Bump github/codeql-action from 3.29.9 to 3.29.11 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2651">#2651</a>) (<a href="https://github.com/tj-actions/changed-files/commit/e7b6c977e51984988e3cc1d6b18abe2a3ba8daaa">e7b6c97</a>) - (dependabot[bot])</li> <li><strong>deps:</strong> Bump tj-actions/git-cliff from 2.0.2 to 2.1.0 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2648">#2648</a>) (<a href="https://github.com/tj-actions/changed-files/commit/765d62bc041415a5b494ef13d02d566128b25973">765d62b</a>) - (dependabot[bot])</li> <li><strong>deps:</strong> Bump github/codeql-action from 3.29.8 to 3.29.9 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2647">#2647</a>) (<a href="https://github.com/tj-actions/changed-files/commit/2036da178f85576f1940fedb74bb93a36cd89ab7">2036da1</a>) - (dependabot[bot])</li> <li><strong>deps:</strong> Bump github/codeql-ac…

* [TorchComms] add testing badge at experiments readme (#2010) * [compiler toolkit] specify passes through config (#2006) We should be able to control what passes to run in the compiler. This PR uses the config compile.passes to indicate in a list of graph passes to apply on the captured gm. By default, no pass is applied. Users can specify what passes to apply. Currently there are `autobucketing_reordering_pass` and `regional_inductor_pass`. ``` NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes autobucketing_reordering,regional_inductor ``` Also updated CI to include this new config * [simplefsdp] fix region ac in zero2-style FSDP (#1970) After some offline discussion, we've concluded that life would be easier if we can put simplefsdp's checkpoint logic for `reshard_after_forward` to compiler. The ac annotation part is borrowed form AP: [LINK](https://github.com/meta-pytorch/autoparallel/blob/main/autoparallel/activation_checkpointing.py#L69). **Trace and Loss Check** (all with torch.compile enable) reshard_after_fwd = False 1. SAC + llama3 ([trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-30-17-05-06_rank0_trace.json)) <img width="768" height="115" alt="Screenshot 2025-10-30 at 4 28 59 PM" src="https://github.com/user-attachments/assets/e4e22335-2e3f-46c8-8def-a60d592fee0a" /> <img width="689" height="512" alt="Screenshot 2025-11-05 at 9 02 30 PM" src="https://github.com/user-attachments/assets/40a71316-a457-4e72-9002-cc8beea8f32c" /> 2. Full AC + llama3 [(trace)]() <img width="729" height="105" alt="Screenshot 2025-10-30 at 4 30 53 PM" src="https://github.com/user-attachments/assets/e8d63460-579b-4f0a-8504-851480e5b548" /> <img width="789" height="763" alt="Screenshot 2025-11-05 at 9 11 34 PM" src="https://github.com/user-attachments/assets/1a13d09e-04c4-4db9-99fe-cf10d24bf7f5" /> 3. No AC + llama3 [[trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-30-17-03-50_rank0_trace.json)] <img width="748" height="115" alt="Screenshot 2025-10-30 at 4 32 05 PM" src="https://github.com/user-attachments/assets/20104d24-9d45-4eba-b694-815e133b88d0" /> <img width="800" height="764" alt="Screenshot 2025-11-05 at 9 07 46 PM" src="https://github.com/user-attachments/assets/55b104ce-8ec1-4ed6-95e7-300e96ad55af" /> reshard_after_fwd = True 1. SAC + llama3 ([Trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-31-11-34-24_rank0_trace.json)) <img width="795" height="108" alt="Screenshot 2025-10-31 at 11 34 47 AM" src="https://github.com/user-attachments/assets/a3988f72-7e87-4e52-90f9-8bee840cd6f4" /> 2. Full AC + llama3 ([Trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-31-11-36-27_rank0_trace.json)) <img width="593" height="110" alt="Screenshot 2025-10-31 at 11 38 02 AM" src="https://github.com/user-attachments/assets/5ee61b2b-9600-4af8-9a24-61b3564f93ca" /> 3. No AC + llama3 ([Trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-30-17-02-44_rank0_trace.json)) <img width="701" height="109" alt="Screenshot 2025-10-31 at 11 43 04 AM" src="https://github.com/user-attachments/assets/576b28f6-dae4-4ff7-b005-57b0cf9ad7cc" /> * [SimpleFSDP] Add typing to simple_fsdp.py (#2001) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #2002 * __->__ #2001 Add typing, credit to Claude. * [Full DTensor][Reland] Add full_dtensor flag (#2013) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2013 When full_dtensor is True, the compute_placement will be preserved. This means that `to_local()` won't be called for fsdp only case. nD parallelism case (fsdp + tp) will error out as we have not implemented this case. This argument doesn't affect the current simple_fsdp. We have verified `full_dtensor=True` case with the full dtensor skleton PR, which will be published once it is ready. **This is a reland PR of https://github.com/pytorch/torchtitan/pull/2002. The previous one was broken during rebase.** * set pg names (#1986) Summary: - we need to pass the global rank information to pytorch so that the pg name can include the pg information - this is necessary to differentiate the default pg's on different replicas - these need to different because flight recorder matches collectives based on pg name as well - add ft training to experiments folder, we'll move remaining pieces of ft to this gradually but make new features only available through this folder --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1986). * #1988 * #1987 * __->__ #1986 Co-authored-by: Tushar Jain <tushar00jain@users.noreply.github.com> * Fix the error message of maybe_enable_async_tp() (#2011) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2012 * __->__ #2011 It is not correct as JobConfig has changed. * Add dry run mode (#2012) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2012 * #2011 Summary: The current configuration validation requires torchx and GPUs. It can waste time, resources, ane engery. Polar bears are crying. Let's fix this by providing a dry run mode. This PR doesn't verify everything. In theory, we should be able to verify parallelisms settings as well. This PR is just a start but it at least can let us catch the typos quickly. * [easy] [compiler toolkit] Clean up unused function (#2014) As titled. `_clear_traced_params_buffers` is no longer being used as we have switched the dynamo graph capture API. * Run Torchtitan ROCm workflow on cron schedule & push to Main branch only (#2016) Addressing following issues in this PR- - Running Torchtitan ROCm workflow on cron schedule & only when push to Main branch. CUDA workflow will run as is. - Refactor Torchtitan test run to address older PR comment https://github.com/pytorch/torchtitan/pull/1786#discussion_r2476279289 * Revert PR-2016 & Redo "Run Torchtitan ROCm workflow on cron schedule & push to Main branch only" (#2017) Reverts PR: https://github.com/pytorch/torchtitan/pull/2016 Addressing following issues in this PR- - Running Torchtitan ROCm workflow on cron schedule & only when push to Main branch. CUDA workflow will run as is. - Refactor Torchtitan test run to address older PR comment https://github.com/pytorch/torchtitan/pull/1786#discussion_r2476279289 Co-authored-by: tianyu-l <150487191+tianyu-l@users.noreply.github.com> * [compiler toolkit] Add tests and scripts for numerics check (#2015) This PR adds the utils to automatically check the training numerics (losses, grad norms) of two runs to verify if they have bitwise equivalence. The added script triggers two runs with user defined configs. Then it loads metrics saved during training and compare the numerics to verify bitwise equivalence. Currently we check for losses and grad norms during training steps For example, we want to compare the numerics between compiler toolkit with aot_eager backend and eager on llama3-8B. ``` python torchtitan/experiments/compiler_toolkit/scripts/check_numerics.py --ngpu 4 --config-file torchtitan/models/llama3/train_configs/llama3_8b.toml --dp-shard-degree 2 --tp-degree 2 ``` It'll run `simple_fsdp` experiment without `torch.compile` as the eager baseline, and `compile_toolkit` experiment as the compiled run. Then it compares the training numerics of these two runs to verify bitwise equivalence. When it is bitwise equivalent, we'll see the following output ``` Starting training: simple_fsdp.llama3 ✓ Training completed: simple_fsdp.llama3 Starting training: compiler_toolkit.llama3 ✓ Training completed: compiler_toolkit.llama3 ✓ PASS: All 11 steps match exactly (bitwise equivalent) ✓ PASS: All 11 steps match exactly (bitwise equivalent) ✓ SUCCESS: All metrics are bitwise equivalent ``` Also added unit-tests in `compiler_toolkit/tests/test_numerics.py` so that we can guard working parallelism combinations that already have bitwise equivalence in CI. * Add .claude to .gitignore (#2026) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2029 * #2030 * #2028 * #2027 * __->__ #2026 As title * Fix dry run mode (#2027) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2029 * #2030 * #2028 * __->__ #2027 * #2026 Dry run mode works but it doesn't exit gracefully for all cases. This PR fixes it ``` DRY_RUN=1 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --training.steps=10 --activation_checkpoint.mode="none" --debug.deterministic --debug.seed=42 ``` * [Compiler Toolkit] Make compiler toolkit work with checkpoint (#2030) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2029 * __->__ #2030 The current CompileModule will result in an "inner" prefix for everything. This PR fixes it by overloading the methods. Also merge https://github.com/pytorch/torchtitan/pull/2028 to this PR. Something wrong with ghstack. * [Flux] Update integration test badge in README.md (#2019) Fixes the badge in the `README.md` file * Print device and stride when print module (#2045) Before: <img width="978" height="93" alt="image" src="https://github.com/user-attachments/assets/48dc39d9-e897-4396-ac62-025574303403" /> After: <img width="1318" height="82" alt="image" src="https://github.com/user-attachments/assets/47b4771a-aaf9-4f61-80bc-757f3a08c1d2" /> * [SimpleFSDP] add manual bucketing pass (#1881) This PR adds support for aten-level manual bucketing in SimpleFSDP+`aot_eager` backend. Dependent on PyTorch [PR](https://github.com/pytorch/pytorch/pull/165487) TODO List: - [ ] We should have better way of handling region info other than a list of str FQNs in current `manual_bucketed_modules`. It would be very easy to miss some of model modules. (cc. @xmfan @SherlockNoMad ) - [ ] Currently, the reordering happens under the hood and overlap with last/next compute. We should allow users to specify which module they want to reorder. - [ ] Loss difference on multi-node training - [ ] DSV3 manual bucketing I'll address the TODO items in follow up PRs. Let's start with this simple FSDP+TP+llama3 PR. 1. Performance (FSDP2 under eager mode, SimpleFSDP uses `aot_eager` backend) **Llama 3-8B** * Performance (All Batch_size = 1). (The slower TPS on Single Node is sort of as expected, since FSDP2 handles copy-in/out in two different streams, whereas SimpleFSDP handles copy-in/out in the same stream) |Node| Method | Parallelism | Memory | TPS | Trace| |---------|---------|-----------|----------|------|------| |1-Node (8H100)|SimpleFSDP | FSDP=8| 40.96GiB(43.12%) | 7,227| [LINK](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-16-10-48-48_rank0_trace.json)| |1-Node (8H100)|FSDP2-eager| FSDP=8| 47.82GiB(50.35%) | 7,380 | [LINK](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-16-10-54-14_rank0_trace.json)| |8-Node (64H100)|SimpleFSDP| FSDP=64 | 29.37GiB | 4,984| | |8-Node (64H100)|FSDP2| FSDP=64 | 31.41GiB |5,097 | | |1-Node (8H100)|SimpleFSDP| FSDP=4 TP=2 | 28.28GiB(29.77%) | 5,881 | [LINK](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-26-18-00-18_rank0_trace.json) | |1-Node (8H100)|FSDP2| FSDP=4 TP=2 | 35.33GiB(37.20%) | 5,898 | [LINK](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-26-15-35-47_rank0_trace.json) | |8-Node (64H100)|SimpleFSDP| FSDP=8 TP=8 | ||| |8-Node (64H100)|FSDP2| FSDP=8 TP=8 | ||| Example SimpleFSDP 1D overlapping trace: <img width="1127" height="127" alt="Screenshot 2025-10-16 at 10 49 55 AM" src="https://github.com/user-attachments/assets/2d9e3ff8-8e9b-40a7-a666-3c0a0975186e" /> Example SimpleFSDP 2D overlapping trace: <img width="1162" height="166" alt="Screenshot 2025-10-26 at 6 00 51 PM" src="https://github.com/user-attachments/assets/bc5cc031-5b6c-4e4d-a9da-70c43114f49a" /> - Bitwise Loss: FSDP-only: <img width="1266" height="837" alt="Screenshot 2025-10-17 at 10 41 56 AM" src="https://github.com/user-attachments/assets/30f83d95-1eca-4f10-9e7e-47c45278cd8d" /> FSDP+TP: <img width="1259" height="808" alt="Screenshot 2025-10-26 at 9 03 58 PM" src="https://github.com/user-attachments/assets/b75b452b-adb9-4078-9412-ee9e584ffe15" /> * Add export_dtype parameter to `convert_to_hf` function (#2041) The current `convert_to_hf.py` does not support `export_dtype`, which makes it `float32` by default. This PR adds support for export dtypes of `["float16", "bfloat16", "float32"]`. * [compiler toolkit] Port joint_ac_pass from simplefsdp (#2051) This PR integrates the changes in #1970 to compiler toolkit (applying `joint_ac_pass` on the joint graph graph to tag nodes based on `reshard_after_forward` flag) Also did some refactor for applying graph passes in compiler toolkit experiments. We will have two kinds of passes 1. joint_custom_passes: these are passes to be applied on the captured joint graph before partitioner. By default we `validate_flex_attn_annotation_pass` and `fsdp_reshard_after_fwd_pass` 2. compiler_passes: there are passes to be applied on partitioned fwd and bwd graphs as backend optimizations. By default there is none. We can indicate `autobucketing_reordering_pass` and `regional_inductor_pass` using configs. * [compiler toolkit] Port manual bucketing from SimpleFSDP experiment (#2056) This PR integrates the manual bucketing pass (transformer block bucketing) added in SimpleFSDP experiment (#1881) to compiler toolkit So now compiler toolkit can also run manual bucketing pass by specifying the config ``` NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes transformer_block_bucketing ``` Also updated README and integration test to include the newly ported pass * Re:Run Torchtitan ROCm workflow on cron schedule & push to Main branch only (#2018) Addressing following issues in this PR- Running Torchtitan ROCm workflow on cron schedule & only when push to Main branch. CUDA workflow will run as is. Refactor Torchtitan test run to address older PR comment https://github.com/pytorch/torchtitan/pull/1786#discussion_r2476279289 * Add a loss comparison script (#2029) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2049 * __->__ #2029 ## Summary This PR adds `scripts/loss_compare.py` for comparing training losses between different git commits and/or training configurations. ## Key Features - Commit Comparison: Compare losses between two different git commits with deterministic training - Configuration Comparison: Compare different training configurations on the same commit - Reproducibility: Automatically enables deterministic mode and seed checkpointing for reproducible comparisons - Real-time Output: Streams training output to both console and log files during execution - Statistical Analysis: Generates step-by-step loss comparisons and summary statistics - CI Testing: Includes --assert-equal flag for automated testing to verify identical losses ## Usage Examples #### Compare two commits ``` python3 ./scripts/loss_compare.py main my_branch ``` #### Compare two commits with custom configuration ``` python3 ./scripts/loss_compare.py main my_branch \ --baseline-config="./custom.toml" --baseline-options="--parallelism.tensor_parallel_degree=2" \ ``` #### Compare different parallelization strategies on same commit ``` python3 ./scripts/loss_compare.py . . \ --baseline-config="./llama3_8b.toml" --baseline-options="--parallelism.tensor_parallel_degree=2" \ --test-options="--parallelism.tensor_parallel_degree=1" \ ``` #### Assert equality for CI testing ``` python3 ./scripts/loss_compare.py main my_branch --assert-equal ``` ## Real Use Cases Compare full dtensor simple fsdp with fsdp2: ``` python3 scripts/loss_compare.py . . \ --baseline-options='--activation_checkpoint.mode="none"' \ --test-train-file='torchtitan.experiments.full_dtensor.train' \ --test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none"' \ --assert-equal --no-seed-checkpoint [LOSS_COMPARE] [LOSS_COMPARE] Asserting losses are equal... [LOSS_COMPARE] Baseline log: /tmp/baseline_training.log [LOSS_COMPARE] Test log: /tmp/test_training.log [LOSS_COMPARE] Extracted 100 steps from baseline log [LOSS_COMPARE] Extracted 100 steps from test log test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok ``` * Fix integration test gpu_arch_type field (#2060) All tests in experiments are broken due to the `gpu_arch_type` field added in #2018. * [compiler toolkit] Add Trainer subclass for compiler toolkit (#2064) Adding CudaGraph pass (https://github.com/pytorch/torchtitan/pull/2050) would require some custom logic in Trainer's close() method. So we create a Trainer subclass in compiler toolkit * Let loss_compare.py check the repo cleaness (#2062) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2063 * __->__ #2062 This will prevent errors when later doing git checkout * CUDAGraph support for SimpleFSDP and TP (#2050) ## Features - [x] Support SimpleFSDP and TP - [x] Support static input indices to reduce copy - [x] Support memory reuse to reduce memory consumption - [x] Cleanup cudagraph when training finishes to avoid nccl hang from destroy_process_group Command: ``` NCCL_GRAPH_REGISTER=0 NGPU=8 TRAIN_FILE=torchtitan.experiments.compiler_toolkit.train CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes cudagraph ``` Note: we use `NCCL_GRAPH_REGISTER=0` due to a known issue that nccl + cudagraphs + expandable segments result in IMA. https://github.com/pytorch/pytorch/issues/158029 [trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces%2Ftree%2Fshared_trace%2Fboyuan_e1ef464b-ee61-4c61-82e5-f7a485e561bf_rank0_trace.json) ## Result **Numerics:** Achieved bitwise equivalence w/ and w/o cudagraph pass on llama3.1-8B AND llama3.1-70B. **Performance:** <img width="560" height="90" alt="image" src="https://github.com/user-attachments/assets/9d54c461-0eb1-4f7e-9652-3d52043ad74f" /> Raw log: [llama3-8b](https://www.internalfb.com/phabricator/paste/view/P2045444190), [llama3-70b](https://www.internalfb.com/phabricator/paste/view/P2045567416) **Memory:** On llama3.1-70b, cudagraph takes 6% more memory consumption (143 GiB vs 153 GiB). A few tricks to reduce memory consumption (use llama3.1-70b w/ cudagraph as an example): - Start: 161 GiB - \+ use the same stream for warmup and graph capture of both fwd and bwd: 160 GiB - \+ warmup in cudagraph memory pool instead of eager memory pool: 153 GiB **static input copy:** On llama3.1-70B, for forward, we copy 1 tensor of 128 bytes; for backward, we copy 1 tensor of 0.98 GB. This shows static input indices is handled correctly. ## Followup PR In the followup PR, I will enable fx graph partition for deepseek v3 https://github.com/pytorch/pytorch/pull/165945. * compiler_toolkit: fix args access (#2067) This PR fixes access to args; it's an attribute, not a variable in the scope. The method itself though would not be used because `should_check_address` seems to be always `False` and there doesn't seem to be a command line argument for it. Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com> * 3outeille/transformers backend (Dense model only) (#2048) # Context Reference PR: https://github.com/huggingface/torchtitan/pull/1 This PR enables: - Llama-like HF models to work with 4D parallelism: FSDP, CP, TP, PP (and the combinations between them). The following models were tested: - `meta-llama/Llama-3.2-1B` - `microsoft/phi-2` - `Qwen/Qwen2.5-7B` - `mistralai/Mistral-7B-v0.1` - `ByteDance-Seed/Seed-Coder-8B-Instruct` - `Qwen/Qwen3-4B-Instruct-2507` - `arcee-ai/AFM-4.5B` - `ibm-granite/granite-3b-code-base-2k` - `baidu/ERNIE-4.5-0.3B-Base-PT` - `kyutai/helium-1-preview-2b` - `allenai/OLMo-7B-hf` - `mistralai/Ministral-8B-Instruct-2410` - Patching HF models weights initialisation. Without this, the the `loss` and `grad_norm` starts very high # Usage - Requirements `transformers==4.57.1` - Config: `torchtitan/torchtitan/experiments/transformers_backend/configs/qwen3.toml` ```diff ... [model] - name = "llama3" + name = "transformers_backend" flavor = "debugmodel" hf_assets_path = "./tests/assets/tokenizer" +[hf_transformers] +model = "Qwen/Qwen3-4B-Instruct-2507" ... ``` - Train: `LOG_RANK=7 CONFIG_FILE=<YOUR_PATH>/torchtitan/experiments/transformers_backend/configs/qwen3.toml ./run_train.sh --job.custom_config_module=torchtitan.experiments.transformers_backend.job_config --compile.enable` <img width="1334" height="453" alt="image" src="https://github.com/user-attachments/assets/da459448-027b-4af9-8176-6a3e433a272c" /> # Testing methodology <img width="2672" height="2018" alt="image" src="https://github.com/user-attachments/assets/66d8689d-7ede-47e3-b389-d4fc1bdd70f7" /> - Following the [converging.md](https://github.com/pytorch/torchtitan/blob/main/docs/converging.md) guidelines, I am comparing the baseline `FSDP=2` vs `FSDP=2 & <other //-ism>` - More precisely, the `test_hf_integration.py`is going to do: ```bash results/ |_ meta-llama |_ Llama-3.2-1B |_ debugmodel/ |_ seed_checkpoint/ |_ config.toml |_ seed.slurm |_ step-0/ |_ .... |_ fsdp2_tp1_cp1_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ fsdp2_tp2_cp1_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp1_pp2/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp2_pp1/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log |_ fsdp2_tp1_cp2_pp2/ |_ config.toml |_ nd_parallelism.slurm |_ nd_parallelism.log |_ diff_baseline_vs_nd_parallelism.log` |_ full/ ... ``` - Here is the grid search to test the HF modelling ```shell #!/usr/bin/bash model_names=( "meta-llama/Llama-3.2-1B" "microsoft/phi-2" "Qwen/Qwen2.5-7B" "mistralai/Mistral-7B-v0.1" "ByteDance-Seed/Seed-Coder-8B-Instruct" "Qwen/Qwen3-4B-Instruct-2507" "arcee-ai/AFM-4.5B" "ibm-granite/granite-3b-code-base-2k" "baidu/ERNIE-4.5-0.3B-Base-PT" "kyutai/helium-1-preview-2b" "allenai/OLMo-7B-hf" "mistralai/Ministral-8B-Instruct-2410" ) for model_name in "${model_names[@]}"; do rm -rf slurm_results/${model_name} python test_hf_integration.py create_configs --model_name "$model_name" --out_dir slurm_results --flavor debugmodel python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel/seed_checkpoint --qos high while [ ! -f slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt ] || [ "$(cat slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt)" != "completed" ]; do echo "Waiting for seed checkpoint from ${model_name} to complete ..." sleep 1 done python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel --qos high echo "================" done ``` # Further tasks - Moe (handle in PR https://github.com/huggingface/torchtitan/pull/3) - Missing `build_optimizers_with_moe_load_balancing` support for MoE - Missing TP/PP/EP supports for MoE - When using HF modeling, the test `FSDP=2 vs FSDP=2 + PP=2`, the `loss` and `grad_norm` not bitwise matching (but converging) while it is the case with Torchtitan modeling. (issue is tracked in https://github.com/huggingface/torchtitan/pull/4) - Add convergence tests to CI by doing tiny model + gloo backend (once PP is bitwise matching) - the HF modeling has lower MFU than Torchtitan MFU - NOTE: `import torch._dynamo.config; torch._dynamo.config.cache_size_limit = 128` to avoid recomputation for graph when using `torch.compile` and `activation checkpointing` * adding variable length attention to llama3 8b (#2000) **Summary** This PR adds variable length attention (varlen) support to the Llama 3 8b model in torchtitan. We replace `use_flex_attn` with `attn_type` (either "sdpa", "varlen", "flex"). If `attn_type = "varlen"`, the attention module calls a compiled `varlen_attn` defined [here](https://github.com/pytorch/pytorch/blob/main/torch/nn/attention/varlen.py). **Testing** Ran loss and performance tests against flex attention. Loss is on par. <img width="947" height="505" alt="Screenshot 2025-11-19 at 3 24 26 PM" src="https://github.com/user-attachments/assets/d85dfc09-4f5e-4f82-abc9-49b870b34990" /> Varlen is slightly slower than Flex due to the cuda kernel speeds (varlen calls into `flash_attention_forward`/`flash_attention_backward` today). | | Varlen | Flex | | :---: | :------ | :---: | | Forward | 774us 357ns | 722us 317ns | | Backward | 1ms 955us 916ns | 1ms 558us 747ns | * remove scatter_add in MoE implementation (#1974) PR for removing `scatter_add` in the MoE implementation. `scatter_add` is somewhat problematic as it is non-deterministic due to the necessity of [atomic adds](https://discuss.pytorch.org/t/why-does-index-add-and-scatter-add-induce-non-deterministic-behavior-on-the-cuda-backend/45544/2) for correctness. Determinism, correctness, and performance tests using scripts under `torchtitan/moe_bench_and_test`: ``` # Determinism: run same forward 100x and compute standard deviations pytest -rsfP torchtitan/moe_bench_and_test/test_moe.py -k test_determinism out_old_std=tensor(0.0297, device='cuda:0', dtype=torch.bfloat16) out_std=tensor(0., device='cuda:0', dtype=torch.bfloat16) out_old_std/out_moe_old.abs().mean()=tensor(0.0006, device='cuda:0', dtype=torch.bfloat16) out_std/out_moe.abs().mean()=tensor(0., device='cuda:0', dtype=torch.bfloat16) ``` ``` # Accuracy: compare MoE outputs to FFN outputs, with weights set such that outputs should be the same # Relative error decreased by 3x pytest -rsfP torchtitan/moe_bench_and_test/test_moe.py -k test_moe_ffn_equivalence moe_old_rel_err=0.009754068047048696 moe_rel_err=0.002507858727736454 moe_old_rel_err/moe_rel_err=3.8894009216589858 ``` ``` # Timing: triton do_bench for DSv3 16B layer fwd + bwd. ~3% faster runtime python torchtitan/moe_bench_and_test/moe_timing.py moe_old && python torchtitan/moe_bench_and_test/moe_timing.py moe args=Namespace(cls='moe_old', perf_reps=1000, perf_warmups=100, seqlen=4096, bsz=4) moe_time_ms=19.712812881469727 args=Namespace(cls='moe', perf_reps=1000, perf_warmups=100, seqlen=4096, bsz=4) moe_time_ms=19.03301840562087 ``` ``` # Memory: for DSv3 16B layer fwd + bwd. ~15% reduction in active mem, ~18% in reserved mem. python torchtitan/moe_bench_and_test/moe_memory.py moe_old && python torchtitan/moe_bench_and_test/moe_memory.py moe args=Namespace(cls='moe_old', iters=1, seqlen=4096, bsz=4) peak_stats.max_active_gib=5.926029682159424 peak_stats.max_reserved_gib=7.224609375 args=Namespace(cls='moe', iters=1, seqlen=4096, bsz=4) peak_stats.max_active_gib=5.051033020019531 peak_stats.max_reserved_gib=5.91015625 ``` Testing fwd + bwd correctness for `tp_degree=ep_degree=world_size=8` and `etp=1` ``` # Similar relative errors torchrun --nproc-per-node 8 torchtitan/moe_bench_and_test/test_tp.py args=Namespace(seqlen=256, bsz=4, tol=0.01), world_size=8, tp=8, ep=8, etp=1 err_ratio_fsdp_ep_old=0.0028211805268959435 err_ratio_fsdp_ep=0.002805679534989922 err_ratio_ep_ep_old=0.0022941468020912068 kl_fsdp_ep_old=tensor(2.4915e-05, device='cuda:0', dtype=torch.bfloat16) kl_fsdp_ep=tensor(2.0981e-05, device='cuda:0', dtype=torch.bfloat16) kl_ep_ep_old=tensor(2.1458e-05, device='cuda:0', dtype=torch.bfloat16) ``` Everything under `torchtitan/moe_bench_and_test` is temporary testing utilities and is to be deleted prior to merging. * Update transformers backend name (#2075) following Huggingface efforts in VLLM (cf https://github.com/vllm-project/vllm/pull/28725), we would like to uniformize the naming and make sure that people think we use the HF models only * Enhance loss_compare.py: Add Import/Export Options and Enable CI Comparison with Existing Losses (#2063) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2063 This PR allows us to check if the loss is consistent across commits/PRs. 1. This PR contains a pre-tested losses result file. 2. This PR improve the loss_compare.py to add --import and --export options. 3. In CI, uses --import to get the previous losses and compare them with the current PR. If anything mismatch (10 steps), the CI will fail. * Print out the version number (#2083) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2083 This PR and https://github.com/pytorch/torchtitan/pull/2070 can resolve https://github.com/pytorch/torchtitan/issues/2043. This should not affect `.github/scripts/update_version.sh` as `.github/scripts/update_version.sh` will append the version at the end of the file, which will overwrite the value. * Autoparallel as an experiment in main (#2054) Experiments like SimpleFSDP/Compiler Toolkit/Autoparallel are all being developed at the same time, and SimpleFSDP/Compiler Toolkit both run into issues with PP that requires the PP utilities from Autoparallel. We want to land the Autoparallel experiment into main to facilitate that sharing. --------- Signed-off-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Will Constable <whc@meta.com> Co-authored-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Francisco Massa <fvsmassa@gmail.com> Co-authored-by: ruisizhang123 <ruisizhang123@gmail.com> Co-authored-by: Brian Hirsh <briandhirsh@gmail.com> Co-authored-by: Will Constable <willconstable@gmail.com> * skip varlen integration test on rocm (#2085) as title since varlen attention is not supported on rocm * [Local Tensor] Replace dry_run.py with fake mode implementation (#2057) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2057 Replaces `dry_run.py` implementation with fake PG mode for DRY_RUN configuration validation. This PR also adds support of Local tensor mode to provide deeper validation coverage. **Note:** Currently returns early before `init_weights()` if using local tensor mode due to some limitation of local tensor, which will be fixed by https://github.com/pytorch/pytorch/pull/166540 . * add varlen attention for qwen 3 (#2084) As title **Testing** <img width="469" height="431" alt="Screenshot 2025-11-24 at 4 30 53 PM" src="https://github.com/user-attachments/assets/6b9a362d-de36-48b7-b465-d91ae24f4cbf" /> performance and loss on par * [FLUX] Add FLUX inference test in CI (#1969) * Improve logging by formatting the dict as JSON. (#2094) We use Slurm to run jobs, and i just noticed that job configs and model args were being logged on a single line by default, which makes the logs hard to read. This PR improves readability by formatting these dictionaries with `json.dumps` before logging, so the configs are formatted nicely and easier for humans to read. before: <img width="2594" height="640" alt="image" src="https://github.com/user-attachments/assets/c3c07b09-d12c-484d-aa90-a626cd25c6d2" /> after: <img width="2252" height="1032" alt="image" src="https://github.com/user-attachments/assets/4cbde979-c34c-4fc5-aa55-f280f39cf9ef" /> * add all SDPA backends to op_sac_save_list (#2095) As we discussed in https://github.com/pytorch/torchtitan/issues/2091, we should add all `scaled_dot_product_attention` backends to `op_sac_save_list`to avoid recomputing attention during backward. * modify save list for varlen attn (#2082) adding varlen attention ops to ac save list **testing** used DebugMode() to print out op list. verified that forward is not being recomputed in the backward step. ``` [rank0]:forward ops [rank0]:varlen_attn in forward: True ... [rank0]:varlen_attn recomputed in backward: False [rank0]:saved correctly ``` * Make sure log after distributed initialized. (#2102) There is a condition check in config logging for distributed initialization, so the config logging has to be happen after distributed has been initialized. Co-authored-by: Zhiqiang Zang <zzq@fb.com> * [mxfp8] [docs] [BE] add MXFP8 usage documentation and benchmarks (#2096) Fixes #1998 * Mark input tokens to routed experts as dynamic to avoid a recompile (#2007) Stacked PRs: * __->__#2007 --- --- --- Mark input tokens to routed experts as dynamic to avoid a recompile This saves 1 recompile, and you can see the input tokens are dynamic from the first graph compiled: ```python class GraphModule(torch.nn.Module): def forward(...s77: "Sym(s77)", L_x_: "bf16[s77, 5120][5120, 1]cuda:0"... ``` I verified that this also fixes the AC recompile issue of: https://github.com/pytorch/torchtitan/issues/1971. But I'm keeping `torch._C._dynamo.eval_frame._set_lru_cache(False)`, as there could be other recompile reasons popping up. * fix mxfp8 loss image (#2104) In the original PR i moved the image location without updating the markdown pointing to it by accident. This fixes that. * Update hf_assets_path for llama4 (#2110) Fix typo in train_config, hf asset should be for maverick, see: https://huggingface.co/meta-llama/models?search=128e * Enables parsing of --compile.components through CLI (#2115) Without this PR, I'm not able to pass `--compile.components=model,loss`. Tested using `python -m torchtitan.config.manager --compile.components=model,loss`. * fix `ForgeEngine` compatibility issue with (#2121) Summary: Fix backward incompatible changes introduced in https://github.com/pytorch/torchtitan/commit/f29828bbc8018c9374861aff142c658e2e08e8b4 Differential Revision: D88572518 * Remove the hack for SAC + FlexAttention (#2118) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2118 PyTorch can now support torch.compile inside the SAC region even if torch.compile is not used to wrap SAC. This PR removes the workaround to ensure torch.compile works with Flex * Add warning to run_tests (#2123) Small addition since right now running a test that doesn't exist just outputs nothing, e.g. `python -m tests.integration_tests.run_tests ./test-out --test_name does_not_exist` Now the output is: ` WARNING:root:No tests were run for --test_name 'does_not_exist' in test suite 'features'. Available test names in 'features' suite: ['default', '1d_compile', '1d_compile_sac_op', '2d_eager', '2d_compile', 'full_checkpoint', 'model_only_hf_checkpoint', 'last_save_model_only_fp32', 'last_save_model_only_bf16', 'pp_looped_zero_bubble', 'pp_zbv', 'pp_1f1b', 'pp_gpipe', 'pp_dp_1f1b', 'pp_dp_gpipe', 'pp_tp', 'pp_dp_tp', '3d_compile', 'pp_looped_1f1b', 'pp_custom_csv', 'optimizer_foreach', 'ddp', 'hsdp', 'fsdp+flex_attn', 'fsdp+flex_attn+per_op_sac', 'fsdp+varlen_attn+per_op_sac', 'cp_allgather', 'cp_alltoall', 'hsdp+tp', 'fsdp+cp', 'hsdp+cp_without_dp_shard', 'hsdp+cp_with_dp_shard', 'fsdp+tp+cp', 'cpu_offload+opt_in_bwd+TP+DP+CP', 'test_generate', 'fsdp_reshard_always', 'optional_checkpoint', 'float8_emulation', 'gradient_accumulation', 'validation_tp_cp_pp'] ` * [compiler toolkit] Disable CUDAGraph integration test (#2127) As titled. We'll enable when it is fixed. * Add CI for Autoparallel experiment llama3 on 4 GPUs (#2105) * Support rope cache indexing using positions (#2112) Add support to indexing rope cache using `position_ids`, this might be needed during 1. inference, where we passed in `position_ids` into transformer forward 2. CP load balancing where we need to index rope cache given positions ids Test: running dpskv3 16b base <img width="489" height="286" alt="image" src="https://github.com/user-attachments/assets/6f463d65-a0de-413d-ab19-770db9983dbb" /> also tested in https://github.com/wwwjn/torchtitan/pull/1/files when passing position_ids <img width="665" height="269" alt="image" src="https://github.com/user-attachments/assets/70e4bddc-0334-4dbf-b00d-6e4b49a94655" /> --------- Co-authored-by: JessicaZhong <zhengjesszhong@gmail.com> * [forge] allow torchforges to set checkpoint base folder (#2131) this PR 1) allowing Torchforge to decide where to put the checkpoint and wandb, etc, instead of the "current" folder ~~allowing Torchforge to decide to print / log the configs~~ * Rename auto_parallel experiment to autoparallel (#2128) * PyTorch depends on psutil (#2132) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2132 TorchTitan should also depends on psutil. * Remove caching for attention masks (#2117) We remove the lru_cache for attention masks, because in get_attention_mask() function, `and_masks(*mask_mods)` will return different object id. `create_attention_mask` will use all parameters as cache key, and new object id will always cause cache miss. Before the change: (llama3 debugmodel_flex_attn) <img width="1182" height="275" alt="Screenshot 2025-12-09 at 1 27 45 PM" src="https://github.com/user-attachments/assets/e9af2597-9d94-4478-8136-8b9b8c35d9e6" /> After the change: <img width="1182" height="275" alt="Screenshot 2025-12-09 at 1 29 56 PM" src="https://github.com/user-attachments/assets/756a7d09-b47f-434f-8ff6-40098b265a03" /> * Clarify contribution guidelines. (#2134) * Enable PP and EP overlap for MoE (#1721) Option 2 of https://github.com/pytorch/torchtitan/issues/1682 These changes add a custom `overlap_callback` function to replace the OVERLAP_F_B action that is run during the schedule execution. In the custom function, we write `run_forward()` and `run_backward()`. `run_backward()` is run as a separate thread so that we can have both forward and backward running together side by side. Looks like this: <img width="1321" height="443" alt="image" src="https://github.com/user-attachments/assets/911f3637-1afa-4537-989a-a325ba558957" /> In order for these changes to work with Expert Parallel, we also need to add custom autograd functions to act as the boundary points at which we do communication. We added hooks before and after expert parallel dispatch and combine to signal boundary points, so our figure from before now turns into: <img width="1382" height="388" alt="image" src="https://github.com/user-attachments/assets/3991749d-7d67-4098-81a4-4efcfd1c75ca" /> Now in each of these red blocks, we use a global coordinator. We need `threading.Barrier(2).wait()` so that the comm and compute from our forward and backward steps are scheduled in lock-step before continuing. DSv3 16B run command: ``` TORCH_NCCL_TRACE_BUFFER_SIZE=2000 TORCH_NCCL_DUMP_ON_TIMEOUT=true TORCH_FR_DUMP_TEMP_FILE=./nccl_trace_rank_ NGPU=8 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh ``` Trace examples: <img width="2409" height="1889" alt="image" src="https://github.com/user-attachments/assets/923efc8f-9241-4646-aba0-ccc846d3932b" /> Test command: `python -m tests.integration_tests.run_tests ./test-out --test_name pp_dualpipev --test_suite models` --------- Co-authored-by: tianyu-l <150487191+tianyu-l@users.noreply.github.com> * Fix apply_compile called multiple times in PP initialization (#2135) Stacked PRs: * __->__#2135 --- --- --- PP initialization calls apply_compile multiple times, once per pp stage. But apply_compile does some global patching. So I add `already_patched` to avoid patching the same method multiple times. If we patch multiple times, the second time will wrap `_run_experts_grouped_mm_dynamic` in a torch.compile(fullgraph=True) leading to the error in the issue below. FIXES https://github.com/pytorch/torchtitan/issues/2124 * Enable static type checking with Pyrefly (#2136) Enables static type checking of torchtitan with [pyrefly](https://github.com/facebook/pyrefly). Type checking the code helps catch bugs earlier in the development cycle. * Adds pyrefly to CI, as part of the linting workflow. * Addresses ~100 type errors that can be fixed via local code changes and updates to type annotations, and silences the rest with `# pyrefly: ignore` suppression comments. Note that https://github.com/pytorch/torchtitan/commit/325efd946f1cbea85e503f9e684b8c879891fc1a contains all of the non-comment changes. * [Autoparallel] Add local_map variant of DSv3 and 2D mesh AP (#2129) Stacked PRs: * __->__#2129 --- --- --- [Autoparallel] Add local_map variant of DSv3 and 2D mesh AP Currently, the AP experiment monkey patches Titan's main DSv3 implementation. But this is prone to breakage from both model definition changes in titan and from HOP/partitioner related changes in core. When these breaks happen, people are usually blocked until I find the root cause. I'm going on PTO for the rest of the year, so I'm adding an integration to AP's DSv3 model in an attempt to make the development more stable for the upcoming PP integration. Test: https://gist.github.com/xmfan/db15fda1e1bc1df7cd523005fe0baf33 * Implement ciflow/rocm on Torchtitan (#2114) In this PR, I implemented ciflow/rocm on Torchtitan. The changes are part of integration_test_8gpu_features.yaml. The workflow still supports running on pull_request (without any PR label) for CUDA. However, along with push to main and cron schedule, with the ciflow/8gpu label added to PR, the workflow runs for both CUDA & ROCm. --------- Co-authored-by: Huy Do <huydhn@gmail.com> * [MoE] Add node limited routing support (#2111) As titled, added node-limited routing support via two-layer routing. First, group experts into `num_groups` groups, and experts in the same group should reside on the same node to utilize fast intra-node communication. Second, pick the `top_k_group` by the top 2 expert scores' sum in each group. Third, pick `top_k` experts within the selected `top_k_groups`. Reference: https://github.com/huggingface/transformers/blob/4c9fde2a2a3aece0bcf1be93f696e88297da9397/src/transformers/models/deepseek_v3/modeling_deepseek_v3.py#L212 Test on one node using DeepSeek V3 debug model with MoE arguments `num_experts=8, num_shared_experts=2, num_groups=4, top_k_group=2, top_k=3`. <img width="1196" height="465" alt="Pasted Graphic" src="https://github.com/user-attachments/assets/63fd8414-1761-4efe-acff-154b1f46a16d" /> * Upgrade GitHub Actions to latest versions (#2152) ## Summary Upgrade GitHub Actions to their latest versions for improved features, bug fixes, and security updates. ## Changes | Action | Old Version(s) | New Version | Release | Files | |--------|---------------|-------------|---------|-------| | `pypa/gh-action-pypi-publish` | [`release/v1`](https://github.com/pypa/gh-action-pypi-publish/releases/tag/release/v1) | [`v1`](https://github.com/pypa/gh-action-pypi-publish/releases/tag/v1) | [Release](https://github.com/pypa/gh-action-pypi-publish/releases/tag/v1) | release.yml | ## Why upgrade? Keeping GitHub Actions up to date ensures: - **Security**: Latest security patches and fixes - **Features**: Access to new functionality and improvements - **Compatibility**: Better support for current GitHub features - **Performance**: Optimizations and efficiency improvements ### Security Note Actions that were previously pinned to commit SHAs remain pinned to SHAs (updated to the latest release SHA) to maintain the security benefits of immutable references. ### Testing These changes only affect CI/CD workflow configurations and should not impact application functionality. The workflows should be tested by running them on a branch before merging. * Upgrade GitHub Actions for Node 24 compatibility (#2151) ## Summary Upgrade GitHub Actions to their latest versions to ensure compatibility with Node 24, as Node 20 will reach end-of-life in April 2026. ## Changes | Action | Old Version(s) | New Version | Release | Files | |--------|---------------|-------------|---------|-------| | `actions/checkout` | [`v3`](https://github.com/actions/checkout/releases/tag/v3), [`v4`](https://github.com/actions/checkout/releases/tag/v4) | [`v6`](https://github.com/actions/checkout/releases/tag/v6) | [Release](https://github.com/actions/checkout/releases/tag/v6) | docker-builds.yml, release.yml | | `actions/download-artifact` | [`v4`](https://github.com/actions/download-artifact/releases/tag/v4) | [`v7`](https://github.com/actions/download-artifact/releases/tag/v7) | [Release](https://github.com/actions/download-artifact/releases/tag/v7) | release.yml | | `actions/setup-python` | [`v5`](https://github.com/actions/setup-python/releases/tag/v5) | [`v6`](https://github.com/actions/setup-python/releases/tag/v6) | [Release](https://github.com/actions/setup-python/releases/tag/v6) | release.yml | | `actions/upload-artifact` | [`v4`](https://github.com/actions/upload-artifact/releases/tag/v4) | [`v6`](https://github.com/actions/upload-artifact/releases/tag/v6) | [Release](https://github.com/actions/upload-artifact/releases/tag/v6) | release.yml | ## Context Per [GitHub's announcement](https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/), Node 20 is being deprecated and runners will begin using Node 24 by default starting March 4th, 2026. ### Why this matters - **Node 20 EOL**: April 2026 - **Node 24 default**: March 4th, 2026 - **Action**: Update to latest action versions that support Node 24 ### Security Note Actions that were previously pinned to commit SHAs remain pinned to SHAs (updated to the latest release SHA) to maintain the security benefits of immutable references. ### Testing These changes only affect CI/CD workflow configurations and should not impact application functionality. The workflows should be tested by running them on a branch before merging. * Improve the loss_compare.sh logic (#2143) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2145 * #2144 * __->__ #2143 1. Accept one "." (meaning the current commit) case to simplify the command line. 2. Ignore the untracked files. * [GPT-OSS] Add HF state dict adapter to support loading from HF checkpoints (#2021) As titled, this PR adds HF state dict adapter to support loading from GPT-OSS HF checkpoint. GPT-OSS checkpoint is quantized in MXPF4 format. The de-quantization steps are offloaded to the `QuantizedHuggingFaceStorageReader` in `dcp`, so this feature depends on this PR to update `QuantizedHuggingFaceStorageReader` (https://github.com/pytorch/pytorch/pull/167672). 1. Test 1. We use `dcp.load(hf_state_dict, storage_reader=QuantizedHuggingFaceStorageReader(path=input_dir))` to load from GPT-OSS HF checkpoint, and map the `hf_state_dict` back to TorchTitan state dict. We build one test input, and compare two outputs: 1. Using `transformer` library to load GPT-OSS HF checkpoint and run inference on the test input; 2. We use the converted TorchTitan model to run inference on the test input. We compare the outputs by comparing the KL divergence of two output probability distributions. The result shows two models are very similar. <img width="1191" height="191" alt="Pasted Graphic" src="https://github.com/user-attachments/assets/bb6a75e9-3dd7-43fa-847e-3f5f4fb5fd93" /> 2. Test 2. We load the model directly from quantized GPT-OSS HF checkpoint, and do a test training. <img width="1198" height="408" alt="Pasted Graphic 1" src="https://github.com/user-attachments/assets/49ab42ff-0115-4e79-b069-c556e0dd23f6" /> * Add local built pytorch path for pyrefly (#2155) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * #2156 * __->__ #2155 This assumes that the local built version has the same parent folder as torchtitan. Also fixes some pyrefly errors for moe.py * Run vLLM inference using torchtitan model definition (single GPU) (#2119) As titled, put it in deterministic RL folder * [RELAND] Let CUDA and ROCm read different loss result (#2157) Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom): * __->__ #2157 CUDA and ROCm have different loss results. So we need to read from different loss result files. The loss results of FSDP and HSDP start to diverge after 5th step when running with ROCm, we also need to adjust this. But this is more an unknown issue that AMD people may want to figure out the root cause or confirm that this is an expected behavior. **This PR is a reland PR of https://github.com/pytorch/torchtitan/pull/2156** due to some landing issue of the previous PR. * Use new DeviceMesh unflatten to rewrite parallel_dims (#1660) **Summary** This PR utilizes the latest APIs provided by DeviceMesh to simplify the creation of all different meshes. The design philosophy is as follow: 1. Create one world mesh with the shape as [world_size,] 2. Create all 1-D submeshes by using 1) unflattening from the world mesh, or 2) slicing and flatten from other derived meshes. 3. ParallelDims now provides an API, get_mesh() and get_optional_mesh(). which accepts str or list[str]. When the argument is str, the API directly return the corresponding 1-D submesh. If the argument is list[str], the dim names will be used to concatenate to form a n-D device mesh. The main difference between the two APIs is that the former one will raise an ValueError if the result mesh is None the later one will just return None. * Integrate DeepEP to torchtitan (#2107) ## Summary This initial version integrates DeepEP into TorchTitan, focusing on correctness and compatibility rather than maximal performance tuning. - Functional DeepEP-backed MoE + Expert Parallelism - User-controlled configuration - Compatible with torch.compile and SAC - Intended as a first unblocker for benchmarking and iteration ## Perf: DeepSeek-V3 671B on 64 nodes × H100 (512 GPUs total) <details> <summary><strong>Training config (click to expand)</strong></summary> ``` config_path="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml", command_args=[ "--training.dataset_path=/lustre/fsw/portfolios/sw/users/elfieg/hf_datasets/c4", "--training.seq_len=4096", "--training.steps=120", "--metrics.log_freq=10", "--profiling.no-enable-profiling", "--comm.init_timeout_seconds=2000", "--comm.train_timeout_seconds=300", "--metrics.disable_color_printing", # Parallelism "--parallelism.data_parallel_replicate_degree=1", "--parallelism.data_parallel_shard_degree=64", "--parallelism.fsdp_reshard_after_forward=default", "--parallelism.tensor_parallel_degree=1", "--parallelism.expert_parallel_degree=32", "--parallelism.expert_tensor_parallel_degree=1", "--parallelism.pipeline_parallel_degree=8", "--parallelism.pipeline_parallel_schedule=Interleaved1F1B", # Training "--training.local_batch_size=16", "--activation_checkpoint.mode=full", # Compilation "--compile.enable", "--compile.components=model", "--compile.components=loss", # MoE / DeepEP "--debug.moe_force_load_balance", "--parallelism.expert_parallel_comm_backend=deepep", ], ``` </details> After: ``` memory: 56.75GiB(71.74%) tps: 579 tflops: 162.82 mfu: 16.46% ``` Before: ``` memory: 60.18GiB(76.07%) tps: 346 tflops: 97.24 mfu: 9.83% ``` ## Loss Curve: <img width="877" height="380" alt="Screenshot 2025-12-16 at 11 30 02 PM" src="https://github.com/user-attachments/assets/b2f15297-2f05-4f4b-b4d5-b2747a30b2fa" /> Shout out to my colleagues @gekurian @syed-ahmed @aazzolini for internal supports! * Fix pypa/gh-action-pypi-publish version to use SHA pinning (#2161) ## Summary Fix incorrect version reference for `pypa/gh-action-pypi-publish`. ## Problem A previous PR incorrectly changed the action reference from `release/v1` (valid branch) to `v1` (non-existent tag). The `v1` tag doesn't exist in the pypa/gh-action-pypi-publish repository. ## Solution Updated to use SHA pinning for release/v1.13: ```yaml uses: pypa/gh-action-pypi-publish@ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e # release/v1.13 ``` This follows [GitHub's security best practices](https://docs.github.com/en/actions/reference/security/secure-use#using-third-party-actions) for third-party actions by pinning to an immutable SHA. ## Files Changed - `.github/workflows/release.yml` --------- Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com> * Upgrade GitHub Actions for Node 24 compatibility (#2164) ## Summary Upgrade GitHub Actions to their latest versions to ensure compatibility with Node 24, as Node 20 will reach end-of-life in April 2026. ## Changes | Action | Old Version(s) | New Version | Release | Files | |--------|---------------|-------------|---------|-------| | `actions/checkout` | [`v3`](https://github.com/actions/checkout/releases/tag/v3) | [`v6`](https://github.com/actions/checkout/releases/tag/v6) | [Release](https://github.com/actions/checkout/releases/tag/v6) | lint.yaml | | `actions/setup-python` | [`v4`](https://github.com/actions/setup-python/releases/tag/v4) | [`v6`](https://github.com/actions/setup-python/releases/tag/v6) | [Release](https://github.com/actions/setup-python/releases/tag/v6) | lint.yaml | ## Context Per [GitHub's announcement](https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/), Node 20 is being deprecated and runners will begin using Node 24 by default starting March 4th, 2026. ### Why this matters - **Node 20 EOL**: April 2026 - **Node 24 default**: March 4th, 2026 - **Action**: Update to latest action versions that support Node 24 ### Security Note Actions that were previously pinned to commit SHAs remain pinned to SHAs (updated to the latest release SHA) to maintain the security benefits of immutable references. ### Testing These changes only affect CI/CD workflow configurations and should not impact application functionality. The workflows should be tested by running them on a branch before merging. Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com> * Expose common dataloader args (#2097) This diff introduces common dataloader args which are supported by statefuldataloader (and torch.utils.data dataloader). Users should be able to use them in their config files. I was thinking about introducing a catch all kwargs to make it easier to specify args but that can easily complicate things (validation checks, duplication, existing defined named args in function definitions etc). * Replace `logger.warn()` to `logger.warning()` , allow `log_validation` to log `extra_metrics` and expose common wandb args (#2166) 1. Replace `logger.warn()` to `logger.warning()` 2. allow `log_validation` to log `extra_metrics` 3. expose common wandb init args, it is userful when resume training. * Add Dependabot for GitHub Actions updates (#2163) ## Summary Add Dependabot configuration to automatically keep GitHub Actions up to date. Here's some more information about Dependabot: https://docs.github.com/en/code-security/dependabot/working-with-dependabot/keeping-your-actions-up-to-date-with-dependabot ## Changes - Added `.github/dependabot.yml` with weekly checks for GitHub Actions updates ## Context As discussed in #2161 ([comment](https://github.com/pytorch/torchtitan/pull/2161#issuecomment-3667526716)), adding Dependabot to automatically manage GitHub Actions updates going forward. ## Why Dependabot will automatically create PRs when new versions of GitHub Actions are available, helping to: - Keep CI/CD workflows secure with the latest patches - Get new features and improvements - Maintain compatibility with GitHub's infrastructure Each action update will be proposed as a separate PR for individual review and testing. --------- Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com> * Bump tj-actions/changed-files from d6e91a2266cdb9d62096cebf1e8546899c6aa18f to e0021407031f5be11a464abee9a0776171c79891 in the github-actions group (#2167) Bumps the github-actions group with 1 update: [tj-actions/changed-files](https://github.com/tj-actions/changed-files). Updates `tj-actions/changed-files` from d6e91a2266cdb9d62096cebf1e8546899c6aa18f to e0021407031f5be11a464abee9a0776171c79891 <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/tj-actions/changed-files/blob/main/HISTORY.md">tj-actions/changed-files's changelog</a>.</em></p> <blockquote> <h1>Changelog</h1> <h1><a href="https://github.com/tj-actions/changed-files/compare/v46.0.5...v47.0.0">47.0.0</a> - (2025-09-13)</h1> <h2>🚀 Features</h2> <ul> <li>Add any_added to outputs (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2567">#2567</a>) (<a href="https://github.com/tj-actions/changed-files/commit/c260d49a827b5eb266673bed7871c5d3ee9b5aef">c260d49</a>) - (Jellyfrog)</li> </ul> <h2>➖ Remove</h2> <ul> <li>Commit and push step from build job (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2538">#2538</a>) (<a href="https://github.com/tj-actions/changed-files/commit/be393a90381e27c9fec2c8c2e02b00f005710145">be393a9</a>) - (Tonye Jack)</li> </ul> <h2>🔄 Update</h2> <ul> <li>Updated README.md (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2592">#2592</a>)</li> </ul> <p>Co-authored-by: github-actions[bot] <41898282+github-actions[bot]<a href="https://github.com/users"><code>@users</code></a>.noreply.github.com> (<a href="https://github.com/tj-actions/changed-files/commit/3dbc1e181273d808ccff822a6e00cf18b6628ef0">3dbc1e1</a>) - (github-actions[bot])</p> <ul> <li>Updated README.md (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2591">#2591</a>)</li> </ul> <p>Co-authored-by: github-actions[bot] <41898282+github-actions[bot]<a href="https://github.com/users"><code>@users</code></a>.noreply.github.com> (<a href="https://github.com/tj-actions/changed-files/commit/b1ccff8c0892ad141d7d2de6f31e526a9dad931f">b1ccff8</a>) - (github-actions[bot])</p> <ul> <li>Updated README.md (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2574">#2574</a>)</li> </ul> <p>Co-authored-by: github-actions[bot] <41898282+github-actions[bot]<a href="https://github.com/users"><code>@users</code></a>.noreply.github.com> (<a href="https://github.com/tj-actions/changed-files/commit/050a3d3360d29711ee9d8210fc639d902d23ad07">050a3d3</a>) - (github-actions[bot])</p> <h2>📚 Documentation</h2> <ul> <li>Update link to glob patterns (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2590">#2590</a>) (<a href="https://github.com/tj-actions/changed-files/commit/a892f50f7a7187bc288633c09230b09ce7ad8fd0">a892f50</a>) - (Tonye Jack)</li> <li>Add Jellyfrog as a contributor for code, and doc (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2573">#2573</a>) (<a href="https://github.com/tj-actions/changed-files/commit/f000a9b97f254f9590ff26f651cccde827ad36da">f000a9b</a>) - (allcontributors[bot])</li> </ul> <h2>🧪 Testing</h2> <ul> <li>Manual triggered workflows (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2637">#2637</a>) (<a href="https://github.com/tj-actions/changed-files/commit/c2ca2493190021783138cb8aac49bcee14b4bb89">c2ca249</a>) - (Tonye Jack)</li> </ul> <h2>⚙️ Miscellaneous Tasks</h2> <ul> <li><strong>deps-dev:</strong> Bump jest from 30.0.5 to 30.1.3 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2655">#2655</a>) (<a href="https://github.com/tj-actions/changed-files/commit/9a6755550a331fdcc8ec45443738933f8fa22eea">9a67555</a>) - (dependabot[bot])</li> <li><strong>deps:</strong> Bump tj-actions/git-cliff from 2.1.0 to 2.2.0 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2660">#2660</a>) (<a href="https://github.com/tj-actions/changed-files/commit/b67e30df88f43e244f4e83775e5ad8335114fb95">b67e30d</a>) - (dependabot[bot])</li> <li><strong>deps:</strong> Bump github/codeql-action from 3.30.2 to 3.30.3 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2661">#2661</a>) (<a href="https://github.com/tj-actions/changed-files/commit/62aef422ffa195474d80d73387535cf4622b2824">62aef42</a>) - (dependabot[bot])</li> <li><strong>deps:</strong> Bump github/codeql-action from 3.29.11 to 3.30.2 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2659">#2659</a>) (<a href="https://github.com/tj-actions/changed-files/commit/e874f3cddd0f54ae776e6995ae6dae4cf40fd3d3">e874f3c</a>) - (dependabot[bot])</li> <li><strong>deps:</strong> Bump actions/setup-node from 4.4.0 to 5.0.0 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2656">#2656</a>) (<a href="https://github.com/tj-actions/changed-files/commit/8c14441336bb3d84fd6b7fa83b6d7201c740baf5">8c14441</a>) - (dependabot[bot])</li> <li><strong>deps-dev:</strong> Bump <code>@types/node</code> from 24.3.0 to 24.3.1 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2657">#2657</a>) (<a href="https://github.com/tj-actions/changed-files/commit/e995ac4be5be2bcb6e29556edc51fb63aca6b49b">e995ac4</a>) - (dependabot[bot])</li> <li><strong>deps-dev:</strong> Bump <code>@types/node</code> from 24.2.1 to 24.3.0 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2649">#2649</a>) (<a href="https://github.com/tj-actions/changed-files/commit/3b04099b21072562f07469c10deb182b24236ca9">3b04099</a>) - (dependabot[bot])</li> <li><strong>deps:</strong> Bump github/codeql-action from 3.29.9 to 3.29.11 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2651">#2651</a>) (<a href="https://github.com/tj-actions/changed-files/commit/e7b6c977e51984988e3cc1d6b18abe2a3ba8daaa">e7b6c97</a>) - (dependabot[bot])</li> <li><strong>deps:</strong> Bump tj-actions/git-cliff from 2.0.2 to 2.1.0 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2648">#2648</a>) (<a href="https://github.com/tj-actions/changed-files/commit/765d62bc041415a5b494ef13d02d566128b25973">765d62b</a>) - (dependabot[bot])</li> <li><strong>deps:</strong> Bump github/codeql-action from 3.29.8 to 3.29.9 (<a href="https://redirect.github.com/tj-actions/changed-files/issues/2647">#2647</a>) (<a href="https://github.com/tj-actions/changed-files/commit/2036da178f85576f1940fedb74bb93a36cd89ab7">2036da1</a>) - (dependabot[bot])</li> <li><strong>deps:</strong> Bump github/codeql-ac…

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 7, 2025

liangel-02 force-pushed the test_varlen branch 3 times, most recently from eeecb63 to cad97e5 Compare November 12, 2025 22:49

liangel-02 changed the title ~~Test varlen~~ adding variable length attention to llama 3 8b Nov 12, 2025

liangel-02 changed the title ~~adding variable length attention to llama 3 8b~~ adding variable length attention to llama3 8b Nov 12, 2025

liangel-02 requested a review from drisspg November 12, 2025 23:18

drisspg reviewed Nov 12, 2025

View reviewed changes

torchtitan/hf_datasets/text_datasets.py Outdated Show resolved Hide resolved

drisspg reviewed Nov 12, 2025

View reviewed changes

torchtitan/hf_datasets/text_datasets.py Outdated Show resolved Hide resolved

fegin requested changes Nov 13, 2025

View reviewed changes

torchtitan/hf_datasets/text_datasets.py Outdated Show resolved Hide resolved

torchtitan/hf_datasets/text_datasets.py Outdated Show resolved Hide resolved

liangel-02 force-pushed the test_varlen branch 4 times, most recently from 55352a5 to 066ca02 Compare November 14, 2025 18:11

liangel-02 requested a review from fegin November 14, 2025 18:11

liangel-02 marked this pull request as ready for review November 14, 2025 18:14

liangel-02 requested review from tianyu-l, wconstab and wwwjn as code owners November 14, 2025 18:14

wwwjn reviewed Nov 17, 2025

View reviewed changes

torchtitan/models/llama3/train_configs/llama3_8b_varlen.toml Outdated Show resolved Hide resolved

torchtitan/models/attention.py Outdated Show resolved Hide resolved

torchtitan/models/llama3/__init__.py Outdated Show resolved Hide resolved

liangel-02 force-pushed the test_varlen branch from 066ca02 to c9b6d5c Compare November 17, 2025 15:17

fegin requested changes Nov 17, 2025

View reviewed changes

torchtitan/models/attention.py Show resolved Hide resolved

torchtitan/models/llama3/model/model.py Outdated Show resolved Hide resolved

torchtitan/models/llama3/train_configs/llama3_8b.toml Outdated Show resolved Hide resolved

liangel-02 force-pushed the test_varlen branch 2 times, most recently from a902cbe to de416f9 Compare November 17, 2025 18:05

liangel-02 requested a review from fegin November 17, 2025 18:05

tianyu-l requested changes Nov 17, 2025

View reviewed changes

liangel-02 force-pushed the test_varlen branch 4 times, most recently from caafc81 to 4d36560 Compare November 18, 2025 21:49

liangel-02 force-pushed the test_varlen branch 2 times, most recently from 9380847 to 42c0c85 Compare November 19, 2025 22:33

liangel-02 requested review from fegin and tianyu-l November 19, 2025 22:34

tianyu-l reviewed Nov 20, 2025

View reviewed changes

liangel-02 force-pushed the test_varlen branch 4 times, most recently from 5528029 to 31c1c77 Compare November 20, 2025 17:35

fegin approved these changes Nov 20, 2025

View reviewed changes

torchtitan/models/deepseek_v3/infra/parallelize.py Show resolved Hide resolved

liangel-02 force-pushed the test_varlen branch 4 times, most recently from b717da3 to 9c99fcb Compare November 20, 2025 19:11

liangel-02 requested a review from tianyu-l November 20, 2025 19:46

tianyu-l reviewed Nov 21, 2025

View reviewed changes

remove use_flex for all other models

ab033dd

liangel-02 force-pushed the test_varlen branch 2 times, most recently from 1af38e5 to df22636 Compare November 21, 2025 16:45

liangel-02 requested a review from tianyu-l November 21, 2025 18:03

integration test

2b1a40f

liangel-02 force-pushed the test_varlen branch from df22636 to 2b1a40f Compare November 21, 2025 18:17

tianyu-l approved these changes Nov 21, 2025

View reviewed changes

tianyu-l merged commit f8fa21e into main Nov 21, 2025
10 of 12 checks passed

tianyu-l deleted the test_varlen branch November 21, 2025 22:46

kiansierra added a commit to kiansierra/torchtitan-modal that referenced this pull request Nov 22, 2025

Revert "adding variable length attention to llama3 8b (pytorch#2000)"

081cd35

This reverts commit f8fa21e.

Conversation

liangel-02 commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tianyu-l Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wwwjn commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

liangel-02 commented Nov 7, 2025 •

edited

Loading