Skip to content

adding variable length attention to llama3 8b #2000

Merged
tianyu-l merged 11 commits intomainfrom
test_varlen
Nov 21, 2025
Merged

adding variable length attention to llama3 8b #2000
tianyu-l merged 11 commits intomainfrom
test_varlen

Conversation

@liangel-02
Copy link
Copy Markdown
Contributor

@liangel-02 liangel-02 commented Nov 7, 2025

Summary
This PR adds variable length attention (varlen) support to the Llama 3 8b model in torchtitan. We replace use_flex_attn with attn_type (either "sdpa", "varlen", "flex"). If attn_type = "varlen", the attention module calls a compiled varlen_attn defined here.

Testing
Ran loss and performance tests against flex attention. Loss is on par.

Screenshot 2025-11-19 at 3 24 26 PM

Varlen is slightly slower than Flex due to the cuda kernel speeds (varlen calls into flash_attention_forward/flash_attention_backward today).

Varlen Flex
Forward 774us 357ns 722us 317ns
Backward 1ms 955us 916ns 1ms 558us 747ns

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 7, 2025
@liangel-02 liangel-02 force-pushed the test_varlen branch 3 times, most recently from eeecb63 to cad97e5 Compare November 12, 2025 22:49
@liangel-02 liangel-02 changed the title Test varlen adding variable length attention to llama 3 8b Nov 12, 2025
@liangel-02 liangel-02 changed the title adding variable length attention to llama 3 8b adding variable length attention to llama3 8b Nov 12, 2025
@liangel-02 liangel-02 requested a review from drisspg November 12, 2025 23:18
Copy link
Copy Markdown
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation won't work with PP and too model intrusive. The pack logic should be hide inside the inner attention.

@liangel-02 liangel-02 force-pushed the test_varlen branch 4 times, most recently from 55352a5 to 066ca02 Compare November 14, 2025 18:11
@liangel-02 liangel-02 requested a review from fegin November 14, 2025 18:11
@liangel-02 liangel-02 marked this pull request as ready for review November 14, 2025 18:14
Copy link
Copy Markdown
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the update. Leave some other comments, after the comments are addressed, this PR should be ready.

@liangel-02 liangel-02 force-pushed the test_varlen branch 2 times, most recently from a902cbe to de416f9 Compare November 17, 2025 18:05
@liangel-02 liangel-02 requested a review from fegin November 17, 2025 18:05
Copy link
Copy Markdown
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Left some comments, please see if they make sense to you.

@liangel-02 liangel-02 force-pushed the test_varlen branch 4 times, most recently from caafc81 to 4d36560 Compare November 18, 2025 21:49
@liangel-02 liangel-02 force-pushed the test_varlen branch 2 times, most recently from 9380847 to 42c0c85 Compare November 19, 2025 22:33
@liangel-02 liangel-02 requested review from fegin and tianyu-l November 19, 2025 22:34
Copy link
Copy Markdown
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some more comments. If you'd like to focus on Llama 3 in this PR, that's fine with me too.

@liangel-02 liangel-02 force-pushed the test_varlen branch 4 times, most recently from 5528029 to 31c1c77 Compare November 20, 2025 17:35
Copy link
Copy Markdown
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we can leave other models to other PR(s).

@liangel-02 liangel-02 force-pushed the test_varlen branch 4 times, most recently from b717da3 to 9c99fcb Compare November 20, 2025 19:11
@liangel-02 liangel-02 requested a review from tianyu-l November 20, 2025 19:46
xv,
self.head_dim,
attention_masks,
is_causal=True,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would fail? I think is_causal is no longer accepted.

Btw, it seems varlen is not tested in CI, can we add one test similar to https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests/features.py#L336

@liangel-02 liangel-02 force-pushed the test_varlen branch 2 times, most recently from 1af38e5 to df22636 Compare November 21, 2025 16:45
@liangel-02 liangel-02 requested a review from tianyu-l November 21, 2025 18:03
Copy link
Copy Markdown
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

We need to modify save_list of SAC to save the result of varlen attn, to be consistent with other attn implementations. Can do this in next PR.

[
[
"--parallelism.data_parallel_shard_degree=4",
"--activation_checkpoint.mode='full'",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use per_op_sac like the test above.

@tianyu-l tianyu-l merged commit f8fa21e into main Nov 21, 2025
10 of 12 checks passed
@tianyu-l tianyu-l deleted the test_varlen branch November 21, 2025 22:46
kiansierra added a commit to kiansierra/torchtitan-modal that referenced this pull request Nov 22, 2025
@wwwjn
Copy link
Copy Markdown
Contributor

wwwjn commented Nov 24, 2025

The test on AMD hardward failed, but works on NVIDIA setup. Do you know why? https://github.com/pytorch/torchtitan/actions/runs/19644468781/job/56256208915 cc @liangel-02 @drisspg

xrsrke pushed a commit to NousResearch/torchtitan that referenced this pull request Feb 13, 2026
**Summary**
This PR adds variable length attention (varlen) support to the Llama 3
8b model in torchtitan. We replace `use_flex_attn` with `attn_type`
(either "sdpa", "varlen", "flex"). If `attn_type = "varlen"`, the
attention module calls a compiled `varlen_attn` defined
[here](https://github.com/pytorch/pytorch/blob/main/torch/nn/attention/varlen.py).

**Testing**
Ran loss and performance tests against flex attention. Loss is on par.

<img width="947" height="505" alt="Screenshot 2025-11-19 at 3 24 26 PM"
src="https://github.com/user-attachments/assets/d85dfc09-4f5e-4f82-abc9-49b870b34990"
/>

Varlen is slightly slower than Flex due to the cuda kernel speeds
(varlen calls into `flash_attention_forward`/`flash_attention_backward`
today).


| | Varlen | Flex |
| :---: | :------ | :---: |
| Forward  | 774us 357ns | 722us 317ns  |
| Backward   | 1ms 955us 916ns  | 1ms 558us 747ns    |
dmahan93 added a commit to NousResearch/torchtitan that referenced this pull request Mar 13, 2026
* [TorchComms] add testing badge at experiments readme (#2010)

* [compiler toolkit] specify passes through config (#2006)

We should be able to control what passes to run in the compiler. This PR
uses the config compile.passes to indicate in a list of graph passes to
apply on the captured gm.

By default, no pass is applied. Users can specify what passes to apply.

Currently there are `autobucketing_reordering_pass` and
`regional_inductor_pass`.

```
NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes autobucketing_reordering,regional_inductor
```

Also updated CI to include this new config

* [simplefsdp] fix region ac in zero2-style FSDP (#1970)

After some offline discussion, we've concluded that life would be easier
if we can put simplefsdp's checkpoint logic for `reshard_after_forward`
to compiler. The ac annotation part is borrowed form AP:
[LINK](https://github.com/meta-pytorch/autoparallel/blob/main/autoparallel/activation_checkpointing.py#L69).

**Trace and Loss Check** (all with torch.compile enable)

reshard_after_fwd = False
1. SAC + llama3
([trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-30-17-05-06_rank0_trace.json))
<img width="768" height="115" alt="Screenshot 2025-10-30 at 4 28 59 PM"
src="https://github.com/user-attachments/assets/e4e22335-2e3f-46c8-8def-a60d592fee0a"
/>

<img width="689" height="512" alt="Screenshot 2025-11-05 at 9 02 30 PM"
src="https://github.com/user-attachments/assets/40a71316-a457-4e72-9002-cc8beea8f32c"
/>


2. Full AC + llama3 [(trace)]()

<img width="729" height="105" alt="Screenshot 2025-10-30 at 4 30 53 PM"
src="https://github.com/user-attachments/assets/e8d63460-579b-4f0a-8504-851480e5b548"
/>

<img width="789" height="763" alt="Screenshot 2025-11-05 at 9 11 34 PM"
src="https://github.com/user-attachments/assets/1a13d09e-04c4-4db9-99fe-cf10d24bf7f5"
/>


3. No AC + llama3
[[trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-30-17-03-50_rank0_trace.json)]

<img width="748" height="115" alt="Screenshot 2025-10-30 at 4 32 05 PM"
src="https://github.com/user-attachments/assets/20104d24-9d45-4eba-b694-815e133b88d0"
/>

<img width="800" height="764" alt="Screenshot 2025-11-05 at 9 07 46 PM"
src="https://github.com/user-attachments/assets/55b104ce-8ec1-4ed6-95e7-300e96ad55af"
/>


reshard_after_fwd = True

1. SAC + llama3
([Trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-31-11-34-24_rank0_trace.json))

<img width="795" height="108" alt="Screenshot 2025-10-31 at 11 34 47 AM"
src="https://github.com/user-attachments/assets/a3988f72-7e87-4e52-90f9-8bee840cd6f4"
/>


2. Full AC + llama3
([Trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-31-11-36-27_rank0_trace.json))

<img width="593" height="110" alt="Screenshot 2025-10-31 at 11 38 02 AM"
src="https://github.com/user-attachments/assets/5ee61b2b-9600-4af8-9a24-61b3564f93ca"
/>


3. No AC + llama3
([Trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-30-17-02-44_rank0_trace.json))


<img width="701" height="109" alt="Screenshot 2025-10-31 at 11 43 04 AM"
src="https://github.com/user-attachments/assets/576b28f6-dae4-4ff7-b005-57b0cf9ad7cc"
/>

* [SimpleFSDP] Add typing to simple_fsdp.py (#2001)

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* #2002
* __->__ #2001

Add typing, credit to Claude.

* [Full DTensor][Reland] Add full_dtensor flag (#2013)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2013

When full_dtensor is True, the compute_placement will be preserved. This
means that `to_local()` won't be called for fsdp only case. nD
parallelism case (fsdp + tp) will error out as we have not implemented
this case.

This argument doesn't affect the current simple_fsdp. We have verified
`full_dtensor=True` case with the full dtensor skleton PR, which will be
published once it is ready.

**This is a reland PR of
https://github.com/pytorch/torchtitan/pull/2002. The previous one was
broken during rebase.**

* set pg names (#1986)

Summary:
- we need to pass the global rank information to pytorch so that the pg
name can include the pg information
- this is necessary to differentiate the default pg's on different
replicas
- these need to different because flight recorder matches collectives
based on pg name as well
- add ft training to experiments folder, we'll move remaining pieces of
ft to this gradually but make new features only available through this
folder

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1986).
* #1988
* #1987
* __->__ #1986

Co-authored-by: Tushar Jain <tushar00jain@users.noreply.github.com>

* Fix the error message of maybe_enable_async_tp() (#2011)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2012
* __->__ #2011

It is not correct as JobConfig has changed.

* Add dry run mode (#2012)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2012
* #2011

Summary:
The current configuration validation requires torchx and GPUs. It can
waste time, resources, ane engery. Polar bears are crying. Let's fix
this by providing a dry run mode. This PR doesn't verify everything. In
theory, we should be able to verify parallelisms settings as well. This
PR is just a start but it at least can let us catch the typos quickly.

* [easy] [compiler toolkit] Clean up unused function (#2014)

As titled. `_clear_traced_params_buffers` is no longer being used as we
have switched the dynamo graph capture API.

* Run Torchtitan ROCm workflow on cron schedule & push to Main branch only (#2016)

Addressing following issues in this PR-

- Running Torchtitan ROCm workflow on cron schedule & only when push to
Main branch. CUDA workflow will run as is.
- Refactor Torchtitan test run to address older PR comment
https://github.com/pytorch/torchtitan/pull/1786#discussion_r2476279289

* Revert PR-2016 & Redo "Run Torchtitan ROCm workflow on cron schedule & push to Main branch only" (#2017)

Reverts PR: https://github.com/pytorch/torchtitan/pull/2016
Addressing following issues in this PR-
- Running Torchtitan ROCm workflow on cron schedule & only when push to
Main branch. CUDA workflow will run as is.
- Refactor Torchtitan test run to address older PR comment
https://github.com/pytorch/torchtitan/pull/1786#discussion_r2476279289

Co-authored-by: tianyu-l <150487191+tianyu-l@users.noreply.github.com>

* [compiler toolkit] Add tests and scripts for numerics check (#2015)

This PR adds the utils to automatically check the training numerics
(losses, grad norms) of two runs to verify if they have bitwise
equivalence.

The added script triggers two runs with user defined configs. Then it
loads metrics saved during training and compare the numerics to verify
bitwise equivalence. Currently we check for losses and grad norms during
training steps

For example, we want to compare the numerics between compiler toolkit
with aot_eager backend and eager on llama3-8B.
```
python torchtitan/experiments/compiler_toolkit/scripts/check_numerics.py --ngpu 4 --config-file torchtitan/models/llama3/train_configs/llama3_8b.toml --dp-shard-degree 2 --tp-degree 2
```
It'll run `simple_fsdp` experiment without `torch.compile` as the eager
baseline, and `compile_toolkit` experiment as the compiled run. Then it
compares the training numerics of these two runs to verify bitwise
equivalence.

When it is bitwise equivalent, we'll see the following output
```
Starting training: simple_fsdp.llama3
✓ Training completed: simple_fsdp.llama3

Starting training: compiler_toolkit.llama3
✓ Training completed: compiler_toolkit.llama3
  ✓ PASS: All 11 steps match exactly (bitwise equivalent)
  ✓ PASS: All 11 steps match exactly (bitwise equivalent)
✓ SUCCESS: All metrics are bitwise equivalent
```

Also added unit-tests in `compiler_toolkit/tests/test_numerics.py` so
that we can guard working parallelism combinations that already have
bitwise equivalence in CI.

* Add .claude to .gitignore (#2026)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2029
* #2030
* #2028
* #2027
* __->__ #2026

As title

* Fix dry run mode (#2027)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2029
* #2030
* #2028
* __->__ #2027
* #2026

Dry run mode works but it doesn't exit gracefully for all cases. This PR
fixes it

```
DRY_RUN=1 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh   --training.steps=10 --activation_checkpoint.mode="none"
--debug.deterministic --debug.seed=42
```

* [Compiler Toolkit] Make compiler toolkit work with checkpoint (#2030)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2029
* __->__ #2030

The current CompileModule will result in an "inner" prefix for
everything. This
PR fixes it by overloading the methods.

Also merge https://github.com/pytorch/torchtitan/pull/2028 to this PR.
Something wrong with ghstack.

* [Flux] Update integration test badge in README.md (#2019)

Fixes the badge in the `README.md` file

* Print device and stride when print module (#2045)

Before:
<img width="978" height="93" alt="image"
src="https://github.com/user-attachments/assets/48dc39d9-e897-4396-ac62-025574303403"
/>


After:
<img width="1318" height="82" alt="image"
src="https://github.com/user-attachments/assets/47b4771a-aaf9-4f61-80bc-757f3a08c1d2"
/>

* [SimpleFSDP] add manual bucketing pass (#1881)

This PR adds support for aten-level manual bucketing in
SimpleFSDP+`aot_eager` backend. Dependent on PyTorch
[PR](https://github.com/pytorch/pytorch/pull/165487)

TODO List:
- [ ] We should have better way of handling region info other than a
list of str FQNs in current `manual_bucketed_modules`. It would be very
easy to miss some of model modules. (cc. @xmfan @SherlockNoMad )
- [ ] Currently, the reordering happens under the hood and overlap with
last/next compute. We should allow users to specify which module they
want to reorder.
- [ ] Loss difference on multi-node training
- [ ] DSV3 manual bucketing

I'll address the TODO items in follow up PRs. Let's start with this
simple FSDP+TP+llama3 PR.

1. Performance (FSDP2 under eager mode, SimpleFSDP uses `aot_eager`
backend)

**Llama 3-8B**

* Performance (All Batch_size = 1). (The slower TPS on Single Node is
sort of as expected, since FSDP2 handles copy-in/out in two different
streams, whereas SimpleFSDP handles copy-in/out in the same stream)

|Node| Method | Parallelism | Memory | TPS | Trace|
|---------|---------|-----------|----------|------|------|
|1-Node (8H100)|SimpleFSDP | FSDP=8| 40.96GiB(43.12%) | 7,227|
[LINK](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-16-10-48-48_rank0_trace.json)|
|1-Node (8H100)|FSDP2-eager| FSDP=8| 47.82GiB(50.35%) | 7,380 |
[LINK](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-16-10-54-14_rank0_trace.json)|
|8-Node (64H100)|SimpleFSDP| FSDP=64  | 29.37GiB | 4,984| |
|8-Node (64H100)|FSDP2| FSDP=64 | 31.41GiB  |5,097 | |
|1-Node (8H100)|SimpleFSDP| FSDP=4 TP=2 | 28.28GiB(29.77%) | 5,881 |
[LINK](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-26-18-00-18_rank0_trace.json)
|
|1-Node (8H100)|FSDP2| FSDP=4 TP=2 | 35.33GiB(37.20%) | 5,898 |
[LINK](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-26-15-35-47_rank0_trace.json)
|
|8-Node (64H100)|SimpleFSDP| FSDP=8 TP=8  |   |||
|8-Node (64H100)|FSDP2| FSDP=8 TP=8 |   |||

Example SimpleFSDP 1D overlapping trace:

<img width="1127" height="127" alt="Screenshot 2025-10-16 at 10 49
55 AM"
src="https://github.com/user-attachments/assets/2d9e3ff8-8e9b-40a7-a666-3c0a0975186e"
/>

Example SimpleFSDP 2D overlapping trace:
<img width="1162" height="166" alt="Screenshot 2025-10-26 at 6 00 51 PM"
src="https://github.com/user-attachments/assets/bc5cc031-5b6c-4e4d-a9da-70c43114f49a"
/>


- Bitwise Loss:

FSDP-only:
<img width="1266" height="837" alt="Screenshot 2025-10-17 at 10 41
56 AM"
src="https://github.com/user-attachments/assets/30f83d95-1eca-4f10-9e7e-47c45278cd8d"
/>

FSDP+TP:
<img width="1259" height="808" alt="Screenshot 2025-10-26 at 9 03 58 PM"
src="https://github.com/user-attachments/assets/b75b452b-adb9-4078-9412-ee9e584ffe15"
/>

* Add export_dtype parameter to `convert_to_hf` function (#2041)

The current `convert_to_hf.py` does not support `export_dtype`, which
makes it `float32` by default. This PR adds support for export dtypes of
`["float16", "bfloat16", "float32"]`.

* [compiler toolkit] Port joint_ac_pass from simplefsdp (#2051)

This PR integrates the changes in #1970 to compiler toolkit (applying
`joint_ac_pass` on the joint graph graph to tag nodes based on
`reshard_after_forward` flag)

Also did some refactor for applying graph passes in compiler toolkit
experiments. We will have two kinds of passes

1. joint_custom_passes: these are passes to be applied on the captured
joint graph before partitioner. By default we
`validate_flex_attn_annotation_pass` and `fsdp_reshard_after_fwd_pass`

2. compiler_passes: there are passes to be applied on partitioned fwd
and bwd graphs as backend optimizations. By default there is none. We
can indicate `autobucketing_reordering_pass` and
`regional_inductor_pass` using configs.

* [compiler toolkit] Port manual bucketing from SimpleFSDP experiment (#2056)

This PR integrates the manual bucketing pass (transformer block
bucketing) added in SimpleFSDP experiment (#1881) to compiler toolkit

So now compiler toolkit can also run manual bucketing pass by specifying
the config

```
NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes transformer_block_bucketing
``` 

Also updated README and integration test to include the newly ported
pass

* Re:Run Torchtitan ROCm workflow on cron schedule & push to Main branch only (#2018)

Addressing following issues in this PR-

Running Torchtitan ROCm workflow on cron schedule & only when push to
Main branch. CUDA workflow will run as is.
Refactor Torchtitan test run to address older PR comment
https://github.com/pytorch/torchtitan/pull/1786#discussion_r2476279289

* Add a loss comparison script (#2029)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2049
* __->__ #2029


## Summary
This PR adds `scripts/loss_compare.py` for comparing training losses
between different git commits and/or training configurations.

## Key Features

- Commit Comparison: Compare losses between two different git commits
with deterministic training
- Configuration Comparison: Compare different training configurations on
the same commit
- Reproducibility: Automatically enables deterministic mode and seed
checkpointing for reproducible
  comparisons
- Real-time Output: Streams training output to both console and log
files during execution
- Statistical Analysis: Generates step-by-step loss comparisons and
summary statistics
- CI Testing: Includes --assert-equal flag for automated testing to
verify identical losses

## Usage Examples

#### Compare two commits
```
python3 ./scripts/loss_compare.py main my_branch
```
#### Compare two commits with custom configuration 
```
python3 ./scripts/loss_compare.py main my_branch \
--baseline-config="./custom.toml" 
--baseline-options="--parallelism.tensor_parallel_degree=2"  \
```

#### Compare different parallelization strategies on same commit
```
python3 ./scripts/loss_compare.py . . \
--baseline-config="./llama3_8b.toml" 
--baseline-options="--parallelism.tensor_parallel_degree=2" \
--test-options="--parallelism.tensor_parallel_degree=1" \
```

#### Assert equality for CI testing
```
python3 ./scripts/loss_compare.py main my_branch --assert-equal
```


## Real Use Cases
Compare full dtensor simple fsdp with fsdp2:
```
python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none"'  \
--test-train-file='torchtitan.experiments.full_dtensor.train' \ 
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none"'  \
 --assert-equal --no-seed-checkpoint


[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok
```

* Fix integration test gpu_arch_type field (#2060)

All tests in experiments are broken due to the `gpu_arch_type` field
added in #2018.

* [compiler toolkit] Add Trainer subclass for compiler toolkit (#2064)

Adding CudaGraph pass (https://github.com/pytorch/torchtitan/pull/2050)
would require some custom logic in Trainer's close() method.

So we create a Trainer subclass in compiler toolkit

* Let loss_compare.py check the repo cleaness (#2062)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2063
* __->__ #2062

This will prevent errors when later doing git checkout

* CUDAGraph support for SimpleFSDP and TP (#2050)

## Features
- [x] Support SimpleFSDP and TP
- [x] Support static input indices to reduce copy
- [x] Support memory reuse to reduce memory consumption
- [x] Cleanup cudagraph when training finishes to avoid nccl hang from
destroy_process_group

Command:
```
NCCL_GRAPH_REGISTER=0 NGPU=8 TRAIN_FILE=torchtitan.experiments.compiler_toolkit.train CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4  --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes cudagraph
```


Note: we use `NCCL_GRAPH_REGISTER=0` due to a known issue that nccl +
cudagraphs + expandable segments result in IMA.
https://github.com/pytorch/pytorch/issues/158029


[trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces%2Ftree%2Fshared_trace%2Fboyuan_e1ef464b-ee61-4c61-82e5-f7a485e561bf_rank0_trace.json)

## Result

**Numerics:**
Achieved bitwise equivalence w/ and w/o cudagraph pass on llama3.1-8B
AND llama3.1-70B.

**Performance:**
<img width="560" height="90" alt="image"
src="https://github.com/user-attachments/assets/9d54c461-0eb1-4f7e-9652-3d52043ad74f"
/>

Raw log:
[llama3-8b](https://www.internalfb.com/phabricator/paste/view/P2045444190),
[llama3-70b](https://www.internalfb.com/phabricator/paste/view/P2045567416)

**Memory:**
On llama3.1-70b, cudagraph takes 6% more memory consumption (143 GiB vs
153 GiB).

A few tricks to reduce memory consumption (use llama3.1-70b w/ cudagraph
as an example):
- Start: 161 GiB
- \+ use the same stream for warmup and graph capture of both fwd and
bwd: 160 GiB
- \+ warmup in cudagraph memory pool instead of eager memory pool: 153
GiB


**static input copy:**
On llama3.1-70B, for forward, we copy 1 tensor of 128 bytes; for
backward, we copy 1 tensor of 0.98 GB. This shows static input indices
is handled correctly.


## Followup PR
In the followup PR, I will enable fx graph partition for deepseek v3
https://github.com/pytorch/pytorch/pull/165945.

* compiler_toolkit: fix args access (#2067)

This PR fixes access to args; it's an attribute, not a variable in the
scope.
The method itself though would not be used because
`should_check_address` seems to be always `False` and there doesn't seem
to be a command line argument for it.

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* 3outeille/transformers backend (Dense model only) (#2048)

# Context
Reference PR: https://github.com/huggingface/torchtitan/pull/1

This PR enables:
- Llama-like HF models to work with 4D parallelism: FSDP, CP, TP, PP
(and the combinations between them). The following models were tested:
  - `meta-llama/Llama-3.2-1B`
  - `microsoft/phi-2`
  - `Qwen/Qwen2.5-7B`
  - `mistralai/Mistral-7B-v0.1`
  - `ByteDance-Seed/Seed-Coder-8B-Instruct`
  - `Qwen/Qwen3-4B-Instruct-2507`
  - `arcee-ai/AFM-4.5B`
  - `ibm-granite/granite-3b-code-base-2k`
  - `baidu/ERNIE-4.5-0.3B-Base-PT`
  - `kyutai/helium-1-preview-2b`
  - `allenai/OLMo-7B-hf`
  - `mistralai/Ministral-8B-Instruct-2410`
- Patching HF models weights initialisation. Without this, the the
`loss` and `grad_norm` starts very high

# Usage

- Requirements `transformers==4.57.1`
- Config:
`torchtitan/torchtitan/experiments/transformers_backend/configs/qwen3.toml`
```diff
...
[model]
- name = "llama3"
+ name = "transformers_backend"
flavor = "debugmodel"
hf_assets_path = "./tests/assets/tokenizer"

+[hf_transformers]
+model = "Qwen/Qwen3-4B-Instruct-2507"
...
```
- Train: `LOG_RANK=7
CONFIG_FILE=<YOUR_PATH>/torchtitan/experiments/transformers_backend/configs/qwen3.toml
./run_train.sh
--job.custom_config_module=torchtitan.experiments.transformers_backend.job_config
--compile.enable`

<img width="1334" height="453" alt="image"
src="https://github.com/user-attachments/assets/da459448-027b-4af9-8176-6a3e433a272c"
/>

# Testing methodology

<img width="2672" height="2018" alt="image"
src="https://github.com/user-attachments/assets/66d8689d-7ede-47e3-b389-d4fc1bdd70f7"
/>

- Following the
[converging.md](https://github.com/pytorch/torchtitan/blob/main/docs/converging.md)
guidelines, I am comparing the baseline `FSDP=2` vs `FSDP=2 & <other
//-ism>`
- More precisely, the `test_hf_integration.py`is going to do:

```bash
    results/
        |_ meta-llama
            |_ Llama-3.2-1B
                |_ debugmodel/
                    |_ seed_checkpoint/
                        |_ config.toml
                        |_ seed.slurm
                        |_ step-0/
                           |_ ....
                    |_ fsdp2_tp1_cp1_pp1/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                    |_ fsdp2_tp2_cp1_pp1/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log
                    |_ fsdp2_tp1_cp1_pp2/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log
                    |_ fsdp2_tp1_cp2_pp1/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log
                    |_ fsdp2_tp1_cp2_pp2/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log`
                |_ full/
                ...
```
- Here is the grid search to test the HF modelling
```shell
#!/usr/bin/bash
model_names=(
     "meta-llama/Llama-3.2-1B"
     "microsoft/phi-2" 
     "Qwen/Qwen2.5-7B"
     "mistralai/Mistral-7B-v0.1"
     "ByteDance-Seed/Seed-Coder-8B-Instruct"
     "Qwen/Qwen3-4B-Instruct-2507" 
     "arcee-ai/AFM-4.5B" 
     "ibm-granite/granite-3b-code-base-2k" 
     "baidu/ERNIE-4.5-0.3B-Base-PT" 
     "kyutai/helium-1-preview-2b" 
     "allenai/OLMo-7B-hf"
     "mistralai/Ministral-8B-Instruct-2410" 
)

for model_name in "${model_names[@]}"; do
    rm -rf slurm_results/${model_name}

    python test_hf_integration.py create_configs --model_name "$model_name" --out_dir slurm_results --flavor debugmodel
    python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel/seed_checkpoint --qos high
    while [ ! -f slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt ] || [ "$(cat slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt)" != "completed" ]; do
        echo "Waiting for seed checkpoint from ${model_name} to complete ..."
        sleep 1
    done
    python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel --qos high
    echo "================"
done
```

# Further tasks

- Moe (handle in PR https://github.com/huggingface/torchtitan/pull/3)
	- Missing `build_optimizers_with_moe_load_balancing` support for MoE
	- Missing TP/PP/EP supports for MoE 
- When using HF modeling, the test `FSDP=2 vs FSDP=2 + PP=2`, the `loss`
and `grad_norm` not bitwise matching (but converging) while it is the
case with Torchtitan modeling. (issue is tracked in
https://github.com/huggingface/torchtitan/pull/4)
- Add convergence tests to CI by doing tiny model + gloo backend (once
PP is bitwise matching)
- the HF modeling has lower MFU than Torchtitan MFU
- NOTE: `import torch._dynamo.config;
torch._dynamo.config.cache_size_limit = 128` to avoid recomputation for
graph when using `torch.compile` and `activation checkpointing`

* adding variable length attention to llama3 8b   (#2000)

**Summary**
This PR adds variable length attention (varlen) support to the Llama 3
8b model in torchtitan. We replace `use_flex_attn` with `attn_type`
(either "sdpa", "varlen", "flex"). If `attn_type = "varlen"`, the
attention module calls a compiled `varlen_attn` defined
[here](https://github.com/pytorch/pytorch/blob/main/torch/nn/attention/varlen.py).

**Testing**
Ran loss and performance tests against flex attention. Loss is on par.

<img width="947" height="505" alt="Screenshot 2025-11-19 at 3 24 26 PM"
src="https://github.com/user-attachments/assets/d85dfc09-4f5e-4f82-abc9-49b870b34990"
/>

Varlen is slightly slower than Flex due to the cuda kernel speeds
(varlen calls into `flash_attention_forward`/`flash_attention_backward`
today).


| | Varlen | Flex |
| :---: | :------ | :---: |
| Forward  | 774us 357ns | 722us 317ns  |
| Backward   | 1ms 955us 916ns  | 1ms 558us 747ns    |

* remove scatter_add in MoE implementation (#1974)

PR for removing `scatter_add` in the MoE implementation. `scatter_add`
is somewhat problematic as it is non-deterministic due to the necessity
of [atomic
adds](https://discuss.pytorch.org/t/why-does-index-add-and-scatter-add-induce-non-deterministic-behavior-on-the-cuda-backend/45544/2)
for correctness.

Determinism, correctness, and performance tests using scripts under
`torchtitan/moe_bench_and_test`:

```
# Determinism: run same forward 100x and compute standard deviations
pytest -rsfP torchtitan/moe_bench_and_test/test_moe.py -k test_determinism

out_old_std=tensor(0.0297, device='cuda:0', dtype=torch.bfloat16)
out_std=tensor(0., device='cuda:0', dtype=torch.bfloat16)
out_old_std/out_moe_old.abs().mean()=tensor(0.0006, device='cuda:0', dtype=torch.bfloat16)
out_std/out_moe.abs().mean()=tensor(0., device='cuda:0', dtype=torch.bfloat16)
```

```
# Accuracy: compare MoE outputs to FFN outputs, with weights set such that outputs should be the same
# Relative error decreased by 3x
pytest -rsfP torchtitan/moe_bench_and_test/test_moe.py -k test_moe_ffn_equivalence

moe_old_rel_err=0.009754068047048696
moe_rel_err=0.002507858727736454
moe_old_rel_err/moe_rel_err=3.8894009216589858
```

```
# Timing: triton do_bench for DSv3 16B layer fwd + bwd. ~3% faster runtime
python torchtitan/moe_bench_and_test/moe_timing.py moe_old && python torchtitan/moe_bench_and_test/moe_timing.py moe

args=Namespace(cls='moe_old', perf_reps=1000, perf_warmups=100, seqlen=4096, bsz=4)
moe_time_ms=19.712812881469727

args=Namespace(cls='moe', perf_reps=1000, perf_warmups=100, seqlen=4096, bsz=4)
moe_time_ms=19.03301840562087

```

```
# Memory: for DSv3 16B layer fwd + bwd. ~15% reduction in active mem, ~18% in reserved mem.
python torchtitan/moe_bench_and_test/moe_memory.py moe_old && python torchtitan/moe_bench_and_test/moe_memory.py moe

args=Namespace(cls='moe_old', iters=1, seqlen=4096, bsz=4)
peak_stats.max_active_gib=5.926029682159424
peak_stats.max_reserved_gib=7.224609375

args=Namespace(cls='moe', iters=1, seqlen=4096, bsz=4)
peak_stats.max_active_gib=5.051033020019531
peak_stats.max_reserved_gib=5.91015625
```

Testing fwd + bwd correctness for `tp_degree=ep_degree=world_size=8` and
`etp=1`
```
# Similar relative errors
torchrun --nproc-per-node 8 torchtitan/moe_bench_and_test/test_tp.py

args=Namespace(seqlen=256, bsz=4, tol=0.01), world_size=8, tp=8, ep=8, etp=1

err_ratio_fsdp_ep_old=0.0028211805268959435
err_ratio_fsdp_ep=0.002805679534989922
err_ratio_ep_ep_old=0.0022941468020912068

kl_fsdp_ep_old=tensor(2.4915e-05, device='cuda:0', dtype=torch.bfloat16)
kl_fsdp_ep=tensor(2.0981e-05, device='cuda:0', dtype=torch.bfloat16)
kl_ep_ep_old=tensor(2.1458e-05, device='cuda:0', dtype=torch.bfloat16)
```

Everything under `torchtitan/moe_bench_and_test` is temporary testing
utilities and is to be deleted prior to merging.

* Update transformers backend name (#2075)

following Huggingface efforts in VLLM (cf
https://github.com/vllm-project/vllm/pull/28725), we would like to
uniformize the naming and make sure that people think we use the HF
models only

* Enhance loss_compare.py: Add Import/Export Options and Enable CI Comparison with Existing Losses (#2063)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2063

This PR allows us to check if the loss is consistent across commits/PRs.
1. This PR contains a pre-tested losses result file.
2. This PR improve the loss_compare.py to add --import and --export
options.
3. In CI, uses --import to get the previous losses and compare them with
the current PR. If anything mismatch (10 steps), the CI will fail.

* Print out the version number (#2083)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2083

This PR and https://github.com/pytorch/torchtitan/pull/2070 can resolve
https://github.com/pytorch/torchtitan/issues/2043.

This should not affect `.github/scripts/update_version.sh` as
`.github/scripts/update_version.sh` will append the version at the end
of the file, which will overwrite the value.

* Autoparallel as an experiment in main (#2054)

Experiments like SimpleFSDP/Compiler Toolkit/Autoparallel are all being
developed at the same time, and SimpleFSDP/Compiler Toolkit both run
into issues with PP that requires the PP utilities from Autoparallel. We
want to land the Autoparallel experiment into main to facilitate that
sharing.

---------

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Co-authored-by: Will Constable <whc@meta.com>
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Co-authored-by: Francisco Massa <fvsmassa@gmail.com>
Co-authored-by: ruisizhang123 <ruisizhang123@gmail.com>
Co-authored-by: Brian Hirsh <briandhirsh@gmail.com>
Co-authored-by: Will Constable <willconstable@gmail.com>

* skip varlen integration test on rocm (#2085)

as title since varlen attention is not supported on rocm

* [Local Tensor] Replace dry_run.py with fake mode implementation (#2057)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2057

Replaces `dry_run.py` implementation with fake PG mode for DRY_RUN
configuration validation. This PR also adds support of Local tensor mode
to provide deeper validation coverage.

**Note:** Currently returns early before `init_weights()` if using local
tensor mode due to some limitation of local tensor, which will be fixed
by https://github.com/pytorch/pytorch/pull/166540 .

* add varlen attention for qwen 3 (#2084)

As title 

**Testing**

<img width="469" height="431" alt="Screenshot 2025-11-24 at 4 30 53 PM"
src="https://github.com/user-attachments/assets/6b9a362d-de36-48b7-b465-d91ae24f4cbf"
/>

performance and loss on par

* [FLUX] Add FLUX inference test in CI (#1969)

* Improve logging by formatting the dict as JSON. (#2094)

We use Slurm to run jobs, and i just noticed that job configs and model
args were being logged on a single line by default, which makes the logs
hard to read.

This PR improves readability by formatting these dictionaries with
`json.dumps` before logging, so the configs are formatted nicely and
easier for humans to read.

before:
<img width="2594" height="640" alt="image"
src="https://github.com/user-attachments/assets/c3c07b09-d12c-484d-aa90-a626cd25c6d2"
/>

after:
<img width="2252" height="1032" alt="image"
src="https://github.com/user-attachments/assets/4cbde979-c34c-4fc5-aa55-f280f39cf9ef"
/>

* add all SDPA backends to op_sac_save_list (#2095)

As we discussed in https://github.com/pytorch/torchtitan/issues/2091, we
should add all `scaled_dot_product_attention` backends to
`op_sac_save_list`to avoid recomputing attention during backward.

* modify save list for varlen attn (#2082)

adding varlen attention ops to ac save list

**testing**

used DebugMode() to print out op list. verified that forward is not
being recomputed in the backward step.

```
[rank0]:forward ops
[rank0]:varlen_attn in forward: True
...
[rank0]:varlen_attn recomputed in backward: False
[rank0]:saved correctly
```

* Make sure log after distributed initialized. (#2102)

There is a condition check in config logging for distributed
initialization, so the config logging has to be happen after distributed
has been initialized.

Co-authored-by: Zhiqiang Zang <zzq@fb.com>

* [mxfp8] [docs] [BE] add MXFP8 usage documentation and benchmarks (#2096)

Fixes #1998

* Mark input tokens to routed experts as dynamic to avoid a recompile (#2007)

Stacked PRs:
 * __->__#2007


--- --- ---

Mark input tokens to routed experts as dynamic to avoid a recompile


This saves 1 recompile, and you can see the input tokens are dynamic
from the first graph compiled:
```python
class GraphModule(torch.nn.Module):
    def forward(...s77: "Sym(s77)", L_x_: "bf16[s77, 5120][5120, 1]cuda:0"...
```

I verified that this also fixes the AC recompile issue of:
https://github.com/pytorch/torchtitan/issues/1971. But I'm keeping
`torch._C._dynamo.eval_frame._set_lru_cache(False)`, as there could be
other recompile reasons popping up.

* fix mxfp8 loss image (#2104)

In the original PR i moved the image location without updating the
markdown pointing to it by accident. This fixes that.

* Update hf_assets_path for llama4 (#2110)

Fix typo in train_config, hf asset should be for maverick, see:

https://huggingface.co/meta-llama/models?search=128e

* Enables parsing of --compile.components through CLI (#2115)

Without this PR, I'm not able to pass `--compile.components=model,loss`.
Tested using `python -m torchtitan.config.manager
--compile.components=model,loss`.

* fix `ForgeEngine` compatibility issue with (#2121)

Summary:
Fix backward incompatible changes introduced in 


https://github.com/pytorch/torchtitan/commit/ff078526d1b9a51a3507cd234715ac3c61291e85

Differential Revision: D88572518

* Remove the hack for SAC + FlexAttention (#2118)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2118

PyTorch can now support torch.compile inside the SAC region even if
torch.compile is not used to wrap SAC. This PR removes the workaround to
ensure torch.compile works with Flex

* Add warning to run_tests (#2123)

Small addition since right now running a test that doesn't exist just
outputs nothing, e.g.

`python -m tests.integration_tests.run_tests ./test-out --test_name
does_not_exist`

Now the output is:

`
WARNING:root:No tests were run for --test_name 'does_not_exist' in test
suite 'features'.
Available test names in 'features' suite: ['default', '1d_compile',
'1d_compile_sac_op', '2d_eager', '2d_compile', 'full_checkpoint',
'model_only_hf_checkpoint', 'last_save_model_only_fp32',
'last_save_model_only_bf16', 'pp_looped_zero_bubble', 'pp_zbv',
'pp_1f1b', 'pp_gpipe', 'pp_dp_1f1b', 'pp_dp_gpipe', 'pp_tp', 'pp_dp_tp',
'3d_compile', 'pp_looped_1f1b', 'pp_custom_csv', 'optimizer_foreach',
'ddp', 'hsdp', 'fsdp+flex_attn', 'fsdp+flex_attn+per_op_sac',
'fsdp+varlen_attn+per_op_sac', 'cp_allgather', 'cp_alltoall', 'hsdp+tp',
'fsdp+cp', 'hsdp+cp_without_dp_shard', 'hsdp+cp_with_dp_shard',
'fsdp+tp+cp', 'cpu_offload+opt_in_bwd+TP+DP+CP', 'test_generate',
'fsdp_reshard_always', 'optional_checkpoint', 'float8_emulation',
'gradient_accumulation', 'validation_tp_cp_pp']
`

* [compiler toolkit] Disable CUDAGraph integration test (#2127)

As titled. We'll enable when it is fixed.

* Add CI for Autoparallel experiment llama3 on 4 GPUs (#2105)

* Support rope cache indexing using positions (#2112)

Add support to indexing rope cache using `position_ids`, this might be
needed during
1. inference, where we passed in `position_ids` into transformer forward
2. CP load balancing where we need to index rope cache given positions
ids

Test: 
running dpskv3 16b base
<img width="489" height="286" alt="image"
src="https://github.com/user-attachments/assets/6f463d65-a0de-413d-ab19-770db9983dbb"
/>

also tested in https://github.com/wwwjn/torchtitan/pull/1/files when
passing position_ids
<img width="665" height="269" alt="image"
src="https://github.com/user-attachments/assets/70e4bddc-0334-4dbf-b00d-6e4b49a94655"
/>

---------

Co-authored-by: JessicaZhong <zhengjesszhong@gmail.com>

* [forge] allow torchforges to set checkpoint base folder (#2131)

this PR 
1) allowing Torchforge to decide where to put the checkpoint and wandb,
etc, instead of the "current" folder
~~allowing Torchforge to decide to print / log the configs~~

* Rename auto_parallel experiment to autoparallel (#2128)

* PyTorch depends on psutil (#2132)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2132

TorchTitan should also depends on psutil.

* Remove caching for attention masks (#2117)

We remove the lru_cache for attention masks, because in
get_attention_mask() function, `and_masks(*mask_mods)` will return
different object id. `create_attention_mask` will use all parameters as
cache key, and new object id will always cause cache miss.

Before the change: (llama3 debugmodel_flex_attn)
<img width="1182" height="275" alt="Screenshot 2025-12-09 at 1 27 45 PM"
src="https://github.com/user-attachments/assets/e9af2597-9d94-4478-8136-8b9b8c35d9e6"
/>

After the change:
<img width="1182" height="275" alt="Screenshot 2025-12-09 at 1 29 56 PM"
src="https://github.com/user-attachments/assets/756a7d09-b47f-434f-8ff6-40098b265a03"
/>

* Clarify contribution guidelines. (#2134)

* Enable PP and EP overlap for MoE (#1721)

Option 2 of https://github.com/pytorch/torchtitan/issues/1682

These changes add a custom `overlap_callback` function to replace the
OVERLAP_F_B action that is run during the schedule execution. In the
custom function, we write `run_forward()` and `run_backward()`.
`run_backward()` is run as a separate thread so that we can have both
forward and backward running together side by side. Looks like this:

<img width="1321" height="443" alt="image"
src="https://github.com/user-attachments/assets/911f3637-1afa-4537-989a-a325ba558957"
/>

In order for these changes to work with Expert Parallel, we also need to
add custom autograd functions to act as the boundary points at which we
do communication. We added hooks before and after expert parallel
dispatch and combine to signal boundary points, so our figure from
before now turns into:

<img width="1382" height="388" alt="image"
src="https://github.com/user-attachments/assets/3991749d-7d67-4098-81a4-4efcfd1c75ca"
/>

Now in each of these red blocks, we use a global coordinator. We need
`threading.Barrier(2).wait()` so that the comm and compute from our
forward and backward steps are scheduled in lock-step before continuing.

DSv3 16B run command:
```
TORCH_NCCL_TRACE_BUFFER_SIZE=2000 TORCH_NCCL_DUMP_ON_TIMEOUT=true TORCH_FR_DUMP_TEMP_FILE=./nccl_trace_rank_ NGPU=8  CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh
```

Trace examples:

<img width="2409" height="1889" alt="image"
src="https://github.com/user-attachments/assets/923efc8f-9241-4646-aba0-ccc846d3932b"
/>

Test command:

`python -m tests.integration_tests.run_tests ./test-out --test_name
pp_dualpipev --test_suite models`

---------

Co-authored-by: tianyu-l <150487191+tianyu-l@users.noreply.github.com>

* Fix apply_compile called multiple times in PP initialization (#2135)

Stacked PRs:
 * __->__#2135


--- --- ---

PP initialization calls apply_compile multiple times, once per pp stage.
But apply_compile does some global patching. So I add `already_patched`
to avoid patching the same method multiple times.

If we patch multiple times, the second time will wrap
`_run_experts_grouped_mm_dynamic` in a torch.compile(fullgraph=True)
leading to the error in the issue below.

FIXES https://github.com/pytorch/torchtitan/issues/2124

* Enable static type checking with Pyrefly (#2136)

Enables static type checking of torchtitan with
[pyrefly](https://github.com/facebook/pyrefly). Type checking the code
helps catch bugs earlier in the development cycle.

* Adds pyrefly to CI, as part of the linting workflow.
* Addresses ~100 type errors that can be fixed via local code changes
and updates to type annotations, and silences the rest with `# pyrefly:
ignore` suppression comments. Note that
https://github.com/pytorch/torchtitan/commit/325efd946f1cbea85e503f9e684b8c879891fc1a
contains all of the non-comment changes.

* [Autoparallel] Add local_map variant of DSv3 and 2D mesh AP (#2129)

Stacked PRs:
 * __->__#2129


--- --- ---

[Autoparallel] Add local_map variant of DSv3 and 2D mesh AP

Currently, the AP experiment monkey patches Titan's main DSv3
implementation. But this is prone to breakage from both model definition
changes in titan and from HOP/partitioner related changes in core. When
these breaks happen, people are usually blocked until I find the root
cause.

I'm going on PTO for the rest of the year, so I'm adding an integration
to AP's DSv3 model in an attempt to make the development more stable for
the upcoming PP integration.

Test: https://gist.github.com/xmfan/db15fda1e1bc1df7cd523005fe0baf33

* Implement ciflow/rocm on Torchtitan (#2114)

In this PR, I implemented ciflow/rocm on Torchtitan. The changes are
part of integration_test_8gpu_features.yaml. The workflow still supports
running on pull_request (without any PR label) for CUDA. However, along
with push to main and cron schedule, with the ciflow/8gpu label added to
PR, the workflow runs for both CUDA & ROCm.

---------

Co-authored-by: Huy Do <huydhn@gmail.com>

* [MoE] Add node limited routing support (#2111)

As titled, added node-limited routing support via two-layer routing.
First, group experts into `num_groups` groups, and experts in the same
group should reside on the same node to utilize fast intra-node
communication. Second, pick the `top_k_group` by the top 2 expert
scores' sum in each group. Third, pick `top_k` experts within the
selected `top_k_groups`.

Reference:
https://github.com/huggingface/transformers/blob/4c9fde2a2a3aece0bcf1be93f696e88297da9397/src/transformers/models/deepseek_v3/modeling_deepseek_v3.py#L212

Test on one node using DeepSeek V3 debug model with MoE arguments
`num_experts=8,
            num_shared_experts=2,
            num_groups=4,
            top_k_group=2,
            top_k=3`.
<img width="1196" height="465" alt="Pasted Graphic"
src="https://github.com/user-attachments/assets/63fd8414-1761-4efe-acff-154b1f46a16d"
/>

* Upgrade GitHub Actions to latest versions (#2152)

## Summary

Upgrade GitHub Actions to their latest versions for improved features,
bug fixes, and security updates.

## Changes

| Action | Old Version(s) | New Version | Release | Files |
|--------|---------------|-------------|---------|-------|
| `pypa/gh-action-pypi-publish` |
[`release/v1`](https://github.com/pypa/gh-action-pypi-publish/releases/tag/release/v1)
| [`v1`](https://github.com/pypa/gh-action-pypi-publish/releases/tag/v1)
|
[Release](https://github.com/pypa/gh-action-pypi-publish/releases/tag/v1)
| release.yml |

## Why upgrade?

Keeping GitHub Actions up to date ensures:
- **Security**: Latest security patches and fixes
- **Features**: Access to new functionality and improvements
- **Compatibility**: Better support for current GitHub features
- **Performance**: Optimizations and efficiency improvements

### Security Note

Actions that were previously pinned to commit SHAs remain pinned to SHAs
(updated to the latest release SHA) to maintain the security benefits of
immutable references.

### Testing

These changes only affect CI/CD workflow configurations and should not
impact application functionality. The workflows should be tested by
running them on a branch before merging.

* Upgrade GitHub Actions for Node 24 compatibility (#2151)

## Summary

Upgrade GitHub Actions to their latest versions to ensure compatibility
with Node 24, as Node 20 will reach end-of-life in April 2026.

## Changes

| Action | Old Version(s) | New Version | Release | Files |
|--------|---------------|-------------|---------|-------|
| `actions/checkout` |
[`v3`](https://github.com/actions/checkout/releases/tag/v3),
[`v4`](https://github.com/actions/checkout/releases/tag/v4) |
[`v6`](https://github.com/actions/checkout/releases/tag/v6) |
[Release](https://github.com/actions/checkout/releases/tag/v6) |
docker-builds.yml, release.yml |
| `actions/download-artifact` |
[`v4`](https://github.com/actions/download-artifact/releases/tag/v4) |
[`v7`](https://github.com/actions/download-artifact/releases/tag/v7) |
[Release](https://github.com/actions/download-artifact/releases/tag/v7)
| release.yml |
| `actions/setup-python` |
[`v5`](https://github.com/actions/setup-python/releases/tag/v5) |
[`v6`](https://github.com/actions/setup-python/releases/tag/v6) |
[Release](https://github.com/actions/setup-python/releases/tag/v6) |
release.yml |
| `actions/upload-artifact` |
[`v4`](https://github.com/actions/upload-artifact/releases/tag/v4) |
[`v6`](https://github.com/actions/upload-artifact/releases/tag/v6) |
[Release](https://github.com/actions/upload-artifact/releases/tag/v6) |
release.yml |

## Context

Per [GitHub's
announcement](https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/),
Node 20 is being deprecated and runners will begin using Node 24 by
default starting March 4th, 2026.

### Why this matters

- **Node 20 EOL**: April 2026
- **Node 24 default**: March 4th, 2026
- **Action**: Update to latest action versions that support Node 24

### Security Note

Actions that were previously pinned to commit SHAs remain pinned to SHAs
(updated to the latest release SHA) to maintain the security benefits of
immutable references.

### Testing

These changes only affect CI/CD workflow configurations and should not
impact application functionality. The workflows should be tested by
running them on a branch before merging.

* Improve the loss_compare.sh logic (#2143)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2145
* #2144
* __->__ #2143

1. Accept one "." (meaning the current commit) case to simplify the
command line.
2. Ignore the untracked files.

* [GPT-OSS] Add HF state dict adapter to support loading from HF checkpoints (#2021)

As titled, this PR adds HF state dict adapter to support loading from
GPT-OSS HF checkpoint. GPT-OSS checkpoint is quantized in MXPF4 format.
The de-quantization steps are offloaded to the
`QuantizedHuggingFaceStorageReader` in `dcp`, so this feature depends on
this PR to update `QuantizedHuggingFaceStorageReader`
(https://github.com/pytorch/pytorch/pull/167672).

1. Test 1. We use `dcp.load(hf_state_dict,
storage_reader=QuantizedHuggingFaceStorageReader(path=input_dir))` to
load from GPT-OSS HF checkpoint, and map the `hf_state_dict` back to
TorchTitan state dict. We build one test input, and compare two outputs:
1. Using `transformer` library to load GPT-OSS HF checkpoint and run
inference on the test input; 2. We use the converted TorchTitan model to
run inference on the test input. We compare the outputs by comparing the
KL divergence of two output probability distributions. The result shows
two models are very similar. <img width="1191" height="191" alt="Pasted
Graphic"
src="https://github.com/user-attachments/assets/bb6a75e9-3dd7-43fa-847e-3f5f4fb5fd93"
/>

2. Test 2. We load the model directly from quantized GPT-OSS HF
checkpoint, and do a test training.
<img width="1198" height="408" alt="Pasted Graphic 1"
src="https://github.com/user-attachments/assets/49ab42ff-0115-4e79-b069-c556e0dd23f6"
/>

* Add local built pytorch path for pyrefly (#2155)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2156
* __->__ #2155

This assumes that the local built version has the same parent folder as
torchtitan.

Also fixes some pyrefly errors for moe.py

* Run vLLM inference using torchtitan model definition (single GPU) (#2119)

As titled, put it in deterministic RL folder

* [RELAND] Let CUDA and ROCm read different loss result (#2157)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2157

CUDA and ROCm have different loss results. So we need to read from
different loss result files.
The loss results of FSDP and HSDP start to diverge after 5th step when
running with ROCm, we also need to adjust this. But this is more an
unknown issue that AMD people may want to figure out the root cause or
confirm that this is an expected behavior.

**This PR is a reland PR of
https://github.com/pytorch/torchtitan/pull/2156** due to some landing
issue of the previous PR.

* Use new DeviceMesh unflatten to rewrite parallel_dims (#1660)

**Summary**
This PR utilizes the latest APIs provided by DeviceMesh to simplify the
creation of all different meshes.

The design philosophy is as follow:

1. Create one world mesh with the shape as [world_size,]
2. Create all 1-D submeshes by using 1) unflattening from the world
mesh, or 2) slicing and flatten from other derived meshes.
3. ParallelDims now provides an API, get_mesh() and get_optional_mesh().
which accepts str or list[str]. When the argument is str, the API
directly return the corresponding 1-D submesh. If the argument is
list[str], the dim names will be used to concatenate to form a n-D
device mesh. The main difference between the two APIs is that the former
one will raise an ValueError if the result mesh is None the later one
will just return None.

* Integrate DeepEP to torchtitan (#2107)

## Summary
This initial version integrates DeepEP into TorchTitan, focusing on
correctness and compatibility rather than maximal performance tuning.

- Functional DeepEP-backed MoE + Expert Parallelism
- User-controlled configuration
- Compatible with torch.compile and SAC
-  Intended as a first unblocker for benchmarking and iteration

## Perf: DeepSeek-V3 671B on 64 nodes × H100 (512 GPUs total)

<details> <summary><strong>Training config (click to
expand)</strong></summary>

```
config_path="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml",
command_args=[
    "--training.dataset_path=/lustre/fsw/portfolios/sw/users/elfieg/hf_datasets/c4",
    "--training.seq_len=4096",
    "--training.steps=120",
    "--metrics.log_freq=10",
    "--profiling.no-enable-profiling",
    "--comm.init_timeout_seconds=2000",
    "--comm.train_timeout_seconds=300",
    "--metrics.disable_color_printing",

    # Parallelism
    "--parallelism.data_parallel_replicate_degree=1",
    "--parallelism.data_parallel_shard_degree=64",
    "--parallelism.fsdp_reshard_after_forward=default",
    "--parallelism.tensor_parallel_degree=1",
    "--parallelism.expert_parallel_degree=32",
    "--parallelism.expert_tensor_parallel_degree=1",
    "--parallelism.pipeline_parallel_degree=8",
    "--parallelism.pipeline_parallel_schedule=Interleaved1F1B",

    # Training
    "--training.local_batch_size=16",
    "--activation_checkpoint.mode=full",

    # Compilation
    "--compile.enable",
    "--compile.components=model",
    "--compile.components=loss",

    # MoE / DeepEP
    "--debug.moe_force_load_balance",
    "--parallelism.expert_parallel_comm_backend=deepep",
],
```
</details>

After:
```
memory: 56.75GiB(71.74%)  tps: 579  tflops: 162.82  mfu: 16.46%
```
Before:
```
memory: 60.18GiB(76.07%)  tps: 346  tflops: 97.24  mfu: 9.83%
```

## Loss Curve:
<img width="877" height="380" alt="Screenshot 2025-12-16 at 11 30 02 PM"
src="https://github.com/user-attachments/assets/b2f15297-2f05-4f4b-b4d5-b2747a30b2fa"
/>



Shout out to my colleagues @gekurian @syed-ahmed @aazzolini for internal
supports!

* Fix pypa/gh-action-pypi-publish version to use SHA pinning (#2161)

## Summary

Fix incorrect version reference for `pypa/gh-action-pypi-publish`.

## Problem

A previous PR incorrectly changed the action reference from `release/v1`
(valid branch) to `v1` (non-existent tag). The `v1` tag doesn't exist in
the pypa/gh-action-pypi-publish repository.

## Solution

Updated to use SHA pinning for release/v1.13:
```yaml
uses: pypa/gh-action-pypi-publish@ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e  # release/v1.13
```

This follows [GitHub's security best
practices](https://docs.github.com/en/actions/reference/security/secure-use#using-third-party-actions)
for third-party actions by pinning to an immutable SHA.

## Files Changed

- `.github/workflows/release.yml`

---------

Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com>

* Upgrade GitHub Actions for Node 24 compatibility (#2164)

## Summary

Upgrade GitHub Actions to their latest versions to ensure compatibility
with Node 24, as Node 20 will reach end-of-life in April 2026.

## Changes

| Action | Old Version(s) | New Version | Release | Files |
|--------|---------------|-------------|---------|-------|
| `actions/checkout` |
[`v3`](https://github.com/actions/checkout/releases/tag/v3) |
[`v6`](https://github.com/actions/checkout/releases/tag/v6) |
[Release](https://github.com/actions/checkout/releases/tag/v6) |
lint.yaml |
| `actions/setup-python` |
[`v4`](https://github.com/actions/setup-python/releases/tag/v4) |
[`v6`](https://github.com/actions/setup-python/releases/tag/v6) |
[Release](https://github.com/actions/setup-python/releases/tag/v6) |
lint.yaml |

## Context

Per [GitHub's
announcement](https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/),
Node 20 is being deprecated and runners will begin using Node 24 by
default starting March 4th, 2026.

### Why this matters

- **Node 20 EOL**: April 2026
- **Node 24 default**: March 4th, 2026
- **Action**: Update to latest action versions that support Node 24

### Security Note

Actions that were previously pinned to commit SHAs remain pinned to SHAs
(updated to the latest release SHA) to maintain the security benefits of
immutable references.

### Testing

These changes only affect CI/CD workflow configurations and should not
impact application functionality. The workflows should be tested by
running them on a branch before merging.

Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com>

* Expose common dataloader args (#2097)

This diff introduces common dataloader args which are supported by
statefuldataloader (and torch.utils.data dataloader). Users should be
able to use them in their config files.

I was thinking about introducing a catch all kwargs to make it easier to
specify args but that can easily complicate things (validation checks,
duplication, existing defined named args in function definitions etc).

* Replace `logger.warn()` to `logger.warning()` , allow `log_validation` to log `extra_metrics` and expose common wandb args (#2166)

1. Replace `logger.warn()` to `logger.warning()` 
2. allow `log_validation` to log `extra_metrics`
3. expose common wandb init args, it is userful when resume training.

* Add Dependabot for GitHub Actions updates (#2163)

## Summary

Add Dependabot configuration to automatically keep GitHub Actions up to
date.

Here's some more information about Dependabot:
https://docs.github.com/en/code-security/dependabot/working-with-dependabot/keeping-your-actions-up-to-date-with-dependabot

## Changes

- Added `.github/dependabot.yml` with weekly checks for GitHub Actions
updates

## Context

As discussed in #2161
([comment](https://github.com/pytorch/torchtitan/pull/2161#issuecomment-3667526716)),
adding Dependabot to automatically manage GitHub Actions updates going
forward.

## Why

Dependabot will automatically create PRs when new versions of GitHub
Actions are available, helping to:
- Keep CI/CD workflows secure with the latest patches
- Get new features and improvements
- Maintain compatibility with GitHub's infrastructure

Each action update will be proposed as a separate PR for individual
review and testing.

---------

Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com>

* Bump tj-actions/changed-files from d6e91a2266cdb9d62096cebf1e8546899c6aa18f to e0021407031f5be11a464abee9a0776171c79891 in the github-actions group (#2167)

Bumps the github-actions group with 1 update:
[tj-actions/changed-files](https://github.com/tj-actions/changed-files).

Updates `tj-actions/changed-files` from
d6e91a2266cdb9d62096cebf1e8546899c6aa18f to
e0021407031f5be11a464abee9a0776171c79891
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/tj-actions/changed-files/blob/main/HISTORY.md">tj-actions/changed-files's
changelog</a>.</em></p>
<blockquote>
<h1>Changelog</h1>
<h1><a
href="https://github.com/tj-actions/changed-files/compare/v46.0.5...v47.0.0">47.0.0</a>
- (2025-09-13)</h1>
<h2><!-- raw HTML omitted -->🚀 Features</h2>
<ul>
<li>Add any_added to outputs (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2567">#2567</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/c260d49a827b5eb266673bed7871c5d3ee9b5aef">c260d49</a>)
- (Jellyfrog)</li>
</ul>
<h2><!-- raw HTML omitted -->➖ Remove</h2>
<ul>
<li>Commit and push step from build job (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2538">#2538</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/be393a90381e27c9fec2c8c2e02b00f005710145">be393a9</a>)
- (Tonye Jack)</li>
</ul>
<h2><!-- raw HTML omitted -->🔄 Update</h2>
<ul>
<li>Updated README.md (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2592">#2592</a>)</li>
</ul>
<p>Co-authored-by: github-actions[bot]
&lt;41898282+github-actions[bot]<a
href="https://github.com/users"><code>@​users</code></a>.noreply.github.com&gt;
(<a
href="https://github.com/tj-actions/changed-files/commit/3dbc1e181273d808ccff822a6e00cf18b6628ef0">3dbc1e1</a>)
- (github-actions[bot])</p>
<ul>
<li>Updated README.md (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2591">#2591</a>)</li>
</ul>
<p>Co-authored-by: github-actions[bot]
&lt;41898282+github-actions[bot]<a
href="https://github.com/users"><code>@​users</code></a>.noreply.github.com&gt;
(<a
href="https://github.com/tj-actions/changed-files/commit/b1ccff8c0892ad141d7d2de6f31e526a9dad931f">b1ccff8</a>)
- (github-actions[bot])</p>
<ul>
<li>Updated README.md (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2574">#2574</a>)</li>
</ul>
<p>Co-authored-by: github-actions[bot]
&lt;41898282+github-actions[bot]<a
href="https://github.com/users"><code>@​users</code></a>.noreply.github.com&gt;
(<a
href="https://github.com/tj-actions/changed-files/commit/050a3d3360d29711ee9d8210fc639d902d23ad07">050a3d3</a>)
- (github-actions[bot])</p>
<h2><!-- raw HTML omitted -->📚 Documentation</h2>
<ul>
<li>Update link to glob patterns (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2590">#2590</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/a892f50f7a7187bc288633c09230b09ce7ad8fd0">a892f50</a>)
- (Tonye Jack)</li>
<li>Add Jellyfrog as a contributor for code, and doc (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2573">#2573</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/f000a9b97f254f9590ff26f651cccde827ad36da">f000a9b</a>)
- (allcontributors[bot])</li>
</ul>
<h2><!-- raw HTML omitted -->🧪 Testing</h2>
<ul>
<li>Manual triggered workflows (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2637">#2637</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/c2ca2493190021783138cb8aac49bcee14b4bb89">c2ca249</a>)
- (Tonye Jack)</li>
</ul>
<h2><!-- raw HTML omitted -->⚙️ Miscellaneous Tasks</h2>
<ul>
<li><strong>deps-dev:</strong> Bump jest from 30.0.5 to 30.1.3 (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2655">#2655</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/9a6755550a331fdcc8ec45443738933f8fa22eea">9a67555</a>)
- (dependabot[bot])</li>
<li><strong>deps:</strong> Bump tj-actions/git-cliff from 2.1.0 to 2.2.0
(<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2660">#2660</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/b67e30df88f43e244f4e83775e5ad8335114fb95">b67e30d</a>)
- (dependabot[bot])</li>
<li><strong>deps:</strong> Bump github/codeql-action from 3.30.2 to
3.30.3 (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2661">#2661</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/62aef422ffa195474d80d73387535cf4622b2824">62aef42</a>)
- (dependabot[bot])</li>
<li><strong>deps:</strong> Bump github/codeql-action from 3.29.11 to
3.30.2 (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2659">#2659</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/e874f3cddd0f54ae776e6995ae6dae4cf40fd3d3">e874f3c</a>)
- (dependabot[bot])</li>
<li><strong>deps:</strong> Bump actions/setup-node from 4.4.0 to 5.0.0
(<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2656">#2656</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/8c14441336bb3d84fd6b7fa83b6d7201c740baf5">8c14441</a>)
- (dependabot[bot])</li>
<li><strong>deps-dev:</strong> Bump <code>@​types/node</code> from
24.3.0 to 24.3.1 (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2657">#2657</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/e995ac4be5be2bcb6e29556edc51fb63aca6b49b">e995ac4</a>)
- (dependabot[bot])</li>
<li><strong>deps-dev:</strong> Bump <code>@​types/node</code> from
24.2.1 to 24.3.0 (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2649">#2649</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/3b04099b21072562f07469c10deb182b24236ca9">3b04099</a>)
- (dependabot[bot])</li>
<li><strong>deps:</strong> Bump github/codeql-action from 3.29.9 to
3.29.11 (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2651">#2651</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/e7b6c977e51984988e3cc1d6b18abe2a3ba8daaa">e7b6c97</a>)
- (dependabot[bot])</li>
<li><strong>deps:</strong> Bump tj-actions/git-cliff from 2.0.2 to 2.1.0
(<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2648">#2648</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/765d62bc041415a5b494ef13d02d566128b25973">765d62b</a>)
- (dependabot[bot])</li>
<li><strong>deps:</strong> Bump github/codeql-action from 3.29.8 to
3.29.9 (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2647">#2647</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/2036da178f85576f1940fedb74bb93a36cd89ab7">2036da1</a>)
- (dependabot[bot])</li>
<li><strong>deps:</strong> Bump github/codeql-ac…
mormio pushed a commit to NousResearch/torchtitan that referenced this pull request Mar 26, 2026
* [TorchComms] add testing badge at experiments readme (#2010)

* [compiler toolkit] specify passes through config (#2006)

We should be able to control what passes to run in the compiler. This PR
uses the config compile.passes to indicate in a list of graph passes to
apply on the captured gm.

By default, no pass is applied. Users can specify what passes to apply.

Currently there are `autobucketing_reordering_pass` and
`regional_inductor_pass`.

```
NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes autobucketing_reordering,regional_inductor
```

Also updated CI to include this new config

* [simplefsdp] fix region ac in zero2-style FSDP (#1970)

After some offline discussion, we've concluded that life would be easier
if we can put simplefsdp's checkpoint logic for `reshard_after_forward`
to compiler. The ac annotation part is borrowed form AP:
[LINK](https://github.com/meta-pytorch/autoparallel/blob/main/autoparallel/activation_checkpointing.py#L69).

**Trace and Loss Check** (all with torch.compile enable)

reshard_after_fwd = False
1. SAC + llama3
([trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-30-17-05-06_rank0_trace.json))
<img width="768" height="115" alt="Screenshot 2025-10-30 at 4 28 59 PM"
src="https://github.com/user-attachments/assets/e4e22335-2e3f-46c8-8def-a60d592fee0a"
/>

<img width="689" height="512" alt="Screenshot 2025-11-05 at 9 02 30 PM"
src="https://github.com/user-attachments/assets/40a71316-a457-4e72-9002-cc8beea8f32c"
/>


2. Full AC + llama3 [(trace)]()

<img width="729" height="105" alt="Screenshot 2025-10-30 at 4 30 53 PM"
src="https://github.com/user-attachments/assets/e8d63460-579b-4f0a-8504-851480e5b548"
/>

<img width="789" height="763" alt="Screenshot 2025-11-05 at 9 11 34 PM"
src="https://github.com/user-attachments/assets/1a13d09e-04c4-4db9-99fe-cf10d24bf7f5"
/>


3. No AC + llama3
[[trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-30-17-03-50_rank0_trace.json)]

<img width="748" height="115" alt="Screenshot 2025-10-30 at 4 32 05 PM"
src="https://github.com/user-attachments/assets/20104d24-9d45-4eba-b694-815e133b88d0"
/>

<img width="800" height="764" alt="Screenshot 2025-11-05 at 9 07 46 PM"
src="https://github.com/user-attachments/assets/55b104ce-8ec1-4ed6-95e7-300e96ad55af"
/>


reshard_after_fwd = True

1. SAC + llama3
([Trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-31-11-34-24_rank0_trace.json))

<img width="795" height="108" alt="Screenshot 2025-10-31 at 11 34 47 AM"
src="https://github.com/user-attachments/assets/a3988f72-7e87-4e52-90f9-8bee840cd6f4"
/>


2. Full AC + llama3
([Trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-31-11-36-27_rank0_trace.json))

<img width="593" height="110" alt="Screenshot 2025-10-31 at 11 38 02 AM"
src="https://github.com/user-attachments/assets/5ee61b2b-9600-4af8-9a24-61b3564f93ca"
/>


3. No AC + llama3
([Trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-30-17-02-44_rank0_trace.json))


<img width="701" height="109" alt="Screenshot 2025-10-31 at 11 43 04 AM"
src="https://github.com/user-attachments/assets/576b28f6-dae4-4ff7-b005-57b0cf9ad7cc"
/>

* [SimpleFSDP] Add typing to simple_fsdp.py (#2001)

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* #2002
* __->__ #2001

Add typing, credit to Claude.

* [Full DTensor][Reland] Add full_dtensor flag (#2013)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2013

When full_dtensor is True, the compute_placement will be preserved. This
means that `to_local()` won't be called for fsdp only case. nD
parallelism case (fsdp + tp) will error out as we have not implemented
this case.

This argument doesn't affect the current simple_fsdp. We have verified
`full_dtensor=True` case with the full dtensor skleton PR, which will be
published once it is ready.

**This is a reland PR of
https://github.com/pytorch/torchtitan/pull/2002. The previous one was
broken during rebase.**

* set pg names (#1986)

Summary:
- we need to pass the global rank information to pytorch so that the pg
name can include the pg information
- this is necessary to differentiate the default pg's on different
replicas
- these need to different because flight recorder matches collectives
based on pg name as well
- add ft training to experiments folder, we'll move remaining pieces of
ft to this gradually but make new features only available through this
folder

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1986).
* #1988
* #1987
* __->__ #1986

Co-authored-by: Tushar Jain <tushar00jain@users.noreply.github.com>

* Fix the error message of maybe_enable_async_tp() (#2011)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2012
* __->__ #2011

It is not correct as JobConfig has changed.

* Add dry run mode (#2012)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2012
* #2011

Summary:
The current configuration validation requires torchx and GPUs. It can
waste time, resources, ane engery. Polar bears are crying. Let's fix
this by providing a dry run mode. This PR doesn't verify everything. In
theory, we should be able to verify parallelisms settings as well. This
PR is just a start but it at least can let us catch the typos quickly.

* [easy] [compiler toolkit] Clean up unused function (#2014)

As titled. `_clear_traced_params_buffers` is no longer being used as we
have switched the dynamo graph capture API.

* Run Torchtitan ROCm workflow on cron schedule & push to Main branch only (#2016)

Addressing following issues in this PR-

- Running Torchtitan ROCm workflow on cron schedule & only when push to
Main branch. CUDA workflow will run as is.
- Refactor Torchtitan test run to address older PR comment
https://github.com/pytorch/torchtitan/pull/1786#discussion_r2476279289

* Revert PR-2016 & Redo "Run Torchtitan ROCm workflow on cron schedule & push to Main branch only" (#2017)

Reverts PR: https://github.com/pytorch/torchtitan/pull/2016
Addressing following issues in this PR-
- Running Torchtitan ROCm workflow on cron schedule & only when push to
Main branch. CUDA workflow will run as is.
- Refactor Torchtitan test run to address older PR comment
https://github.com/pytorch/torchtitan/pull/1786#discussion_r2476279289

Co-authored-by: tianyu-l <150487191+tianyu-l@users.noreply.github.com>

* [compiler toolkit] Add tests and scripts for numerics check (#2015)

This PR adds the utils to automatically check the training numerics
(losses, grad norms) of two runs to verify if they have bitwise
equivalence.

The added script triggers two runs with user defined configs. Then it
loads metrics saved during training and compare the numerics to verify
bitwise equivalence. Currently we check for losses and grad norms during
training steps

For example, we want to compare the numerics between compiler toolkit
with aot_eager backend and eager on llama3-8B.
```
python torchtitan/experiments/compiler_toolkit/scripts/check_numerics.py --ngpu 4 --config-file torchtitan/models/llama3/train_configs/llama3_8b.toml --dp-shard-degree 2 --tp-degree 2
```
It'll run `simple_fsdp` experiment without `torch.compile` as the eager
baseline, and `compile_toolkit` experiment as the compiled run. Then it
compares the training numerics of these two runs to verify bitwise
equivalence.

When it is bitwise equivalent, we'll see the following output
```
Starting training: simple_fsdp.llama3
✓ Training completed: simple_fsdp.llama3

Starting training: compiler_toolkit.llama3
✓ Training completed: compiler_toolkit.llama3
  ✓ PASS: All 11 steps match exactly (bitwise equivalent)
  ✓ PASS: All 11 steps match exactly (bitwise equivalent)
✓ SUCCESS: All metrics are bitwise equivalent
```

Also added unit-tests in `compiler_toolkit/tests/test_numerics.py` so
that we can guard working parallelism combinations that already have
bitwise equivalence in CI.

* Add .claude to .gitignore (#2026)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2029
* #2030
* #2028
* #2027
* __->__ #2026

As title

* Fix dry run mode (#2027)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2029
* #2030
* #2028
* __->__ #2027
* #2026

Dry run mode works but it doesn't exit gracefully for all cases. This PR
fixes it

```
DRY_RUN=1 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh   --training.steps=10 --activation_checkpoint.mode="none"
--debug.deterministic --debug.seed=42
```

* [Compiler Toolkit] Make compiler toolkit work with checkpoint (#2030)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2029
* __->__ #2030

The current CompileModule will result in an "inner" prefix for
everything. This
PR fixes it by overloading the methods.

Also merge https://github.com/pytorch/torchtitan/pull/2028 to this PR.
Something wrong with ghstack.

* [Flux] Update integration test badge in README.md (#2019)

Fixes the badge in the `README.md` file

* Print device and stride when print module (#2045)

Before:
<img width="978" height="93" alt="image"
src="https://github.com/user-attachments/assets/48dc39d9-e897-4396-ac62-025574303403"
/>


After:
<img width="1318" height="82" alt="image"
src="https://github.com/user-attachments/assets/47b4771a-aaf9-4f61-80bc-757f3a08c1d2"
/>

* [SimpleFSDP] add manual bucketing pass (#1881)

This PR adds support for aten-level manual bucketing in
SimpleFSDP+`aot_eager` backend. Dependent on PyTorch
[PR](https://github.com/pytorch/pytorch/pull/165487)

TODO List:
- [ ] We should have better way of handling region info other than a
list of str FQNs in current `manual_bucketed_modules`. It would be very
easy to miss some of model modules. (cc. @xmfan @SherlockNoMad )
- [ ] Currently, the reordering happens under the hood and overlap with
last/next compute. We should allow users to specify which module they
want to reorder.
- [ ] Loss difference on multi-node training
- [ ] DSV3 manual bucketing

I'll address the TODO items in follow up PRs. Let's start with this
simple FSDP+TP+llama3 PR.

1. Performance (FSDP2 under eager mode, SimpleFSDP uses `aot_eager`
backend)

**Llama 3-8B**

* Performance (All Batch_size = 1). (The slower TPS on Single Node is
sort of as expected, since FSDP2 handles copy-in/out in two different
streams, whereas SimpleFSDP handles copy-in/out in the same stream)

|Node| Method | Parallelism | Memory | TPS | Trace|
|---------|---------|-----------|----------|------|------|
|1-Node (8H100)|SimpleFSDP | FSDP=8| 40.96GiB(43.12%) | 7,227|
[LINK](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-16-10-48-48_rank0_trace.json)|
|1-Node (8H100)|FSDP2-eager| FSDP=8| 47.82GiB(50.35%) | 7,380 |
[LINK](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-16-10-54-14_rank0_trace.json)|
|8-Node (64H100)|SimpleFSDP| FSDP=64  | 29.37GiB | 4,984| |
|8-Node (64H100)|FSDP2| FSDP=64 | 31.41GiB  |5,097 | |
|1-Node (8H100)|SimpleFSDP| FSDP=4 TP=2 | 28.28GiB(29.77%) | 5,881 |
[LINK](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-26-18-00-18_rank0_trace.json)
|
|1-Node (8H100)|FSDP2| FSDP=4 TP=2 | 35.33GiB(37.20%) | 5,898 |
[LINK](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/ruisizhang123_2025-10-26-15-35-47_rank0_trace.json)
|
|8-Node (64H100)|SimpleFSDP| FSDP=8 TP=8  |   |||
|8-Node (64H100)|FSDP2| FSDP=8 TP=8 |   |||

Example SimpleFSDP 1D overlapping trace:

<img width="1127" height="127" alt="Screenshot 2025-10-16 at 10 49
55 AM"
src="https://github.com/user-attachments/assets/2d9e3ff8-8e9b-40a7-a666-3c0a0975186e"
/>

Example SimpleFSDP 2D overlapping trace:
<img width="1162" height="166" alt="Screenshot 2025-10-26 at 6 00 51 PM"
src="https://github.com/user-attachments/assets/bc5cc031-5b6c-4e4d-a9da-70c43114f49a"
/>


- Bitwise Loss:

FSDP-only:
<img width="1266" height="837" alt="Screenshot 2025-10-17 at 10 41
56 AM"
src="https://github.com/user-attachments/assets/30f83d95-1eca-4f10-9e7e-47c45278cd8d"
/>

FSDP+TP:
<img width="1259" height="808" alt="Screenshot 2025-10-26 at 9 03 58 PM"
src="https://github.com/user-attachments/assets/b75b452b-adb9-4078-9412-ee9e584ffe15"
/>

* Add export_dtype parameter to `convert_to_hf` function (#2041)

The current `convert_to_hf.py` does not support `export_dtype`, which
makes it `float32` by default. This PR adds support for export dtypes of
`["float16", "bfloat16", "float32"]`.

* [compiler toolkit] Port joint_ac_pass from simplefsdp (#2051)

This PR integrates the changes in #1970 to compiler toolkit (applying
`joint_ac_pass` on the joint graph graph to tag nodes based on
`reshard_after_forward` flag)

Also did some refactor for applying graph passes in compiler toolkit
experiments. We will have two kinds of passes

1. joint_custom_passes: these are passes to be applied on the captured
joint graph before partitioner. By default we
`validate_flex_attn_annotation_pass` and `fsdp_reshard_after_fwd_pass`

2. compiler_passes: there are passes to be applied on partitioned fwd
and bwd graphs as backend optimizations. By default there is none. We
can indicate `autobucketing_reordering_pass` and
`regional_inductor_pass` using configs.

* [compiler toolkit] Port manual bucketing from SimpleFSDP experiment (#2056)

This PR integrates the manual bucketing pass (transformer block
bucketing) added in SimpleFSDP experiment (#1881) to compiler toolkit

So now compiler toolkit can also run manual bucketing pass by specifying
the config

```
NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes transformer_block_bucketing
``` 

Also updated README and integration test to include the newly ported
pass

* Re:Run Torchtitan ROCm workflow on cron schedule & push to Main branch only (#2018)

Addressing following issues in this PR-

Running Torchtitan ROCm workflow on cron schedule & only when push to
Main branch. CUDA workflow will run as is.
Refactor Torchtitan test run to address older PR comment
https://github.com/pytorch/torchtitan/pull/1786#discussion_r2476279289

* Add a loss comparison script (#2029)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2049
* __->__ #2029


## Summary
This PR adds `scripts/loss_compare.py` for comparing training losses
between different git commits and/or training configurations.

## Key Features

- Commit Comparison: Compare losses between two different git commits
with deterministic training
- Configuration Comparison: Compare different training configurations on
the same commit
- Reproducibility: Automatically enables deterministic mode and seed
checkpointing for reproducible
  comparisons
- Real-time Output: Streams training output to both console and log
files during execution
- Statistical Analysis: Generates step-by-step loss comparisons and
summary statistics
- CI Testing: Includes --assert-equal flag for automated testing to
verify identical losses

## Usage Examples

#### Compare two commits
```
python3 ./scripts/loss_compare.py main my_branch
```
#### Compare two commits with custom configuration 
```
python3 ./scripts/loss_compare.py main my_branch \
--baseline-config="./custom.toml" 
--baseline-options="--parallelism.tensor_parallel_degree=2"  \
```

#### Compare different parallelization strategies on same commit
```
python3 ./scripts/loss_compare.py . . \
--baseline-config="./llama3_8b.toml" 
--baseline-options="--parallelism.tensor_parallel_degree=2" \
--test-options="--parallelism.tensor_parallel_degree=1" \
```

#### Assert equality for CI testing
```
python3 ./scripts/loss_compare.py main my_branch --assert-equal
```


## Real Use Cases
Compare full dtensor simple fsdp with fsdp2:
```
python3 scripts/loss_compare.py . . \
--baseline-options='--activation_checkpoint.mode="none"'  \
--test-train-file='torchtitan.experiments.full_dtensor.train' \ 
--test-options='--model.name full_dtensor.llama3 --activation_checkpoint.mode="none"'  \
 --assert-equal --no-seed-checkpoint


[LOSS_COMPARE]
[LOSS_COMPARE] Asserting losses are equal...
[LOSS_COMPARE] Baseline log: /tmp/baseline_training.log
[LOSS_COMPARE] Test log: /tmp/test_training.log
[LOSS_COMPARE] Extracted 100 steps from baseline log
[LOSS_COMPARE] Extracted 100 steps from test log
test_losses_equal (__main__.assert_losses_equal.<locals>.LossEqualityTest.test_losses_equal) ... ok
```

* Fix integration test gpu_arch_type field (#2060)

All tests in experiments are broken due to the `gpu_arch_type` field
added in #2018.

* [compiler toolkit] Add Trainer subclass for compiler toolkit (#2064)

Adding CudaGraph pass (https://github.com/pytorch/torchtitan/pull/2050)
would require some custom logic in Trainer's close() method.

So we create a Trainer subclass in compiler toolkit

* Let loss_compare.py check the repo cleaness (#2062)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2063
* __->__ #2062

This will prevent errors when later doing git checkout

* CUDAGraph support for SimpleFSDP and TP (#2050)

## Features
- [x] Support SimpleFSDP and TP
- [x] Support static input indices to reduce copy
- [x] Support memory reuse to reduce memory consumption
- [x] Cleanup cudagraph when training finishes to avoid nccl hang from
destroy_process_group

Command:
```
NCCL_GRAPH_REGISTER=0 NGPU=8 TRAIN_FILE=torchtitan.experiments.compiler_toolkit.train CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4  --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes cudagraph
```


Note: we use `NCCL_GRAPH_REGISTER=0` due to a known issue that nccl +
cudagraphs + expandable segments result in IMA.
https://github.com/pytorch/pytorch/issues/158029


[trace](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces%2Ftree%2Fshared_trace%2Fboyuan_e1ef464b-ee61-4c61-82e5-f7a485e561bf_rank0_trace.json)

## Result

**Numerics:**
Achieved bitwise equivalence w/ and w/o cudagraph pass on llama3.1-8B
AND llama3.1-70B.

**Performance:**
<img width="560" height="90" alt="image"
src="https://github.com/user-attachments/assets/9d54c461-0eb1-4f7e-9652-3d52043ad74f"
/>

Raw log:
[llama3-8b](https://www.internalfb.com/phabricator/paste/view/P2045444190),
[llama3-70b](https://www.internalfb.com/phabricator/paste/view/P2045567416)

**Memory:**
On llama3.1-70b, cudagraph takes 6% more memory consumption (143 GiB vs
153 GiB).

A few tricks to reduce memory consumption (use llama3.1-70b w/ cudagraph
as an example):
- Start: 161 GiB
- \+ use the same stream for warmup and graph capture of both fwd and
bwd: 160 GiB
- \+ warmup in cudagraph memory pool instead of eager memory pool: 153
GiB


**static input copy:**
On llama3.1-70B, for forward, we copy 1 tensor of 128 bytes; for
backward, we copy 1 tensor of 0.98 GB. This shows static input indices
is handled correctly.


## Followup PR
In the followup PR, I will enable fx graph partition for deepseek v3
https://github.com/pytorch/pytorch/pull/165945.

* compiler_toolkit: fix args access (#2067)

This PR fixes access to args; it's an attribute, not a variable in the
scope.
The method itself though would not be used because
`should_check_address` seems to be always `False` and there doesn't seem
to be a command line argument for it.

Signed-off-by: Masaki Kozuki <mkozuki@nvidia.com>

* 3outeille/transformers backend (Dense model only) (#2048)

# Context
Reference PR: https://github.com/huggingface/torchtitan/pull/1

This PR enables:
- Llama-like HF models to work with 4D parallelism: FSDP, CP, TP, PP
(and the combinations between them). The following models were tested:
  - `meta-llama/Llama-3.2-1B`
  - `microsoft/phi-2`
  - `Qwen/Qwen2.5-7B`
  - `mistralai/Mistral-7B-v0.1`
  - `ByteDance-Seed/Seed-Coder-8B-Instruct`
  - `Qwen/Qwen3-4B-Instruct-2507`
  - `arcee-ai/AFM-4.5B`
  - `ibm-granite/granite-3b-code-base-2k`
  - `baidu/ERNIE-4.5-0.3B-Base-PT`
  - `kyutai/helium-1-preview-2b`
  - `allenai/OLMo-7B-hf`
  - `mistralai/Ministral-8B-Instruct-2410`
- Patching HF models weights initialisation. Without this, the the
`loss` and `grad_norm` starts very high

# Usage

- Requirements `transformers==4.57.1`
- Config:
`torchtitan/torchtitan/experiments/transformers_backend/configs/qwen3.toml`
```diff
...
[model]
- name = "llama3"
+ name = "transformers_backend"
flavor = "debugmodel"
hf_assets_path = "./tests/assets/tokenizer"

+[hf_transformers]
+model = "Qwen/Qwen3-4B-Instruct-2507"
...
```
- Train: `LOG_RANK=7
CONFIG_FILE=<YOUR_PATH>/torchtitan/experiments/transformers_backend/configs/qwen3.toml
./run_train.sh
--job.custom_config_module=torchtitan.experiments.transformers_backend.job_config
--compile.enable`

<img width="1334" height="453" alt="image"
src="https://github.com/user-attachments/assets/da459448-027b-4af9-8176-6a3e433a272c"
/>

# Testing methodology

<img width="2672" height="2018" alt="image"
src="https://github.com/user-attachments/assets/66d8689d-7ede-47e3-b389-d4fc1bdd70f7"
/>

- Following the
[converging.md](https://github.com/pytorch/torchtitan/blob/main/docs/converging.md)
guidelines, I am comparing the baseline `FSDP=2` vs `FSDP=2 & <other
//-ism>`
- More precisely, the `test_hf_integration.py`is going to do:

```bash
    results/
        |_ meta-llama
            |_ Llama-3.2-1B
                |_ debugmodel/
                    |_ seed_checkpoint/
                        |_ config.toml
                        |_ seed.slurm
                        |_ step-0/
                           |_ ....
                    |_ fsdp2_tp1_cp1_pp1/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                    |_ fsdp2_tp2_cp1_pp1/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log
                    |_ fsdp2_tp1_cp1_pp2/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log
                    |_ fsdp2_tp1_cp2_pp1/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log
                    |_ fsdp2_tp1_cp2_pp2/
                        |_ config.toml
                        |_ nd_parallelism.slurm
                        |_ nd_parallelism.log
                        |_ diff_baseline_vs_nd_parallelism.log`
                |_ full/
                ...
```
- Here is the grid search to test the HF modelling
```shell
#!/usr/bin/bash
model_names=(
     "meta-llama/Llama-3.2-1B"
     "microsoft/phi-2" 
     "Qwen/Qwen2.5-7B"
     "mistralai/Mistral-7B-v0.1"
     "ByteDance-Seed/Seed-Coder-8B-Instruct"
     "Qwen/Qwen3-4B-Instruct-2507" 
     "arcee-ai/AFM-4.5B" 
     "ibm-granite/granite-3b-code-base-2k" 
     "baidu/ERNIE-4.5-0.3B-Base-PT" 
     "kyutai/helium-1-preview-2b" 
     "allenai/OLMo-7B-hf"
     "mistralai/Ministral-8B-Instruct-2410" 
)

for model_name in "${model_names[@]}"; do
    rm -rf slurm_results/${model_name}

    python test_hf_integration.py create_configs --model_name "$model_name" --out_dir slurm_results --flavor debugmodel
    python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel/seed_checkpoint --qos high
    while [ ! -f slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt ] || [ "$(cat slurm_results/${model_name}/debugmodel/seed_checkpoint/status.txt)" != "completed" ]; do
        echo "Waiting for seed checkpoint from ${model_name} to complete ..."
        sleep 1
    done
    python test_hf_integration.py submit_jobs --inp_dir slurm_results/${model_name}/debugmodel --qos high
    echo "================"
done
```

# Further tasks

- Moe (handle in PR https://github.com/huggingface/torchtitan/pull/3)
	- Missing `build_optimizers_with_moe_load_balancing` support for MoE
	- Missing TP/PP/EP supports for MoE 
- When using HF modeling, the test `FSDP=2 vs FSDP=2 + PP=2`, the `loss`
and `grad_norm` not bitwise matching (but converging) while it is the
case with Torchtitan modeling. (issue is tracked in
https://github.com/huggingface/torchtitan/pull/4)
- Add convergence tests to CI by doing tiny model + gloo backend (once
PP is bitwise matching)
- the HF modeling has lower MFU than Torchtitan MFU
- NOTE: `import torch._dynamo.config;
torch._dynamo.config.cache_size_limit = 128` to avoid recomputation for
graph when using `torch.compile` and `activation checkpointing`

* adding variable length attention to llama3 8b   (#2000)

**Summary**
This PR adds variable length attention (varlen) support to the Llama 3
8b model in torchtitan. We replace `use_flex_attn` with `attn_type`
(either "sdpa", "varlen", "flex"). If `attn_type = "varlen"`, the
attention module calls a compiled `varlen_attn` defined
[here](https://github.com/pytorch/pytorch/blob/main/torch/nn/attention/varlen.py).

**Testing**
Ran loss and performance tests against flex attention. Loss is on par.

<img width="947" height="505" alt="Screenshot 2025-11-19 at 3 24 26 PM"
src="https://github.com/user-attachments/assets/d85dfc09-4f5e-4f82-abc9-49b870b34990"
/>

Varlen is slightly slower than Flex due to the cuda kernel speeds
(varlen calls into `flash_attention_forward`/`flash_attention_backward`
today).


| | Varlen | Flex |
| :---: | :------ | :---: |
| Forward  | 774us 357ns | 722us 317ns  |
| Backward   | 1ms 955us 916ns  | 1ms 558us 747ns    |

* remove scatter_add in MoE implementation (#1974)

PR for removing `scatter_add` in the MoE implementation. `scatter_add`
is somewhat problematic as it is non-deterministic due to the necessity
of [atomic
adds](https://discuss.pytorch.org/t/why-does-index-add-and-scatter-add-induce-non-deterministic-behavior-on-the-cuda-backend/45544/2)
for correctness.

Determinism, correctness, and performance tests using scripts under
`torchtitan/moe_bench_and_test`:

```
# Determinism: run same forward 100x and compute standard deviations
pytest -rsfP torchtitan/moe_bench_and_test/test_moe.py -k test_determinism

out_old_std=tensor(0.0297, device='cuda:0', dtype=torch.bfloat16)
out_std=tensor(0., device='cuda:0', dtype=torch.bfloat16)
out_old_std/out_moe_old.abs().mean()=tensor(0.0006, device='cuda:0', dtype=torch.bfloat16)
out_std/out_moe.abs().mean()=tensor(0., device='cuda:0', dtype=torch.bfloat16)
```

```
# Accuracy: compare MoE outputs to FFN outputs, with weights set such that outputs should be the same
# Relative error decreased by 3x
pytest -rsfP torchtitan/moe_bench_and_test/test_moe.py -k test_moe_ffn_equivalence

moe_old_rel_err=0.009754068047048696
moe_rel_err=0.002507858727736454
moe_old_rel_err/moe_rel_err=3.8894009216589858
```

```
# Timing: triton do_bench for DSv3 16B layer fwd + bwd. ~3% faster runtime
python torchtitan/moe_bench_and_test/moe_timing.py moe_old && python torchtitan/moe_bench_and_test/moe_timing.py moe

args=Namespace(cls='moe_old', perf_reps=1000, perf_warmups=100, seqlen=4096, bsz=4)
moe_time_ms=19.712812881469727

args=Namespace(cls='moe', perf_reps=1000, perf_warmups=100, seqlen=4096, bsz=4)
moe_time_ms=19.03301840562087

```

```
# Memory: for DSv3 16B layer fwd + bwd. ~15% reduction in active mem, ~18% in reserved mem.
python torchtitan/moe_bench_and_test/moe_memory.py moe_old && python torchtitan/moe_bench_and_test/moe_memory.py moe

args=Namespace(cls='moe_old', iters=1, seqlen=4096, bsz=4)
peak_stats.max_active_gib=5.926029682159424
peak_stats.max_reserved_gib=7.224609375

args=Namespace(cls='moe', iters=1, seqlen=4096, bsz=4)
peak_stats.max_active_gib=5.051033020019531
peak_stats.max_reserved_gib=5.91015625
```

Testing fwd + bwd correctness for `tp_degree=ep_degree=world_size=8` and
`etp=1`
```
# Similar relative errors
torchrun --nproc-per-node 8 torchtitan/moe_bench_and_test/test_tp.py

args=Namespace(seqlen=256, bsz=4, tol=0.01), world_size=8, tp=8, ep=8, etp=1

err_ratio_fsdp_ep_old=0.0028211805268959435
err_ratio_fsdp_ep=0.002805679534989922
err_ratio_ep_ep_old=0.0022941468020912068

kl_fsdp_ep_old=tensor(2.4915e-05, device='cuda:0', dtype=torch.bfloat16)
kl_fsdp_ep=tensor(2.0981e-05, device='cuda:0', dtype=torch.bfloat16)
kl_ep_ep_old=tensor(2.1458e-05, device='cuda:0', dtype=torch.bfloat16)
```

Everything under `torchtitan/moe_bench_and_test` is temporary testing
utilities and is to be deleted prior to merging.

* Update transformers backend name (#2075)

following Huggingface efforts in VLLM (cf
https://github.com/vllm-project/vllm/pull/28725), we would like to
uniformize the naming and make sure that people think we use the HF
models only

* Enhance loss_compare.py: Add Import/Export Options and Enable CI Comparison with Existing Losses (#2063)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2063

This PR allows us to check if the loss is consistent across commits/PRs.
1. This PR contains a pre-tested losses result file.
2. This PR improve the loss_compare.py to add --import and --export
options.
3. In CI, uses --import to get the previous losses and compare them with
the current PR. If anything mismatch (10 steps), the CI will fail.

* Print out the version number (#2083)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2083

This PR and https://github.com/pytorch/torchtitan/pull/2070 can resolve
https://github.com/pytorch/torchtitan/issues/2043.

This should not affect `.github/scripts/update_version.sh` as
`.github/scripts/update_version.sh` will append the version at the end
of the file, which will overwrite the value.

* Autoparallel as an experiment in main (#2054)

Experiments like SimpleFSDP/Compiler Toolkit/Autoparallel are all being
developed at the same time, and SimpleFSDP/Compiler Toolkit both run
into issues with PP that requires the PP utilities from Autoparallel. We
want to land the Autoparallel experiment into main to facilitate that
sharing.

---------

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Co-authored-by: Will Constable <whc@meta.com>
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Co-authored-by: Francisco Massa <fvsmassa@gmail.com>
Co-authored-by: ruisizhang123 <ruisizhang123@gmail.com>
Co-authored-by: Brian Hirsh <briandhirsh@gmail.com>
Co-authored-by: Will Constable <willconstable@gmail.com>

* skip varlen integration test on rocm (#2085)

as title since varlen attention is not supported on rocm

* [Local Tensor] Replace dry_run.py with fake mode implementation (#2057)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2057

Replaces `dry_run.py` implementation with fake PG mode for DRY_RUN
configuration validation. This PR also adds support of Local tensor mode
to provide deeper validation coverage.

**Note:** Currently returns early before `init_weights()` if using local
tensor mode due to some limitation of local tensor, which will be fixed
by https://github.com/pytorch/pytorch/pull/166540 .

* add varlen attention for qwen 3 (#2084)

As title 

**Testing**

<img width="469" height="431" alt="Screenshot 2025-11-24 at 4 30 53 PM"
src="https://github.com/user-attachments/assets/6b9a362d-de36-48b7-b465-d91ae24f4cbf"
/>

performance and loss on par

* [FLUX] Add FLUX inference test in CI (#1969)

* Improve logging by formatting the dict as JSON. (#2094)

We use Slurm to run jobs, and i just noticed that job configs and model
args were being logged on a single line by default, which makes the logs
hard to read.

This PR improves readability by formatting these dictionaries with
`json.dumps` before logging, so the configs are formatted nicely and
easier for humans to read.

before:
<img width="2594" height="640" alt="image"
src="https://github.com/user-attachments/assets/c3c07b09-d12c-484d-aa90-a626cd25c6d2"
/>

after:
<img width="2252" height="1032" alt="image"
src="https://github.com/user-attachments/assets/4cbde979-c34c-4fc5-aa55-f280f39cf9ef"
/>

* add all SDPA backends to op_sac_save_list (#2095)

As we discussed in https://github.com/pytorch/torchtitan/issues/2091, we
should add all `scaled_dot_product_attention` backends to
`op_sac_save_list`to avoid recomputing attention during backward.

* modify save list for varlen attn (#2082)

adding varlen attention ops to ac save list

**testing**

used DebugMode() to print out op list. verified that forward is not
being recomputed in the backward step.

```
[rank0]:forward ops
[rank0]:varlen_attn in forward: True
...
[rank0]:varlen_attn recomputed in backward: False
[rank0]:saved correctly
```

* Make sure log after distributed initialized. (#2102)

There is a condition check in config logging for distributed
initialization, so the config logging has to be happen after distributed
has been initialized.

Co-authored-by: Zhiqiang Zang <zzq@fb.com>

* [mxfp8] [docs] [BE] add MXFP8 usage documentation and benchmarks (#2096)

Fixes #1998

* Mark input tokens to routed experts as dynamic to avoid a recompile (#2007)

Stacked PRs:
 * __->__#2007


--- --- ---

Mark input tokens to routed experts as dynamic to avoid a recompile


This saves 1 recompile, and you can see the input tokens are dynamic
from the first graph compiled:
```python
class GraphModule(torch.nn.Module):
    def forward(...s77: "Sym(s77)", L_x_: "bf16[s77, 5120][5120, 1]cuda:0"...
```

I verified that this also fixes the AC recompile issue of:
https://github.com/pytorch/torchtitan/issues/1971. But I'm keeping
`torch._C._dynamo.eval_frame._set_lru_cache(False)`, as there could be
other recompile reasons popping up.

* fix mxfp8 loss image (#2104)

In the original PR i moved the image location without updating the
markdown pointing to it by accident. This fixes that.

* Update hf_assets_path for llama4 (#2110)

Fix typo in train_config, hf asset should be for maverick, see:

https://huggingface.co/meta-llama/models?search=128e

* Enables parsing of --compile.components through CLI (#2115)

Without this PR, I'm not able to pass `--compile.components=model,loss`.
Tested using `python -m torchtitan.config.manager
--compile.components=model,loss`.

* fix `ForgeEngine` compatibility issue with (#2121)

Summary:
Fix backward incompatible changes introduced in 


https://github.com/pytorch/torchtitan/commit/f29828bbc8018c9374861aff142c658e2e08e8b4

Differential Revision: D88572518

* Remove the hack for SAC + FlexAttention (#2118)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2118

PyTorch can now support torch.compile inside the SAC region even if
torch.compile is not used to wrap SAC. This PR removes the workaround to
ensure torch.compile works with Flex

* Add warning to run_tests (#2123)

Small addition since right now running a test that doesn't exist just
outputs nothing, e.g.

`python -m tests.integration_tests.run_tests ./test-out --test_name
does_not_exist`

Now the output is:

`
WARNING:root:No tests were run for --test_name 'does_not_exist' in test
suite 'features'.
Available test names in 'features' suite: ['default', '1d_compile',
'1d_compile_sac_op', '2d_eager', '2d_compile', 'full_checkpoint',
'model_only_hf_checkpoint', 'last_save_model_only_fp32',
'last_save_model_only_bf16', 'pp_looped_zero_bubble', 'pp_zbv',
'pp_1f1b', 'pp_gpipe', 'pp_dp_1f1b', 'pp_dp_gpipe', 'pp_tp', 'pp_dp_tp',
'3d_compile', 'pp_looped_1f1b', 'pp_custom_csv', 'optimizer_foreach',
'ddp', 'hsdp', 'fsdp+flex_attn', 'fsdp+flex_attn+per_op_sac',
'fsdp+varlen_attn+per_op_sac', 'cp_allgather', 'cp_alltoall', 'hsdp+tp',
'fsdp+cp', 'hsdp+cp_without_dp_shard', 'hsdp+cp_with_dp_shard',
'fsdp+tp+cp', 'cpu_offload+opt_in_bwd+TP+DP+CP', 'test_generate',
'fsdp_reshard_always', 'optional_checkpoint', 'float8_emulation',
'gradient_accumulation', 'validation_tp_cp_pp']
`

* [compiler toolkit] Disable CUDAGraph integration test (#2127)

As titled. We'll enable when it is fixed.

* Add CI for Autoparallel experiment llama3 on 4 GPUs (#2105)

* Support rope cache indexing using positions (#2112)

Add support to indexing rope cache using `position_ids`, this might be
needed during
1. inference, where we passed in `position_ids` into transformer forward
2. CP load balancing where we need to index rope cache given positions
ids

Test: 
running dpskv3 16b base
<img width="489" height="286" alt="image"
src="https://github.com/user-attachments/assets/6f463d65-a0de-413d-ab19-770db9983dbb"
/>

also tested in https://github.com/wwwjn/torchtitan/pull/1/files when
passing position_ids
<img width="665" height="269" alt="image"
src="https://github.com/user-attachments/assets/70e4bddc-0334-4dbf-b00d-6e4b49a94655"
/>

---------

Co-authored-by: JessicaZhong <zhengjesszhong@gmail.com>

* [forge] allow torchforges to set checkpoint base folder (#2131)

this PR 
1) allowing Torchforge to decide where to put the checkpoint and wandb,
etc, instead of the "current" folder
~~allowing Torchforge to decide to print / log the configs~~

* Rename auto_parallel experiment to autoparallel (#2128)

* PyTorch depends on psutil (#2132)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2132

TorchTitan should also depends on psutil.

* Remove caching for attention masks (#2117)

We remove the lru_cache for attention masks, because in
get_attention_mask() function, `and_masks(*mask_mods)` will return
different object id. `create_attention_mask` will use all parameters as
cache key, and new object id will always cause cache miss.

Before the change: (llama3 debugmodel_flex_attn)
<img width="1182" height="275" alt="Screenshot 2025-12-09 at 1 27 45 PM"
src="https://github.com/user-attachments/assets/e9af2597-9d94-4478-8136-8b9b8c35d9e6"
/>

After the change:
<img width="1182" height="275" alt="Screenshot 2025-12-09 at 1 29 56 PM"
src="https://github.com/user-attachments/assets/756a7d09-b47f-434f-8ff6-40098b265a03"
/>

* Clarify contribution guidelines. (#2134)

* Enable PP and EP overlap for MoE (#1721)

Option 2 of https://github.com/pytorch/torchtitan/issues/1682

These changes add a custom `overlap_callback` function to replace the
OVERLAP_F_B action that is run during the schedule execution. In the
custom function, we write `run_forward()` and `run_backward()`.
`run_backward()` is run as a separate thread so that we can have both
forward and backward running together side by side. Looks like this:

<img width="1321" height="443" alt="image"
src="https://github.com/user-attachments/assets/911f3637-1afa-4537-989a-a325ba558957"
/>

In order for these changes to work with Expert Parallel, we also need to
add custom autograd functions to act as the boundary points at which we
do communication. We added hooks before and after expert parallel
dispatch and combine to signal boundary points, so our figure from
before now turns into:

<img width="1382" height="388" alt="image"
src="https://github.com/user-attachments/assets/3991749d-7d67-4098-81a4-4efcfd1c75ca"
/>

Now in each of these red blocks, we use a global coordinator. We need
`threading.Barrier(2).wait()` so that the comm and compute from our
forward and backward steps are scheduled in lock-step before continuing.

DSv3 16B run command:
```
TORCH_NCCL_TRACE_BUFFER_SIZE=2000 TORCH_NCCL_DUMP_ON_TIMEOUT=true TORCH_FR_DUMP_TEMP_FILE=./nccl_trace_rank_ NGPU=8  CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh
```

Trace examples:

<img width="2409" height="1889" alt="image"
src="https://github.com/user-attachments/assets/923efc8f-9241-4646-aba0-ccc846d3932b"
/>

Test command:

`python -m tests.integration_tests.run_tests ./test-out --test_name
pp_dualpipev --test_suite models`

---------

Co-authored-by: tianyu-l <150487191+tianyu-l@users.noreply.github.com>

* Fix apply_compile called multiple times in PP initialization (#2135)

Stacked PRs:
 * __->__#2135


--- --- ---

PP initialization calls apply_compile multiple times, once per pp stage.
But apply_compile does some global patching. So I add `already_patched`
to avoid patching the same method multiple times.

If we patch multiple times, the second time will wrap
`_run_experts_grouped_mm_dynamic` in a torch.compile(fullgraph=True)
leading to the error in the issue below.

FIXES https://github.com/pytorch/torchtitan/issues/2124

* Enable static type checking with Pyrefly (#2136)

Enables static type checking of torchtitan with
[pyrefly](https://github.com/facebook/pyrefly). Type checking the code
helps catch bugs earlier in the development cycle.

* Adds pyrefly to CI, as part of the linting workflow.
* Addresses ~100 type errors that can be fixed via local code changes
and updates to type annotations, and silences the rest with `# pyrefly:
ignore` suppression comments. Note that
https://github.com/pytorch/torchtitan/commit/325efd946f1cbea85e503f9e684b8c879891fc1a
contains all of the non-comment changes.

* [Autoparallel] Add local_map variant of DSv3 and 2D mesh AP (#2129)

Stacked PRs:
 * __->__#2129


--- --- ---

[Autoparallel] Add local_map variant of DSv3 and 2D mesh AP

Currently, the AP experiment monkey patches Titan's main DSv3
implementation. But this is prone to breakage from both model definition
changes in titan and from HOP/partitioner related changes in core. When
these breaks happen, people are usually blocked until I find the root
cause.

I'm going on PTO for the rest of the year, so I'm adding an integration
to AP's DSv3 model in an attempt to make the development more stable for
the upcoming PP integration.

Test: https://gist.github.com/xmfan/db15fda1e1bc1df7cd523005fe0baf33

* Implement ciflow/rocm on Torchtitan (#2114)

In this PR, I implemented ciflow/rocm on Torchtitan. The changes are
part of integration_test_8gpu_features.yaml. The workflow still supports
running on pull_request (without any PR label) for CUDA. However, along
with push to main and cron schedule, with the ciflow/8gpu label added to
PR, the workflow runs for both CUDA & ROCm.

---------

Co-authored-by: Huy Do <huydhn@gmail.com>

* [MoE] Add node limited routing support (#2111)

As titled, added node-limited routing support via two-layer routing.
First, group experts into `num_groups` groups, and experts in the same
group should reside on the same node to utilize fast intra-node
communication. Second, pick the `top_k_group` by the top 2 expert
scores' sum in each group. Third, pick `top_k` experts within the
selected `top_k_groups`.

Reference:
https://github.com/huggingface/transformers/blob/4c9fde2a2a3aece0bcf1be93f696e88297da9397/src/transformers/models/deepseek_v3/modeling_deepseek_v3.py#L212

Test on one node using DeepSeek V3 debug model with MoE arguments
`num_experts=8,
            num_shared_experts=2,
            num_groups=4,
            top_k_group=2,
            top_k=3`.
<img width="1196" height="465" alt="Pasted Graphic"
src="https://github.com/user-attachments/assets/63fd8414-1761-4efe-acff-154b1f46a16d"
/>

* Upgrade GitHub Actions to latest versions (#2152)

## Summary

Upgrade GitHub Actions to their latest versions for improved features,
bug fixes, and security updates.

## Changes

| Action | Old Version(s) | New Version | Release | Files |
|--------|---------------|-------------|---------|-------|
| `pypa/gh-action-pypi-publish` |
[`release/v1`](https://github.com/pypa/gh-action-pypi-publish/releases/tag/release/v1)
| [`v1`](https://github.com/pypa/gh-action-pypi-publish/releases/tag/v1)
|
[Release](https://github.com/pypa/gh-action-pypi-publish/releases/tag/v1)
| release.yml |

## Why upgrade?

Keeping GitHub Actions up to date ensures:
- **Security**: Latest security patches and fixes
- **Features**: Access to new functionality and improvements
- **Compatibility**: Better support for current GitHub features
- **Performance**: Optimizations and efficiency improvements

### Security Note

Actions that were previously pinned to commit SHAs remain pinned to SHAs
(updated to the latest release SHA) to maintain the security benefits of
immutable references.

### Testing

These changes only affect CI/CD workflow configurations and should not
impact application functionality. The workflows should be tested by
running them on a branch before merging.

* Upgrade GitHub Actions for Node 24 compatibility (#2151)

## Summary

Upgrade GitHub Actions to their latest versions to ensure compatibility
with Node 24, as Node 20 will reach end-of-life in April 2026.

## Changes

| Action | Old Version(s) | New Version | Release | Files |
|--------|---------------|-------------|---------|-------|
| `actions/checkout` |
[`v3`](https://github.com/actions/checkout/releases/tag/v3),
[`v4`](https://github.com/actions/checkout/releases/tag/v4) |
[`v6`](https://github.com/actions/checkout/releases/tag/v6) |
[Release](https://github.com/actions/checkout/releases/tag/v6) |
docker-builds.yml, release.yml |
| `actions/download-artifact` |
[`v4`](https://github.com/actions/download-artifact/releases/tag/v4) |
[`v7`](https://github.com/actions/download-artifact/releases/tag/v7) |
[Release](https://github.com/actions/download-artifact/releases/tag/v7)
| release.yml |
| `actions/setup-python` |
[`v5`](https://github.com/actions/setup-python/releases/tag/v5) |
[`v6`](https://github.com/actions/setup-python/releases/tag/v6) |
[Release](https://github.com/actions/setup-python/releases/tag/v6) |
release.yml |
| `actions/upload-artifact` |
[`v4`](https://github.com/actions/upload-artifact/releases/tag/v4) |
[`v6`](https://github.com/actions/upload-artifact/releases/tag/v6) |
[Release](https://github.com/actions/upload-artifact/releases/tag/v6) |
release.yml |

## Context

Per [GitHub's
announcement](https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/),
Node 20 is being deprecated and runners will begin using Node 24 by
default starting March 4th, 2026.

### Why this matters

- **Node 20 EOL**: April 2026
- **Node 24 default**: March 4th, 2026
- **Action**: Update to latest action versions that support Node 24

### Security Note

Actions that were previously pinned to commit SHAs remain pinned to SHAs
(updated to the latest release SHA) to maintain the security benefits of
immutable references.

### Testing

These changes only affect CI/CD workflow configurations and should not
impact application functionality. The workflows should be tested by
running them on a branch before merging.

* Improve the loss_compare.sh logic (#2143)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2145
* #2144
* __->__ #2143

1. Accept one "." (meaning the current commit) case to simplify the
command line.
2. Ignore the untracked files.

* [GPT-OSS] Add HF state dict adapter to support loading from HF checkpoints (#2021)

As titled, this PR adds HF state dict adapter to support loading from
GPT-OSS HF checkpoint. GPT-OSS checkpoint is quantized in MXPF4 format.
The de-quantization steps are offloaded to the
`QuantizedHuggingFaceStorageReader` in `dcp`, so this feature depends on
this PR to update `QuantizedHuggingFaceStorageReader`
(https://github.com/pytorch/pytorch/pull/167672).

1. Test 1. We use `dcp.load(hf_state_dict,
storage_reader=QuantizedHuggingFaceStorageReader(path=input_dir))` to
load from GPT-OSS HF checkpoint, and map the `hf_state_dict` back to
TorchTitan state dict. We build one test input, and compare two outputs:
1. Using `transformer` library to load GPT-OSS HF checkpoint and run
inference on the test input; 2. We use the converted TorchTitan model to
run inference on the test input. We compare the outputs by comparing the
KL divergence of two output probability distributions. The result shows
two models are very similar. <img width="1191" height="191" alt="Pasted
Graphic"
src="https://github.com/user-attachments/assets/bb6a75e9-3dd7-43fa-847e-3f5f4fb5fd93"
/>

2. Test 2. We load the model directly from quantized GPT-OSS HF
checkpoint, and do a test training.
<img width="1198" height="408" alt="Pasted Graphic 1"
src="https://github.com/user-attachments/assets/49ab42ff-0115-4e79-b069-c556e0dd23f6"
/>

* Add local built pytorch path for pyrefly (#2155)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* #2156
* __->__ #2155

This assumes that the local built version has the same parent folder as
torchtitan.

Also fixes some pyrefly errors for moe.py

* Run vLLM inference using torchtitan model definition (single GPU) (#2119)

As titled, put it in deterministic RL folder

* [RELAND] Let CUDA and ROCm read different loss result (#2157)

Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0)
(oldest at bottom):
* __->__ #2157

CUDA and ROCm have different loss results. So we need to read from
different loss result files.
The loss results of FSDP and HSDP start to diverge after 5th step when
running with ROCm, we also need to adjust this. But this is more an
unknown issue that AMD people may want to figure out the root cause or
confirm that this is an expected behavior.

**This PR is a reland PR of
https://github.com/pytorch/torchtitan/pull/2156** due to some landing
issue of the previous PR.

* Use new DeviceMesh unflatten to rewrite parallel_dims (#1660)

**Summary**
This PR utilizes the latest APIs provided by DeviceMesh to simplify the
creation of all different meshes.

The design philosophy is as follow:

1. Create one world mesh with the shape as [world_size,]
2. Create all 1-D submeshes by using 1) unflattening from the world
mesh, or 2) slicing and flatten from other derived meshes.
3. ParallelDims now provides an API, get_mesh() and get_optional_mesh().
which accepts str or list[str]. When the argument is str, the API
directly return the corresponding 1-D submesh. If the argument is
list[str], the dim names will be used to concatenate to form a n-D
device mesh. The main difference between the two APIs is that the former
one will raise an ValueError if the result mesh is None the later one
will just return None.

* Integrate DeepEP to torchtitan (#2107)

## Summary
This initial version integrates DeepEP into TorchTitan, focusing on
correctness and compatibility rather than maximal performance tuning.

- Functional DeepEP-backed MoE + Expert Parallelism
- User-controlled configuration
- Compatible with torch.compile and SAC
-  Intended as a first unblocker for benchmarking and iteration

## Perf: DeepSeek-V3 671B on 64 nodes × H100 (512 GPUs total)

<details> <summary><strong>Training config (click to
expand)</strong></summary>

```
config_path="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml",
command_args=[
    "--training.dataset_path=/lustre/fsw/portfolios/sw/users/elfieg/hf_datasets/c4",
    "--training.seq_len=4096",
    "--training.steps=120",
    "--metrics.log_freq=10",
    "--profiling.no-enable-profiling",
    "--comm.init_timeout_seconds=2000",
    "--comm.train_timeout_seconds=300",
    "--metrics.disable_color_printing",

    # Parallelism
    "--parallelism.data_parallel_replicate_degree=1",
    "--parallelism.data_parallel_shard_degree=64",
    "--parallelism.fsdp_reshard_after_forward=default",
    "--parallelism.tensor_parallel_degree=1",
    "--parallelism.expert_parallel_degree=32",
    "--parallelism.expert_tensor_parallel_degree=1",
    "--parallelism.pipeline_parallel_degree=8",
    "--parallelism.pipeline_parallel_schedule=Interleaved1F1B",

    # Training
    "--training.local_batch_size=16",
    "--activation_checkpoint.mode=full",

    # Compilation
    "--compile.enable",
    "--compile.components=model",
    "--compile.components=loss",

    # MoE / DeepEP
    "--debug.moe_force_load_balance",
    "--parallelism.expert_parallel_comm_backend=deepep",
],
```
</details>

After:
```
memory: 56.75GiB(71.74%)  tps: 579  tflops: 162.82  mfu: 16.46%
```
Before:
```
memory: 60.18GiB(76.07%)  tps: 346  tflops: 97.24  mfu: 9.83%
```

## Loss Curve:
<img width="877" height="380" alt="Screenshot 2025-12-16 at 11 30 02 PM"
src="https://github.com/user-attachments/assets/b2f15297-2f05-4f4b-b4d5-b2747a30b2fa"
/>



Shout out to my colleagues @gekurian @syed-ahmed @aazzolini for internal
supports!

* Fix pypa/gh-action-pypi-publish version to use SHA pinning (#2161)

## Summary

Fix incorrect version reference for `pypa/gh-action-pypi-publish`.

## Problem

A previous PR incorrectly changed the action reference from `release/v1`
(valid branch) to `v1` (non-existent tag). The `v1` tag doesn't exist in
the pypa/gh-action-pypi-publish repository.

## Solution

Updated to use SHA pinning for release/v1.13:
```yaml
uses: pypa/gh-action-pypi-publish@ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e  # release/v1.13
```

This follows [GitHub's security best
practices](https://docs.github.com/en/actions/reference/security/secure-use#using-third-party-actions)
for third-party actions by pinning to an immutable SHA.

## Files Changed

- `.github/workflows/release.yml`

---------

Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com>

* Upgrade GitHub Actions for Node 24 compatibility (#2164)

## Summary

Upgrade GitHub Actions to their latest versions to ensure compatibility
with Node 24, as Node 20 will reach end-of-life in April 2026.

## Changes

| Action | Old Version(s) | New Version | Release | Files |
|--------|---------------|-------------|---------|-------|
| `actions/checkout` |
[`v3`](https://github.com/actions/checkout/releases/tag/v3) |
[`v6`](https://github.com/actions/checkout/releases/tag/v6) |
[Release](https://github.com/actions/checkout/releases/tag/v6) |
lint.yaml |
| `actions/setup-python` |
[`v4`](https://github.com/actions/setup-python/releases/tag/v4) |
[`v6`](https://github.com/actions/setup-python/releases/tag/v6) |
[Release](https://github.com/actions/setup-python/releases/tag/v6) |
lint.yaml |

## Context

Per [GitHub's
announcement](https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/),
Node 20 is being deprecated and runners will begin using Node 24 by
default starting March 4th, 2026.

### Why this matters

- **Node 20 EOL**: April 2026
- **Node 24 default**: March 4th, 2026
- **Action**: Update to latest action versions that support Node 24

### Security Note

Actions that were previously pinned to commit SHAs remain pinned to SHAs
(updated to the latest release SHA) to maintain the security benefits of
immutable references.

### Testing

These changes only affect CI/CD workflow configurations and should not
impact application functionality. The workflows should be tested by
running them on a branch before merging.

Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com>

* Expose common dataloader args (#2097)

This diff introduces common dataloader args which are supported by
statefuldataloader (and torch.utils.data dataloader). Users should be
able to use them in their config files.

I was thinking about introducing a catch all kwargs to make it easier to
specify args but that can easily complicate things (validation checks,
duplication, existing defined named args in function definitions etc).

* Replace `logger.warn()` to `logger.warning()` , allow `log_validation` to log `extra_metrics` and expose common wandb args (#2166)

1. Replace `logger.warn()` to `logger.warning()` 
2. allow `log_validation` to log `extra_metrics`
3. expose common wandb init args, it is userful when resume training.

* Add Dependabot for GitHub Actions updates (#2163)

## Summary

Add Dependabot configuration to automatically keep GitHub Actions up to
date.

Here's some more information about Dependabot:
https://docs.github.com/en/code-security/dependabot/working-with-dependabot/keeping-your-actions-up-to-date-with-dependabot

## Changes

- Added `.github/dependabot.yml` with weekly checks for GitHub Actions
updates

## Context

As discussed in #2161
([comment](https://github.com/pytorch/torchtitan/pull/2161#issuecomment-3667526716)),
adding Dependabot to automatically manage GitHub Actions updates going
forward.

## Why

Dependabot will automatically create PRs when new versions of GitHub
Actions are available, helping to:
- Keep CI/CD workflows secure with the latest patches
- Get new features and improvements
- Maintain compatibility with GitHub's infrastructure

Each action update will be proposed as a separate PR for individual
review and testing.

---------

Signed-off-by: Salman Muin Kayser Chishti <13schishti@gmail.com>

* Bump tj-actions/changed-files from d6e91a2266cdb9d62096cebf1e8546899c6aa18f to e0021407031f5be11a464abee9a0776171c79891 in the github-actions group (#2167)

Bumps the github-actions group with 1 update:
[tj-actions/changed-files](https://github.com/tj-actions/changed-files).

Updates `tj-actions/changed-files` from
d6e91a2266cdb9d62096cebf1e8546899c6aa18f to
e0021407031f5be11a464abee9a0776171c79891
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/tj-actions/changed-files/blob/main/HISTORY.md">tj-actions/changed-files's
changelog</a>.</em></p>
<blockquote>
<h1>Changelog</h1>
<h1><a
href="https://github.com/tj-actions/changed-files/compare/v46.0.5...v47.0.0">47.0.0</a>
- (2025-09-13)</h1>
<h2><!-- raw HTML omitted -->🚀 Features</h2>
<ul>
<li>Add any_added to outputs (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2567">#2567</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/c260d49a827b5eb266673bed7871c5d3ee9b5aef">c260d49</a>)
- (Jellyfrog)</li>
</ul>
<h2><!-- raw HTML omitted -->➖ Remove</h2>
<ul>
<li>Commit and push step from build job (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2538">#2538</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/be393a90381e27c9fec2c8c2e02b00f005710145">be393a9</a>)
- (Tonye Jack)</li>
</ul>
<h2><!-- raw HTML omitted -->🔄 Update</h2>
<ul>
<li>Updated README.md (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2592">#2592</a>)</li>
</ul>
<p>Co-authored-by: github-actions[bot]
&lt;41898282+github-actions[bot]<a
href="https://github.com/users"><code>@​users</code></a>.noreply.github.com&gt;
(<a
href="https://github.com/tj-actions/changed-files/commit/3dbc1e181273d808ccff822a6e00cf18b6628ef0">3dbc1e1</a>)
- (github-actions[bot])</p>
<ul>
<li>Updated README.md (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2591">#2591</a>)</li>
</ul>
<p>Co-authored-by: github-actions[bot]
&lt;41898282+github-actions[bot]<a
href="https://github.com/users"><code>@​users</code></a>.noreply.github.com&gt;
(<a
href="https://github.com/tj-actions/changed-files/commit/b1ccff8c0892ad141d7d2de6f31e526a9dad931f">b1ccff8</a>)
- (github-actions[bot])</p>
<ul>
<li>Updated README.md (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2574">#2574</a>)</li>
</ul>
<p>Co-authored-by: github-actions[bot]
&lt;41898282+github-actions[bot]<a
href="https://github.com/users"><code>@​users</code></a>.noreply.github.com&gt;
(<a
href="https://github.com/tj-actions/changed-files/commit/050a3d3360d29711ee9d8210fc639d902d23ad07">050a3d3</a>)
- (github-actions[bot])</p>
<h2><!-- raw HTML omitted -->📚 Documentation</h2>
<ul>
<li>Update link to glob patterns (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2590">#2590</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/a892f50f7a7187bc288633c09230b09ce7ad8fd0">a892f50</a>)
- (Tonye Jack)</li>
<li>Add Jellyfrog as a contributor for code, and doc (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2573">#2573</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/f000a9b97f254f9590ff26f651cccde827ad36da">f000a9b</a>)
- (allcontributors[bot])</li>
</ul>
<h2><!-- raw HTML omitted -->🧪 Testing</h2>
<ul>
<li>Manual triggered workflows (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2637">#2637</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/c2ca2493190021783138cb8aac49bcee14b4bb89">c2ca249</a>)
- (Tonye Jack)</li>
</ul>
<h2><!-- raw HTML omitted -->⚙️ Miscellaneous Tasks</h2>
<ul>
<li><strong>deps-dev:</strong> Bump jest from 30.0.5 to 30.1.3 (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2655">#2655</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/9a6755550a331fdcc8ec45443738933f8fa22eea">9a67555</a>)
- (dependabot[bot])</li>
<li><strong>deps:</strong> Bump tj-actions/git-cliff from 2.1.0 to 2.2.0
(<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2660">#2660</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/b67e30df88f43e244f4e83775e5ad8335114fb95">b67e30d</a>)
- (dependabot[bot])</li>
<li><strong>deps:</strong> Bump github/codeql-action from 3.30.2 to
3.30.3 (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2661">#2661</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/62aef422ffa195474d80d73387535cf4622b2824">62aef42</a>)
- (dependabot[bot])</li>
<li><strong>deps:</strong> Bump github/codeql-action from 3.29.11 to
3.30.2 (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2659">#2659</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/e874f3cddd0f54ae776e6995ae6dae4cf40fd3d3">e874f3c</a>)
- (dependabot[bot])</li>
<li><strong>deps:</strong> Bump actions/setup-node from 4.4.0 to 5.0.0
(<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2656">#2656</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/8c14441336bb3d84fd6b7fa83b6d7201c740baf5">8c14441</a>)
- (dependabot[bot])</li>
<li><strong>deps-dev:</strong> Bump <code>@​types/node</code> from
24.3.0 to 24.3.1 (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2657">#2657</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/e995ac4be5be2bcb6e29556edc51fb63aca6b49b">e995ac4</a>)
- (dependabot[bot])</li>
<li><strong>deps-dev:</strong> Bump <code>@​types/node</code> from
24.2.1 to 24.3.0 (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2649">#2649</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/3b04099b21072562f07469c10deb182b24236ca9">3b04099</a>)
- (dependabot[bot])</li>
<li><strong>deps:</strong> Bump github/codeql-action from 3.29.9 to
3.29.11 (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2651">#2651</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/e7b6c977e51984988e3cc1d6b18abe2a3ba8daaa">e7b6c97</a>)
- (dependabot[bot])</li>
<li><strong>deps:</strong> Bump tj-actions/git-cliff from 2.0.2 to 2.1.0
(<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2648">#2648</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/765d62bc041415a5b494ef13d02d566128b25973">765d62b</a>)
- (dependabot[bot])</li>
<li><strong>deps:</strong> Bump github/codeql-action from 3.29.8 to
3.29.9 (<a
href="https://redirect.github.com/tj-actions/changed-files/issues/2647">#2647</a>)
(<a
href="https://github.com/tj-actions/changed-files/commit/2036da178f85576f1940fedb74bb93a36cd89ab7">2036da1</a>)
- (dependabot[bot])</li>
<li><strong>deps:</strong> Bump github/codeql-ac…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants