Skip to content

Add ROCm support for H100 tests#2202

Merged
tianyu-l merged 4 commits intopytorch:mainfrom
akashveramd:av_enable_h100_tests_for_rocm
Jan 28, 2026
Merged

Add ROCm support for H100 tests#2202
tianyu-l merged 4 commits intopytorch:mainfrom
akashveramd:av_enable_h100_tests_for_rocm

Conversation

@akashveramd
Copy link
Copy Markdown
Collaborator

This PR adds ROCm support for H100 tests.

@akashveramd akashveramd self-assigned this Jan 5, 2026
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 5, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Jan 5, 2026

Warning: Unknown label ciflow/rocm-mi300.
Currently recognized labels are

  • ciflow/8gpu

Please add the new label to .github/pytorch-probot.yml

@tianyu-l
Copy link
Copy Markdown
Contributor

tianyu-l commented Jan 6, 2026

existing tests are failing, can we solve them first?
https://github.com/pytorch/torchtitan/actions/runs/20724721636

@akashveramd
Copy link
Copy Markdown
Collaborator Author

akashveramd commented Jan 6, 2026

existing tests are failing, can we solve them first? https://github.com/pytorch/torchtitan/actions/runs/20724721636

It seems ROCm runner ran out of disk space. I've shared this issue with our runner team.

@akashveramd akashveramd marked this pull request as draft January 9, 2026 17:34
@akashveramd
Copy link
Copy Markdown
Collaborator Author

The H100 tests are working for ROCm and CI for ROCm is enabled. However, we also wanted to change the name of the H100 test to something generic that goes with ROCm as well. The reason the tests were named H100 because there are some features that's only supported on H100, including async TP with symmetric memory, Float8 quantization. Hence the test name is H100.

As of now ROCm supports FP8 quant but does not support sync TP with symmetric memory. We are working on supporting sync TP with symmetric memory for ROCm. Hence, moved the PR to draft. Once the support is added we'll open the PR again and think about changing the test name.
cc: @tianyu-l @huydhn

@akashveramd akashveramd force-pushed the av_enable_h100_tests_for_rocm branch from f5e56c0 to a58cb19 Compare January 27, 2026 22:27
@akashveramd akashveramd marked this pull request as ready for review January 27, 2026 22:27
Copy link
Copy Markdown
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks OK to me

@tianyu-l tianyu-l merged commit bc4b809 into pytorch:main Jan 28, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm-mi300 ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot. module: rocm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants