fix: add pull to solve the problem that fsdp2 traninng in shared will timeout by gygdh-001 · Pull Request #68 · OpenMOSS/MOVA

gygdh-001 · 2026-06-29T02:31:28Z

During multi-node multi-GPU FSDP training, checkpoint saving and resume frequently hit timeout errors.

The root cause is that accelerate_trainer.py uses FullStateDictConfig / FullOptimStateDictConfig for FSDP,
which gathers the full state dict from all ranks onto rank 0 during every save_state / load_state call.
Transferring the full parameter set across nodes over the network causes timeouts,
and rank 0 is also at risk of OOM.

This PR applies the following changes:

Replaces FullStateDictConfig / FullOptimStateDictConfig with
ShardedStateDictConfig / ShardedOptimStateDictConfig for FSDP,
so each rank only holds its own shard without cross-node full-state gather.
Under FSDP, save_checkpoint now saves per-rank files:
optimizer_{rank}.bin and scheduler_{rank}.bin,
bypassing accelerator.save_state() which would trigger full-state gather.
Under FSDP, _resume_checkpoint now loads per-rank optimizer and scheduler
state dicts independently, bypassing accelerator.load_state() and its full-state gather path.

Validation:

Verified on a multi-node (2 nodes × 8 Ascend 910B) FSDP training setup:

Checkpoint save time reduced from 400s+ to under 300s, no more timeouts.
Resume success rate 100%, each rank correctly restores optimizer/scheduler state.
Resume time reduced from 160s to 40s.
Loss continuity verified — the loss curve after resume is seamless with the original training run.

gygdh-001 · 2026-07-01T01:16:12Z

@yhzx233

modify accelerate_trainer

66807ed

gygdh-001 changed the title ~~add pull to solve the problem that fsdp2 traninng in shared will timeout~~ fix: add pull to solve the problem that fsdp2 traninng in shared will timeout Jul 1, 2026

gygdh-001 marked this pull request as draft July 1, 2026 01:11

gygdh-001 marked this pull request as ready for review July 1, 2026 01:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add pull to solve the problem that fsdp2 traninng in shared will timeout #68

fix: add pull to solve the problem that fsdp2 traninng in shared will timeout #68
gygdh-001 wants to merge 1 commit into
OpenMOSS:mainfrom
gygdh-001:personal/main

gygdh-001 commented Jun 29, 2026 •

edited

Loading

Uh oh!

gygdh-001 commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

gygdh-001 commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gygdh-001 commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gygdh-001 commented Jun 29, 2026 •

edited

Loading