Skip to content

fix: add pull to solve the problem that fsdp2 traninng in shared will timeout #68

Open
gygdh-001 wants to merge 1 commit into
OpenMOSS:mainfrom
gygdh-001:personal/main
Open

fix: add pull to solve the problem that fsdp2 traninng in shared will timeout #68
gygdh-001 wants to merge 1 commit into
OpenMOSS:mainfrom
gygdh-001:personal/main

Conversation

@gygdh-001

@gygdh-001 gygdh-001 commented Jun 29, 2026

Copy link
Copy Markdown

During multi-node multi-GPU FSDP training, checkpoint saving and resume frequently hit timeout errors.

The root cause is that accelerate_trainer.py uses FullStateDictConfig / FullOptimStateDictConfig for FSDP,
which gathers the full state dict from all ranks onto rank 0 during every save_state / load_state call.
Transferring the full parameter set across nodes over the network causes timeouts,
and rank 0 is also at risk of OOM.

This PR applies the following changes:

  1. Replaces FullStateDictConfig / FullOptimStateDictConfig with
    ShardedStateDictConfig / ShardedOptimStateDictConfig for FSDP,
    so each rank only holds its own shard without cross-node full-state gather.

  2. Under FSDP, save_checkpoint now saves per-rank files:
    optimizer_{rank}.bin and scheduler_{rank}.bin,
    bypassing accelerator.save_state() which would trigger full-state gather.

  3. Under FSDP, _resume_checkpoint now loads per-rank optimizer and scheduler
    state dicts independently, bypassing accelerator.load_state() and its full-state gather path.

Validation:

Verified on a multi-node (2 nodes × 8 Ascend 910B) FSDP training setup:

  • Checkpoint save time reduced from 400s+ to under 300s, no more timeouts.
  • Resume success rate 100%, each rank correctly restores optimizer/scheduler state.
  • Resume time reduced from 160s to 40s.
  • Loss continuity verified — the loss curve after resume is seamless with the original training run.

@gygdh-001 gygdh-001 changed the title add pull to solve the problem that fsdp2 traninng in shared will timeout fix: add pull to solve the problem that fsdp2 traninng in shared will timeout Jul 1, 2026
@gygdh-001 gygdh-001 marked this pull request as draft July 1, 2026 01:11
@gygdh-001 gygdh-001 marked this pull request as ready for review July 1, 2026 01:12
@gygdh-001

Copy link
Copy Markdown
Author

@yhzx233

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant