Skip to content

feature: add kl divergence with ref model#65

Merged
LovelyBuggies merged 4 commits into
mainfrom
new
Jun 30, 2026
Merged

feature: add kl divergence with ref model#65
LovelyBuggies merged 4 commits into
mainfrom
new

Conversation

@LovelyBuggies

Copy link
Copy Markdown
Member

No description provided.

@LovelyBuggies

LovelyBuggies commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

Why Not Use Ref Model

In previous multi-agent training, I didn't use a reference model or even support this feature. However, considering some other algorithms now, I've had to add it back in. This is just to demonstrate the limitations of the reference model. Specifically, I want to show how the constraints of staying close to the reference model have negatively impacted our training. My testing is conducted using B200 chips.

For these experiments, I generally stuck with the default configuration parameters, except in cases where I needed more VRAM for training or when a reference model was added. For all algorithms, the reference model causes the optimization to deviate from the original objective, but it does not dramatically increase VRAM usage because of no_grad.

TLDR

image
  • The MAGRPO algorithm reached over 8,000 steps within 4 hours, though there were some oscillations later on. The VRAM usage is about 40G with our reward model.
  • The MAAC algorithm ran for 4,000 steps over 6.5 hours with 70G VRAM, which seems slower than before.
  • The IAC algorithm ran fewer than 4,000 steps in 6.5 hours. The VRAM usage is approximately 48 GB if the Actor and Critic share parameters; otherwise, it would reach about 77 GB.

CHE

Unfortunately, I noticed some issues with the expected return metrics in Weights & Biases after merging this PR.

I leave the fix for the next PR (PR #66), but you should still be able to see the general trends from the reward.

image
  • MAGRPO completed half of the steps in just two hours, though the improvement wasn't particularly significant. Additionally, its VRAM usage is sitting at over 80 GB.
  • MAAC and IAC runs mostly took about 5 hours to reach 6,000 to 8,000 steps. In terms of VRAM usage: MAAC used over 110 GB of VRAM; IAC with a separate actor-critic used around 125 GB of VRAM; IAC with a shared architecture used roughly half the VRAM of the separate version.

Consistent with the results described in the paper, the ascent speed of MAGRPO becomes very slow, and IAC faces convergence issues.

HouseBuild

Given the bugs mentioned above, I have decided to stop waiting for this batch of HouseBuild runs to finish completely. However, we can already see a clear trend:

  • MAGRPO performed surprisingly well. It completed nearly 3,000 steps in about six hours and still seems to be improving. Its VRAM usage is approximately 120 GB.
  • MAAC only completed about 1.5k steps in the same amount of time, but it converged very quickly (at around 500 steps), demonstrating high sample efficiency. Its VRAM usage is approximately 146 GB.
  • IAC is clearly ineffective. I used three GPUs for it, but two would have sufficed. Since the third GPU was reserved for the reward model (which was barely utilized), the memory footprint remained largely unchanged. In a separate actor-critic configuration, the VRAM usage across the two active GPUs was roughly 136 GB and 80 GB, even with the reward model included.
image image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant