feature: add kl divergence with ref model by LovelyBuggies · Pull Request #65 · OpenMLRL/CoMLRL

LovelyBuggies · 2026-06-30T20:02:48Z

No description provided.

LovelyBuggies · 2026-07-01T03:04:01Z

Why Not Use Ref Model

In previous multi-agent training, I didn't use a reference model or even support this feature. However, considering some other algorithms now, I've had to add it back in. This is just to demonstrate the limitations of the reference model. Specifically, I want to show how the constraints of staying close to the reference model have negatively impacted our training. My testing is conducted using B200 chips.

For these experiments, I generally stuck with the default configuration parameters, except in cases where I needed more VRAM for training or when a reference model was added. For all algorithms, the reference model causes the optimization to deviate from the original objective, but it does not dramatically increase VRAM usage because of no_grad.

TLDR

The MAGRPO algorithm reached over 8,000 steps within 4 hours, though there were some oscillations later on. The VRAM usage is about 40G with our reward model.
The MAAC algorithm ran for 4,000 steps over 6.5 hours with 70G VRAM, which seems slower than before.
The IAC algorithm ran fewer than 4,000 steps in 6.5 hours. The VRAM usage is approximately 48 GB if the Actor and Critic share parameters; otherwise, it would reach about 77 GB.

CHE

Unfortunately, I noticed some issues with the expected return metrics in Weights & Biases after merging this PR.

I leave the fix for the next PR (PR #66), but you should still be able to see the general trends from the reward.

MAGRPO completed half of the steps in just two hours, though the improvement wasn't particularly significant. Additionally, its VRAM usage is sitting at over 80 GB.
MAAC and IAC runs mostly took about 5 hours to reach 6,000 to 8,000 steps. In terms of VRAM usage: MAAC used over 110 GB of VRAM; IAC with a separate actor-critic used around 125 GB of VRAM; IAC with a shared architecture used roughly half the VRAM of the separate version.

Consistent with the results described in the paper, the ascent speed of MAGRPO becomes very slow, and IAC faces convergence issues.

HouseBuild

Given the bugs mentioned above, I have decided to stop waiting for this batch of HouseBuild runs to finish completely. However, we can already see a clear trend:

MAGRPO performed surprisingly well. It completed nearly 3,000 steps in about six hours and still seems to be improving. Its VRAM usage is approximately 120 GB.
MAAC only completed about 1.5k steps in the same amount of time, but it converged very quickly (at around 500 steps), demonstrating high sample efficiency. Its VRAM usage is approximately 146 GB.
IAC is clearly ineffective. I used three GPUs for it, but two would have sufficed. Since the third GPU was reserved for the reward model (which was barely utilized), the memory footprint remained largely unchanged. In a separate actor-critic configuration, the VRAM usage across the two active GPUs was roughly 136 GB and 80 GB, even with the reward model included.

LovelyBuggies added 4 commits June 30, 2026 15:57

add kl

bd94ec3

allow ref on separate devices

86d1bb9

allow ref on different device

d99bde9

ud

dda5661

LovelyBuggies merged commit d5760c6 into main Jun 30, 2026
4 checks passed

LovelyBuggies deleted the new branch June 30, 2026 22:45

This was referenced Jun 30, 2026

feature: add kl divergence with ref model OpenMLRL/LLM_Collab_Writing#5

Merged

feature: add kl divergence with ref model OpenMLRL/LLM_Collab_Code_Generation#31

Merged

feature: add kl divergence with ref model OpenMLRL/LLM_Collab_Minecraft#6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature: add kl divergence with ref model#65

feature: add kl divergence with ref model#65
LovelyBuggies merged 4 commits into
mainfrom
new

LovelyBuggies commented Jun 30, 2026

Uh oh!

Uh oh!

LovelyBuggies commented Jul 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

LovelyBuggies commented Jun 30, 2026

Uh oh!

Uh oh!

LovelyBuggies commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why Not Use Ref Model

TLDR

CHE

HouseBuild

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LovelyBuggies commented Jul 1, 2026 •

edited

Loading