feature: add kl divergence with ref model#65
Conversation
Why Not Use Ref ModelIn previous multi-agent training, I didn't use a reference model or even support this feature. However, considering some other algorithms now, I've had to add it back in. This is just to demonstrate the limitations of the reference model. Specifically, I want to show how the constraints of staying close to the reference model have negatively impacted our training. My testing is conducted using B200 chips. For these experiments, I generally stuck with the default configuration parameters, except in cases where I needed more VRAM for training or when a reference model was added. For all algorithms, the reference model causes the optimization to deviate from the original objective, but it does not dramatically increase VRAM usage because of TLDR
CHEUnfortunately, I noticed some issues with the expected return metrics in Weights & Biases after merging this PR. I leave the fix for the next PR (PR #66), but you should still be able to see the general trends from the reward.
Consistent with the results described in the paper, the ascent speed of MAGRPO becomes very slow, and IAC faces convergence issues. HouseBuildGiven the bugs mentioned above, I have decided to stop waiting for this batch of HouseBuild runs to finish completely. However, we can already see a clear trend:
|




No description provided.