Adapted the RL loop to run on Brev instances. by ilya-kolchinsky · Pull Request #29 · redhat-et/code-agent

ilya-kolchinsky · 2026-06-26T19:51:58Z

Summary by CodeRabbit

New Features
- Added setup scripts to automate Brev GPU environment preparation, storage relocation, cluster deployment, image building, and dataset generation.
- Training deployments can now control container image pull behavior and include additional runtime settings for caching and debugging.
- Container images now include updated GPU-related dependencies and compatibility fixes for smoother training startup.
Bug Fixes
- Improved local image references for training containers.
- Added compatibility adjustments to avoid missing CUDA/runtime and user पहचान issues during startup.

coderabbitai · 2026-06-26T19:52:12Z

📝 Walkthrough

Walkthrough

The PR updates the SWE-bench RL container images, RayCluster manifest generation, and Brev setup scripts. It adds explicit dependency and runtime settings, configurable image pull policy, storage relocation, cluster deployment, and dataset provisioning steps.

Changes

SWE-bench RL image and Brev bootstrap

Layer / File(s)	Summary
Container image dependency stack `infra/images/Containerfile.ray-swe-rl-base`, `infra/images/Containerfile.ray-swe-rl`	The base image installs `flash-attn==2.8.3`, `openrlhf[vllm_latest]`, evaluation and training dependencies, bleeding-edge `transformers`, `nvidia-cuda-runtime-cu12`, and a `ray` passwd entry; the derived image now starts from `ray-swe-rl-base:latest`.
RayCluster head defaults `train_swe_bench_grpo.sh`	Adds `IMAGE_PULL_POLICY`, uses it for the head container, and adds `HOME`, `USER`, `XDG_CACHE_HOME`, and `TRITON_CACHE_DIR` to the head environment.
RayCluster worker templates `train_swe_bench_grpo.sh`	The colocated, rollout, and training worker templates switch to `IMAGE_PULL_POLICY` and add a `SYS_PTRACE` Linux capability in `securityContext`.
Brev storage relocation `training/swe_bench/deploy/setup-brev-storage.sh`	The new root-only storage script moves Docker and microk8s data into `/ephemeral`, replaces the original paths with symlinks, restarts services, waits for readiness, and prints disk usage.
Bootstrap setup and image push `training/swe_bench/deploy/setup-brev.sh`	The bootstrap entrypoint sets strict shell options, checks host and storage prerequisites, enables microk8s addons, starts the local registry, installs KubeRay, applies RBAC and PVC manifests, and builds and pushes the Ray images.
Cluster deploy and dataset job `training/swe_bench/deploy/setup-brev.sh`	The bootstrap script installs Ray CLI when missing, applies the generated RayCluster, provisions the training dataset with a Kubernetes Job when needed, and prints the completion commands.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

redhat-et/code-agent#28: Introduced the SWE-Bench RL training loop setup and Ray build/deploy flow that this PR extends with image, cluster, and Brev bootstrap changes.

Suggested reviewers

MichaelClifford
TreeinRandomForest

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: adapting the RL training loop for Brev instances.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

train_swe_bench_grpo.sh (1)
229-232: 🔒 Security & Privacy | 🔵 Trivial | 💤 Low value

Confirm SYS_PTRACE is required and minimally scoped.

SYS_PTRACE lets the container trace processes (useful for py-spy/gdb/Nsight). For the colocated worker this combines with the pod-level runAsUser: 0 (Line 224), so the container runs as root and with ptrace. The capability stays scoped to the pod's PID namespace by default, so blast radius is limited, but if this is only needed for ad-hoc debugging consider gating it behind a flag rather than baking it into every worker template.

This same securityContext addition also applies to the rollout (Lines 293-296) and training (Lines 348-351) workers.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@train_swe_bench_grpo.sh` around lines 229 - 232, The worker pod templates
currently always add SYS_PTRACE under securityContext, which should be minimized
or made optional. Update the colocated worker, rollout worker, and training
worker templates in train_swe_bench_grpo.sh so SYS_PTRACE is only enabled when
needed, ideally behind a debug/diagnostic flag or similar gate, while keeping
the existing securityContext structure intact for the relevant worker specs.
infra/images/Containerfile.ray-swe-rl-base (1)
37-38: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Pin the transformers git install to a commit/tag.

Installing from main makes the image non-reproducible and exposes builds to upstream-breaking changes landing at any time. Pin to a specific commit SHA (or release tag once Qwen3.5 support is released) so rebuilds are deterministic.
♻️ Suggested pin
 RUN pip install --no-cache-dir \
-    "transformers @ git+https://github.com/huggingface/transformers.git"
+    "transformers @ git+https://github.com/huggingface/transformers.git@<pinned-commit-sha>"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/images/Containerfile.ray-swe-rl-base` around lines 37 - 38, The
Containerfile’s transformers git dependency is unpinned, which makes the image
build non-deterministic. Update the RUN pip install step in
Containerfile.ray-swe-rl-base to reference a specific transformers commit SHA or
release tag instead of the moving git main branch, so rebuilds using the same
image definition always resolve the same code.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@infra/images/Containerfile.ray-swe-rl-base`:
- Around line 23-24: The flash-attn wheel install in the Containerfile is
pulling from an untrusted personal GitHub release and lacks integrity
verification. Update the image build to use the official
Dao-AILab/flash-attention release asset or a vetted internal mirror, and pin the
wheel with a SHA256 or --require-hashes. Keep the change localized to the pip
install step so the build remains reproducible and safe.

In `@training/swe_bench/deploy/setup-brev-storage.sh`:
- Around line 68-73: The microk8s startup/readiness path is swallowing failures
via the final fallback in the startup block and the readiness check, which can
let the script continue as if storage relocation succeeded. Update the
setup-brev-storage.sh flow around the microk8s start and readiness commands so
failures are surfaced instead of ending in a blanket success fallback, and
ensure the script exits or reports a clear error when microk8s cannot start or
become ready.
- Around line 31-44: The relocate helper currently moves data without checking
whether the destination already exists, which can cause nested directories and
incorrect symlinks when src is a real directory but dst was left behind from a
previous run. Update relocate to first validate any existing dst state, remove
or reconcile stale destinations before mv, and ensure the final symlink at src
points to the expected /ephemeral target. Also verify existing symlinks in
relocate and skip only when they already resolve to the intended destination,
not just when src is any symlink.

In `@training/swe_bench/deploy/setup-brev.sh`:
- Around line 87-97: The registry startup logic in setup-brev.sh only checks
running containers via docker ps, so a stopped registry container can still
block docker run --name registry. Update the registry handling block to detect
an existing container named registry even when it is not running, and start it
instead of creating a new one; use the existing registry check/startup flow
around the docker ps / docker run section to handle both running and stopped
cases cleanly.
- Around line 203-233: Before creating the build-swe-dataset Job in
setup-brev.sh, remove any existing Job with the same name so reruns don’t reuse
a stale failed or interrupted job. Update the section around the kubectl
apply/wait/delete flow to delete or replace build-swe-dataset prior to applying
the new Job manifest, and make sure the cleanup is robust even if the previous
run exited early. Use the existing job name build-swe-dataset and the kubectl
wait/kubectl delete logic to locate the change.
- Around line 62-67: The bootstrap validation in setup-brev.sh only checks
/var/lib/docker, but it needs to verify all relocated storage locations moved by
setup-brev-storage.sh. Update the storage sanity check to confirm
/var/lib/docker, /var/lib/containerd, and the microk8s containerd path are all
correctly relocated to the ephemeral disk before continuing; if any are still on
the root disk, emit a warning and exit like the existing Docker check.

---

Nitpick comments:
In `@infra/images/Containerfile.ray-swe-rl-base`:
- Around line 37-38: The Containerfile’s transformers git dependency is
unpinned, which makes the image build non-deterministic. Update the RUN pip
install step in Containerfile.ray-swe-rl-base to reference a specific
transformers commit SHA or release tag instead of the moving git main branch, so
rebuilds using the same image definition always resolve the same code.

In `@train_swe_bench_grpo.sh`:
- Around line 229-232: The worker pod templates currently always add SYS_PTRACE
under securityContext, which should be minimized or made optional. Update the
colocated worker, rollout worker, and training worker templates in
train_swe_bench_grpo.sh so SYS_PTRACE is only enabled when needed, ideally
behind a debug/diagnostic flag or similar gate, while keeping the existing
securityContext structure intact for the relevant worker specs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: fe3bdef7-52ff-46aa-8206-4d4a3a87e87b

📥 Commits

Reviewing files that changed from the base of the PR and between 6d1a1f9 and fd7c33c.

📒 Files selected for processing (5)

infra/images/Containerfile.ray-swe-rl
infra/images/Containerfile.ray-swe-rl-base
train_swe_bench_grpo.sh
training/swe_bench/deploy/setup-brev-storage.sh
training/swe_bench/deploy/setup-brev.sh

coderabbitai · 2026-06-26T20:01:50Z

+RUN pip install --no-cache-dir \
+    "https://github.com/lesj0610/flash-attention/releases/download/v2.8.3-cu12-torch2.10-cp312/flash_attn-2.8.3%2Bcu12torch2.10cxx11abiTRUE-cp312-cp312-linux_x86_64.whl"


🔒 Security & Privacy | 🟠 Major | 🏗️ Heavy lift

Supply-chain risk: flash-attn wheel pulled from an untrusted personal fork.

The wheel is fetched from github.com/lesj0610/flash-attention, a personal account rather than the official Dao-AILab/flash-attention releases or PyPI, and there is no hash/integrity verification. A compromised or deleted release would silently break or poison the image. Prefer the official release asset (or self-host/mirror the vetted wheel in your own registry) and pin with --require-hashes / a SHA256 to guarantee integrity and reproducibility.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infra/images/Containerfile.ray-swe-rl-base` around lines 23 - 24, The flash-attn wheel install in the Containerfile is pulling from an untrusted personal GitHub release and lacks integrity verification. Update the image build to use the official Dao-AILab/flash-attention release asset or a vetted internal mirror, and pin the wheel with a SHA256 or --require-hashes. Keep the change localized to the pip install step so the build remains reproducible and safe.

coderabbitai · 2026-06-26T20:01:50Z

+relocate() {
+    local src="$1"
+    local dst="$2"
+    local label="$3"
+
+    if [ -L "$src" ]; then
+        echo "$label: $src is already a symlink, skipping."
+        return
+    fi
+
+    echo "$label: moving $src → $dst ..."
+    mkdir -p "$(dirname "$dst")"
+    mv "$src" "$dst"
+    ln -s "$dst" "$src"


🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Guard against stale destinations before moving container data.

If a previous attempt created $dst but left $src as a real directory, mv "$src" "$dst" nests the data under the existing destination and then symlinks Docker/microk8s to the wrong level. Also verify existing symlinks point at the expected /ephemeral path.

Proposed hardening

relocate() { local src="$1" local dst="$2" local label="$3" if [ -L "$src" ]; then - echo "$label: $src is already a symlink, skipping." + if [ "$(readlink -f "$src")" != "$(readlink -m "$dst")" ]; then + echo "ERROR: $label symlink points to $(readlink -f "$src"), expected $dst" >&2 + exit 1 + fi + echo "$label: $src already points to $dst, skipping." return fi + + if [ -e "$dst" ]; then + echo "ERROR: $label destination $dst already exists; refusing to nest $src under it." >&2 + exit 1 + fi echo "$label: moving $src → $dst ..."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

relocate() {

local src="$1"

local dst="$2"

local label="$3"

if [ -L "$src" ]; then

echo "$label: $src is already a symlink, skipping."

return

fi

echo "$label: moving $src → $dst ..."

mkdir -p "$(dirname "$dst")"

mv "$src" "$dst"

ln -s "$dst" "$src"

relocate() {

local src="$1"

local dst="$2"

local label="$3"

if [ -L "$src" ]; then

if [ "$(readlink -f "$src")" != "$(readlink -m "$dst")" ]; then

echo "ERROR: $label symlink points to $(readlink -f "$src"), expected $dst" >&2

exit 1

fi

echo "$label: $src already points to $dst, skipping."

return

fi

if [ -e "$dst" ]; then

echo "ERROR: $label destination $dst already exists; refusing to nest $src under it." >&2

exit 1

fi

echo "$label: moving $src → $dst ..."

mkdir -p "$(dirname "$dst")"

mv "$src" "$dst"

ln -s "$dst" "$src"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@training/swe_bench/deploy/setup-brev-storage.sh` around lines 31 - 44, The relocate helper currently moves data without checking whether the destination already exists, which can cause nested directories and incorrect symlinks when src is a real directory but dst was left behind from a previous run. Update relocate to first validate any existing dst state, remove or reconcile stale destinations before mv, and ensure the final symlink at src points to the expected /ephemeral target. Also verify existing symlinks in relocate and skip only when they already resolve to the intended destination, not just when src is any symlink.

coderabbitai · 2026-06-26T20:01:50Z

+echo "=== Starting microk8s ==="
+microk8s start 2>/dev/null || snap start microk8s 2>/dev/null || true
+
+echo "=== Waiting for microk8s to be ready ==="
+microk8s status --wait-ready --timeout 120 2>/dev/null || \
+    kubectl wait --for=condition=Ready node --all --timeout=120s 2>/dev/null || true


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Do not mask microk8s restart/readiness failures.

The final || true lets the script report “Storage relocated” even when microk8s failed to start or become ready, leaving the bootstrap script to fail later with a less actionable error.

Proposed fix

echo "=== Starting microk8s ===" -microk8s start 2>/dev/null || snap start microk8s 2>/dev/null || true +microk8s start 2>/dev/null || snap start microk8s echo "=== Waiting for microk8s to be ready ===" microk8s status --wait-ready --timeout 120 2>/dev/null || \ - kubectl wait --for=condition=Ready node --all --timeout=120s 2>/dev/null || true + kubectl wait --for=condition=Ready node --all --timeout=120s

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

echo "=== Starting microk8s ==="

microk8s start 2>/dev/null || snap start microk8s 2>/dev/null || true

echo "=== Waiting for microk8s to be ready ==="

microk8s status --wait-ready --timeout 120 2>/dev/null || \

kubectl wait --for=condition=Ready node --all --timeout=120s 2>/dev/null || true

echo "=== Starting microk8s ==="

microk8s start 2>/dev/null || snap start microk8s

echo "=== Waiting for microk8s to be ready ==="

microk8s status --wait-ready --timeout 120 2>/dev/null || \

kubectl wait --for=condition=Ready node --all --timeout=120s

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@training/swe_bench/deploy/setup-brev-storage.sh` around lines 68 - 73, The microk8s startup/readiness path is swallowing failures via the final fallback in the startup block and the readiness check, which can let the script continue as if storage relocation succeeded. Update the setup-brev-storage.sh flow around the microk8s start and readiness commands so failures are surfaced instead of ending in a blanket success fallback, and ensure the script exits or reports a clear error when microk8s cannot start or become ready.

coderabbitai · 2026-06-26T20:01:50Z

+if [ ! -L /var/lib/docker ] && [ -d /var/lib/docker ]; then
+    echo "WARNING: /var/lib/docker is not a symlink to /ephemeral."
+    echo "         Run:  sudo bash training/swe_bench/deploy/setup-brev-storage.sh"
+    echo "         Then re-run this script."
+    exit 1
+fi


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Validate all relocated storage paths, not only Docker metadata.

setup-brev-storage.sh moves /var/lib/docker, /var/lib/containerd, and microk8s containerd. Checking only /var/lib/docker allows the bootstrap to continue with Kubernetes image storage still on the small root disk.

Proposed validation

-if [ ! -L /var/lib/docker ] && [ -d /var/lib/docker ]; then - echo "WARNING: /var/lib/docker is not a symlink to /ephemeral." - echo " Run: sudo bash training/swe_bench/deploy/setup-brev-storage.sh" - echo " Then re-run this script." - exit 1 -fi +for storage_path in \ + /var/lib/docker \ + /var/lib/containerd \ + /var/snap/microk8s/common/var/lib/containerd +do + if [ ! -L "$storage_path" ] || [[ "$(readlink -f "$storage_path")" != /ephemeral/* ]]; then + echo "ERROR: $storage_path is not relocated to /ephemeral." >&2 + echo " Run: sudo bash training/swe_bench/deploy/setup-brev-storage.sh" >&2 + exit 1 + fi +done

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if [ ! -L /var/lib/docker ] && [ -d /var/lib/docker ]; then

echo "WARNING: /var/lib/docker is not a symlink to /ephemeral."

echo " Run: sudo bash training/swe_bench/deploy/setup-brev-storage.sh"

echo " Then re-run this script."

exit 1

fi

for storage_path in \

/var/lib/docker \

/var/lib/containerd \

/var/snap/microk8s/common/var/lib/containerd

do

if [ ! -L "$storage_path" ] || [[ "$(readlink -f "$storage_path")" != /ephemeral/* ]]; then

echo "ERROR: $storage_path is not relocated to /ephemeral." >&2

echo " Run: sudo bash training/swe_bench/deploy/setup-brev-storage.sh" >&2

exit 1

fi

done

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@training/swe_bench/deploy/setup-brev.sh` around lines 62 - 67, The bootstrap validation in setup-brev.sh only checks /var/lib/docker, but it needs to verify all relocated storage locations moved by setup-brev-storage.sh. Update the storage sanity check to confirm /var/lib/docker, /var/lib/containerd, and the microk8s containerd path are all correctly relocated to the ephemeral disk before continuing; if any are still on the root disk, emit a warning and exit like the existing Docker check.

coderabbitai · 2026-06-26T20:01:50Z

+if docker ps --format '{{.Names}}' | grep -q '^registry$'; then
+    echo "  Registry container already running."
+else
+    echo "  Starting registry on ${REGISTRY} (data: /ephemeral/registry)..."
+    docker run -d \
+        --restart=always \
+        -p 5000:5000 \
+        -v /ephemeral/registry:/var/lib/registry \
+        --name registry \
+        registry:2
+fi


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Handle an existing stopped registry container.

docker ps misses stopped containers; after a partial failure or manual stop, docker run --name registry fails because the name already exists.

Proposed fix

-if docker ps --format '{{.Names}}' | grep -q '^registry$'; then +if docker ps --format '{{.Names}}' | grep -qx 'registry'; then echo " Registry container already running." +elif docker ps -a --format '{{.Names}}' | grep -qx 'registry'; then + echo " Starting existing registry container." + docker start registry else

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if docker ps --format '{{.Names}}' | grep -q '^registry$'; then

echo " Registry container already running."

else

echo " Starting registry on ${REGISTRY} (data: /ephemeral/registry)..."

docker run -d \

--restart=always \

-p 5000:5000 \

-v /ephemeral/registry:/var/lib/registry \

--name registry \

registry:2

fi

if docker ps --format '{{.Names}}' | grep -qx 'registry'; then

echo " Registry container already running."

elif docker ps -a --format '{{.Names}}' | grep -qx 'registry'; then

echo " Starting existing registry container."

docker start registry

else

echo " Starting registry on ${REGISTRY} (data: /ephemeral/registry)..."

docker run -d \

--restart=always \

-p 5000:5000 \

-v /ephemeral/registry:/var/lib/registry \

--name registry \

registry:2

fi

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@training/swe_bench/deploy/setup-brev.sh` around lines 87 - 97, The registry startup logic in setup-brev.sh only checks running containers via docker ps, so a stopped registry container can still block docker run --name registry. Update the registry handling block to detect an existing container named registry even when it is not running, and start it instead of creating a new one; use the existing registry check/startup flow around the docker ps / docker run section to handle both running and stopped cases cleanly.

coderabbitai · 2026-06-26T20:01:50Z

+    echo "  Running build_dataset.py via K8s Job (model: ${MODEL_PATH})..."
+    cat <<JOB_YAML | kubectl apply -f -
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: build-swe-dataset
+spec:
+  backoffLimit: 0
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+        - name: builder
+          image: ${RAY_IMAGE}
+          imagePullPolicy: Always
+          command: ["/bin/bash", "-c", "${BUILD_CMD}"]
+          volumeMounts:
+            - name: hf-cache
+              mountPath: /hf-cache
+          env:
+            - name: HF_HOME
+              value: /hf-cache
+      volumes:
+        - name: hf-cache
+          persistentVolumeClaim:
+            claimName: hf-cache-0
+JOB_YAML
+
+    echo "  Waiting for dataset Job to complete..."
+    kubectl wait --for=condition=complete job/build-swe-dataset --timeout=600s
+    kubectl delete job build-swe-dataset


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Delete stale dataset Jobs before creating a new one.

A failed or interrupted build-swe-dataset Job is left behind. On rerun, kubectl apply can hit immutable Job-template errors or wait on the stale failed Job instead of starting a fresh dataset build.

Proposed fix

fi echo " Running build_dataset.py via K8s Job (model: ${MODEL_PATH})..." + kubectl delete job build-swe-dataset --ignore-not-found --wait=true cat <<JOB_YAML | kubectl apply -f - apiVersion: batch/v1 kind: Job

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

echo " Running build_dataset.py via K8s Job (model: ${MODEL_PATH})..."

cat <<JOB_YAML | kubectl apply -f -

apiVersion: batch/v1

kind: Job

metadata:

name: build-swe-dataset

spec:

backoffLimit: 0

template:

spec:

restartPolicy: Never

containers:

- name: builder

image: ${RAY_IMAGE}

imagePullPolicy: Always

command: ["/bin/bash", "-c", "${BUILD_CMD}"]

volumeMounts:

- name: hf-cache

mountPath: /hf-cache

env:

- name: HF_HOME

value: /hf-cache

volumes:

- name: hf-cache

persistentVolumeClaim:

claimName: hf-cache-0

JOB_YAML

echo " Waiting for dataset Job to complete..."

kubectl wait --for=condition=complete job/build-swe-dataset --timeout=600s

kubectl delete job build-swe-dataset

echo " Running build_dataset.py via K8s Job (model: ${MODEL_PATH})..."

kubectl delete job build-swe-dataset --ignore-not-found --wait=true

cat <<JOB_YAML | kubectl apply -f -

apiVersion: batch/v1

kind: Job

metadata:

name: build-swe-dataset

spec:

backoffLimit: 0

template:

spec:

restartPolicy: Never

containers:

- name: builder

image: ${RAY_IMAGE}

imagePullPolicy: Always

command: ["/bin/bash", "-c", "${BUILD_CMD}"]

volumeMounts:

- name: hf-cache

mountPath: /hf-cache

env:

- name: HF_HOME

value: /hf-cache

volumes:

- name: hf-cache

persistentVolumeClaim:

claimName: hf-cache-0

JOB_YAML

echo " Waiting for dataset Job to complete..."

kubectl wait --for=condition=complete job/build-swe-dataset --timeout=600s

kubectl delete job build-swe-dataset

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@training/swe_bench/deploy/setup-brev.sh` around lines 203 - 233, Before creating the build-swe-dataset Job in setup-brev.sh, remove any existing Job with the same name so reruns don’t reuse a stale failed or interrupted job. Update the section around the kubectl apply/wait/delete flow to delete or replace build-swe-dataset prior to applying the new Job manifest, and make sure the cleanup is robust even if the previous run exited early. Use the existing job name build-swe-dataset and the kubectl wait/kubectl delete logic to locate the change.

Adapted the RL loop to run on Brev instances.

fd7c33c

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adapted the RL loop to run on Brev instances.#29

Adapted the RL loop to run on Brev instances.#29
ilya-kolchinsky wants to merge 1 commit into
redhat-et:mainfrom
ilya-kolchinsky:nvidia-brev-setup

ilya-kolchinsky commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

coderabbitai Bot Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		RUN pip install --no-cache-dir \
		"https://github.com/lesj0610/flash-attention/releases/download/v2.8.3-cu12-torch2.10-cp312/flash_attn-2.8.3%2Bcu12torch2.10cxx11abiTRUE-cp312-cp312-linux_x86_64.whl"

-if [ ! -L /var/lib/docker ] && [ -d /var/lib/docker ]; then
-    echo "WARNING: /var/lib/docker is not a symlink to /ephemeral."
-    echo "         Run:  sudo bash training/swe_bench/deploy/setup-brev-storage.sh"
-    echo "         Then re-run this script."
-    exit 1
-fi
+for storage_path in \
+    /var/lib/docker \
+    /var/lib/containerd \
+    /var/snap/microk8s/common/var/lib/containerd
+do
+    if [ ! -L "$storage_path" ] || [[ "$(readlink -f "$storage_path")" != /ephemeral/* ]]; then
+        echo "ERROR: $storage_path is not relocated to /ephemeral." >&2
+        echo "       Run: sudo bash training/swe_bench/deploy/setup-brev-storage.sh" >&2
+        exit 1
+    fi
+done

Uh oh!

Conversation

ilya-kolchinsky commented Jun 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ilya-kolchinsky commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading