Skip to content

Adapted the RL loop to run on Brev instances.#29

Open
ilya-kolchinsky wants to merge 1 commit into
redhat-et:mainfrom
ilya-kolchinsky:nvidia-brev-setup
Open

Adapted the RL loop to run on Brev instances.#29
ilya-kolchinsky wants to merge 1 commit into
redhat-et:mainfrom
ilya-kolchinsky:nvidia-brev-setup

Conversation

@ilya-kolchinsky

@ilya-kolchinsky ilya-kolchinsky commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • New Features

    • Added setup scripts to automate Brev GPU environment preparation, storage relocation, cluster deployment, image building, and dataset generation.
    • Training deployments can now control container image pull behavior and include additional runtime settings for caching and debugging.
    • Container images now include updated GPU-related dependencies and compatibility fixes for smoother training startup.
  • Bug Fixes

    • Improved local image references for training containers.
    • Added compatibility adjustments to avoid missing CUDA/runtime and user पहचान issues during startup.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

The PR updates the SWE-bench RL container images, RayCluster manifest generation, and Brev setup scripts. It adds explicit dependency and runtime settings, configurable image pull policy, storage relocation, cluster deployment, and dataset provisioning steps.

Changes

SWE-bench RL image and Brev bootstrap

Layer / File(s) Summary
Container image dependency stack
infra/images/Containerfile.ray-swe-rl-base, infra/images/Containerfile.ray-swe-rl
The base image installs flash-attn==2.8.3, openrlhf[vllm_latest], evaluation and training dependencies, bleeding-edge transformers, nvidia-cuda-runtime-cu12, and a ray passwd entry; the derived image now starts from ray-swe-rl-base:latest.
RayCluster head defaults
train_swe_bench_grpo.sh
Adds IMAGE_PULL_POLICY, uses it for the head container, and adds HOME, USER, XDG_CACHE_HOME, and TRITON_CACHE_DIR to the head environment.
RayCluster worker templates
train_swe_bench_grpo.sh
The colocated, rollout, and training worker templates switch to IMAGE_PULL_POLICY and add a SYS_PTRACE Linux capability in securityContext.
Brev storage relocation
training/swe_bench/deploy/setup-brev-storage.sh
The new root-only storage script moves Docker and microk8s data into /ephemeral, replaces the original paths with symlinks, restarts services, waits for readiness, and prints disk usage.
Bootstrap setup and image push
training/swe_bench/deploy/setup-brev.sh
The bootstrap entrypoint sets strict shell options, checks host and storage prerequisites, enables microk8s addons, starts the local registry, installs KubeRay, applies RBAC and PVC manifests, and builds and pushes the Ray images.
Cluster deploy and dataset job
training/swe_bench/deploy/setup-brev.sh
The bootstrap script installs Ray CLI when missing, applies the generated RayCluster, provisions the training dataset with a Kubernetes Job when needed, and prints the completion commands.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • redhat-et/code-agent#28: Introduced the SWE-Bench RL training loop setup and Ray build/deploy flow that this PR extends with image, cluster, and Brev bootstrap changes.

Suggested reviewers

  • MichaelClifford
  • TreeinRandomForest
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adapting the RL training loop for Brev instances.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
train_swe_bench_grpo.sh (1)

229-232: 🔒 Security & Privacy | 🔵 Trivial | 💤 Low value

Confirm SYS_PTRACE is required and minimally scoped.

SYS_PTRACE lets the container trace processes (useful for py-spy/gdb/Nsight). For the colocated worker this combines with the pod-level runAsUser: 0 (Line 224), so the container runs as root and with ptrace. The capability stays scoped to the pod's PID namespace by default, so blast radius is limited, but if this is only needed for ad-hoc debugging consider gating it behind a flag rather than baking it into every worker template.

This same securityContext addition also applies to the rollout (Lines 293-296) and training (Lines 348-351) workers.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@train_swe_bench_grpo.sh` around lines 229 - 232, The worker pod templates
currently always add SYS_PTRACE under securityContext, which should be minimized
or made optional. Update the colocated worker, rollout worker, and training
worker templates in train_swe_bench_grpo.sh so SYS_PTRACE is only enabled when
needed, ideally behind a debug/diagnostic flag or similar gate, while keeping
the existing securityContext structure intact for the relevant worker specs.
infra/images/Containerfile.ray-swe-rl-base (1)

37-38: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Pin the transformers git install to a commit/tag.

Installing from main makes the image non-reproducible and exposes builds to upstream-breaking changes landing at any time. Pin to a specific commit SHA (or release tag once Qwen3.5 support is released) so rebuilds are deterministic.

♻️ Suggested pin
 RUN pip install --no-cache-dir \
-    "transformers @ git+https://github.com/huggingface/transformers.git"
+    "transformers @ git+https://github.com/huggingface/transformers.git@<pinned-commit-sha>"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/images/Containerfile.ray-swe-rl-base` around lines 37 - 38, The
Containerfile’s transformers git dependency is unpinned, which makes the image
build non-deterministic. Update the RUN pip install step in
Containerfile.ray-swe-rl-base to reference a specific transformers commit SHA or
release tag instead of the moving git main branch, so rebuilds using the same
image definition always resolve the same code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@infra/images/Containerfile.ray-swe-rl-base`:
- Around line 23-24: The flash-attn wheel install in the Containerfile is
pulling from an untrusted personal GitHub release and lacks integrity
verification. Update the image build to use the official
Dao-AILab/flash-attention release asset or a vetted internal mirror, and pin the
wheel with a SHA256 or --require-hashes. Keep the change localized to the pip
install step so the build remains reproducible and safe.

In `@training/swe_bench/deploy/setup-brev-storage.sh`:
- Around line 68-73: The microk8s startup/readiness path is swallowing failures
via the final fallback in the startup block and the readiness check, which can
let the script continue as if storage relocation succeeded. Update the
setup-brev-storage.sh flow around the microk8s start and readiness commands so
failures are surfaced instead of ending in a blanket success fallback, and
ensure the script exits or reports a clear error when microk8s cannot start or
become ready.
- Around line 31-44: The relocate helper currently moves data without checking
whether the destination already exists, which can cause nested directories and
incorrect symlinks when src is a real directory but dst was left behind from a
previous run. Update relocate to first validate any existing dst state, remove
or reconcile stale destinations before mv, and ensure the final symlink at src
points to the expected /ephemeral target. Also verify existing symlinks in
relocate and skip only when they already resolve to the intended destination,
not just when src is any symlink.

In `@training/swe_bench/deploy/setup-brev.sh`:
- Around line 87-97: The registry startup logic in setup-brev.sh only checks
running containers via docker ps, so a stopped registry container can still
block docker run --name registry. Update the registry handling block to detect
an existing container named registry even when it is not running, and start it
instead of creating a new one; use the existing registry check/startup flow
around the docker ps / docker run section to handle both running and stopped
cases cleanly.
- Around line 203-233: Before creating the build-swe-dataset Job in
setup-brev.sh, remove any existing Job with the same name so reruns don’t reuse
a stale failed or interrupted job. Update the section around the kubectl
apply/wait/delete flow to delete or replace build-swe-dataset prior to applying
the new Job manifest, and make sure the cleanup is robust even if the previous
run exited early. Use the existing job name build-swe-dataset and the kubectl
wait/kubectl delete logic to locate the change.
- Around line 62-67: The bootstrap validation in setup-brev.sh only checks
/var/lib/docker, but it needs to verify all relocated storage locations moved by
setup-brev-storage.sh. Update the storage sanity check to confirm
/var/lib/docker, /var/lib/containerd, and the microk8s containerd path are all
correctly relocated to the ephemeral disk before continuing; if any are still on
the root disk, emit a warning and exit like the existing Docker check.

---

Nitpick comments:
In `@infra/images/Containerfile.ray-swe-rl-base`:
- Around line 37-38: The Containerfile’s transformers git dependency is
unpinned, which makes the image build non-deterministic. Update the RUN pip
install step in Containerfile.ray-swe-rl-base to reference a specific
transformers commit SHA or release tag instead of the moving git main branch, so
rebuilds using the same image definition always resolve the same code.

In `@train_swe_bench_grpo.sh`:
- Around line 229-232: The worker pod templates currently always add SYS_PTRACE
under securityContext, which should be minimized or made optional. Update the
colocated worker, rollout worker, and training worker templates in
train_swe_bench_grpo.sh so SYS_PTRACE is only enabled when needed, ideally
behind a debug/diagnostic flag or similar gate, while keeping the existing
securityContext structure intact for the relevant worker specs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: fe3bdef7-52ff-46aa-8206-4d4a3a87e87b

📥 Commits

Reviewing files that changed from the base of the PR and between 6d1a1f9 and fd7c33c.

📒 Files selected for processing (5)
  • infra/images/Containerfile.ray-swe-rl
  • infra/images/Containerfile.ray-swe-rl-base
  • train_swe_bench_grpo.sh
  • training/swe_bench/deploy/setup-brev-storage.sh
  • training/swe_bench/deploy/setup-brev.sh

Comment on lines +23 to +24
RUN pip install --no-cache-dir \
"https://github.com/lesj0610/flash-attention/releases/download/v2.8.3-cu12-torch2.10-cp312/flash_attn-2.8.3%2Bcu12torch2.10cxx11abiTRUE-cp312-cp312-linux_x86_64.whl"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | 🏗️ Heavy lift

Supply-chain risk: flash-attn wheel pulled from an untrusted personal fork.

The wheel is fetched from github.com/lesj0610/flash-attention, a personal account rather than the official Dao-AILab/flash-attention releases or PyPI, and there is no hash/integrity verification. A compromised or deleted release would silently break or poison the image. Prefer the official release asset (or self-host/mirror the vetted wheel in your own registry) and pin with --require-hashes / a SHA256 to guarantee integrity and reproducibility.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@infra/images/Containerfile.ray-swe-rl-base` around lines 23 - 24, The
flash-attn wheel install in the Containerfile is pulling from an untrusted
personal GitHub release and lacks integrity verification. Update the image build
to use the official Dao-AILab/flash-attention release asset or a vetted internal
mirror, and pin the wheel with a SHA256 or --require-hashes. Keep the change
localized to the pip install step so the build remains reproducible and safe.

Comment on lines +31 to +44
relocate() {
local src="$1"
local dst="$2"
local label="$3"

if [ -L "$src" ]; then
echo "$label: $src is already a symlink, skipping."
return
fi

echo "$label: moving $src → $dst ..."
mkdir -p "$(dirname "$dst")"
mv "$src" "$dst"
ln -s "$dst" "$src"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Guard against stale destinations before moving container data.

If a previous attempt created $dst but left $src as a real directory, mv "$src" "$dst" nests the data under the existing destination and then symlinks Docker/microk8s to the wrong level. Also verify existing symlinks point at the expected /ephemeral path.

Proposed hardening
 relocate() {
     local src="$1"
     local dst="$2"
     local label="$3"
 
     if [ -L "$src" ]; then
-        echo "$label: $src is already a symlink, skipping."
+        if [ "$(readlink -f "$src")" != "$(readlink -m "$dst")" ]; then
+            echo "ERROR: $label symlink points to $(readlink -f "$src"), expected $dst" >&2
+            exit 1
+        fi
+        echo "$label: $src already points to $dst, skipping."
         return
     fi
+
+    if [ -e "$dst" ]; then
+        echo "ERROR: $label destination $dst already exists; refusing to nest $src under it." >&2
+        exit 1
+    fi
 
     echo "$label: moving $src → $dst ..."
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
relocate() {
local src="$1"
local dst="$2"
local label="$3"
if [ -L "$src" ]; then
echo "$label: $src is already a symlink, skipping."
return
fi
echo "$label: moving $src$dst ..."
mkdir -p "$(dirname "$dst")"
mv "$src" "$dst"
ln -s "$dst" "$src"
relocate() {
local src="$1"
local dst="$2"
local label="$3"
if [ -L "$src" ]; then
if [ "$(readlink -f "$src")" != "$(readlink -m "$dst")" ]; then
echo "ERROR: $label symlink points to $(readlink -f "$src"), expected $dst" >&2
exit 1
fi
echo "$label: $src already points to $dst, skipping."
return
fi
if [ -e "$dst" ]; then
echo "ERROR: $label destination $dst already exists; refusing to nest $src under it." >&2
exit 1
fi
echo "$label: moving $src$dst ..."
mkdir -p "$(dirname "$dst")"
mv "$src" "$dst"
ln -s "$dst" "$src"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@training/swe_bench/deploy/setup-brev-storage.sh` around lines 31 - 44, The
relocate helper currently moves data without checking whether the destination
already exists, which can cause nested directories and incorrect symlinks when
src is a real directory but dst was left behind from a previous run. Update
relocate to first validate any existing dst state, remove or reconcile stale
destinations before mv, and ensure the final symlink at src points to the
expected /ephemeral target. Also verify existing symlinks in relocate and skip
only when they already resolve to the intended destination, not just when src is
any symlink.

Comment on lines +68 to +73
echo "=== Starting microk8s ==="
microk8s start 2>/dev/null || snap start microk8s 2>/dev/null || true

echo "=== Waiting for microk8s to be ready ==="
microk8s status --wait-ready --timeout 120 2>/dev/null || \
kubectl wait --for=condition=Ready node --all --timeout=120s 2>/dev/null || true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Do not mask microk8s restart/readiness failures.

The final || true lets the script report “Storage relocated” even when microk8s failed to start or become ready, leaving the bootstrap script to fail later with a less actionable error.

Proposed fix
 echo "=== Starting microk8s ==="
-microk8s start 2>/dev/null || snap start microk8s 2>/dev/null || true
+microk8s start 2>/dev/null || snap start microk8s
 
 echo "=== Waiting for microk8s to be ready ==="
 microk8s status --wait-ready --timeout 120 2>/dev/null || \
-    kubectl wait --for=condition=Ready node --all --timeout=120s 2>/dev/null || true
+    kubectl wait --for=condition=Ready node --all --timeout=120s
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "=== Starting microk8s ==="
microk8s start 2>/dev/null || snap start microk8s 2>/dev/null || true
echo "=== Waiting for microk8s to be ready ==="
microk8s status --wait-ready --timeout 120 2>/dev/null || \
kubectl wait --for=condition=Ready node --all --timeout=120s 2>/dev/null || true
echo "=== Starting microk8s ==="
microk8s start 2>/dev/null || snap start microk8s
echo "=== Waiting for microk8s to be ready ==="
microk8s status --wait-ready --timeout 120 2>/dev/null || \
kubectl wait --for=condition=Ready node --all --timeout=120s
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@training/swe_bench/deploy/setup-brev-storage.sh` around lines 68 - 73, The
microk8s startup/readiness path is swallowing failures via the final fallback in
the startup block and the readiness check, which can let the script continue as
if storage relocation succeeded. Update the setup-brev-storage.sh flow around
the microk8s start and readiness commands so failures are surfaced instead of
ending in a blanket success fallback, and ensure the script exits or reports a
clear error when microk8s cannot start or become ready.

Comment on lines +62 to +67
if [ ! -L /var/lib/docker ] && [ -d /var/lib/docker ]; then
echo "WARNING: /var/lib/docker is not a symlink to /ephemeral."
echo " Run: sudo bash training/swe_bench/deploy/setup-brev-storage.sh"
echo " Then re-run this script."
exit 1
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Validate all relocated storage paths, not only Docker metadata.

setup-brev-storage.sh moves /var/lib/docker, /var/lib/containerd, and microk8s containerd. Checking only /var/lib/docker allows the bootstrap to continue with Kubernetes image storage still on the small root disk.

Proposed validation
-if [ ! -L /var/lib/docker ] && [ -d /var/lib/docker ]; then
-    echo "WARNING: /var/lib/docker is not a symlink to /ephemeral."
-    echo "         Run:  sudo bash training/swe_bench/deploy/setup-brev-storage.sh"
-    echo "         Then re-run this script."
-    exit 1
-fi
+for storage_path in \
+    /var/lib/docker \
+    /var/lib/containerd \
+    /var/snap/microk8s/common/var/lib/containerd
+do
+    if [ ! -L "$storage_path" ] || [[ "$(readlink -f "$storage_path")" != /ephemeral/* ]]; then
+        echo "ERROR: $storage_path is not relocated to /ephemeral." >&2
+        echo "       Run: sudo bash training/swe_bench/deploy/setup-brev-storage.sh" >&2
+        exit 1
+    fi
+done
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [ ! -L /var/lib/docker ] && [ -d /var/lib/docker ]; then
echo "WARNING: /var/lib/docker is not a symlink to /ephemeral."
echo " Run: sudo bash training/swe_bench/deploy/setup-brev-storage.sh"
echo " Then re-run this script."
exit 1
fi
for storage_path in \
/var/lib/docker \
/var/lib/containerd \
/var/snap/microk8s/common/var/lib/containerd
do
if [ ! -L "$storage_path" ] || [[ "$(readlink -f "$storage_path")" != /ephemeral/* ]]; then
echo "ERROR: $storage_path is not relocated to /ephemeral." >&2
echo " Run: sudo bash training/swe_bench/deploy/setup-brev-storage.sh" >&2
exit 1
fi
done
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@training/swe_bench/deploy/setup-brev.sh` around lines 62 - 67, The bootstrap
validation in setup-brev.sh only checks /var/lib/docker, but it needs to verify
all relocated storage locations moved by setup-brev-storage.sh. Update the
storage sanity check to confirm /var/lib/docker, /var/lib/containerd, and the
microk8s containerd path are all correctly relocated to the ephemeral disk
before continuing; if any are still on the root disk, emit a warning and exit
like the existing Docker check.

Comment on lines +87 to +97
if docker ps --format '{{.Names}}' | grep -q '^registry$'; then
echo " Registry container already running."
else
echo " Starting registry on ${REGISTRY} (data: /ephemeral/registry)..."
docker run -d \
--restart=always \
-p 5000:5000 \
-v /ephemeral/registry:/var/lib/registry \
--name registry \
registry:2
fi

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Handle an existing stopped registry container.

docker ps misses stopped containers; after a partial failure or manual stop, docker run --name registry fails because the name already exists.

Proposed fix
-if docker ps --format '{{.Names}}' | grep -q '^registry$'; then
+if docker ps --format '{{.Names}}' | grep -qx 'registry'; then
     echo "  Registry container already running."
+elif docker ps -a --format '{{.Names}}' | grep -qx 'registry'; then
+    echo "  Starting existing registry container."
+    docker start registry
 else
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if docker ps --format '{{.Names}}' | grep -q '^registry$'; then
echo " Registry container already running."
else
echo " Starting registry on ${REGISTRY} (data: /ephemeral/registry)..."
docker run -d \
--restart=always \
-p 5000:5000 \
-v /ephemeral/registry:/var/lib/registry \
--name registry \
registry:2
fi
if docker ps --format '{{.Names}}' | grep -qx 'registry'; then
echo " Registry container already running."
elif docker ps -a --format '{{.Names}}' | grep -qx 'registry'; then
echo " Starting existing registry container."
docker start registry
else
echo " Starting registry on ${REGISTRY} (data: /ephemeral/registry)..."
docker run -d \
--restart=always \
-p 5000:5000 \
-v /ephemeral/registry:/var/lib/registry \
--name registry \
registry:2
fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@training/swe_bench/deploy/setup-brev.sh` around lines 87 - 97, The registry
startup logic in setup-brev.sh only checks running containers via docker ps, so
a stopped registry container can still block docker run --name registry. Update
the registry handling block to detect an existing container named registry even
when it is not running, and start it instead of creating a new one; use the
existing registry check/startup flow around the docker ps / docker run section
to handle both running and stopped cases cleanly.

Comment on lines +203 to +233
echo " Running build_dataset.py via K8s Job (model: ${MODEL_PATH})..."
cat <<JOB_YAML | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: build-swe-dataset
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: builder
image: ${RAY_IMAGE}
imagePullPolicy: Always
command: ["/bin/bash", "-c", "${BUILD_CMD}"]
volumeMounts:
- name: hf-cache
mountPath: /hf-cache
env:
- name: HF_HOME
value: /hf-cache
volumes:
- name: hf-cache
persistentVolumeClaim:
claimName: hf-cache-0
JOB_YAML

echo " Waiting for dataset Job to complete..."
kubectl wait --for=condition=complete job/build-swe-dataset --timeout=600s
kubectl delete job build-swe-dataset

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Delete stale dataset Jobs before creating a new one.

A failed or interrupted build-swe-dataset Job is left behind. On rerun, kubectl apply can hit immutable Job-template errors or wait on the stale failed Job instead of starting a fresh dataset build.

Proposed fix
     fi
 
     echo "  Running build_dataset.py via K8s Job (model: ${MODEL_PATH})..."
+    kubectl delete job build-swe-dataset --ignore-not-found --wait=true
     cat <<JOB_YAML | kubectl apply -f -
 apiVersion: batch/v1
 kind: Job
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo " Running build_dataset.py via K8s Job (model: ${MODEL_PATH})..."
cat <<JOB_YAML | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: build-swe-dataset
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: builder
image: ${RAY_IMAGE}
imagePullPolicy: Always
command: ["/bin/bash", "-c", "${BUILD_CMD}"]
volumeMounts:
- name: hf-cache
mountPath: /hf-cache
env:
- name: HF_HOME
value: /hf-cache
volumes:
- name: hf-cache
persistentVolumeClaim:
claimName: hf-cache-0
JOB_YAML
echo " Waiting for dataset Job to complete..."
kubectl wait --for=condition=complete job/build-swe-dataset --timeout=600s
kubectl delete job build-swe-dataset
echo " Running build_dataset.py via K8s Job (model: ${MODEL_PATH})..."
kubectl delete job build-swe-dataset --ignore-not-found --wait=true
cat <<JOB_YAML | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: build-swe-dataset
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: builder
image: ${RAY_IMAGE}
imagePullPolicy: Always
command: ["/bin/bash", "-c", "${BUILD_CMD}"]
volumeMounts:
- name: hf-cache
mountPath: /hf-cache
env:
- name: HF_HOME
value: /hf-cache
volumes:
- name: hf-cache
persistentVolumeClaim:
claimName: hf-cache-0
JOB_YAML
echo " Waiting for dataset Job to complete..."
kubectl wait --for=condition=complete job/build-swe-dataset --timeout=600s
kubectl delete job build-swe-dataset
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@training/swe_bench/deploy/setup-brev.sh` around lines 203 - 233, Before
creating the build-swe-dataset Job in setup-brev.sh, remove any existing Job
with the same name so reruns don’t reuse a stale failed or interrupted job.
Update the section around the kubectl apply/wait/delete flow to delete or
replace build-swe-dataset prior to applying the new Job manifest, and make sure
the cleanup is robust even if the previous run exited early. Use the existing
job name build-swe-dataset and the kubectl wait/kubectl delete logic to locate
the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant