Adapted the RL loop to run on Brev instances.#29
Conversation
📝 WalkthroughWalkthroughThe PR updates the SWE-bench RL container images, RayCluster manifest generation, and Brev setup scripts. It adds explicit dependency and runtime settings, configurable image pull policy, storage relocation, cluster deployment, and dataset provisioning steps. ChangesSWE-bench RL image and Brev bootstrap
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
train_swe_bench_grpo.sh (1)
229-232: 🔒 Security & Privacy | 🔵 Trivial | 💤 Low valueConfirm
SYS_PTRACEis required and minimally scoped.
SYS_PTRACElets the container trace processes (useful forpy-spy/gdb/Nsight). For the colocated worker this combines with the pod-levelrunAsUser: 0(Line 224), so the container runs as root and with ptrace. The capability stays scoped to the pod's PID namespace by default, so blast radius is limited, but if this is only needed for ad-hoc debugging consider gating it behind a flag rather than baking it into every worker template.This same
securityContextaddition also applies to the rollout (Lines 293-296) and training (Lines 348-351) workers.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@train_swe_bench_grpo.sh` around lines 229 - 232, The worker pod templates currently always add SYS_PTRACE under securityContext, which should be minimized or made optional. Update the colocated worker, rollout worker, and training worker templates in train_swe_bench_grpo.sh so SYS_PTRACE is only enabled when needed, ideally behind a debug/diagnostic flag or similar gate, while keeping the existing securityContext structure intact for the relevant worker specs.infra/images/Containerfile.ray-swe-rl-base (1)
37-38: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winPin the
transformersgit install to a commit/tag.Installing from
mainmakes the image non-reproducible and exposes builds to upstream-breaking changes landing at any time. Pin to a specific commit SHA (or release tag once Qwen3.5 support is released) so rebuilds are deterministic.♻️ Suggested pin
RUN pip install --no-cache-dir \ - "transformers @ git+https://github.com/huggingface/transformers.git" + "transformers @ git+https://github.com/huggingface/transformers.git@<pinned-commit-sha>"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@infra/images/Containerfile.ray-swe-rl-base` around lines 37 - 38, The Containerfile’s transformers git dependency is unpinned, which makes the image build non-deterministic. Update the RUN pip install step in Containerfile.ray-swe-rl-base to reference a specific transformers commit SHA or release tag instead of the moving git main branch, so rebuilds using the same image definition always resolve the same code.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@infra/images/Containerfile.ray-swe-rl-base`:
- Around line 23-24: The flash-attn wheel install in the Containerfile is
pulling from an untrusted personal GitHub release and lacks integrity
verification. Update the image build to use the official
Dao-AILab/flash-attention release asset or a vetted internal mirror, and pin the
wheel with a SHA256 or --require-hashes. Keep the change localized to the pip
install step so the build remains reproducible and safe.
In `@training/swe_bench/deploy/setup-brev-storage.sh`:
- Around line 68-73: The microk8s startup/readiness path is swallowing failures
via the final fallback in the startup block and the readiness check, which can
let the script continue as if storage relocation succeeded. Update the
setup-brev-storage.sh flow around the microk8s start and readiness commands so
failures are surfaced instead of ending in a blanket success fallback, and
ensure the script exits or reports a clear error when microk8s cannot start or
become ready.
- Around line 31-44: The relocate helper currently moves data without checking
whether the destination already exists, which can cause nested directories and
incorrect symlinks when src is a real directory but dst was left behind from a
previous run. Update relocate to first validate any existing dst state, remove
or reconcile stale destinations before mv, and ensure the final symlink at src
points to the expected /ephemeral target. Also verify existing symlinks in
relocate and skip only when they already resolve to the intended destination,
not just when src is any symlink.
In `@training/swe_bench/deploy/setup-brev.sh`:
- Around line 87-97: The registry startup logic in setup-brev.sh only checks
running containers via docker ps, so a stopped registry container can still
block docker run --name registry. Update the registry handling block to detect
an existing container named registry even when it is not running, and start it
instead of creating a new one; use the existing registry check/startup flow
around the docker ps / docker run section to handle both running and stopped
cases cleanly.
- Around line 203-233: Before creating the build-swe-dataset Job in
setup-brev.sh, remove any existing Job with the same name so reruns don’t reuse
a stale failed or interrupted job. Update the section around the kubectl
apply/wait/delete flow to delete or replace build-swe-dataset prior to applying
the new Job manifest, and make sure the cleanup is robust even if the previous
run exited early. Use the existing job name build-swe-dataset and the kubectl
wait/kubectl delete logic to locate the change.
- Around line 62-67: The bootstrap validation in setup-brev.sh only checks
/var/lib/docker, but it needs to verify all relocated storage locations moved by
setup-brev-storage.sh. Update the storage sanity check to confirm
/var/lib/docker, /var/lib/containerd, and the microk8s containerd path are all
correctly relocated to the ephemeral disk before continuing; if any are still on
the root disk, emit a warning and exit like the existing Docker check.
---
Nitpick comments:
In `@infra/images/Containerfile.ray-swe-rl-base`:
- Around line 37-38: The Containerfile’s transformers git dependency is
unpinned, which makes the image build non-deterministic. Update the RUN pip
install step in Containerfile.ray-swe-rl-base to reference a specific
transformers commit SHA or release tag instead of the moving git main branch, so
rebuilds using the same image definition always resolve the same code.
In `@train_swe_bench_grpo.sh`:
- Around line 229-232: The worker pod templates currently always add SYS_PTRACE
under securityContext, which should be minimized or made optional. Update the
colocated worker, rollout worker, and training worker templates in
train_swe_bench_grpo.sh so SYS_PTRACE is only enabled when needed, ideally
behind a debug/diagnostic flag or similar gate, while keeping the existing
securityContext structure intact for the relevant worker specs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: fe3bdef7-52ff-46aa-8206-4d4a3a87e87b
📒 Files selected for processing (5)
infra/images/Containerfile.ray-swe-rlinfra/images/Containerfile.ray-swe-rl-basetrain_swe_bench_grpo.shtraining/swe_bench/deploy/setup-brev-storage.shtraining/swe_bench/deploy/setup-brev.sh
| RUN pip install --no-cache-dir \ | ||
| "https://github.com/lesj0610/flash-attention/releases/download/v2.8.3-cu12-torch2.10-cp312/flash_attn-2.8.3%2Bcu12torch2.10cxx11abiTRUE-cp312-cp312-linux_x86_64.whl" |
There was a problem hiding this comment.
🔒 Security & Privacy | 🟠 Major | 🏗️ Heavy lift
Supply-chain risk: flash-attn wheel pulled from an untrusted personal fork.
The wheel is fetched from github.com/lesj0610/flash-attention, a personal account rather than the official Dao-AILab/flash-attention releases or PyPI, and there is no hash/integrity verification. A compromised or deleted release would silently break or poison the image. Prefer the official release asset (or self-host/mirror the vetted wheel in your own registry) and pin with --require-hashes / a SHA256 to guarantee integrity and reproducibility.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@infra/images/Containerfile.ray-swe-rl-base` around lines 23 - 24, The
flash-attn wheel install in the Containerfile is pulling from an untrusted
personal GitHub release and lacks integrity verification. Update the image build
to use the official Dao-AILab/flash-attention release asset or a vetted internal
mirror, and pin the wheel with a SHA256 or --require-hashes. Keep the change
localized to the pip install step so the build remains reproducible and safe.
| relocate() { | ||
| local src="$1" | ||
| local dst="$2" | ||
| local label="$3" | ||
|
|
||
| if [ -L "$src" ]; then | ||
| echo "$label: $src is already a symlink, skipping." | ||
| return | ||
| fi | ||
|
|
||
| echo "$label: moving $src → $dst ..." | ||
| mkdir -p "$(dirname "$dst")" | ||
| mv "$src" "$dst" | ||
| ln -s "$dst" "$src" |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Guard against stale destinations before moving container data.
If a previous attempt created $dst but left $src as a real directory, mv "$src" "$dst" nests the data under the existing destination and then symlinks Docker/microk8s to the wrong level. Also verify existing symlinks point at the expected /ephemeral path.
Proposed hardening
relocate() {
local src="$1"
local dst="$2"
local label="$3"
if [ -L "$src" ]; then
- echo "$label: $src is already a symlink, skipping."
+ if [ "$(readlink -f "$src")" != "$(readlink -m "$dst")" ]; then
+ echo "ERROR: $label symlink points to $(readlink -f "$src"), expected $dst" >&2
+ exit 1
+ fi
+ echo "$label: $src already points to $dst, skipping."
return
fi
+
+ if [ -e "$dst" ]; then
+ echo "ERROR: $label destination $dst already exists; refusing to nest $src under it." >&2
+ exit 1
+ fi
echo "$label: moving $src → $dst ..."📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| relocate() { | |
| local src="$1" | |
| local dst="$2" | |
| local label="$3" | |
| if [ -L "$src" ]; then | |
| echo "$label: $src is already a symlink, skipping." | |
| return | |
| fi | |
| echo "$label: moving $src → $dst ..." | |
| mkdir -p "$(dirname "$dst")" | |
| mv "$src" "$dst" | |
| ln -s "$dst" "$src" | |
| relocate() { | |
| local src="$1" | |
| local dst="$2" | |
| local label="$3" | |
| if [ -L "$src" ]; then | |
| if [ "$(readlink -f "$src")" != "$(readlink -m "$dst")" ]; then | |
| echo "ERROR: $label symlink points to $(readlink -f "$src"), expected $dst" >&2 | |
| exit 1 | |
| fi | |
| echo "$label: $src already points to $dst, skipping." | |
| return | |
| fi | |
| if [ -e "$dst" ]; then | |
| echo "ERROR: $label destination $dst already exists; refusing to nest $src under it." >&2 | |
| exit 1 | |
| fi | |
| echo "$label: moving $src → $dst ..." | |
| mkdir -p "$(dirname "$dst")" | |
| mv "$src" "$dst" | |
| ln -s "$dst" "$src" |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@training/swe_bench/deploy/setup-brev-storage.sh` around lines 31 - 44, The
relocate helper currently moves data without checking whether the destination
already exists, which can cause nested directories and incorrect symlinks when
src is a real directory but dst was left behind from a previous run. Update
relocate to first validate any existing dst state, remove or reconcile stale
destinations before mv, and ensure the final symlink at src points to the
expected /ephemeral target. Also verify existing symlinks in relocate and skip
only when they already resolve to the intended destination, not just when src is
any symlink.
| echo "=== Starting microk8s ===" | ||
| microk8s start 2>/dev/null || snap start microk8s 2>/dev/null || true | ||
|
|
||
| echo "=== Waiting for microk8s to be ready ===" | ||
| microk8s status --wait-ready --timeout 120 2>/dev/null || \ | ||
| kubectl wait --for=condition=Ready node --all --timeout=120s 2>/dev/null || true |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Do not mask microk8s restart/readiness failures.
The final || true lets the script report “Storage relocated” even when microk8s failed to start or become ready, leaving the bootstrap script to fail later with a less actionable error.
Proposed fix
echo "=== Starting microk8s ==="
-microk8s start 2>/dev/null || snap start microk8s 2>/dev/null || true
+microk8s start 2>/dev/null || snap start microk8s
echo "=== Waiting for microk8s to be ready ==="
microk8s status --wait-ready --timeout 120 2>/dev/null || \
- kubectl wait --for=condition=Ready node --all --timeout=120s 2>/dev/null || true
+ kubectl wait --for=condition=Ready node --all --timeout=120s📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| echo "=== Starting microk8s ===" | |
| microk8s start 2>/dev/null || snap start microk8s 2>/dev/null || true | |
| echo "=== Waiting for microk8s to be ready ===" | |
| microk8s status --wait-ready --timeout 120 2>/dev/null || \ | |
| kubectl wait --for=condition=Ready node --all --timeout=120s 2>/dev/null || true | |
| echo "=== Starting microk8s ===" | |
| microk8s start 2>/dev/null || snap start microk8s | |
| echo "=== Waiting for microk8s to be ready ===" | |
| microk8s status --wait-ready --timeout 120 2>/dev/null || \ | |
| kubectl wait --for=condition=Ready node --all --timeout=120s |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@training/swe_bench/deploy/setup-brev-storage.sh` around lines 68 - 73, The
microk8s startup/readiness path is swallowing failures via the final fallback in
the startup block and the readiness check, which can let the script continue as
if storage relocation succeeded. Update the setup-brev-storage.sh flow around
the microk8s start and readiness commands so failures are surfaced instead of
ending in a blanket success fallback, and ensure the script exits or reports a
clear error when microk8s cannot start or become ready.
| if [ ! -L /var/lib/docker ] && [ -d /var/lib/docker ]; then | ||
| echo "WARNING: /var/lib/docker is not a symlink to /ephemeral." | ||
| echo " Run: sudo bash training/swe_bench/deploy/setup-brev-storage.sh" | ||
| echo " Then re-run this script." | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Validate all relocated storage paths, not only Docker metadata.
setup-brev-storage.sh moves /var/lib/docker, /var/lib/containerd, and microk8s containerd. Checking only /var/lib/docker allows the bootstrap to continue with Kubernetes image storage still on the small root disk.
Proposed validation
-if [ ! -L /var/lib/docker ] && [ -d /var/lib/docker ]; then
- echo "WARNING: /var/lib/docker is not a symlink to /ephemeral."
- echo " Run: sudo bash training/swe_bench/deploy/setup-brev-storage.sh"
- echo " Then re-run this script."
- exit 1
-fi
+for storage_path in \
+ /var/lib/docker \
+ /var/lib/containerd \
+ /var/snap/microk8s/common/var/lib/containerd
+do
+ if [ ! -L "$storage_path" ] || [[ "$(readlink -f "$storage_path")" != /ephemeral/* ]]; then
+ echo "ERROR: $storage_path is not relocated to /ephemeral." >&2
+ echo " Run: sudo bash training/swe_bench/deploy/setup-brev-storage.sh" >&2
+ exit 1
+ fi
+done📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [ ! -L /var/lib/docker ] && [ -d /var/lib/docker ]; then | |
| echo "WARNING: /var/lib/docker is not a symlink to /ephemeral." | |
| echo " Run: sudo bash training/swe_bench/deploy/setup-brev-storage.sh" | |
| echo " Then re-run this script." | |
| exit 1 | |
| fi | |
| for storage_path in \ | |
| /var/lib/docker \ | |
| /var/lib/containerd \ | |
| /var/snap/microk8s/common/var/lib/containerd | |
| do | |
| if [ ! -L "$storage_path" ] || [[ "$(readlink -f "$storage_path")" != /ephemeral/* ]]; then | |
| echo "ERROR: $storage_path is not relocated to /ephemeral." >&2 | |
| echo " Run: sudo bash training/swe_bench/deploy/setup-brev-storage.sh" >&2 | |
| exit 1 | |
| fi | |
| done |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@training/swe_bench/deploy/setup-brev.sh` around lines 62 - 67, The bootstrap
validation in setup-brev.sh only checks /var/lib/docker, but it needs to verify
all relocated storage locations moved by setup-brev-storage.sh. Update the
storage sanity check to confirm /var/lib/docker, /var/lib/containerd, and the
microk8s containerd path are all correctly relocated to the ephemeral disk
before continuing; if any are still on the root disk, emit a warning and exit
like the existing Docker check.
| if docker ps --format '{{.Names}}' | grep -q '^registry$'; then | ||
| echo " Registry container already running." | ||
| else | ||
| echo " Starting registry on ${REGISTRY} (data: /ephemeral/registry)..." | ||
| docker run -d \ | ||
| --restart=always \ | ||
| -p 5000:5000 \ | ||
| -v /ephemeral/registry:/var/lib/registry \ | ||
| --name registry \ | ||
| registry:2 | ||
| fi |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Handle an existing stopped registry container.
docker ps misses stopped containers; after a partial failure or manual stop, docker run --name registry fails because the name already exists.
Proposed fix
-if docker ps --format '{{.Names}}' | grep -q '^registry$'; then
+if docker ps --format '{{.Names}}' | grep -qx 'registry'; then
echo " Registry container already running."
+elif docker ps -a --format '{{.Names}}' | grep -qx 'registry'; then
+ echo " Starting existing registry container."
+ docker start registry
else📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if docker ps --format '{{.Names}}' | grep -q '^registry$'; then | |
| echo " Registry container already running." | |
| else | |
| echo " Starting registry on ${REGISTRY} (data: /ephemeral/registry)..." | |
| docker run -d \ | |
| --restart=always \ | |
| -p 5000:5000 \ | |
| -v /ephemeral/registry:/var/lib/registry \ | |
| --name registry \ | |
| registry:2 | |
| fi | |
| if docker ps --format '{{.Names}}' | grep -qx 'registry'; then | |
| echo " Registry container already running." | |
| elif docker ps -a --format '{{.Names}}' | grep -qx 'registry'; then | |
| echo " Starting existing registry container." | |
| docker start registry | |
| else | |
| echo " Starting registry on ${REGISTRY} (data: /ephemeral/registry)..." | |
| docker run -d \ | |
| --restart=always \ | |
| -p 5000:5000 \ | |
| -v /ephemeral/registry:/var/lib/registry \ | |
| --name registry \ | |
| registry:2 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@training/swe_bench/deploy/setup-brev.sh` around lines 87 - 97, The registry
startup logic in setup-brev.sh only checks running containers via docker ps, so
a stopped registry container can still block docker run --name registry. Update
the registry handling block to detect an existing container named registry even
when it is not running, and start it instead of creating a new one; use the
existing registry check/startup flow around the docker ps / docker run section
to handle both running and stopped cases cleanly.
| echo " Running build_dataset.py via K8s Job (model: ${MODEL_PATH})..." | ||
| cat <<JOB_YAML | kubectl apply -f - | ||
| apiVersion: batch/v1 | ||
| kind: Job | ||
| metadata: | ||
| name: build-swe-dataset | ||
| spec: | ||
| backoffLimit: 0 | ||
| template: | ||
| spec: | ||
| restartPolicy: Never | ||
| containers: | ||
| - name: builder | ||
| image: ${RAY_IMAGE} | ||
| imagePullPolicy: Always | ||
| command: ["/bin/bash", "-c", "${BUILD_CMD}"] | ||
| volumeMounts: | ||
| - name: hf-cache | ||
| mountPath: /hf-cache | ||
| env: | ||
| - name: HF_HOME | ||
| value: /hf-cache | ||
| volumes: | ||
| - name: hf-cache | ||
| persistentVolumeClaim: | ||
| claimName: hf-cache-0 | ||
| JOB_YAML | ||
|
|
||
| echo " Waiting for dataset Job to complete..." | ||
| kubectl wait --for=condition=complete job/build-swe-dataset --timeout=600s | ||
| kubectl delete job build-swe-dataset |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Delete stale dataset Jobs before creating a new one.
A failed or interrupted build-swe-dataset Job is left behind. On rerun, kubectl apply can hit immutable Job-template errors or wait on the stale failed Job instead of starting a fresh dataset build.
Proposed fix
fi
echo " Running build_dataset.py via K8s Job (model: ${MODEL_PATH})..."
+ kubectl delete job build-swe-dataset --ignore-not-found --wait=true
cat <<JOB_YAML | kubectl apply -f -
apiVersion: batch/v1
kind: Job📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| echo " Running build_dataset.py via K8s Job (model: ${MODEL_PATH})..." | |
| cat <<JOB_YAML | kubectl apply -f - | |
| apiVersion: batch/v1 | |
| kind: Job | |
| metadata: | |
| name: build-swe-dataset | |
| spec: | |
| backoffLimit: 0 | |
| template: | |
| spec: | |
| restartPolicy: Never | |
| containers: | |
| - name: builder | |
| image: ${RAY_IMAGE} | |
| imagePullPolicy: Always | |
| command: ["/bin/bash", "-c", "${BUILD_CMD}"] | |
| volumeMounts: | |
| - name: hf-cache | |
| mountPath: /hf-cache | |
| env: | |
| - name: HF_HOME | |
| value: /hf-cache | |
| volumes: | |
| - name: hf-cache | |
| persistentVolumeClaim: | |
| claimName: hf-cache-0 | |
| JOB_YAML | |
| echo " Waiting for dataset Job to complete..." | |
| kubectl wait --for=condition=complete job/build-swe-dataset --timeout=600s | |
| kubectl delete job build-swe-dataset | |
| echo " Running build_dataset.py via K8s Job (model: ${MODEL_PATH})..." | |
| kubectl delete job build-swe-dataset --ignore-not-found --wait=true | |
| cat <<JOB_YAML | kubectl apply -f - | |
| apiVersion: batch/v1 | |
| kind: Job | |
| metadata: | |
| name: build-swe-dataset | |
| spec: | |
| backoffLimit: 0 | |
| template: | |
| spec: | |
| restartPolicy: Never | |
| containers: | |
| - name: builder | |
| image: ${RAY_IMAGE} | |
| imagePullPolicy: Always | |
| command: ["/bin/bash", "-c", "${BUILD_CMD}"] | |
| volumeMounts: | |
| - name: hf-cache | |
| mountPath: /hf-cache | |
| env: | |
| - name: HF_HOME | |
| value: /hf-cache | |
| volumes: | |
| - name: hf-cache | |
| persistentVolumeClaim: | |
| claimName: hf-cache-0 | |
| JOB_YAML | |
| echo " Waiting for dataset Job to complete..." | |
| kubectl wait --for=condition=complete job/build-swe-dataset --timeout=600s | |
| kubectl delete job build-swe-dataset |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@training/swe_bench/deploy/setup-brev.sh` around lines 203 - 233, Before
creating the build-swe-dataset Job in setup-brev.sh, remove any existing Job
with the same name so reruns don’t reuse a stale failed or interrupted job.
Update the section around the kubectl apply/wait/delete flow to delete or
replace build-swe-dataset prior to applying the new Job manifest, and make sure
the cleanup is robust even if the previous run exited early. Use the existing
job name build-swe-dataset and the kubectl wait/kubectl delete logic to locate
the change.
Summary by CodeRabbit
New Features
Bug Fixes