Skip to content

Danny1218/quantization-autopsy

Repository files navigation

Quantization Autopsy

Quantization Autopsy is a paired, capability-level GGUF quantization fragility benchmark. It asks which narrow capability families degrade first as precision is reduced, and whether a death order found on one model family transfers to another.

Final Result

GGUF quantization fragility is model-family-specific: Qwen2.5-3B mostly degraded at Q2_K, while SmolLM2 1.7B showed earlier Q4_K_M/Q3_K_M degradation on shared calibrated probes.

The practical deployment conclusion is that Q2_K is broadly risky in both tested models, but Q4_K_M/Q3_K_M safety did not transfer cleanly from Qwen2.5-3B to SmolLM2.

model Q4 degradations Q3 degradations Q2 degradations conclusion
Qwen2.5-3B 0 1 6 mostly Q2-only
SmolLM2 1.7B 1 2 4 earlier Q4/Q3 degradation

Final Package

  • final_report/FINAL_REPORT.md - full reproducible research report.
  • final_report/ABSTRACT.md - concise abstract.
  • final_report/PUBLIC_SUMMARY.md - public/blog-style summary.
  • final_report/CLAIMS_AND_CAVEATS.md - supported and unsupported claims.
  • final_report/REPRODUCIBILITY.md - commands and artifact assumptions.
  • final_report/LIMITATIONS.md - methodological, scoring, model-selection, and deployment limits.
  • final_report/RESULTS_INDEX.md - source artifact map.
  • final_report/CLAIM_CHECK.csv - generated claim-check table for headline numerical claims.

What This Does Not Claim

  • It is not a universal law over all models.
  • It is not a full seven-level ladder for every model.
  • It is not human semantic grading.
  • It is not a guarantee for arbitrary deployment tasks.

Verify The Final Package

Regenerate the final report package:

.\.venv\Scripts\python.exe src\write_final_report.py

Check final-report claims and required release files:

.\.venv\Scripts\python.exe src\check_final_report_claims.py
.\.venv\Scripts\python.exe src\verify_final_package.py

Hardware Choice

This workspace machine has an NVIDIA GeForce RTX 4060 Laptop GPU with 8 GB VRAM and about 39 GB free on C:. That is below the 7B path in the project plan, so the checked-in config uses the smaller fallback model for all phases:

qwen2.5-3b-instruct

Users with 24 GB VRAM and enough disk can switch MODEL_NAME in src/config.py to qwen2.5-7b-instruct.

Setup

py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m pip install llama-cpp-python --prefer-binary --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
$env:HF_HOME = "$PWD\hf_cache"

If the CUDA llama-cpp-python wheel is unavailable for the local Python version, install the CPU wheel and record the fallback in results/notes.md:

python -m pip install llama-cpp-python --prefer-binary

llama.cpp Binaries

Download the latest Windows CUDA 12 x64 llama.cpp release assets and extract both archives into tools\llama.cpp\:

mkdir tools
# Replace bXXXX with the latest build tag from https://github.com/ggml-org/llama.cpp/releases/latest
Invoke-WebRequest -Uri "https://github.com/ggml-org/llama.cpp/releases/download/bXXXX/llama-bXXXX-bin-win-cuda-cu12.4-x64.zip" -OutFile "tools\llamacpp.zip"
Invoke-WebRequest -Uri "https://github.com/ggml-org/llama.cpp/releases/download/bXXXX/cudart-llama-bin-win-cuda-cu12.4-x64.zip" -OutFile "tools\cudart.zip"
Expand-Archive tools\llamacpp.zip -DestinationPath tools\llama.cpp
Expand-Archive tools\cudart.zip -DestinationPath tools\llama.cpp
.\tools\llama.cpp\llama-quantize.exe --help

Build Probes

python -m src.build_probes
git add probes
git commit -m "Freeze probe battery"

Validation:

python -c "import json,pathlib; files=list(pathlib.Path('probes').glob('*.jsonl')); assert len(files)==12; [json.loads(line) for f in files for line in f.open(encoding='utf-8')]; assert all(sum(1 for _ in f.open(encoding='utf-8'))==100 for f in files); print('probe files:', len(files))"

Build Model Ladder

Download the FP16 GGUF:

huggingface-cli download Qwen/Qwen2.5-3B-Instruct-GGUF qwen2.5-3b-instruct-fp16.gguf --local-dir models

If the FP16 GGUF is unavailable, download HF weights and convert with llama.cpp's convert_hf_to_gguf.py.

Quantize sequentially:

.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q8_0.gguf Q8_0
.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q6_k.gguf Q6_K
.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q5_k_m.gguf Q5_K_M
.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q4_k_m.gguf Q4_K_M
.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q3_k_m.gguf Q3_K_M
.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q2_k.gguf Q2_K

Sanity run each model:

foreach ($q in "fp16","q8_0","q6_k","q5_k_m","q4_k_m","q3_k_m","q2_k") {
  .\tools\llama.cpp\llama-cli.exe -m "models\qwen2.5-3b-instruct-$q.gguf" -p "2+2=" -n 8 -ngl 99
}

Run Evaluation

Smoke test:

python -m src.run_eval --quant q4_k_m --probe char_manipulation
python -m src.run_eval --quant q4_k_m --probe winogrande

Full sweep:

python -m src.run_sweep

Use partial offload if FP16 does not fit:

python -m src.run_eval --quant fp16 --ngl 24

Analyze

python -m src.analyze
python -m src.verify_report_claims

V1 audit

The audit layer reads the frozen v1 artifacts and writes new outputs under results/audit/ and results/plots/audit/. It does not overwrite probes/, results/raw/, or the original v1 report.

python src\run_audit.py
python src\validate_audit.py
python src\sample_failures.py --probe arithmetic --quant q4_k_m --mode fp16_correct_quant_wrong --n 25

Primary audit outputs:

  • results/REPORT_v1_AUDIT.md
  • results/audit/paired_flip_table.csv
  • results/audit/paired_significance.csv
  • results/audit/death_persistence.csv
  • results/audit/calibration_report.csv
  • results/plots/audit/paired_net_loss_heatmap.png
  • results/plots/audit/retention_among_fp16_correct_overlay.png
  • results/plots/audit/baseline_calibration_bar.png
  • results/plots/audit/death_persistence_comparison.png

The v2 calibrated follow-up plan lives in docs/V2_PLAN.md.

V2 calibrated probes

Build and validate the calibrated v2 prompt suite without running a model:

.\.venv\Scripts\python.exe src\build_probes_v2.py
.\.venv\Scripts\python.exe src\validate_probes_v2.py
.\.venv\Scripts\python.exe src\run_v2_targeted.py --dry-run

Run FP16 calibration first. This writes v2 raw files under results_v2\raw\ and a baseline report to results_v2\calibration_fp16.csv:

.\.venv\Scripts\python.exe src\calibrate_v2.py --quant fp16

After the FP16 calibration report says the probe set is usable enough to continue, run the targeted v2 sweep:

.\.venv\Scripts\python.exe src\run_v2_targeted.py
.\.venv\Scripts\python.exe src\analyze_v2.py
.\.venv\Scripts\python.exe src\write_report_v2.py
.\.venv\Scripts\python.exe src\decide_next_phase.py

To inspect v2 paired failures after a sweep:

.\.venv\Scripts\python.exe src\sample_failures.py --experiment v2 --probe arithmetic_easy --quant q4_k_m --mode fp16_correct_quant_wrong --n 25

Expected outputs:

  • results_v2/summary.csv
  • results_v2/paired_significance.csv
  • results_v2/death_persistence.csv
  • results_v2/fragility_index.csv
  • results_v2/REPORT_v2_TARGETED.md
  • results_v2/plots/v2_accuracy_grid.png
  • results_v2/plots/v2_retention_overlay.png
  • results_v2/plots/v2_paired_net_loss_heatmap.png
  • results_v2/plots/v2_death_persistence.png
  • results_v2/plots/v2_fragility_bar.png

Write Report

After src.analyze has produced the CSVs and plots, fill results/REPORT.md with the death-order table, plots, statistical notes, limitations, and practical recommendations. Every quantitative claim must trace to results/summary.csv or another generated CSV; src.verify_report_claims checks the headline snippets against those generated files.

One-command reproduction after the model ladder exists:

python -m src.build_probes
python -m src.run_sweep
python -m src.analyze
python -m src.verify_report_claims

Cross-model transfer validation

The Qwen2.5-3B run in results_v2/ is the frozen primary v2 run. Cross-model validation writes separate transfer outputs under results_transfer\{model_key}\ and does not overwrite results_v2\REPORT_v2_TARGETED.md, probes_v2_calibrated\, or primary raw outputs.

The built-in registry key for the primary model is qwen2_5_3b. To add a second local model family, create models\model_registry.local.json using the template:

.\.venv\Scripts\python.exe src\model_registry.py --show-template

After editing the local registry, check the resolved GGUF paths:

.\.venv\Scripts\python.exe src\model_registry.py --model-key YOUR_SECOND_MODEL

Dry-run the transfer matrix before any model load:

.\.venv\Scripts\python.exe src\run_transfer_targeted.py --model-key YOUR_SECOND_MODEL --dry-run

Run FP16 calibration first. Continue to lower quant levels only if at least five default transfer probes have usable baselines and the script does not print DO_NOT_RUN_TRANSFER_SWEEP_YET:

.\.venv\Scripts\python.exe src\calibrate_transfer.py --model-key YOUR_SECOND_MODEL

After calibration passes, run the targeted transfer sweep one quant at a time:

.\.venv\Scripts\python.exe src\run_transfer_targeted.py --model-key YOUR_SECOND_MODEL --quant q4_k_m
.\.venv\Scripts\python.exe src\run_transfer_targeted.py --model-key YOUR_SECOND_MODEL --quant q3_k_m
.\.venv\Scripts\python.exe src\run_transfer_targeted.py --model-key YOUR_SECOND_MODEL --quant q2_k
.\.venv\Scripts\python.exe src\analyze_transfer.py --model-key YOUR_SECOND_MODEL
.\.venv\Scripts\python.exe src\compare_transfer.py --model-key YOUR_SECOND_MODEL
.\.venv\Scripts\python.exe src\write_transfer_report.py --model-key YOUR_SECOND_MODEL
.\.venv\Scripts\python.exe src\decide_transfer_next_phase.py --model-key YOUR_SECOND_MODEL

Optional long-context control probe:

.\.venv\Scripts\python.exe src\run_transfer_targeted.py --model-key YOUR_SECOND_MODEL --include-controls --dry-run