Quantization Autopsy is a paired, capability-level GGUF quantization fragility benchmark. It asks which narrow capability families degrade first as precision is reduced, and whether a death order found on one model family transfers to another.
GGUF quantization fragility is model-family-specific: Qwen2.5-3B mostly degraded at Q2_K, while SmolLM2 1.7B showed earlier Q4_K_M/Q3_K_M degradation on shared calibrated probes.
The practical deployment conclusion is that Q2_K is broadly risky in both tested models, but Q4_K_M/Q3_K_M safety did not transfer cleanly from Qwen2.5-3B to SmolLM2.
| model | Q4 degradations | Q3 degradations | Q2 degradations | conclusion |
|---|---|---|---|---|
| Qwen2.5-3B | 0 | 1 | 6 | mostly Q2-only |
| SmolLM2 1.7B | 1 | 2 | 4 | earlier Q4/Q3 degradation |
final_report/FINAL_REPORT.md- full reproducible research report.final_report/ABSTRACT.md- concise abstract.final_report/PUBLIC_SUMMARY.md- public/blog-style summary.final_report/CLAIMS_AND_CAVEATS.md- supported and unsupported claims.final_report/REPRODUCIBILITY.md- commands and artifact assumptions.final_report/LIMITATIONS.md- methodological, scoring, model-selection, and deployment limits.final_report/RESULTS_INDEX.md- source artifact map.final_report/CLAIM_CHECK.csv- generated claim-check table for headline numerical claims.
- It is not a universal law over all models.
- It is not a full seven-level ladder for every model.
- It is not human semantic grading.
- It is not a guarantee for arbitrary deployment tasks.
Regenerate the final report package:
.\.venv\Scripts\python.exe src\write_final_report.pyCheck final-report claims and required release files:
.\.venv\Scripts\python.exe src\check_final_report_claims.py
.\.venv\Scripts\python.exe src\verify_final_package.pyThis workspace machine has an NVIDIA GeForce RTX 4060 Laptop GPU with 8 GB VRAM and about 39 GB free on C:. That is below the 7B path in the project plan, so the checked-in config uses the smaller fallback model for all phases:
qwen2.5-3b-instruct
Users with 24 GB VRAM and enough disk can switch MODEL_NAME in src/config.py to qwen2.5-7b-instruct.
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m pip install llama-cpp-python --prefer-binary --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
$env:HF_HOME = "$PWD\hf_cache"If the CUDA llama-cpp-python wheel is unavailable for the local Python version, install the CPU wheel and record the fallback in results/notes.md:
python -m pip install llama-cpp-python --prefer-binaryDownload the latest Windows CUDA 12 x64 llama.cpp release assets and extract both archives into tools\llama.cpp\:
mkdir tools
# Replace bXXXX with the latest build tag from https://github.com/ggml-org/llama.cpp/releases/latest
Invoke-WebRequest -Uri "https://github.com/ggml-org/llama.cpp/releases/download/bXXXX/llama-bXXXX-bin-win-cuda-cu12.4-x64.zip" -OutFile "tools\llamacpp.zip"
Invoke-WebRequest -Uri "https://github.com/ggml-org/llama.cpp/releases/download/bXXXX/cudart-llama-bin-win-cuda-cu12.4-x64.zip" -OutFile "tools\cudart.zip"
Expand-Archive tools\llamacpp.zip -DestinationPath tools\llama.cpp
Expand-Archive tools\cudart.zip -DestinationPath tools\llama.cpp
.\tools\llama.cpp\llama-quantize.exe --helppython -m src.build_probes
git add probes
git commit -m "Freeze probe battery"Validation:
python -c "import json,pathlib; files=list(pathlib.Path('probes').glob('*.jsonl')); assert len(files)==12; [json.loads(line) for f in files for line in f.open(encoding='utf-8')]; assert all(sum(1 for _ in f.open(encoding='utf-8'))==100 for f in files); print('probe files:', len(files))"Download the FP16 GGUF:
huggingface-cli download Qwen/Qwen2.5-3B-Instruct-GGUF qwen2.5-3b-instruct-fp16.gguf --local-dir modelsIf the FP16 GGUF is unavailable, download HF weights and convert with llama.cpp's convert_hf_to_gguf.py.
Quantize sequentially:
.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q8_0.gguf Q8_0
.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q6_k.gguf Q6_K
.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q5_k_m.gguf Q5_K_M
.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q4_k_m.gguf Q4_K_M
.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q3_k_m.gguf Q3_K_M
.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q2_k.gguf Q2_KSanity run each model:
foreach ($q in "fp16","q8_0","q6_k","q5_k_m","q4_k_m","q3_k_m","q2_k") {
.\tools\llama.cpp\llama-cli.exe -m "models\qwen2.5-3b-instruct-$q.gguf" -p "2+2=" -n 8 -ngl 99
}Smoke test:
python -m src.run_eval --quant q4_k_m --probe char_manipulation
python -m src.run_eval --quant q4_k_m --probe winograndeFull sweep:
python -m src.run_sweepUse partial offload if FP16 does not fit:
python -m src.run_eval --quant fp16 --ngl 24python -m src.analyze
python -m src.verify_report_claimsThe audit layer reads the frozen v1 artifacts and writes new outputs under results/audit/ and results/plots/audit/. It does not overwrite probes/, results/raw/, or the original v1 report.
python src\run_audit.py
python src\validate_audit.py
python src\sample_failures.py --probe arithmetic --quant q4_k_m --mode fp16_correct_quant_wrong --n 25Primary audit outputs:
results/REPORT_v1_AUDIT.mdresults/audit/paired_flip_table.csvresults/audit/paired_significance.csvresults/audit/death_persistence.csvresults/audit/calibration_report.csvresults/plots/audit/paired_net_loss_heatmap.pngresults/plots/audit/retention_among_fp16_correct_overlay.pngresults/plots/audit/baseline_calibration_bar.pngresults/plots/audit/death_persistence_comparison.png
The v2 calibrated follow-up plan lives in docs/V2_PLAN.md.
Build and validate the calibrated v2 prompt suite without running a model:
.\.venv\Scripts\python.exe src\build_probes_v2.py
.\.venv\Scripts\python.exe src\validate_probes_v2.py
.\.venv\Scripts\python.exe src\run_v2_targeted.py --dry-runRun FP16 calibration first. This writes v2 raw files under results_v2\raw\ and a baseline report to results_v2\calibration_fp16.csv:
.\.venv\Scripts\python.exe src\calibrate_v2.py --quant fp16After the FP16 calibration report says the probe set is usable enough to continue, run the targeted v2 sweep:
.\.venv\Scripts\python.exe src\run_v2_targeted.py
.\.venv\Scripts\python.exe src\analyze_v2.py
.\.venv\Scripts\python.exe src\write_report_v2.py
.\.venv\Scripts\python.exe src\decide_next_phase.pyTo inspect v2 paired failures after a sweep:
.\.venv\Scripts\python.exe src\sample_failures.py --experiment v2 --probe arithmetic_easy --quant q4_k_m --mode fp16_correct_quant_wrong --n 25Expected outputs:
results_v2/summary.csvresults_v2/paired_significance.csvresults_v2/death_persistence.csvresults_v2/fragility_index.csvresults_v2/REPORT_v2_TARGETED.mdresults_v2/plots/v2_accuracy_grid.pngresults_v2/plots/v2_retention_overlay.pngresults_v2/plots/v2_paired_net_loss_heatmap.pngresults_v2/plots/v2_death_persistence.pngresults_v2/plots/v2_fragility_bar.png
After src.analyze has produced the CSVs and plots, fill results/REPORT.md with the death-order table, plots, statistical notes, limitations, and practical recommendations. Every quantitative claim must trace to results/summary.csv or another generated CSV; src.verify_report_claims checks the headline snippets against those generated files.
One-command reproduction after the model ladder exists:
python -m src.build_probes
python -m src.run_sweep
python -m src.analyze
python -m src.verify_report_claimsThe Qwen2.5-3B run in results_v2/ is the frozen primary v2 run. Cross-model validation writes separate transfer outputs under results_transfer\{model_key}\ and does not overwrite results_v2\REPORT_v2_TARGETED.md, probes_v2_calibrated\, or primary raw outputs.
The built-in registry key for the primary model is qwen2_5_3b. To add a second local model family, create models\model_registry.local.json using the template:
.\.venv\Scripts\python.exe src\model_registry.py --show-templateAfter editing the local registry, check the resolved GGUF paths:
.\.venv\Scripts\python.exe src\model_registry.py --model-key YOUR_SECOND_MODELDry-run the transfer matrix before any model load:
.\.venv\Scripts\python.exe src\run_transfer_targeted.py --model-key YOUR_SECOND_MODEL --dry-runRun FP16 calibration first. Continue to lower quant levels only if at least five default transfer probes have usable baselines and the script does not print DO_NOT_RUN_TRANSFER_SWEEP_YET:
.\.venv\Scripts\python.exe src\calibrate_transfer.py --model-key YOUR_SECOND_MODELAfter calibration passes, run the targeted transfer sweep one quant at a time:
.\.venv\Scripts\python.exe src\run_transfer_targeted.py --model-key YOUR_SECOND_MODEL --quant q4_k_m
.\.venv\Scripts\python.exe src\run_transfer_targeted.py --model-key YOUR_SECOND_MODEL --quant q3_k_m
.\.venv\Scripts\python.exe src\run_transfer_targeted.py --model-key YOUR_SECOND_MODEL --quant q2_k
.\.venv\Scripts\python.exe src\analyze_transfer.py --model-key YOUR_SECOND_MODEL
.\.venv\Scripts\python.exe src\compare_transfer.py --model-key YOUR_SECOND_MODEL
.\.venv\Scripts\python.exe src\write_transfer_report.py --model-key YOUR_SECOND_MODEL
.\.venv\Scripts\python.exe src\decide_transfer_next_phase.py --model-key YOUR_SECOND_MODELOptional long-context control probe:
.\.venv\Scripts\python.exe src\run_transfer_targeted.py --model-key YOUR_SECOND_MODEL --include-controls --dry-run