Quantization Autopsy

Quantization Autopsy is a paired, capability-level GGUF quantization fragility benchmark. It asks which narrow capability families degrade first as precision is reduced, and whether a death order found on one model family transfers to another.

Final Result

GGUF quantization fragility is model-family-specific: Qwen2.5-3B mostly degraded at Q2_K, while SmolLM2 1.7B showed earlier Q4_K_M/Q3_K_M degradation on shared calibrated probes.

The practical deployment conclusion is that Q2_K is broadly risky in both tested models, but Q4_K_M/Q3_K_M safety did not transfer cleanly from Qwen2.5-3B to SmolLM2.

model	Q4 degradations	Q3 degradations	Q2 degradations	conclusion
Qwen2.5-3B	0	1	6	mostly Q2-only
SmolLM2 1.7B	1	2	4	earlier Q4/Q3 degradation

Final Package

final_report/FINAL_REPORT.md - full reproducible research report.
final_report/ABSTRACT.md - concise abstract.
final_report/PUBLIC_SUMMARY.md - public/blog-style summary.
final_report/CLAIMS_AND_CAVEATS.md - supported and unsupported claims.
final_report/REPRODUCIBILITY.md - commands and artifact assumptions.
final_report/LIMITATIONS.md - methodological, scoring, model-selection, and deployment limits.
final_report/RESULTS_INDEX.md - source artifact map.
final_report/CLAIM_CHECK.csv - generated claim-check table for headline numerical claims.

What This Does Not Claim

It is not a universal law over all models.
It is not a full seven-level ladder for every model.
It is not human semantic grading.
It is not a guarantee for arbitrary deployment tasks.

Verify The Final Package

Regenerate the final report package:

.\.venv\Scripts\python.exe src\write_final_report.py

Check final-report claims and required release files:

.\.venv\Scripts\python.exe src\check_final_report_claims.py
.\.venv\Scripts\python.exe src\verify_final_package.py

Hardware Choice

This workspace machine has an NVIDIA GeForce RTX 4060 Laptop GPU with 8 GB VRAM and about 39 GB free on C:. That is below the 7B path in the project plan, so the checked-in config uses the smaller fallback model for all phases:

qwen2.5-3b-instruct

Users with 24 GB VRAM and enough disk can switch MODEL_NAME in src/config.py to qwen2.5-7b-instruct.

Setup

py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m pip install llama-cpp-python --prefer-binary --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
$env:HF_HOME = "$PWD\hf_cache"

If the CUDA llama-cpp-python wheel is unavailable for the local Python version, install the CPU wheel and record the fallback in results/notes.md:

python -m pip install llama-cpp-python --prefer-binary

llama.cpp Binaries

Download the latest Windows CUDA 12 x64 llama.cpp release assets and extract both archives into tools\llama.cpp\:

mkdir tools
# Replace bXXXX with the latest build tag from https://github.com/ggml-org/llama.cpp/releases/latest
Invoke-WebRequest -Uri "https://github.com/ggml-org/llama.cpp/releases/download/bXXXX/llama-bXXXX-bin-win-cuda-cu12.4-x64.zip" -OutFile "tools\llamacpp.zip"
Invoke-WebRequest -Uri "https://github.com/ggml-org/llama.cpp/releases/download/bXXXX/cudart-llama-bin-win-cuda-cu12.4-x64.zip" -OutFile "tools\cudart.zip"
Expand-Archive tools\llamacpp.zip -DestinationPath tools\llama.cpp
Expand-Archive tools\cudart.zip -DestinationPath tools\llama.cpp
.\tools\llama.cpp\llama-quantize.exe --help

Build Probes

python -m src.build_probes
git add probes
git commit -m "Freeze probe battery"

Validation:

python -c "import json,pathlib; files=list(pathlib.Path('probes').glob('*.jsonl')); assert len(files)==12; [json.loads(line) for f in files for line in f.open(encoding='utf-8')]; assert all(sum(1 for _ in f.open(encoding='utf-8'))==100 for f in files); print('probe files:', len(files))"

Build Model Ladder

Download the FP16 GGUF:

huggingface-cli download Qwen/Qwen2.5-3B-Instruct-GGUF qwen2.5-3b-instruct-fp16.gguf --local-dir models

If the FP16 GGUF is unavailable, download HF weights and convert with llama.cpp's convert_hf_to_gguf.py.

Quantize sequentially:

.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q8_0.gguf Q8_0
.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q6_k.gguf Q6_K
.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q5_k_m.gguf Q5_K_M
.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q4_k_m.gguf Q4_K_M
.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q3_k_m.gguf Q3_K_M
.\tools\llama.cpp\llama-quantize.exe models\qwen2.5-3b-instruct-fp16.gguf models\qwen2.5-3b-instruct-q2_k.gguf Q2_K

Sanity run each model:

foreach ($q in "fp16","q8_0","q6_k","q5_k_m","q4_k_m","q3_k_m","q2_k") {
  .\tools\llama.cpp\llama-cli.exe -m "models\qwen2.5-3b-instruct-$q.gguf" -p "2+2=" -n 8 -ngl 99
}

Run Evaluation

Smoke test:

python -m src.run_eval --quant q4_k_m --probe char_manipulation
python -m src.run_eval --quant q4_k_m --probe winogrande

Full sweep:

python -m src.run_sweep

Use partial offload if FP16 does not fit:

python -m src.run_eval --quant fp16 --ngl 24

Analyze

python -m src.analyze
python -m src.verify_report_claims

V1 audit

The audit layer reads the frozen v1 artifacts and writes new outputs under results/audit/ and results/plots/audit/. It does not overwrite probes/, results/raw/, or the original v1 report.

python src\run_audit.py
python src\validate_audit.py
python src\sample_failures.py --probe arithmetic --quant q4_k_m --mode fp16_correct_quant_wrong --n 25

Primary audit outputs:

results/REPORT_v1_AUDIT.md
results/audit/paired_flip_table.csv
results/audit/paired_significance.csv
results/audit/death_persistence.csv
results/audit/calibration_report.csv
results/plots/audit/paired_net_loss_heatmap.png
results/plots/audit/retention_among_fp16_correct_overlay.png
results/plots/audit/baseline_calibration_bar.png
results/plots/audit/death_persistence_comparison.png

The v2 calibrated follow-up plan lives in docs/V2_PLAN.md.

V2 calibrated probes

Build and validate the calibrated v2 prompt suite without running a model:

.\.venv\Scripts\python.exe src\build_probes_v2.py
.\.venv\Scripts\python.exe src\validate_probes_v2.py
.\.venv\Scripts\python.exe src\run_v2_targeted.py --dry-run

Run FP16 calibration first. This writes v2 raw files under results_v2\raw\ and a baseline report to results_v2\calibration_fp16.csv:

.\.venv\Scripts\python.exe src\calibrate_v2.py --quant fp16

After the FP16 calibration report says the probe set is usable enough to continue, run the targeted v2 sweep:

.\.venv\Scripts\python.exe src\run_v2_targeted.py
.\.venv\Scripts\python.exe src\analyze_v2.py
.\.venv\Scripts\python.exe src\write_report_v2.py
.\.venv\Scripts\python.exe src\decide_next_phase.py

To inspect v2 paired failures after a sweep:

.\.venv\Scripts\python.exe src\sample_failures.py --experiment v2 --probe arithmetic_easy --quant q4_k_m --mode fp16_correct_quant_wrong --n 25

Expected outputs:

results_v2/summary.csv
results_v2/paired_significance.csv
results_v2/death_persistence.csv
results_v2/fragility_index.csv
results_v2/REPORT_v2_TARGETED.md
results_v2/plots/v2_accuracy_grid.png
results_v2/plots/v2_retention_overlay.png
results_v2/plots/v2_paired_net_loss_heatmap.png
results_v2/plots/v2_death_persistence.png
results_v2/plots/v2_fragility_bar.png

Write Report

After src.analyze has produced the CSVs and plots, fill results/REPORT.md with the death-order table, plots, statistical notes, limitations, and practical recommendations. Every quantitative claim must trace to results/summary.csv or another generated CSV; src.verify_report_claims checks the headline snippets against those generated files.

One-command reproduction after the model ladder exists:

python -m src.build_probes
python -m src.run_sweep
python -m src.analyze
python -m src.verify_report_claims

Cross-model transfer validation

The Qwen2.5-3B run in results_v2/ is the frozen primary v2 run. Cross-model validation writes separate transfer outputs under results_transfer\{model_key}\ and does not overwrite results_v2\REPORT_v2_TARGETED.md, probes_v2_calibrated\, or primary raw outputs.

The built-in registry key for the primary model is qwen2_5_3b. To add a second local model family, create models\model_registry.local.json using the template:

.\.venv\Scripts\python.exe src\model_registry.py --show-template

After editing the local registry, check the resolved GGUF paths:

.\.venv\Scripts\python.exe src\model_registry.py --model-key YOUR_SECOND_MODEL

Dry-run the transfer matrix before any model load:

.\.venv\Scripts\python.exe src\run_transfer_targeted.py --model-key YOUR_SECOND_MODEL --dry-run

Run FP16 calibration first. Continue to lower quant levels only if at least five default transfer probes have usable baselines and the script does not print DO_NOT_RUN_TRANSFER_SWEEP_YET:

.\.venv\Scripts\python.exe src\calibrate_transfer.py --model-key YOUR_SECOND_MODEL

After calibration passes, run the targeted transfer sweep one quant at a time:

.\.venv\Scripts\python.exe src\run_transfer_targeted.py --model-key YOUR_SECOND_MODEL --quant q4_k_m
.\.venv\Scripts\python.exe src\run_transfer_targeted.py --model-key YOUR_SECOND_MODEL --quant q3_k_m
.\.venv\Scripts\python.exe src\run_transfer_targeted.py --model-key YOUR_SECOND_MODEL --quant q2_k
.\.venv\Scripts\python.exe src\analyze_transfer.py --model-key YOUR_SECOND_MODEL
.\.venv\Scripts\python.exe src\compare_transfer.py --model-key YOUR_SECOND_MODEL
.\.venv\Scripts\python.exe src\write_transfer_report.py --model-key YOUR_SECOND_MODEL
.\.venv\Scripts\python.exe src\decide_transfer_next_phase.py --model-key YOUR_SECOND_MODEL

Optional long-context control probe:

.\.venv\Scripts\python.exe src\run_transfer_targeted.py --model-key YOUR_SECOND_MODEL --include-controls --dry-run

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quantization Autopsy

Final Result

Final Package

What This Does Not Claim

Verify The Final Package

Hardware Choice

Setup

llama.cpp Binaries

Build Probes

Build Model Ladder

Run Evaluation

Analyze

V1 audit

V2 calibrated probes

Write Report

Cross-model transfer validation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
docs		docs
final_report		final_report
probes		probes
probes_v2_calibrated		probes_v2_calibrated
results		results
results_transfer		results_transfer
results_v2		results_v2
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Quantization Autopsy

Final Result

Final Package

What This Does Not Claim

Verify The Final Package

Hardware Choice

Setup

llama.cpp Binaries

Build Probes

Build Model Ladder

Run Evaluation

Analyze

V1 audit

V2 calibrated probes

Write Report

Cross-model transfer validation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages