Snakemake pipeline for post-processing, QC, and basic visualization of ARG tree sequences (.ts, .trees, .tsz). Written with the aid of Codex and Claude.
If you use, please cite:
Ross-Ibarra, J. 2026. ARGtest: tools for QC and validation of ancestral recombination graphs. doi: 10.5281/zenodo.19698118
- Install
- Quick start
- Suggested Workflow
- Snakemake Pipeline
- Scripts
- Auxiliary scripts
- Inputs, formats, defaults & logs
- Repository notes
- Changelog
See NOTES.md for the shared module layout and trim_samples.py sample-ID matching rules. Run any script with --help for its full argument list.
conda env create -f environment.yml
conda activate argtestWhich version to use: work from the most recent tagged release rather than an
arbitrary commit on main. To check out the latest tag:
git fetch --tags
git checkout "$(git tag -l 'v*' | sort -V | tail -1)"See CHANGELOG.md for a per-version breakdown of changes.
New here? After installing, run the whole pipeline end-to-end on the bundled example dataset:
# 1. Generate the example tree sequences. These are not committed; they
# regenerate deterministically from the seed pinned in ground_truth.json
# (3 chromosomes × 8 replicates × 16 diploids × 10 Mb; a few minutes).
python scripts/make_realistic_example.py --out-dir argtest-realistic-example \
--n-chrom 3 --n-reps 8 --n-samples 16 --seq-length 10000000
# 2. Run the pipeline. The committed config/snakemake.yaml already points at
# the example dataset above.
snakemake --cores 16 --configfile config/snakemake.yamlResults land in argtest-realistic-example-out/. The example is a deliberately-flawed dataset (contaminated individuals, per-window sample pruning, an accessibility mask) with a ground-truth scorecard at scoring_report.md grading the pipeline's masks against the injected flaws; see MAKE_REALISTIC_EXAMPLE.md for the generator's CLI and ground-truth schema. For dry-run preview, run flags, and cluster execution, see How To Run below.
One reasonable post-processing workflow for ARG tree sequences is below. The Snakemake pipeline automates exactly these steps; each can also be run by hand via the linked script (--help for full arguments).
- Find low-recombination regions. Regions of low recombination are more affected by linked selection, so for analyses assuming a neutral ARG it can help to remove them first. scripts/hapmap_low_rec_mask.py turns a HapMap recombination map plus a
.faiinto per-chromosome BED masks of the lowest-recombination intervals (set byrec_fraction; see Required Inputs for the intervals-vs-base-pairs caveat). If your ARG came from singer-snakemake withpaircoal-reweight: true, you may be able to skip this step (setrec_fractionto 0). - Find regions of poor alignment. scripts/find_low_access_regions.py inspects a tree sequence's inferred mutation map and writes windows where more than a cutoff fraction of base pairs are masked to a BED file.
- Find aberrant mutational load. All samples should have similar derived-mutation loads after accounting for local mutation-rate variation. scripts/mutload_masks.py simulates mutations once on the input ARG to get a per-window, per-individual expected load, then flags any (window, individual) pair whose observed load deviates from expectation by more than
mutload_cutoff, writing one BED row per outlier window with the affected individuals in column 4. The test is per individual: a diploid/polyploid individual's haploid copies are pooled before testing, so for haploid data it is effectively per sample. scripts/mutload_summary.py gives an HTML diagnostic of residual load after pruning, flagging candidates for whole-sample removal. - Remove affected regions. Combine the step 1–3 BED masks per chromosome and trim those regions from each tree sequence with scripts/trim_regions.py. The mutation-rate ratemap is embedded in each trimmed tree sequence's metadata.
- Remove affected samples. Often only a few samples in a window are problematic — e.g. introgressed (cf. TRACE) or with odd derived-mutation patterns (step 3). scripts/trim_samples.py prunes given individuals over given intervals. The pipeline automatically prunes the step-3 outliers; add your own with
trim_individuals(genome-wide IDs) and/ortrim_remove_bed(per-individual BEDs), or run the script directly with--individuals/--remove. - Validate (optional). scripts/validation_plots_from_ts.py produces compact QC plots (mutational load, diversity, Tajima's D, folded/unfolded SFS) for the cleaned ARG; compare against the same plots on the originals.
- Merge chromosomes per replicate. scripts/merge_treefiles_by_replicate.py concatenates the per-chromosome tree files into one tree sequence per replicate, merging the embedded mutation-rate ratemaps across chromosomes.
- Summarise (optional). scripts/pipeline_summary.py collects genome-retention statistics across all steps (mean ± SD across replicates), per-individual outlier counts, and the validation plots into one self-contained HTML file.
Optionally, set emit_vcf: true to export a per-(chromosome, replicate) VCF from the filtered per-chromosome tree sequences; see VCF export.
For a rule-based version of the workflow above, this repository now includes a Snakemake pipeline in Snakefile.
The Snakemake workflow is designed for a directory layout with one subdirectory per chromosome and one treefile per replicate inside each chromosome directory:
<root>/
chr1/
1.tsz
2.tsz
...
chr2/
1.tsz
2.tsz
...
The Snakemake workflow discovers treefiles with suffixes .ts, .trees, and .tsz. Replicate IDs are taken from the filename stem, so chr1/1.tsz is replicate 1 for chromosome chr1.
If the treefiles live in a subdirectory of each chromosome directory rather than directly inside it — for example SINGER output where each chrN/trees/ holds the replicates — set tree_subdir to that subdirectory name and discovery looks there instead:
<root>/
chr1/
trees/
chr1.1.tsz
chr1.2.tsz
...
The chromosome name still comes from the chromosome directory (chr1), and a leading chrN. prefix on the filename is stripped to get the replicate ID, so chr1/trees/chr1.2.tsz is replicate 2.
The Snakemake config is in config/snakemake.yaml. Edit it for your project and supply it with --configfile. The file has an inline comment for every option.
Required keys:
root_dir: path to the chromosome-subdirectory roothapmap: single HapMap recombination map covering all chromosomes (one combined file, not one per chromosome — rows are grouped by theChromosomecolumn), used for step 1fai: FASTA index used for chromosome lengthsrec_fraction: fraction of recombination-rate intervals (ranked byRate(cM/Mb), lowest first) to include in the low-recombination mask; e.g.0.1masks the bottom 10 % of intervals, while0.0writes empty low-recombination masks. Note this is a fraction of the number of intervals between map markers, not of base pairs — because the lowest-recombination intervals tend to be the longest, masking the bottom 5 % of intervals can remove well over 5 % of the genome by bplow_access_window_size: window size in bp for step 2low_access_cutoff_bp: minimum accessible bp per window for step 2- exactly one of
mutload_window_sizeormutload_snp_windowfor step 3
Optional keys (all have sensible defaults):
tree_pattern: glob for treefiles within each chromosome directory (default:"*"), for example"*.trees"or"*.tsz"tree_subdir: optional subdirectory within each chromosome directory that holds the treefiles (default: unset → files live directly in the chromosome dir); e.g."trees"for SINGER-stylechrN/trees/layoutsmutload_cutoff: outlier cutoff fraction for step 3 (default:0.5)mutation_rate: single scalar mutation rate (per bp per generation), the shared fallback for both step 3 and step 6 when no embedded or sibling ratemap is availablemutload_random_seed: base seed for the per-replicate mutation simulation in step 3 (default:1)mutload_fraction: fraction threshold for writing mutation-masked BED rows in step 3suffix_to_strip: suffix removed from sample IDs before matching in step 3 and step 5 (default:"_anchorwave")trim_individuals: extra individual IDs removed genome-wide in step 5, in addition to the step-3 mutload outliers (e.g. introgressed samples). Comma-separated string ("id1,id2") or a YAML list; matched withsuffix_to_striplike step 3. Default: unset (trim only mutload outliers)trim_remove_bed: extra BED file(s) of per-individual intervals removed in step 5, in addition to the mutload outliers. Column 4 holds comma-separated sample IDs (or the filename stem if absent). Single path or a YAML list of paths; applied identically to every (chrom, rep). Default: unsetmin_samples: minimum number of non-isolated retained sample nodes (haploids, not individuals) a local tree must have; intervals below it are dropped in an optional step 5b that runs after step 5 and before the merge. Dropping usesdelete_intervals, so sequence coordinates are preserved (the removed spans become empty gaps — there is no coordinate compaction) and thekept_intervalsmetadata is intersected with the surviving spans so downstream accessibility is not overestimated. Each(chrom, rep)also gets a diagnostic BED of the dropped intervals (chrom start end retained_samples min_samples) under<out_dir>/step5b_min_samples/. When set, the merge, VCF export, and validation steps automatically consume the step-5b filtered tree sequences instead of the raw step-5 output. Default: unset/null (skip step 5b entirely, so existing configs are unaffected). Sample pruning in step 5 is what creates the low-sample intervals this filter targetsallow_missing_replicates: set totrueto concatenate partial replicate sets (default:false)burnin: number of leading discovered replicates to discard before concatenation (default:0); must be smaller than the number of replicates discovered after applyingtree_patternbase_name: prefix used for merged outputs (default: name ofroot_dir)merged_out_suffix: force a specific output suffix for merged files (.ts,.trees,.tsz); default is to inherit the suffix of the first inputout_dir: output root for Snakemake products (default:snakemake_out; tilde is expanded)run_validation: master switch for step 6 (default:true); setfalseto skip the validation plots while keepingmutation_rateset for step 3. Step 6 also auto-skips when no rate source is available (neithermutation_ratenorvalidation_sim_branch)validation_first_chrom_only: run step 6 only on the first chromosome (default:true)validation_window_size: window size in bp for step-6 diversity, Tajima's D, and segregating-sites validation plots (default:100000); larger windows run faster and use less memory on large ARGs but give coarser QC curvesvalidation_sim_branch: simulate site mutations on each ARG replicate with msprime for a posterior-predictive check instead of scaling branch statistics (default:false); can run without a scalarmutation_ratewhen every validated tree sequence has an embedded/sibling ratemapemit_vcf: iftrue, export one.vcf.gzper (chromosome, replicate) from the filtered per-chromosome tree sequences into<out_dir>/vcf/(default:false); see VCF export belowvcf_reps: restrict VCF output to specific replicate IDs (a subset of the post-burninreplicates); leave unset/null to emit every post-burninreplicate
Where the mutation rate comes from. Steps 3 and 6 resolve a per-bp mutation rate in this order: (1) a ratemap embedded in the tree-sequence metadata, (2) a sibling *.mut_rate.p file near the treefile, (3) the scalar mutation_rate. SINGER output normally provides (1) or (2), so a *.mut_rate.p file is not required if a ratemap is embedded. For ARGs produced by non-SINGER software — which typically carry no embedded ratemap and ship no *.mut_rate.p — just set the scalar mutation_rate and the pipeline runs without any ratemap file. One caveat: step 3's outlier test is designed to correct for local mutation-rate variation, so a flat scalar reduces it to a uniform-rate expectation (no spatial correction); prefer an embedded or sibling ratemap for step 3 when you have one. Step 6 can run in validation_sim_branch mode with ratemaps alone; otherwise it needs a scalar rate for simulation or branch scaling.
The pipeline derives the chromosome label, the replicate ID, and (optionally) the mutation map from your directory layout and filenames. Two of these must line up with the contents of your input files; getting them wrong is the most common setup failure (KeyError: Chromosome '<label>' not found in <file> at step 1, or a "could not infer mutation map" error at step 3/6).
-
Chromosome label — the name of each chromosome subdirectory directly under
root_dir(e.g.chr1in<root>/chr1/...; see the layout diagrams above). This one string is written as the BED chromosome column and is the key looked up in your hapmap and.fai. Keep it short and chromosome-like (1,chr1); do not embed run descriptions in the directory name — a directory calledchr.10.combined.snp.te.sortedbecomes the label verbatim and will not match a normal hapmap. (If your treefiles sit one level deeper, e.g.chrN/trees/, settree_subdirrather than pushingroot_dirdown or lengthening the chromosome directory name.) -
Replicate ID — the treefile stem with a leading
<chromosome-label>.prefix stripped, sochr1/trees/chr1.2.tszis replicate2. One treefile per replicate per chromosome. -
Hapmap
Chromosomecolumn — the hapmap is one combined file given by thehapmap:path; its filename is irrelevant. For each chromosome the pipeline looks up the chromosome label in theChromosomecolumn, trying, in order: the label as-is, the substring after its first., thenchr<that>andchr_<that>. So a labelamaranth.1matches a column value of1,chr1, orchr_1; a bare labelchr1matches onlychr1(there is no dot to strip, so1alone would not match). The same matching is applied to the first column of the.fai. A label likechr.10.combined.snp.te.sortedonly ever reduces to10.combined.snp.te.sorted, which is why such names fail to match. -
*.mut_rate.pfiles (optional — see "Where the mutation rate comes from" above) — unlike the hapmap, these are discovered by name, from the treefile path, independent ofroot_dir. For a treefile<stem>.<rep>.<suffix>the pipeline searches the treefile's own directory and one level up for<base>.mut_rate.p, where<base>is the stem with the trailing.<rep>removed (the full stem is also accepted). Example: for.../trees/foo.bar.12.tsz, name the mapfoo.bar.mut_rate.p(one per chromosome) orfoo.bar.12.mut_rate.p(one per replicate) and place it intrees/or its parent. If two different*.mut_rate.pfiles could match, the run aborts with an "Ambiguous mutation maps" error.
The Snakemake workflow automates the Suggested Workflow steps above. The final merged outputs are named:
<base_name>.combined.<replicate>.<suffix>
and are written under the configured out_dir in a combined/ directory.
Set emit_vcf: true to write one bgzip-less .vcf.gz per (chromosome, replicate) under <out_dir>/vcf/<chrom>/<rep>.vcf.gz, produced by scripts/export_vcf.py from the filtered per-chromosome tree sequences: step 5 output by default, or step 5b output when min_samples is enabled. Coordinates are real per-chromosome positions, not the concatenated coordinates of the merged ARG. Notes:
- Variable sites only — the records are the sites carried on the ARG; the pipeline does not synthesize invariant/monomorphic positions.
- Pruned samples are missing, not dropped. Because
trim_samplesleaves a pruned sample isolated over its intervals, that sample is written as a missing genotype (.) at any site inside those intervals (via tskit'sisolated_as_missing), while remaining present elsewhere. The sample/site is never globally removed. - Ploidy-aware genotypes — a haploid individual (one sample node, e.g. the
admixdata) is coded as a single allele0/1; a diploid individual as0|1-style. ARG genotypes are phased (|). - One VCF per replicate — each replicate is a distinct ARG with its own topology, sites, and per-replicate trimming, so VCFs differ across replicates. The replicate set follows the pipeline
burnin(leading replicates already dropped); restrict further withvcf_reps. To get one genome-wide VCF per replicate,bcftools concatthe per-chromosome files.
From the repo root:
module load conda
conda activate argtestEdit config/snakemake.yaml to point at your data (at minimum set root_dir, hapmap, and fai), then run.
Preview the planned jobs with a dry run, then run for real:
snakemake -n -p --configfile config/snakemake.yaml
snakemake --cores 16 --rerun-incomplete --keep-going --configfile config/snakemake.yaml-n -p prints the planned jobs and their commands without executing; --rerun-incomplete re-runs any jobs a previous interrupted run left half-finished; --keep-going lets independent jobs continue when one fails.
This is fine for the small example dataset. For real datasets, prefer the SLURM route below. The merge_replicates step loads all chromosomes of a replicate into memory at once and can need ~50–128 GB each; with plain --cores N, Snakemake may launch several merges concurrently and OOM the machine. If you must run locally, cap memory with --resources mem_mb=<node_RAM_in_mb> so the merges serialize to fit.
Sandboxed / read-only
~/.cacheenvironments (some HPC or container setups) may need cache and temp-dir redirects — see Sandboxed environments in NOTES.md.
Add --profile profiles/slurm to submit each job to SLURM instead of running locally:
snakemake --profile profiles/slurm --configfile config/snakemake.yamlThis is the recommended way to run real datasets: besides parallelism, it fans the ~per-(chromosome, replicate) jobs out across the cluster, sending light steps to one partition and the memory-heavy merge_replicates / step6_validation_plots steps to a big-mem partition — so each merge gets its own right-sized node and avoids the local-run OOM noted above. All cluster knobs live in your configfile — slurm_account, slurm_partition, and the per-rule resources: block (mem/time/threads/partition). The profile file itself (profiles/slurm/config.yaml) is set-and-forget; you do not edit it.
Notes:
- Partition names are cluster-specific. The shipped config uses
"low"and"high", which are particular to our cluster; changeslurm_partition(and any per-rulepartitionin theresources:block) in your configfile to match the partitions on your own cluster. out_dirmust be on a shared filesystem (not node-local/tmp) — each job runs on a different node, so a/tmpoutput path silently loses results.- The snakemake process here is just a controller (it submits jobs and polls), but it runs for the whole pipeline — launch it inside
tmux/screenor a smallsrunso it survives disconnects. - Per-job SLURM logs are written to
logs/slurm/. - Plain
snakemake --cores N …(above) still runs everything locally; SLURM-only settings such as account, partition, memory, and walltime are ignored, while per-rulethreadsstill affects local scheduling.
By default, Snakemake writes outputs beneath out_dir with subdirectories for each stage:
<out_dir>/
step1_low_rec/
step2_low_access/
step3_mutload/
step4_masks/
step4_trimmed_regions/
step5_trimmed_samples/
step5b_min_samples/ # low-sample intervals dropped, if min_samples is set
combined/
step6_validation/ # step 6 validation plots (original and cleaned), if configured
vcf/ # per-(chromosome, replicate) VCFs, if emit_vcf: true
pipeline_summary.html
logs/
Intermediate filenames include both chromosome and replicate information so they stay unique across the full workflow.
Pipeline scripts (called by the Snakefile). Run any with --help for arguments, defaults, and examples.
hapmap_low_rec_mask.py— per-chromosome BED of the bottom--rec-fractionof HapMap recombination-rate intervals.find_low_access_regions.py— BED of low-accessibility windows, computed from a tree sequence's inferred mutation map.mutload_summary.py— interactive HTML diagnostic for the mutload step: per-individual residual load after window-level pruning (ASCII bar chart, red highlight on individuals still outside the cutoff band, lineage table of flagged counts).mutload_masks.py— outlier and mutation-masked BED files for one tree sequence (pipeline step 3).combine_remove_masks.py— merge the step 1–3 BED masks into a single combined BED per chromosome.trim_regions.py/trim_regions_single.py— apply a BED mask to a directory (or single file) of tree sequences and write trimmed outputs with compacted coordinates.trim_samples.py— remove individuals genome-wide (--individuals) or over BED intervals (--remove). See NOTES.md for the exact sample-ID matching rules.filter_min_samples.py— drop intervals whose local trees retain fewer than--min-samplesnon-isolated sample nodes, viadelete_intervals(coordinates preserved); updateskept_intervalsmetadata and writes a diagnostic BED. Optional pipeline step 5b, driven bymin_samples.validation_plots_from_ts.py— SINGER-style QC plots (mutational load, diversity, Tajima's D, folded/unfolded SFS) across TS replicates; optional observed-vs-simulated overlays.merge_treefiles_by_replicate.py— concatenate chromosome-specific tree-sequence files by replicate; embedded mutation-rate ratemaps are merged and carried forward.export_vcf.py— export a.vcf/.vcf.gzfrom a (filtered) tree sequence: variable sites only, ploidy-aware genotypes, and samples pruned bytrim_sampleswritten as missing (.) viaisolated_as_missing. Driven byemit_vcf; see VCF export.pipeline_summary.py— self-contained HTML report of genome retention, per-individual outlier counts, and embedded validation plots.
Diversity (π), Tajima's D, and SFS in this pipeline are computed with tskit's built-in methods (ts.diversity, ts.Tajimas_D, ts.allele_frequency_spectrum). These normalize by a constant n · (n − 1) / 2 based on the sample set passed in, so when the sample size varies across regions — as it does after trim_samples.py removes individuals over BED intervals — the per-window statistics are not correctly normalized for the locally retained sample count. Treat post-pruning stats with caution, and prefer comparisons on replicates where sample membership is uniform across the genome.
Scripts not called by the Snakemake pipeline.
coalescence_ne_plots_from_ts.py— pair-coalescence and Ne plots from TS replicates. Choose the time grid with either--time-bins-file(explicit edges) or--num-bins N(equal-coalescence-event bins derived frompair_coalescence_quantilesaveraged over post-burnin replicates). Optional Demes-based coalescent simulations (--sim N) produce window-stat and SFS TSVs for observed-vs-sim density plots invalidation_plots_from_ts.py.
⚠️ Per-window caveat (pair-coalescence rescale):compute_pair_coalrescales tskit'spair_coalescence_counts(pair_normalise=True)PDF by a single scalar (n_pairs · seq_length / connected_pair_span) to fix the over-counted denominator when partial trees leave some samples isolated. This is the same bug fixed upstream in nspope/tskitnsp-paircoal-partial-missing(commitbe9f67f2). The scalar rescale is mathematically exact for the script's current global, single-sample-set, time-binned output, but would not be correct if the script were extended to emit spatial-windows=rates or multiple-sample_setsoutput — those need per-window or per-index-pair denominators rather than a single scalar. The rescale will be retired in favor of the in-engine fix once it ships in a releasedtskit.
compare_trees_html.py— render one tree index from each of two tree sequences side-by-side into a single HTML file.trees_gallery_html.py— scrollable HTML gallery of all trees from two tree sequences, useful for quick before/after inspection.simulate_two_bottleneck_demography.py— simulate replicate ARGs under a fixed two-bottleneck demography (35 ka + 9 ka bottlenecks, present-day expansion) for known-truth pipeline tests.make_realistic_example.py— generate a realistic synthetic example dataset (ARGs from the two-bottleneck model + three injected flaws: contaminated individuals, per-window sample pruning, and amut_rate.paccessibility mask) for end-to-end pipeline testing. Emits aground_truth.jsonfor scoring the pipeline's masks. See MAKE_REALISTIC_EXAMPLE.md for details, CLI options, and the ground-truth schema.
- Tree-sequence files: scripts accept
.ts,.trees, and.tszfiles. Loading/writing.tszrequirestszipto be installed; scripts will raise a clear error iftszipis missing when a.tszis used. - BED files: expected as whitespace-separated lines
chrom start end [name].startandendare numeric (half-open intervals[start, end)). If a fourth columnnameis present it may list one or more comma-separated sample IDs; if omitted the BED filename stem is used as the sample name. Lines starting with#and blank lines are ignored. - HapMap recombination maps: when required (e.g.
hapmap_low_rec_mask.py), the script expects the HapMap format used bymsprime.RateMap.read_hapmap. - Glob
--pattern: arguments named--patternaccept shell-style glob patterns (for example "*.tsz") and are matched against filenames in the supplied directory. - Defaults & output locations: many scripts write to a
results/directory or to an output directory under the input tree-directory when--out/--out-dirare not provided. Examples:trim_samples.py: default output isresults/<ts_stem>_trimmed.tszwhen--outis not given.trim_regions.py: default--out-diris<ts-dir>/trimmedand default log is<out-dir>/logs/trim_regions.log.mutload_summary.pywritesresults/<name>.htmlandlogs/<name>.log; no BED files are written (usemutload_masks.pyfor BED output).- Several plotting scripts write PNG files into
results/by default; most have--outor--out-dirflags to override this.
- Logging & errors: scripts write summary logs into
logs/or into the chosen--out-dir(see each script's--log/--out-diroptions). Common failure modes include missingtszipfor.tszI/O, mismatched sequence lengths across chromosome files (checked bytrim_regions.py), and invalid BED line formats (the loader will raiseValueErrorwhen a BED line has fewer than 3 columns). When a script prints anERROR:or raises an exception, check the correspondinglogs/or<out-dir>/log file for the detailed run record.
- Generated
logs/andresults/are git-ignored. .DS_Storeis git-ignored.- Per-release changes are tracked in CHANGELOG.md, keyed to the
v1.xgit tags.
None of this would be possible without the patient help and advice of Nate Pope. Any errors, bad code, or poor interpretations, however, are my responsibility alone. This repo also uses code from Nate Pope's singer-snakemake.