Demystifying PSMC: A Comprehensive Guide to Inferring Ancient Population Size for Biomedical Research

Carter Jenkins Jan 12, 2026 488

This article provides a comprehensive analysis of the Pairwise Sequentially Markovian Coalescent (PSMC) model, a cornerstone method for reconstructing historical population dynamics from a single genome.

Demystifying PSMC: A Comprehensive Guide to Inferring Ancient Population Size for Biomedical Research

Abstract

This article provides a comprehensive analysis of the Pairwise Sequentially Markovian Coalescent (PSMC) model, a cornerstone method for reconstructing historical population dynamics from a single genome. Tailored for researchers and drug development professionals, we explore PSMC's foundational theory, detailed methodological workflow, common pitfalls with optimization strategies, and its validation against alternative approaches. The content bridges population genetics insights with practical applications in understanding demographic history's role in genetic disease risk, pharmacogenomic diversity, and pathogen evolution, offering actionable guidance for implementing PSMC in contemporary genomic studies.

What is PSMC? Unlocking Demographics from a Single Genome

Within the broader thesis on reconstructing ancient demography, the Pairwise Sequentially Markovian Coalescent (PSMC) model represents a foundational computational breakthrough. It enables researchers to infer historical population size fluctuations from a single diploid genome sequence, bypassing the need for multiple contemporary or ancient samples. This application note details the theory, protocol, and practical implementation of PSMC for researchers in population genetics, evolutionary biology, and comparative genomics, with implications for understanding demographic history in disease susceptibility studies.

Core Principle and Data Requirements

PSMC infers population size (Ne) through time by analyzing the mosaic of heterozygosity and homozygosity along a single diploid genome. It models the time to the most recent common ancestor (TMRCA) between the two chromosomal copies at each genomic position using a hidden Markov model (HMM), where the hidden states are coalescent times discretized into time intervals.

Key Quantitative Parameters & Data Input:

  • Input Data: A diploid consensus sequence in FASTA format (or a mask file for unreliable regions).
  • Fundamental Statistic: The density of heterozygous sites within 100bp windows (or similar) along the genome.
  • Model Parameters: Number of atomic time intervals (typically 64), mutation rate (µ), and generation time (g).

Table 1: Critical Parameters for PSMC Inference

Parameter Symbol Typical Value/Range Function in Model
Mutation Rate µ 1.0e-8 to 2.5e-8 per bp/generation Scales observed heterozygosity to coalescent time.
Generation Time g 1-30 years (species-dependent) Converts coalescent time units (2Ne generations) to years.
Number of Intervals n 64 (default) Discretizes the time axis for the HMM.
Recombination Rate r Varies (e.g., 1.0e-8 per bp/gen) Governs the Markovian transition between coalescent states.
Maximum TMRCA tmax 15 (in 2Ne generations) Upper bound for the oldest inferable coalescent event.

Detailed Protocol for PSMC Analysis

Phase 1: Input Preparation (VCF to PSMC Input)

  • Variant Calling: Generate a VCF file for a single diploid individual using a standard pipeline (e.g., BWA-MEM alignment + GATK haplotype caller).
  • Generate Consensus Sequence:

  • Generate PSMC Input (.fq file): The fq format here denotes a binary encoding of heterozygous/homozygous sites.

    Output: A sequence where 'T' indicates a heterozygous site and 'K' or other letters indicate homozygous blocks.

Phase 2: PSMC Model Inference

  • Run PSMC: The core inference step.

    • -p: The pattern of time interval parameters (atomic intervals + number of free parameters).
    • -o: Output file containing the population size history.

Phase 3: Visualization and Interpretation

  • Generate Plot Data: Convert the .psmc output to a plottable format.

    • -g: Generation time in years.
    • -u: Mutation rate per generation per nucleotide. Output: Creates plot_sample.par and plot_sample.txt.

Experimental Workflow Diagram

Title: PSMC Analysis Workflow from Genome to Plot

The Scientist's Toolkit: Essential Research Reagents & Solutions

Table 2: Key Research Reagents & Computational Tools for PSMC

Item Function & Explanation
High-Coverage WGS Data Minimum 20x coverage recommended for accurate heterozygous call detection.
Reference Genome Assembly A high-quality, contiguous reference for accurate read alignment.
Variant Caller (e.g., GATK) To identify heterozygous/homozygous sites from aligned reads.
bcftools Software suite for processing VCF/BCF files and generating consensus.
PSMC Software Package Contains psmc, fq2psmcfa, psmc_plot.pl and other utilities.
Mutation Rate Estimate Species-specific prior from pedigree or phylogenetic studies.
Generation Time Estimate Prior from life history data, critical for converting to real time.
Masking File BED file defining low-complexity or unreliable genomic regions to exclude.

PSMC Model Logic and Coalescent Framework

Title: Logical Framework of the PSMC Model

Limitations and Advanced Considerations

PSMC has limited resolution for very recent (<10 kya) and very ancient (>1 mya) history. It assumes a panmictic population and is sensitive to the quality of input parameters (µ, g). Subsequent models like MSMC and SMC++ extend the approach to multiple genomes for more recent resolution and complex demographic scenarios.

Within the broader thesis on the Pairwise Sequentially Markovian Coalescent (PSMC) method for inferring ancient population size, this application note focuses on the foundational genetic signal: heterozygosity as a function of TMRCA. The PSMC algorithm leverages the distribution of heterozygosity across a single diploid genome to infer coalescent times and, consequently, historical effective population sizes (Ne). This document details the protocol for extracting this signal and its quantitative interpretation.

Core Principle: From Heterozygosity to Coalescence Time

The density of heterozygous sites within a genomic segment is directly related to the TMRCA of the two chromosomal copies in that segment. Longer segments without heterozygosity imply a more recent common ancestor, while segments with higher heterozygosity density suggest a deeper TMRCA. PSMC models this relationship using a hidden Markov model (HMM) where the hidden state is the TMRCA.

Key Quantitative Relationship: The probability of observing a heterozygote per site, given the coalescence time t (in units of 2Ne generations) and mutation rate μ per generation per site, is approximated by: P(het | t) ≈ 1 - exp(-2μt)

For small values, this is roughly 2μt. Thus, heterozygosity is a direct, if noisy, readout of scaled coalescence time.

Key Data & Parameters Table

Table 1: Core Parameters & Their Impact on TMRCA Inference from Heterozygosity

Parameter Symbol Typical Range/Value Role in Heterozygosity-TMRCA Model Source/Calculation Method
Mutation Rate μ 1.2e-8 to 2.5e-8 per gen/site Converts observed heterozygosity to time scale (T = θ/4μ). Calibration is critical for absolute dating. Derived from pedigree studies or fossil-calibrated phylogenetic rates.
Generation Time G 20-30 years for humans Converts generational TMRCA to years (Tyears = Tgenerations * G). Anthropological/historical records, species-specific data.
Recombination Rate r ~1e-8 per gen/site Defines the correlation length between adjacent genomic segments; enables the Markovian process in PSMC. Estimated from genetic maps (e.g., HapMap, 1000 Genomes).
Effective Population Size Ne Inferred (10,000-30,000 for human ancestors) The scaling factor for coalescence times. The primary output of PSMC, inversely related to TMRCA density. Inferred from the distribution of TMRCA intervals across the genome.
Heterozygosity (observed) θ_w ~0.001 per site in human The raw input signal. Sum of heterozygous sites divided by total callable sites in a diploid genome. Direct measurement from high-coverage (>20x) whole-genome sequencing.

Table 2: PSMC Output Interpretation (Example from a Human Genome)

TMRCA Interval (Generations, scaled) Inferred λ (Ne/10^4) Approximate Historical Period (Years ago, assuming G=25) Implied Population Dynamics
0 - 10,000 1.5 0 - 250,000 Stable modern size.
10,000 - 100,000 0.7 250,000 - 2.5 Mya Bottleneck period.
100,000 - 500,000 1.2 2.5 - 12.5 Mya Population expansion.
500,000 - 1,000,000 2.5 12.5 - 25 Mya Large ancestral population.

Experimental Protocol: From Sample to PSMC Plot

Protocol 4.1: Whole Genome Sequencing & Variant Calling for PSMC Input

Objective: Generate a high-quality, consensus sequence (FASTA) of a single diploid individual for PSMC analysis.

Materials & Reagents:

  • High-molecular-weight genomic DNA (≥ 1μg).
  • Library preparation kit (e.g., Illumina TruSeq DNA PCR-Free).
  • Whole-genome sequencing platform (e.g., Illumina NovaSeq, ≥ 30x coverage).
  • High-performance computing cluster.
  • Bioinformatics tools: BWA-MEM, SAMtools, BCFtools, GATK, vcftools.

Procedure:

  • Library Preparation & Sequencing: Prepare library per manufacturer's protocol. Sequence to achieve minimum 30x coverage across the genome.
  • Read Alignment: Align reads to a reference genome (e.g., GRCh38) using bwa mem. Sort and index output with samtools.

  • Variant Calling & Consensus Generation: a. Call variants using bcftools mpileup and call.

    b. Generate a masked consensus sequence where heterozygous sites are represented by the IUPAC ambiguity code (e.g., A/T becomes 'W') and homozygous or low-quality sites are represented by the reference base.

    c. The mask.bed file is crucial. It should exclude low-complexity regions, segmental duplications, and regions with poor mapping quality. Use resources like the 1000 Genomes strict mask.

  • PSMC Input Preparation: Convert the FASTA to the required PSMC fq format using the fa2fq utility from the PSMC package.

Protocol 4.2: Running PSMC and Bootstrapping

Objective: Infer historical Ne from the .psmcfa file and assess confidence.

Procedure:

  • Run PSMC: Execute the psmc command with initial parameters.

  • Generate Bootstrap Replicates: Create 100 bootstrapped sequences to assess variance.

  • Plot Results: Combine results and generate the final plot.

    -g: generation time; -u: mutation rate.

Visualizations

psmc_workflow Start Diploid Whole Genome Sequence VCF Variant Calling & Filtering Start->VCF Consensus Generate Masked Consensus FASTA (Het as IUPAC) VCF->Consensus PSMCFA Convert to PSMC Input (psmcfa) Consensus->PSMCFA PSMCrun PSMC HMM Inference PSMCFA->PSMCrun Boot Bootstrap Resampling PSMCrun->Boot Plot Plot Ne(t) over Time Boot->Plot

Title: PSMC Analysis Workflow from Raw Sequence

heterozygosity_tmrca cluster_genome Single Diploid Genome Scan Seg1 Genomic Segment A Low Heterozygosity TMRCA1 Recent TMRCA (Short branch) Seg1->TMRCA1 Seg2 Genomic Segment B High Heterozygosity TMRCA2 Ancient TMRCA (Long branch) Seg2->TMRCA2 Seg3 Genomic Segment C Low Heterozygosity TMRCA3 Recent TMRCA (Short branch) Seg3->TMRCA3 Coalescent Coalescent Tree Hidden State TMRCA1->Coalescent  HMM Models  Transition via  Recombination TMRCA2->Coalescent  HMM Models  Transition via  Recombination TMRCA3->Coalescent  HMM Models  Transition via  Recombination

Title: Heterozygosity Density Informs TMRCA per Segment

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for PSMC-based TMRCA Studies

Item Function in Protocol Example Product/Source Critical Specification
High-Quality gDNA Kit Provides intact, high-molecular-weight DNA for WGS library prep. Qiagen Gentra Puregene, Promega Wizard. A260/280 ~1.8, A260/230 >2.0, average fragment size >50 kb.
PCR-Free WGS Library Kit Minimizes amplification biases that could create false heterozygous calls. Illumina DNA PCR-Free Prep. Input DNA requirement: ≥ 1μg. Essential for accurate heterozygosity measurement.
Reference Genome Alignment and variant calling baseline. Must be haplotype-resolved if possible. GRCh38 from GENCODE, species-specific assemblies from NCBI. Use primary assembly, not ALT contigs. Mask problematic regions.
Variant Caller Identifies heterozygous positions from aligned reads. BCFtools, GATK HaplotypeCaller (in single-sample mode). Must be tuned for high sensitivity in diploid individuals.
Genomic Mask File Defines callable regions, excluding repeats, duplications, and low-mappability areas. 1000 Genomes strict mask, UCSC mappability tracks. Critical for removing false heterozygosity signals.
PSMC Software Suite Core algorithm for inferring TMRCA and Ne from a single genome. PSMC (https://github.com/lh3/psmc). Must be compiled with zlib support. Use consistent version for reproducibility.
Mutation Rate Estimate Calibrates the time axis from generations to years. Species-specific, from literature (e.g., human: 1.2e-8 per gen/site). The single largest source of uncertainty in absolute dating.

The Pairwise Sequentially Markovian Coalescent (PSMC) method infers historical population sizes by analyzing the distribution of heterozygosity within a single diploid genome sequence. The core principle relies on the fact that a diploid genome is a mosaic of segments from two distinct haplotypes, each with its own coalescent history. The time to the most recent common ancestor (TMRCA) varies across the genome. By examining the sequential patterns of heterozygous (different alleles) and homozygous (identical alleles) sites, PSMC estimates changes in effective population size (Ne) over time. Understanding the precise generation and interpretation of the diploid sequence input is therefore foundational to all subsequent PSMC analysis and its applications in evolutionary biology, conservation genetics, and informing population history for biomedical research.

Core Principles & Quantitative Data

The diploid genome sequence for PSMC is typically a consensus sequence from high-coverage whole-genome sequencing. The key input metric is the heterozygous density.

Table 1: Key Quantitative Parameters for PSMC from a Diploid Sequence

Parameter Typical Value / Range Description & Impact on PSMC Inference
Sequencing Coverage >20X (ideally >30X) Depth of sequencing required to confidently call heterozygous sites. Lower coverage increases false-negative heterozygote calls.
Heterozygosity Rate (θ) ~1x10⁻³ per site (varies by species) Genome-wide average. Directly informs the scaled mutation rate (θ=4Neμ).
Called Genomic Span >1 Gb (longer is better) Total analyzed sequence length. More data improves the resolution and reduces statistical noise.
Missing Data Rate < 5% Fraction of sites with no confident call. High rates can fragment the coalescent hidden Markov model (HMM) chain.
Haplotype Block Length ~10² - 10⁵ bp (function of recombination) Inferred segments between recombination events. Longer blocks provide information about deeper coalescent times.

Table 2: PSMC Output Interpretation Timeline

Time Segment (Generations Ago) Inferred Ne Sensitivity Dictated By Common Biological Events Correlated
Very Recent (0-10³) Low resolution, requires high recombination rate. Recent bottlenecks, expansions.
Intermediate (10³-10⁵) Highest accuracy for many species. Pleistocene climatic cycles, migrations.
Ancient (>10⁵) Relies on rare, deep coalescent events. Speciation events, long-term stable size.

Detailed Protocols

Protocol 1: Generating PSMC Input from Whole-Genome Sequencing Data

Objective: To produce a diploid consensus sequence in the required "masked" FASTA format for PSMC analysis.

Materials & Reagents:

  • High-quality genomic DNA from a single diploid individual.
  • Library prep kit for short-insert, paired-end sequencing (e.g., Illumina TruSeq).
  • High-throughput sequencer (e.g., Illumina NovaSeq).
  • Reference genome assembly for the target species.
  • Computational cluster with adequate storage and memory.

Procedure:

  • Sequence: Perform whole-genome sequencing to a minimum depth of 20X. Use paired-end 150bp reads for optimal mapping.
  • Quality Control: Use FastQC to assess read quality. Trim adapters and low-quality bases using Trimmomatic or fastp.
  • Alignment: Map cleaned reads to the reference genome using BWA-MEM (bwa mem -M). Sort the resulting SAM file and convert to BAM using samtools (samtools sort, samtools view -b).
  • Mark Duplicates: Identify and tag PCR duplicates using Picard Tools MarkDuplicates or samtools markdup.
  • Variant Calling: Use bcftools mpileup and call to perform variant calling (bcftools mpileup -Ou -f ref.fa aln.bam | bcftools call -c -Ov -o var.vcf). The -c option calls consensus, preserving heterozygosity.
  • Generate Consensus Sequence: Use vcfutils.pl from the bcftools package to generate a diploid consensus FASTA:

  • Masking: Convert the consensus FASTA to the PSMC input format, where heterozygous sites are represented as 'K' (for G/T) or 'R' (for A/G), etc., and homozygous sites as the reference base. Use the fq2psmcfa utility provided with the PSMC software:

  • Quality Masking (Optional): Use a bed file of low-complexity or poorly mapped regions (e.g., from samtools depth) to mask sites by converting them to 'N'.

Protocol 2: Executing PSMC Analysis and Bootstrapping

Objective: To run the PSMC model and assess inference robustness.

Procedure:

  • Run PSMC: Execute the core algorithm with tailored parameters.

  • Bootstrap: Generate bootstrapped sequences to estimate confidence intervals.

  • Combine Results: Use the psmc_plot.pl utility to generate the final plot combining the main run and bootstraps.

    Here, -g sets generation time in years, and -u sets the mutation rate per generation.

Visualization: Workflow & Logical Relationships

G WGS Whole-Genome Sequencing Data (Paired-End Reads) QC Quality Control & Adapter Trimming WGS->QC Align Alignment to Reference Genome QC->Align Prep BAM Processing: Sort, Mark Duplicates Align->Prep VCF Variant Calling (bcftools call -c) Prep->VCF FQ Diploid Consensus FASTQ Generation VCF->FQ PSMCfa Format Conversion (fq2psmcfa) FQ->PSMCfa Input PSMC Input File (.psmcfa) PSMCfa->Input Model PSMC-HMM Inference (Coalescent Modeling) Input->Model Bootstrap Bootstrap Resampling Input->Bootstrap splitfa Output Effective Population Size (Ne) over Time Model->Output Plot Final Visualization with Confidence Intervals Output->Plot Bootstrap->Model 100x runs Bootstrap->Plot

Diagram 1: PSMC Analysis Workflow from Raw Sequencing Data

H HapA Haplotype A (Maternal) DiploidSeq Diploid Genome Sequence HapA->DiploidSeq HapB Haplotype B (Paternal) HapB->DiploidSeq Heterozygote Heterozygous Site (K, R, Y, etc.) DiploidSeq->Heterozygote Alleles differ Homozygote Homozygous Site (A, C, G, T) DiploidSeq->Homozygote Alleles match Coalescent1 Coalescent Event ~10,000 gens ago Coalescent2 Coalescent Event ~100,000 gens ago TMRCA_A TMRCA Segment 1 TMRCA_A->Coalescent1 TMRCA_B TMRCA Segment 2 TMRCA_B->Coalescent2 Heterozygote->TMRCA_A Indicates more recent coalescence Homozygote->TMRCA_B Indicates older coalescence

Diagram 2: Diploid Sequence as a Mosaic of Coalescent Histories

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for Generating PSMC Input

Item / Reagent Function in Protocol Critical Specification / Note
High Molecular Weight gDNA Kit (e.g., Qiagen Blood & Tissue, PacBio) Provides intact, high-quality genomic DNA for WGS library prep. High purity (A260/280 ~1.8), minimal fragmentation. Essential for long-range information.
Illumina DNA Prep Kit Library preparation for short-insert, paired-end sequencing on Illumina platforms. Optimized for even coverage and minimal GC bias. Crucial for uniform variant calling.
PhiX Control v3 Sequencing run control for cluster density and error rate calibration. Includes unmethylated lambda phage DNA. Improves base calling accuracy, especially for low-diversity libraries.
Reference Genome FASTA Linear reference for read alignment and coordinate system. Should be contiguous (N50 > 1Mb), well-annotated, and from a closely related individual/lineage if possible.
BWA-MEM2 Software Ultra-fast and accurate alignment of sequencing reads to the reference. Successor to BWA-MEM. Significantly faster alignment, reducing compute time for large genomes.
Samtools/Bcftools Suite Processing alignment files (BAM) and performing variant/consensus calling. Must be version 1.10 or higher for full compatibility. The call -c command is non-negotiable for PSMC input.
PSMC Software Package Contains the core inference algorithm (psmc) and all necessary utilities (fq2psmcfa, splitfa, psmc_plot.pl). Requires a POSIX environment (Linux/Mac). The -p parameter must be optimized for the timescale of interest.

Application Notes: PSMC as a Bridge Between Theory and Data

The Pairwise Sequentially Markovian Coalescent (PSMC) model represents a pivotal operationalization of coalescent theory for genomic data analysis. It translates theoretical predictions about the time to the most recent common ancestor (TMRCA) across a genome into an inference engine for historical population size (Ne). The core innovation lies in using the heterozygosity patterns within a single diploid genome to reconstruct a Ne trajectory over time, leveraging the sequentially Markovian approximation to make genome-scale coalescent analysis computationally tractable.

The quantitative outputs of PSMC provide a demographic history scaffold, critical for contextualizing:

  • The timing of evolutionary bottlenecks and expansions.
  • Speciation and divergence events.
  • Background selection and positive selection scans.
  • The interpretation of pathogenic variant age in biomedical research.

Protocol: A Standard PSMC Analysis Workflow

Objective: Infer historical effective population size (Ne) trajectories from a single diploid genome sequence.

Materials & Input Data:

  • Input File: A high-coverage (>20x) whole-genome sequence alignment for one diploid individual (BAM/CRAM format).
  • Reference Genome: The relevant high-quality reference genome assembly.
  • Software: PSMC (https://github.com/lh3/psmc), samtools, bcftools.
  • Key Parameters: Mutation rate (μ), Generation time (g).

Procedure:

Step 1: Variant Calling and Consensus Sequence Generation

  • Use samtools mpileup and bcftools to call variants.

  • Generate the diploid consensus sequence in FASTQ format using the vcf2fq utility (part of samtools). The -d and -D parameters filter out extreme depths.

Step 2: Format Conversion for PSMC

  • Convert the FASTQ to a PSMC input file (.psmcfa).

Step 3: Run PSMC Inference

  • Execute the PSMC algorithm with recommended parameters.

Step 4: Plotting Results

  • Generate a plot using the psmc_plot.pl utility script.

Visualization: The PSMC Coalescent Inference Pipeline

Diagram Title: PSMC Data Processing and Coalescent Inference Flow

psmc_flow Start Diploid WGS (BAM Alignment) VarCall Variant Calling (samtools/bcftools) Start->VarCall mpileup ConsensusFQ Diploid Consensus (FASTQ) VarCall->ConsensusFQ vcf2fq PSMCfa Binary Sequence (.psmcfa) ConsensusFQ->PSMCfa fq2psmcfa PSMCrun Coalescent HMM Inference (PSMC) PSMCfa->PSMCrun Coalescent HMM Output Ne Trajectory (Plot & Data) PSMCrun->Output Plot with μ, g

Item Function & Relevance
High-Coverage WGS Data Essential input. Requires >20x coverage for accurate heterozygous site calling, minimizing sequencing error bias.
Species-Specific μ & g Critical calibration parameters. Mutation rate (μ) and generation time (g) convert coalescent units to real years and population size. Must be sourced from prior literature (e.g., fossil records, pedigree studies).
Reference Genome A contiguous, well-assembled reference is needed for accurate read mapping and variant calling. Fragmented assemblies cause spurious signals.
PSMC Software Suite Core analysis toolkit (psmc, utils/fq2psmcfa, utils/psmc_plot.pl).
Variant Caller (bcftools) Generates the raw variant data from aligned reads to build the consensus sequence.
Computational Resources PSMC is single-threaded but memory-efficient. Runs on standard servers; bootstrapping requires multiple iterations.

Data Presentation: PSMC Inferred Parameters & Interpretation

Table 1: Key Parameters for PSMC Analysis and Their Impact

Parameter Symbol Typical Source Effect on Inferred Trajectory
Mutation Rate μ Literature, pedigree studies Scales time & Ne linearly. Incorrect μ shifts events earlier/later.
Generation Time g Life history data Scales time linearly. Incorrect g compresses/expands timeline.
θ Initialization (-r flag) Estimated from data Influences initial recombination rate estimate; moderate effect.
Time Interval Pattern (-p flag) Heuristic optimization Defines resolution of inferred epochs; must match data quantity.

Table 2: Example PSMC Output Interpretation for Model Organisms

Species Inferred Bottleneck (Ne low point) Inferred Expansion (Ne high point) Key Biological Interpretation Context
Human (European) ~10-20k years ago (Ne~2,000) ~100-200k years ago (Ne~15,000) Out-of-Africa bottleneck, then pre-exodus expansion.
Drosophila melanogaster N/A (Recent expansion) ~10k-50k years ago Continuous expansion associated with human agriculture.
Dog (Canis lupus familiaris) ~10k years ago Pre-domestication (>15k years ago) Domestication bottleneck from larger wolf ancestor population.

Inferences of ancient population sizes, enabled by methods like the Pairwise Sequentially Markovian Coalescent (PSMC), provide a critical evolutionary context for modern human disease. Population bottlenecks and expansions have shaped the allele frequency spectrum, influencing the prevalence and penetrance of genetic variants. Understanding these demographic histories is essential for accurately identifying disease-causing variants, interpreting genome-wide association studies (GWAS), and developing targeted therapeutic strategies. This application note details the integration of PSMC-based demographic insights into contemporary disease research workflows.

Demographic History Data from PSMC Analysis

The PSMC method uses the coalescence times derived from a single diploid genome sequence to infer historical effective population size (Ne) over time. Key demographic events leave distinct signatures.

Table 1: Inferred Ancient Demographic Events in Modern Human Lineage

Evolutionary Event Approximate Time (Years Ago) Inferred Effective Population Size (Ne) Impact on Genetic Load
Out-of-Africa Bottleneck 50,000 - 100,000 ~1,000 - 5,000 Reduced genetic diversity, increased rare alleles due to drift.
Human-Neanderthal Admixture ~50,000 N/A (Introgression) Introduced archaic variants, some conferring adaptive immunity, others disease risk.
Neolithic Expansion ~10,000 Gradual increase Increased population size allowed accumulation of slightly deleterious variants.

Table 2: Disease Research Implications of Demographic History

Demographic Scenario Genetic Consequence Challenge for Disease Research PSMC-Informed Strategy
Severe Bottleneck Loss of rare variants, increased frequency of some deleterious alleles (genetic hitchhiking). False positives in GWAS due to population structure; difficulty mapping rare diseases. Use demographic-aware variant filtering; adjust linkage disequilibrium expectations.
Recent Rapid Expansion Surge in very rare, potentially deleterious de novo variants. Distinguishing pathogenic from benign rare variants. Leverage ancient Ne to calibrate mutation rate models and variant age estimators.
Population Substructuring Divergent allele frequencies across sub-populations. Health disparities; poor generalizability of polygenic risk scores. Construct population-specific demographic models as a baseline for association testing.

Protocols and Methodologies

Protocol 1: Generating a PSMC Inference from High-Coverage WGS Data

Objective: To infer the historical effective population size trajectory from a single individual’s whole-genome sequence. Materials: High-coverage (>30X) WGS BAM file, reference genome (e.g., GRCh38), PSMC software suite, samtools, bcftools. Procedure:

  • Variant Calling and Consensus Generation:
    • Use samtools mpileup and bcftools call to generate a VCF of homozygous/heterozygous sites.
    • Generate a diploid consensus sequence in FASTQ format:

  • PSMC Input Preparation:
    • Convert the consensus FASTQ to a PSMC-friendly FASTA format.
    • Generate the .psmcfa file using the fq2psmcfa utility: fq2psmcfa consensus.fq > output.psmcfa.
  • Run PSMC Inference:
    • Execute PSMC with chosen parameters (e.g., atomic time intervals, number of iterations):

  • Plotting Results:
    • Use the psmc_plot.pl utility with a generation time (e.g., 25 years) and mutation rate (e.g., 2.5e-8 per generation) to generate a Ne vs. Time plot: psmc_plot.pl -g 25 -u 2.5e-08 plot_title output.psmc.

Protocol 2: Integrating PSMC Demography into GWAS Variant Prioritization

Objective: To filter and prioritize GWAS hits using demographic history to account for background selection and drift. Materials: GWAS summary statistics, PSMC-inferred Ne trajectory for the study population, B-statistic or similar background selection estimate tools. Procedure:

  • Calculate Neutral Site Frequency Spectrum (SFS): Simulate the expected SFS under the PSMC-inferred demographic model using software like ms or stdpopsim.
  • Model Background Selection: Use the inferred Ne trajectory to compute a per-site reduction in effective population size due to linked selection (e.g., using BEDTools and genetic maps).
  • Adjust Association P-values: Apply a genomic correction method that incorporates the demographic and background selection model to control for confounding and reduce false positives.
  • Variant Annotation & Prioritization: Annotate candidate variants with their estimated age of origin (using methods like GEVA) calibrated by the PSMC timeline, prioritizing older variants shared across populations for common diseases or younger variants for severe, rare disorders.

Visualizations

psmc_workflow WGS Diploid Whole Genome Sequence Consensus Consensus Sequence (.psmcfa) WGS->Consensus Variant Calling PSMC PSMC Algorithm (Coalescent HMM) Consensus->PSMC Input Model Demographic Model (Ne over Time) PSMC->Model Inference GWAS GWAS/Association Study Model->GWAS Provides Evolutionary Context Filter Demography-Aware Variant Filtering Model->Filter Calibration GWAS->Filter Hits & P-values Target Prioritized High- Confidence Variants Filter->Target

Diagram 1: PSMC Data Informs Modern GWAS Analysis (80 chars)

bottleneck_effect LargeAncPop Large Ancestral Population Bottleneck Population Bottleneck LargeAncPop->Bottleneck SmallPop Small Founding Population Bottleneck->SmallPop Reduced Ne Drift Genetic Drift Intensifies Bottleneck->Drift Causes ModernPop Expanded Modern Population SmallPop->ModernPop Rapid Growth FreqChange Allele Frequency Randomly Shifts SmallPop->FreqChange Drift->FreqChange Load Altered Genetic Load Profile FreqChange->Load Results in

Diagram 2: Bottleneck Alters Genetic Load (58 chars)

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Resources for PSMC-Informed Disease Research

Item/Category Function/Description Example/Tool
High-Quality WGS Data Foundational data for accurate PSMC inference and variant discovery. Requires high coverage and low contamination. Illumina NovaSeq, Ultima Genomics, PacBio HiFi for complex regions.
PSMC Software Suite Core algorithm package for demographic inference from a single genome. PSMC, MSMC, SMC++.
Coalescent Simulation Tools Simulate genetic data under inferred demographic models to generate null distributions for testing. ms, stdpopsim, SLiM.
Variant Age Estimator Estimates the time to most recent common ancestor (TMRCA) of a haplotype carrying a specific variant, calibrated by demography. GEVA, RELATE.
Background Selection Maps Quantifies the reduction in neutral diversity due to linked purifying selection, dependent on local recombination rate and Ne. B-statistic from BEDTools, UCNEbase.
Population-Specific Reference Panels Enables accurate haplotype phasing and frequency estimation within the demographic context of the study population. 1000 Genomes Project, gnomAD, population-specific cohorts like TOPMed.

Step-by-Step PSMC Analysis: From Raw Reads to Population History Curves

Within a thesis utilizing the Pairwise Sequentially Markovian Coalescent (PSMC) method for inferring ancient population size, the preprocessing of whole-genome sequencing data is a critical, foundational step. The accuracy of PSMC's inference of historical effective population size (Ne) trajectories is directly contingent upon the quality of its input file—a diploid consensus sequence in FASTQ format. This document details the application notes and protocols for the three-stage preprocessing pipeline: read alignment to a reference genome, conversion to a sorted BAM/BCF format, and the generation of the final masked consensus FASTA for PSMC analysis. The pipeline ensures that artifacts from sequencing, alignment, and variant calling are minimized to prevent spurious signals in the reconstructed demography.

Core Pipeline Workflow and Visualization

The following diagram illustrates the logical flow and dependencies of the complete preprocessing pipeline.

Diagram Title: PSMC Preprocessing Pipeline Workflow

psmc_preprocess Raw_Reads Raw Sequencing Reads (FASTQ) Align Alignment (BWA mem) Raw_Reads->Align Ref_Genome Reference Genome (FASTA) Ref_Genome->Align Var_Call Variant Calling (bcftools mpileup/call) Ref_Genome->Var_Call Consensus Generate Consensus (bcftools consensus) Ref_Genome->Consensus SAM Aligned Reads (SAM) Align->SAM Sort_Index Sort & Index (samtools) SAM->Sort_Index BAM Sorted BAM File Sort_Index->BAM BAM->Var_Call VCF Variant Call File (VCF) Var_Call->VCF VCF->Consensus Mask Apply Mask (bedtools) Consensus->Mask Final_Input PSMC Input (Masked FASTA) Mask->Final_Input

Detailed Experimental Protocols

Protocol: Read Alignment to Reference Genome

Objective: Map high-throughput sequencing reads to a reference genome to determine their genomic origins. Principle: The BWA-MEM algorithm performs fast and accurate alignment of long reads (70bp-1Mbp) against large reference genomes, handling splice-aware alignment.

Methodology:

  • Reference Genome Indexing: Create an index of the reference genome FASTA file.

  • Read Alignment: Map paired-end reads to the indexed reference.

    • -t: Number of threads.
    • -M: Mark shorter split hits as secondary (important for Picard compatibility).

Quality Control Check: Use samtools flagstat aligned_output.sam to assess mapping rate. A successful mammalian genome alignment typically yields >90% properly paired reads.

Protocol: Format Conversion and Processing (SAM to BAM/BCF)

Objective: Convert the human-readable SAM file to a compressed, sorted BAM file, then call genomic variants. Principle: SAMtools and BCFtools provide efficient utilities for manipulating alignments and calling variants, generating the necessary data for consensus sequence generation.

Methodology:

  • SAM to Sorted BAM: Convert, sort by coordinate, and index the BAM file.

  • Variant Calling: Generate a VCF file containing SNP and indel calls.

    • -c: Consensus caller (suitable for low-coverage data).
    • -v: Output variant sites only.

Data Filtering: It is standard to filter the VCF for quality (e.g., bcftools filter -s LowQual -e '%QUAL<20 || DP>100' raw_variants.vcf > filtered_variants.vcf).

Protocol: Generating the PSMC Input File

Objective: Create a diploid, masked consensus FASTA sequence from the reference and variant calls. Principle: The consensus sequence reflects the genotype of the sequenced individual. Masking low-complexity and unreliable regions prevents misinference of coalescent events in PSMC.

Methodology:

  • Generate Raw Consensus: Use the reference and filtered VCF.

  • Apply Mask File: Exclude low-mappability regions (e.g., from the 1000 Genomes Project strict mask).

  • Final Formatting: Ensure the FASTA contains one sequence per autosome, with heterozygous sites represented as K (G/T), S (G/C), W (A/T), Y (C/T), M (A/C), or R (A/G). The fq2psmcfa script (from PSMC) is then used.

Data Presentation

Table 1: Key Performance Metrics for Preprocessing Pipeline Steps (Example Human Genome Data)

Pipeline Stage Key Metric Typical Target Value Tool/Command for Assessment
Alignment (BWA-MEM) Overall Alignment Rate > 90% samtools flagstat
Properly Paired Rate > 85% samtools flagstat
Variant Calling (BCF) Number of Called SNPs ~3-4 million (for 15X coverage) bcftools stats
Ts/Tv Ratio (Human) ~2.0-2.1 bcftools stats
Consensus Generation Heterozygous Site Frequency ~1 per 1000 bp Custom script
Fraction of Genome Masked 10-15% bedtools coverage

Table 2: Impact of Preprocessing Quality on PSMC Inference

Preprocessing Artifact Potential Effect on PSMC Output Mitigation Strategy
High Rate of Misaligned Reads Artificial inflation of recent population size Use stringent mapping quality filters (e.g., MAPQ>30).
Inadequate Masking of Repeat Regions False, deep bottlenecks in ancient history Apply comprehensive mappability mask (e.g., 1000G strict).
Incorrect Handling of Heterozygotes Distortion of the time scale and Ne magnitude Use bcftools call -c and proper fq2psmcfa conversion.
Residual Sequencing Errors Noise and spurious fluctuations in the Ne trajectory Apply base quality filtering during variant calling.

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials and Tools for the PSMC Preprocessing Pipeline

Item Function / Role in Pipeline Example / Note
Reference Genome Linear coordinate system for aligning reads and calling variants. Must be high-quality and contiguous. Human: GRCh38/hg38. Non-model organisms require a high-quality de novo assembly.
Alignment Software Performs the mapping of short DNA sequences to the reference genome. BWA-MEM, Bowtie2. BWA-MEM is standard for Illumina data >70bp.
Sequence Toolkit Suite of programs for manipulating SAM/BAM/CRAM format files, including sorting, indexing, and statistics. SAMtools. A fundamental dependency.
Variant Caller Identifies positions where the sequenced individual differs from the reference genome (SNPs, indels). BCFtools, GATK. BCFtools is lightweight and effective for consensus generation.
Genome Mask A BED file defining genomic regions with poor mappability or high error rates, which must be excluded from PSMC analysis. 1000 Genomes Project "strict mask" for human data. For other species, RepeatMasker output can be used.
Interval Tool Applies genomic masks or performs operations on genomic intervals. BEDTools. Critical for the masking step.
PSMC Utilities Converts the masked consensus FASTA into the final binary input format for the PSMC program. fq2psmcfa from the PSMC package.
High-Performance Computing (HPC) Resources The alignment and variant calling steps are computationally intensive, requiring significant memory and CPU time. Access to a cluster with 16+ cores and 32+ GB RAM per sample is recommended.

Application Notes: PSMC Parameterization in Demography Inference

Within the broader thesis on inferring ancient population size histories using the Pairwise Sequentially Markovian Coalescent (PSMC) method, precise parameter configuration is paramount. The PSMC model decodes the time to the Most Recent Common Ancestor (TMRCA) between two homologous chromosomes to reconstruct effective population size (Ne) trajectories over time. Key command-line parameters directly control the model's structure, scaling, and data processing, significantly impacting the biological interpretation of results.

Table 1: Core PSMC Parameters and Their Quantitative Impact

Parameter Typical Range/Value Function in Thesis Context Impact on Ne Inference
-p "pattern+interval" e.g., "4+25*2+4+6" Defines the atomic time segment pattern for the hidden Markov model (HMM). Determines temporal resolution and smoothing. Fewer segments (4+30) give smoother, less detailed histories; more complex patterns (4+25*2+4+6) allow higher resolution at recent times.
-t max T<sub>MRCA</sub> Default: 15 Maximum allowed TMRCA in units of 2N0 generations (N0=initial Ne). Truncates coalescent events beyond this time. Too low a value clips ancient population size estimates; too high can introduce noise.
-r mu/lambda e.g., 5 or 2 Ratio of per-generation mutation rate (μ) to recombination rate (c). Scales genetic distance to physical time. Critical for time scaling. -r 5 implies μ/c = 5. Incorrect ratio introduces systematic bias in the absolute timing of events.
-d rho Default: -1 (estimate) Minimum recombination rate (ρ = 4N0c). When set, overrides internal estimation. Used for sensitivity testing of recombination rate assumptions.
-s N/A Skip sites with missing data or multiple alleles. Ensures data quality. Mandatory for diploid genome analysis to avoid phasing errors.

Experimental Protocols

Protocol 1: Standard PSMC Analysis for a Single Diploid Genome

  • Data Preparation: Start with a consensus FASTA sequence for a diploid individual (e.g., from a reference-aligned BAM file). Heterozygous sites must be called and represented as K/k (for G/T) or other IUPAC codes.
  • Generate Input File: Use the fq2psmcfa tool to convert the FASTA to the required PSMC input format (.psmcfa). This encodes the sequence into a binary representation of heterozygous and homozygous sites.

  • Run PSMC with Parameterization: Execute the PSMC algorithm, defining key parameters based on prior knowledge (e.g., known μ and c for the species).

  • Time Scaling: Convert the relative time (2N0 generations) output by PSMC to absolute years using known generation time (g) and mutation rate (μ).

  • Visualization: Plot the sample.psmc file using the provided psmc_plot.pl script.

Protocol 2: Parameter Sensitivity and Robustness Testing A core chapter of the thesis must assess inference robustness.

  • Pattern Sensitivity (-p): Run PSMC on the same .psmcfa file with 5-7 different -p patterns (e.g., "4+20", "4+25*2+4+6", "4+30*3"). Plot results together to identify consistent features vs. artifacts.
  • Recombination-Mutation Scaling (-r): Execute PSMC with -r values spanning plausible bounds (e.g., 1, 2, 5, 10). This tests how uncertainty in the μ/c ratio propagates to uncertainty in the timing of Ne changes.
  • Bootstrap Analysis: Use the psmc utility's -b option to generate 100 bootstrapped sequences from the original .psmcfa file. Run PSMC with fixed optimal parameters on each bootstrap replicate to generate confidence intervals for the Ne trajectory.

Visualization of Workflows and Relationships

G DataPrep Diploid Genome (Consensus FASTA) InputGen fq2psmcfa DataPrep->InputGen PSMCRun PSMC Core Algorithm InputGen->PSMCRun .psmcfa Output .psmc File (Relative Time) PSMCRun->Output ScaledResult Scaled Nₑ(t) Plot (Absolute Time) Output->ScaledResult Apply μ & g ParamBox -p: Pattern -t: Max TMRCA -r: μ/c ratio -s: Skip sites ParamBox->PSMCRun

Title: PSMC Analysis Workflow with Key Parameters

G SegPattern Atomic Segment Pattern (-p) HMMStates HMM Discrete Time States SegPattern->HMMStates Defines CoalescentEvents Coalescent Event Density HMMStates->CoalescentEvents Models TMRCA in NeTrajectory Inferred Nₑ(t) Trajectory CoalescentEvents->NeTrajectory Inversely Proportional to

Title: Logical Relationship of -p to Nₑ Inference

The Scientist's Toolkit: PSMC Analysis Reagents & Materials

Table 2: Essential Research Reagents and Solutions for PSMC-Based Studies

Item Function in PSMC Context Specification Notes
High-Coverage Whole Genome Sequence (Diploid Individual) Raw data source. Provides the heterozygous/homozygous site pattern needed for coalescent inference. Minimum 20x coverage recommended. Must be from a single diploid individual or a phased diploid genome.
Species-Specific Mutation Rate (μ) Converts relative PSMC time to calendar years. A critical scaling parameter. Often derived from pedigree studies or fossil-calibrated phylogenies. A major source of uncertainty.
Generation Time (g) Converts generational time scaled by PSMC to absolute years. Average age of reproduction. Obtained from life history studies.
Recombination Rate (c) Used with μ to set the -r parameter. Scales genetic map. Often inferred from genetic maps or LD-based studies. The μ/c ratio (-r) is directly used.
Reference Genome Assembly Platform for aligning sequence reads to call consensus. Must be contiguous (chromosome-level) to allow accurate inference of long-range coalescent events.
PSMC Software Suite Core analysis toolkit. Includes psmc, fq2psmcfa, psmc2history.pl, psmc_plot.pl. Requires compilation on Unix-like systems. Key parameters are passed to the psmc command.
Computational Cluster Resources Enables multiple runs for bootstrapping and parameter sensitivity tests. PSMC is single-threaded but many independent runs can be parallelized. Memory usage is moderate.

This protocol details the execution of the Pairwise Sequentially Markovian Coalescent (PSMC) algorithm, a core methodological component of a thesis investigating historical population size dynamics from a single diploid genome sequence. The PSMC model infers population size history by analyzing the distribution of heterozygosity across the genome, providing critical insights into ancient demographic events relevant to understanding genetic diversity, selection pressures, and disease susceptibility.

Essential Research Reagent Solutions & Materials

Table 1: Key Computational Tools and Data for PSMC Analysis

Item Function/Description
SAMtools Manipulates alignments in SAM/BAM format; used for variant calling and data preparation.
BCFtools Processes VCF and BCF files; used for generating the consensus sequence input for PSMC.
PSMC Software Implements the PSMC inference algorithm (https://github.com/lh3/psmc).
High-Quality Reference Genome A well-assembled reference for the species of interest for read alignment.
Diploid Genome Sequence Data High-coverage (typically >20x) whole-genome sequencing data from a single individual in BAM format.
Perl & GNU Awk Scripting utilities required for running auxiliary PSMC utility scripts.

Detailed Experimental Protocol

Data Preparation and Consensus Sequence Generation

Objective: Generate a diploid consensus FASTA sequence in a format suitable for PSMC.

Procedure:

  • Variant Calling:

  • Generate PSMC Input (FASTQ to PSMCfa): Convert the FASTQ consensus to the required binary format using the fq2psmcfa utility provided with PSMC.

    • -q20: Filters out bases with quality score below 20.

Core PSMC Inference Execution

Objective: Run the PSMC model to estimate population size history.

Procedure:

  • Execute PSMC: The primary command runs the iterative PSMC algorithm on the prepared .psmcfa file.

  • Parameter Explanation:
    • -N25: Maximum number of iterations (25).
    • -t15: Maximum 2N0 coalescent time (15 in units of 2N0 generations).
    • -r5: Initial theta/rho ratio (5).
    • -p "4+25*2+4+6": Defines the atomic time interval pattern. This string specifies 4 free interval parameters for the most recent epoch, followed by 25 intervals whose parameters are tied in pairs (for older epochs), then 4 and 6 tied parameters for the oldest epochs.

Table 2: Key PSMC Command-Line Parameters and Typical Values

Parameter Typical Value Description
-N 25-30 Maximum number of EM algorithm iterations.
-t 5-15 Upper limit of the TMRCA (time to most recent common ancestor).
-r 1-10 Initial ratio of recombination rate to mutation rate (ρ/θ).
-p "4+25*2+4+6" Pattern of time intervals for parameter estimation.
-o output.psmc Specifies the output file name.

Post-processing and Visualization

Objective: Convert results to a plottable format and generate the population history plot.

Procedure:

  • Generate Plot Data: Use the psmc_plot.pl utility to create a plottable file.

  • Plotting: The command generates a GNUplot script (plot_sample.gp). Execute it to produce the final PDF plot.

    The output plot_sample.pdf displays inferred population size (N_e) over time (log scale).

Workflow and Analysis Diagrams

psmc_workflow BAM Input: Aligned BAM (Single Diploid Genome) VCF Variant Calling (samtools/bcftools) BAM->VCF REF Reference Genome (FASTA) REF->VCF CONSENSUS_FQ Diploid Consensus (FASTQ) VCF->CONSENSUS_FQ PSMC_IN Binary Input (.psmcfa file) CONSENSUS_FQ->PSMC_IN fq2psmcfa PSMC_RUN PSMC Inference (psmc command) PSMC_IN->PSMC_RUN PSMC_OUT Model Output (.psmc file) PSMC_RUN->PSMC_OUT PLOT_DATA Plot Data Generation (psmc_plot.pl) PSMC_OUT->PLOT_DATA PLOT Population History Plot (PDF) PLOT_DATA->PLOT

Diagram 1: PSMC Analysis Main Workflow

Diagram 2: PSMC Algorithm Logic and Data Flow

Within the broader thesis on the Pairwise Sequentially Markovian Coalescent (PSMC) method for inferring ancient population size dynamics, the final visualization step is critical. The PSMC algorithm transforms a single diploid genome sequence into a historical timeline of effective population size (Ne). The psmc_plot.pl Perl script is the standard tool for converting PSMC output into an interpretable plot. This protocol details its application, enabling researchers to visualize population expansions, bottlenecks, and divergence events, which are essential contexts for understanding genetic diversity and informing biomedical research on population-specific genetic risks.

Core Application Notes for psmc_plot.pl

The psmc_plot.pl script post-processes the .psmc result file. It scales the time axis using a mutation rate (µ) and generation time (g), and the Ne axis using the genome size. It can overlay multiple runs for comparison.

Key Parameters and Quantitative Data

Table 1: Essential Scaling Parameters for psmc_plot.pl

Parameter Symbol Typical Range Default in Script Impact on Axis
Mutation Rate µ 1.0e-8 to 2.5e-8 per bp/gen Not Set Scales time axis; higher µ compresses time (makes history look younger).
Generation Time g 1-30 years (species-dependent) Not Set Scales time axis; longer g stretches time (makes history look older).
Genome Size G Species-specific (e.g., 3e9 for human) Inferred from PSMC Scales Ne axis; larger G increases inferred Ne.

Table 2: Common psmc_plot.pl Command-line Options

Option Argument Example Purpose
-u -u 1.2e-8 Specifies the mutation rate per nucleotide per generation.
-g -g 25 Specifies the average generation time in years.
-X -X 10000 Sets the maximum plot time (in scaled generations).
-x -x 5 Sets the minimum plot time (in scaled generations).
-Y -Y 50000 Sets the maximum plot Ne.
-M "Sample1,Sample2" Comma-separated labels for multiple files.
-n -n 30 Number of free interval parameters used in PSMC run.
-T -T "Population History" Plot title.

Table 3: Example Scaled Output Data Points (Human, µ=1.25e-8, g=25)

Time (kYears ago) Scaled Generation Ne (x10^4) Interpretation Epoch
~0.2 10 ~7.5 Holocene (recent)
~10 400 ~2.0 Post-Out-of-Africa bottleneck
~100 4000 ~8.0 Pre-bottleneck expansion
~500 20000 ~1.5 Ancient bottleneck

Detailed Experimental Protocol

Protocol 1: Generating a Basic PSMC Plot

Objective: To generate a standard Ne over time plot from a single .psmc result file.

Materials: See "The Scientist's Toolkit" below.

Procedure:

  • Prepare the PSMC Output File: Ensure your PSMC analysis (psmc -p ...) is complete, yielding a file (e.g., sample.psmc).
  • Construct the Basic Command: Navigate to the directory containing psmc_plot.pl and your data. Use the following command structure:

  • Execute the Command: Run the command in your terminal. The script generates two files:
    • plot_result.par: Parameters used for plotting.
    • plot_result.png: The final plot image.
  • Verify Output: Open the .png file. The x-axis will be time (in thousands of years if using -g), and the y-axis will be effective population size (Ne).

Protocol 2: Plotting Multiple Samples for Comparison

Objective: To visualize and compare the demographic histories of multiple individuals or species.

Procedure:

  • Prepare Multiple PSMC Files: Generate .psmc files for each sample (e.g., human.psmc, chimp.psmc, gorilla.psmc).
  • Construct the Multi-Plot Command: Use the -M option to provide labels and list all files.

  • Customize Axes (Optional): Use -X, -x, -Y to set consistent axis limits across all plots for fair comparison.

  • Generate and Interpret: The script produces one plot with all trajectories overlaid, allowing direct comparison of bottleneck timing and severity.

Protocol 3: Bootstrapping and Confidence Intervals

Objective: To assess the confidence in the inferred demographic trajectory.

Procedure:

  • Generate Bootstrap Files: During PSMC analysis, use the -b option to produce 100 bootstrap replicate files (e.g., sample.bootstrap.psmc).
  • Plot with Confidence Intervals: Provide the bootstrap file to psmc_plot.pl.

  • Output Interpretation: The resulting plot will show the main trajectory (solid line) and a shaded region representing the confidence intervals from bootstrap replicates, highlighting periods of high uncertainty.

Visualizations

G PSMC_Output PSMC Output File (.psmc format) psmc_plot_pl psmc_plot.pl Script PSMC_Output->psmc_plot_pl Parameters Scaling Parameters (µ, g, G) Parameters->psmc_plot_pl Plot_Data Scaled Plot Data (Generations, Ne) psmc_plot_pl->Plot_Data Visualization Final Plot (Ne vs. Time) Plot_Data->Visualization

Title: Workflow for Generating a PSMC Plot

G cluster_axis Impact of Scaling Parameters Mu µ Mutation Rate Time Time Axis Mu->Time Increase µ → Compress Time Gen g Generation Time Gen->Time Increase g → Stretch Time Ne Ne Axis

Title: How Parameters Scale PSMC Plot Axes

The Scientist's Toolkit

Table 4: Essential Research Reagent Solutions for PSMC Plotting

Item Function in Protocol Example/Note
PSMC Software Suite Contains the psmc_plot.pl Perl script and the psmc main program. Download from https://github.com/lh3/psmc.
Perl Interpreter Required to execute the psmc_plot.pl script. Version 5.x or higher.
Mutation Rate (µ) Critical scaling factor to convert coalescent time to years. Species-specific; a major source of uncertainty. E.g., Human: ~1.2e-8.
Generation Time (g) Average time between generations, scales time axis to real years. From life history data. E.g., Human: 25-30 years; Mouse: 1 year.
High-Quality Diploid Genome Sequence The raw input for PSMC inference before plotting. High coverage (>20x), long insert size libraries improve accuracy.
Plotting/Customization Scripts (R, Python) For advanced customization of psmc_plot.pl output (e.g., combining plots, adjusting styles). ggplot2 (R), matplotlib (Python).
Compute Cluster/Server PSMC and bootstrapping are computationally intensive; plotting is lightweight. Needed for the initial PSMC analysis, not for plotting itself.

Within the broader thesis on the Pairwise Sequentially Markovian Coalescent (PSMC) method for inferring ancient population size, this document provides critical application notes for interpreting its primary output: the PSMC plot. This plot reconstructs historical effective population size (Ne) over time, offering insights into evolutionary bottlenecks, expansions, and demographic history. Accurate interpretation and scaling are paramount for deriving biologically meaningful conclusions relevant to evolutionary biology, conservation genetics, and identifying population isolates useful in biomedical research.

Core Principles of PSMC Output Interpretation

The PSMC algorithm infers Ne through time by analyzing the distribution of heterozygosity in a diploid genome sequence. Time is measured in units of generations scaled by mutation rate (τ = μ * t), and population size is scaled by the mutation rate per generation (θ = 4Neμ). The raw output is therefore on a relative, scaled axis.

Key Features and Their Demographic Meaning

  • Declining Segment: Indicates a population bottleneck (decrease in Ne).
  • Rising Segment: Indicates a population expansion (increase in Ne).
  • Plateau: Indicates a period of relatively stable population size.
  • Recent Time Boundary: Inference is unreliable for very recent (last ~104 years) and very ancient times (beyond 1-10 million years) due to insufficient coalescent events or mutations.

Table 1: Critical Parameters for Temporal Scaling and Their Effect on Interpretation

Parameter Symbol Typical Input Role in Scaling Impact on Plot if Mis-specified
Mutation Rate per Generation μ 1.2e-8 to 2.5e-8 Scales both time (τ/μ) and size (θ/4μ). Underestimation of μ inflates both Ne and time.
Generation Time in Years g 2-30 years Converts generational time to years (t = τ * g / μ). Underestimation of g compresses the timeline.
Scaled Time (τ) τ PSMC direct output τ = μ * t / g N/A (Raw PSMC x-axis).
Scaled Population Size (θ) θ PSMC direct output θ = 4Ne * μ N/A (Raw PSMC y-axis).
Effective Population Size Ne Calculated Ne = θ / (4μ) Final inferred demographic history.
Time in Years Before Present t Calculated t = (τ * g) / μ Final inferred timeline of events.

Table 2: Example Scaling Scenarios for a Hypothetical PSMC Output Point

Raw PSMC Output (τ, θ) Assumed μ Assumed g (yrs) Calculated Ne Calculated Time (Years Ago)
τ=0.1, θ=0.001 1.2e-8 20 20,833 166,667
τ=0.1, θ=0.001 2.5e-8 25 10,000 100,000
τ=0.1, θ=0.1 1.2e-8 20 2,083,333 166,667

Detailed Protocols

Protocol 1: Temporal Scaling of PSMC Results

Objective: Convert raw PSMC (τ, θ) coordinates into biologically interpretable values (Ne, Years Ago). Materials: PSMC output file (*.psmc), scaling parameters (μ, g). Procedure:

  • Extract Data: Parse the *.psmc file to obtain the time intervals (tk) and λk values (θ estimates per interval). Use provided scripts (e.g., psmc_plot.pl) or custom code.
  • Calculate Ne: For each interval, compute Ne = λ_k / (4 * μ). μ is the per-generation mutation rate.
  • Calculate Time: For each time interval boundary, compute time in years before present: t = (t_k * g) / μ, where g is the generation time in years.
  • Plot: Generate a final plot with scaled axes (Years Ago on x-axis, Ne on y-axis, both log-scaled). Note: Always run sensitivity analyses using a plausible range of μ and g values (see Table 1).

Protocol 2: Bootstrapping for Confidence Assessment

Objective: Assess robustness and variance of the inferred demographic trajectory. Materials: PSMC bootstrap output files (*.bspsmc). Procedure:

  • Generate Bootstrap Files: During PSMC analysis, use the -b option to generate multiple (e.g., 100) bootstrap replicates from the original consensus sequence.
  • Scale Each Replicate: Apply Protocol 1 to each bootstrap file independently using the same μ and g.
  • Visualize: Plot all scaled bootstrap trajectories on the same axes as the main result. The envelope of these lines indicates confidence intervals; wide spreads suggest uncertainty in timing or magnitude of events.

Mandatory Visualizations

G cluster_inputs Inputs & Parameters cluster_psmc PSMC Core Algorithm cluster_scaling Temporal Scaling cluster_output Interpretation DIP Diploid Genome Sequence PSMC PSMC Inference (Raw Output) DIP->PSMC MU Mutation Rate (μ) SCALE Apply Scaling Formulas MU->SCALE GT Generation Time (g) GT->SCALE RAW Scaled Output: τ (time), θ (size) PSMC->RAW RAW->SCALE FINAL Final Metrics: Nₑ (size), Years Ago SCALE->FINAL PLOT PSMC Plot: Demographic History FINAL->PLOT BOT Identify Bottleneck PLOT->BOT EXP Identify Expansion PLOT->EXP

Diagram 1 Title: PSMC Analysis and Interpretation Workflow (76 chars)

G cluster_axes cluster_legend Plot Elements X Time (Log Scale, Years Ago) Y Effective Population Size, Nₑ (Log Scale) p1 p2 p1->p2 p3 p2->p3 p4 p3->p4 p5 p4->p5 p6 p5->p6 p7 p6->p7 BOT Bottleneck (Declining Nₑ) EXP Expansion (Rising Nₑ) STA Stable Period (Plateau) b1 b2 b1->b2 b3 b2->b3 b4 b3->b4 b5 b4->b5 t1 t2 t1->t2 t3 t2->t3 t4 t3->t4 t5 t4->t5 L1 Inferred Nₑ Trajectory L2 Bootstrap Replicates

Diagram 2 Title: Key Features of a PSMC Plot with Bootstrap (72 chars)

The Scientist's Toolkit: Key Research Reagent Solutions

Table 3: Essential Materials for PSMC-Based Demographic Inference

Item Function in PSMC Analysis Notes for Researchers
High-Coverage Whole Genome Sequence (WGS) Data The primary input. PSMC requires a diploid, phased genome with high coverage (>20X) to accurately call heterozygous sites and avoid sequencing bias. Use from a single individual. Long-read sequencing improves phasing.
PSMC Software Package Implements the core PSMC algorithm for inferring population size history from the genomic data. Available from GitHub (lh3/psmc). Includes utilities for plotting and bootstrapping.
Mutation Rate (μ) Estimate Critical scaling parameter to convert PSMC's scaled output into absolute time and population size. Species-specific rate is ideal. Use a plausible range (e.g., 1.0e-8 - 2.5e-8 for mammals) for sensitivity analysis.
Generation Time (g) Estimate Critical scaling parameter to convert generational time into calendar years. Based on species biology (age at first reproduction, average reproductive lifespan).
Bootstrap Resampling Scripts Generate multiple datasets from the original to assess the confidence and variance of the PSMC inference. Built into the PSMC package (psmc -b).
Plotting & Scaling Scripts Convert raw PSMC output into final scaled plots, often incorporating μ and g. Use psmc_plot.pl or develop custom scripts (e.g., in R/Python) for greater flexibility.
Reference Genome & Annotation For alignment and variant calling prior to PSMC analysis. Ensures accurate mapping and identification of autosomal regions. Avoid sex chromosomes and highly repetitive regions.

Application Notes: Method Evolution and Core Concepts

The Pairwise Sequentially Markovian Coalescent (PSMC) method, foundational for inferring historical population sizes from a single diploid genome, has evolved to address its limitations, particularly the inability to infer recent (<10,000 years) demographic history. This has led to the development of advanced extensions like PSMC-IBD and PSMC' (PSMC2).

PSMC-IBD leverages Identity-by-Descent (IBD) segments shared between individuals within a population. Recent segments are abundant and provide direct information about recent coalescence events, effectively "zooming in" on the last dozens to hundreds of generations.

PSMC' (PSMC2) represents a direct, model-based extension of the original PSMC algorithm. It modifies the underlying hidden Markov model (HMM) and its parameterization to improve resolution in the recent past and can incorporate multiple genomes simultaneously, enhancing precision.

The application of these methods has transcended conservation and evolutionary genetics, finding powerful use cases in cancer genomics—tracing the demographic history of tumor cell populations—and pathogen evolution—reconstructing the fluctuation of effective population size in viruses or bacteria through an epidemic.

Quantitative Comparison of PSMC Methods

Table 1: Comparative Summary of PSMC Method Extensions

Feature PSMC (Original) PSMC-IBD PSMC' (PSMC2)
Primary Input One diploid genome (heterozygosity) Many genomes (IBD segment detection) Multiple haplotypes or diploid genomes
Time Inference Range ~10 kya to 1-5 mya ~10 - 10,000 generations ago Improves recent inference, range similar to PSMC
Key Innovation Coalescence times from heterozygous sites Coalescence times from recent IBD sharing Revised HMM for multiple sequences & recent times
Best For Deep demographic history Very recent population dynamics Enhanced resolution with larger sample sizes
Cancer Genomics Use Limited Clonal evolution, subpopulation timing Tumor effective population size trajectory
Pathogen Evolution Use Historical bottleneck/expansion Transmission dynamics, recent outbreaks Population history from multiple strain alignments

Key Applications in Cancer Genomics and Pathogen Evolution

In Cancer Genomics, somatic mutations act as evolutionary markers. PSMC-derived methods can be applied to bulk or single-cell sequencing data from a tumor to infer the historical size of the cancerous cell population. A rapid increase in "effective population size" may indicate a clonal expansion event, while fluctuations can reflect selective sweeps or therapy-induced bottlenecks.

In Pathogen Evolution, analyzing multiple pathogen genomes (e.g., SARS-CoV-2, Mycobacterium tuberculosis) with these methods can reveal changes in the effective number of infections over time, identifying past bottlenecks (host jumps, drug campaigns) and expansions (epidemic spread).

Detailed Experimental Protocols

Protocol: PSMC-IBD Analysis for Recent Demographic Inference

Objective: To infer population size changes within the last ~10,000 years using IBD segment sharing.

Input: High-coverage whole-genome sequencing data for n individuals (n > 50 recommended) from a population.

Software Requirements: bcftools, PLINK, IBD detection software (e.g., hap-IBD, GERMLINE), PSMC-IBD toolkit.

Procedure:

  • Variant Calling & Phasing:
    • Call variants (SNPs) across all samples using a standard pipeline (e.g., GATK).
    • Phase haplotypes using a reference panel or population-based phaser (e.g., SHAPEIT, Eagle2). Output should be in VCF format.
  • IBD Segment Detection:
    • Convert phased VCF to PLINK format.
    • Run IBD detection software (e.g., hap-IBD) with parameters tuned for recent segments (e.g., min length = 2 cM).
    • Filter output to high-confidence IBD segments shared between pairs of individuals.
  • Coalescent Time Estimation from IBD:
    • For each IBD segment, estimate the time to the most recent common ancestor (TMRCA) using its genetic length (L in Morgans). The approximate formula: T ≈ 1/(2L) generations.
    • Aggregate TMRCA estimates across all segment pairs to form a distribution of coalescent times.
  • PSMC-IBD Inference:
    • The distribution of recent coalescent times is used to fit a population size history model. The PSMC-IBD algorithm uses a composite likelihood framework to estimate population size (N_e) in discrete time intervals from the present going backward.
  • Visualization & Interpretation:
    • Plot inferred N_e(t) over time (generations). Recent bottlenecks appear as peaks (increased coalescence rate), expansions as troughs.

Protocol: Applying PSMC' to Pathogen Genomic Data

Objective: To infer the historical effective population size trajectory of a bacterial or viral pathogen from a multi-strain genomic alignment.

Input: A whole-genome multiple sequence alignment (FASTA format) of m pathogen isolates.

Software Requirements: MAFFT or ClustalW, vcfutils.pl (from SAMtools), PSMC2 (PSMC' implementation).

Procedure:

  • Prepare the "Diploid" Consensus:
    • For each isolate in the alignment, generate a "pseudo-diploid" genome by randomly splitting its haplotype into two artificial homologous chromosomes. This simulates the heterozygous input required by PSMC.
    • Alternatively, directly use the alignment of haploid sequences with the -p flag in PSMC2 which supports multiple haplotypes.
  • Generate the Input File:
    • Convert the alignment(s) to a VCF-like format or the required FASTQ input for PSMC. A common method is to call "variants" from the multi-species alignment against a reference.
    • Use vcfutils.pl vcf2fq to generate a consensus FASTQ sequence for each pseudo-diploid sample.
  • Run PSMC2 Analysis:
    • Run PSMC2 with parameters adjusted for the high mutation rate of pathogens (e.g., -p "4+25*2+4+6"). The -p flag specifies the atomic time intervals.
    • Specify the per-generation mutation rate (μ) and generation time (g) accurately for the pathogen. For example, for SARS-CoV-2, g ≈ 5-10 days.
    • Command example: psmc2 -p "4+30*2+4+6" -t 15 -r 5 -μ 1e-6 sample.psmcfa > sample.psmc2
  • Combine Results & Plot:
    • Use the psmc2_plot.pl utility to generate the final plot of effective population size (Ne) over time in years (calculated using μ and g).
    • Interpret peaks as periods of low Ne (bottlenecks, selective sweeps) and troughs as periods of high N_e (epidemic expansion).

Visualization Diagrams

G PSMC PSMC Limitation Limited Recent Resolution (<10k years) PSMC->Limitation Extension Method Extensions Limitation->Extension PSMC2 PSMC' (PSMC2) Extension->PSMC2 IBD PSMC-IBD Extension->IBD App1 Cancer Genomics Tumor Demography PSMC2->App1 App2 Pathogen Evolution Epidemic History PSMC2->App2 IBD->App1 IBD->App2

Title: Evolution of PSMC Methods to New Applications

G Start Multiple Pathogen Genome Sequences Align Multiple Sequence Alignment Start->Align PseudoDip Create Pseudo-Diploid or Use -p flag Align->PseudoDip PSMC2Input Generate PSMC2 Input (FASTQ) PseudoDip->PSMC2Input RunPSMC2 Run PSMC2 with Pathogen μ & g PSMC2Input->RunPSMC2 Output Inferred N_e(t) Trajectory RunPSMC2->Output Bottleneck Interpret: Bottleneck (e.g., host jump) Output->Bottleneck Expansion Interpret: Expansion (e.g., epidemic) Output->Expansion

Title: PSMC2 Workflow for Pathogen Evolution

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Toolkit for PSMC-IBD and PSMC' Experiments

Item / Reagent Function / Purpose Example or Note
High-Coverage WGS Data Provides the raw variant and haplotype information necessary for accurate coalescent inference. Minimum 30x coverage recommended for human studies.
Phasing Pipeline Resolves alleles onto maternal and paternal haplotypes, critical for IBD detection and PSMC input. SHAPEIT4, Eagle2. Often uses a reference panel (e.g., 1000 Genomes).
IBD Detection Algorithm Identifies genomic segments shared identical-by-descent between individuals. hap-IBD (sensitive for recent segments), GERMLINE, RaPID.
PSMC2 Software The core implementation of the extended PSMC' algorithm for multiple sequences. Available from GitHub (lh3/psmc2). Replaces the original psmc.
Pathogen-Specific μ & g Accurate per-generation mutation rate and generation time for scaling time estimates. e.g., SARS-CoV-2: μ ~1e-6, g ~5-10 days; M. tuberculosis: g ~1 year.
High-Performance Computing (HPC) Cluster Runs computationally intensive steps (phasing, IBD calling, PSMC bootstrapping). Needed for population-scale (n>100) or pathogen pangenome analyses.
Visualization Scripts Plots the inferred population history from the .psmc or .psmc2 output files. psmc_plot.pl, psmc2_plot.pl (customizable in R/Python).

Overcoming PSMC Limitations: Data Quality, Parameter Sensitivity, and Interpretation Pitfalls

Application Notes

The Pairwise Sequentially Markovian Coalescent (PSMC) model is a powerful method for inferring historical effective population size (Ne) from a single diploid genome sequence. Within the broader thesis of reconstructing ancient demography for evolutionary and biomedical insights, the accuracy of PSMC is fundamentally constrained by the quality of the input genomic data. This document details the impact of three critical data parameters and provides protocols for their optimization.

1. Impact of Sequencing Depth and Coverage PSMC relies on accurately identifying heterozygous sites and their underlying coalescent times. Inadequate sequencing depth leads to false homozygous calls, distorting the inferred site frequency spectrum and biasing Ne estimates towards more recent times. Insufficient physical coverage (especially for short-read data) creates gaps in the assembly, disrupting the contiguous haplotype information required for the PSMC's hidden Markov model.

Table 1: Impact of Sequencing Depth on PSMC Inference

Mean Depth (X) Called Heterozygotes Ne Bias Time Depth Reliability
< 10X Severe under-call Strong recent bias Poor (< 10kya)
15-20X (Recommended) Near-optimal Minimal Good to 100kya-1mya
> 30X Diminishing returns Very low Potential for ancient noise

2. Impact of Assembly Quality PSMC is typically run on a consensus sequence derived from a reference-guided alignment. Misassemblies, scaffolding errors, and incorrect base calls create artificial "recombination" breakpoints. The PSMC model interprets these as historical recombination events, leading to spurious fluctuations in inferred Ne. High-quality, contiguous assemblies (e.g., from long-read sequencing) provide more accurate haploid sequences for analysis.

Experimental Protocols

Protocol A: Optimal Sequence Data Generation for PSMC Objective: Generate whole-genome sequencing data suitable for robust PSMC analysis.

  • Sample Selection: Use high-molecular-weight DNA from a single diploid individual.
  • Library Preparation: Construct paired-end Illumina libraries (150bp reads). For superior assembly, supplement with long-read data (PacBio HiFi or Oxford Nanopore).
  • Sequencing: Sequence to a minimum mean depth of 20X on an Illumina NovaSeq platform. Aim for > 30X physical coverage.
  • Quality Control: Assess raw data with FastQC. Trim adapters and low-quality bases using Trimmomatic or Cutadapt.
  • Alignment: Map reads to a high-quality reference genome (preferably from a closely related species) using BWA-MEM. Use samtools to sort, index, and mark duplicates.
  • Variant Calling: Generate a consensus sequence using bcftools mpileup and call, followed by bcftools consensus. Apply a base quality filter (Q>20) and depth filter (e.g., minDP=10, maxDP=50).

Protocol B: PSMC Execution and Diagnostic Assessment Objective: Run PSMC and evaluate sensitivity to data quality.

  • Input Preparation: Generate a diploid consensus FASTA. Create the PSMC input file (a binary sequence of heterozygous/monomorphic sites) using the fq2psmcfa utility from the PSMC package.
  • PSMC Run: Execute the PSMC model: psmc -p "4+25*2+4+6" -o output.psmc input.psmcfa. The -p flag defines the atomic time interval pattern.
  • Coalescent Time Scaling: Scale the results using a mutation rate (μ) and generation time (g): psmc_plot.pl -u μ -g g -p plot_output output.psmc.
  • Diagnostic: Downsampling Test. Downsample your high-depth BAM file to 5X, 10X, and 15X using samtools view -s. Re-run Protocol B steps 1-3. Compare Ne trajectories in a single plot to visualize depth-dependent bias.
  • Diagnostic: Assembly Comparison. Run PSMC on consensus sequences derived from short-read-only and hybrid (short+long read) assemblies. Discrepancies in ancient Ne inferences highlight assembly quality impact.

Visualizations

G Start Raw Sequencing Data A1 Low Depth/ Coverage Start->A1 A2 High-Quality Depth/Coverage Start->A2 B1 Poor Assembly A1->B1 B2 High-Quality Assembly A2->B2 C1 Data Issue Manifestation B1->C1  False Heterozygotes & Artificial Breaks C2 Ideal PSMC Input B2->C2  Accurate Het Sites & True Haplotype Blocks D PSMC Model Inference C1->D C2->D E1 Biased/Noisy Ne(t) Trajectory D->E1 E2 Robust Ne(t) Trajectory D->E2

Data Quality Impact on PSMC Inference Pathway

PSMC Analysis and Diagnostic Workflow

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Materials and Tools for PSMC-ready Data Generation

Item Function & Relevance to PSMC Example Product/Software
High-Molecular-Weight DNA Kit Ensures long DNA fragments for accurate library prep and reduced assembly gaps, crucial for PSMC's haplotype reconstruction. Qiagen Genomic-tip, MagAttract HMW DNA Kit
Whole-Genome Sequencing Kit Produces the raw sequencing data. Paired-end kits are standard; long-read kits improve assembly contiguity. Illumina TruSeq DNA PCR-Free, PacBio SMRTbell
Reference Genome A high-quality, contiguous reference from a closely related species for accurate read mapping and variant calling. Species-specific consortium assemblies (e.g., GRCh38, GRCm39)
Read Processing Suite Performs essential QC, adapter trimming, and filtering to ensure clean input for alignment. FastQC, Trimmomatic, Cutadapt
Sequence Aligner Precisely maps reads to the reference genome to identify heterozygous sites. BWA-MEM, minimap2 (for long reads)
Variant Caller/Consensus Generator Identifies heterozygous positions and creates the diploid consensus sequence, the direct PSMC input. BCFtools, GATK
PSMC Software Package Core software for executing the PSMC model and generating plots. PSMC (https://github.com/lh3/psmc)

1. Introduction & Thesis Context Within the broader thesis on the Pairwise Sequentially Markovian Coalescent (PSMC) method for inferring ancient population size dynamics, a critical and often overlooked parameter is the genome-wide recombination rate (ρ). The PSMC model uses the density of heterozygous sites in a diploid genome to infer historical coalescent events, scaling real time using the per-generation mutation rate (μ). The recombination rate ρ (4Nₑr, where r is the per-generation recombination rate per base) is a core parameter that governs the transition between different coalescent states along the genome. Mis-specification of ρ does not merely introduce noise; it systematically distorts the inferred time scales of population size changes, compressing or expanding demographic history. These notes detail protocols for assessing and correcting this bias.

2. Quantitative Impact of Recombination Rate Mis-specification Table 1: Effect of Input Recombination Rate (r_input) on PSMC-Inferred Event Times (T) and Effective Population Size (Nₑ)

Simulation Truth Input r / True r Bias in Time (T) Bias in Nₑ Direction of Temporal Distortion
r = 1e-8 per bp/gen 0.5x (5e-9) Inferred T ~ 0.7x True T Inferred Nₑ ~ 1.4x True Nₑ Recent events appear older; history compressed.
r = 1e-8 per bp/gen 1.0x (1e-8) Inferred T = True T (Baseline) Inferred Nₑ = True Nₑ Accurate scaling.
r = 1e-8 per bp/gen 2.0x (2e-8) Inferred T ~ 1.3x True T Inferred Nₑ ~ 0.8x True Nₑ Past events appear more recent; history expanded.
Bottleneck at 10kya 0.5x True r Bottleneck position ~7kya Bottleneck depth overestimated Timing of key events shifted.
Expansion at 100kya 2.0x True r Expansion position ~130kya Magnitude of expansion underestimated

Table 2: Recommended Reagent Solutions for PSMC-based Recombination Rate Studies

Research Reagent / Tool Function in Protocol
High-coverage Whole Genome Sequence (WGS) Data (e.g., >20x) Provides the raw heterozygous site input for PSMC. High coverage reduces genotype error, critical for accurate inference.
PSMC Software (psmc, psmc+) Core algorithm for inferring Nₑ(t) trajectories from the diploid genome's heterozygous site distribution.
IMPUTE2 or SHAPEIT Phasing software. While PSMC can use unphased data, accurate long-range haplotypes from phasing can improve ρ sensitivity.
LDhat or interval Software to estimate population-scaled recombination rate (ρ) from genetic variation data for an independent benchmark.
msprime or SLiM Coalescent/forward simulation software. Used to generate synthetic genomes with known ρ, μ, and demography to calibrate bias.
GENMAP Tool for estimating sex-averaged genetic maps from population data, providing a prior for r.
MCMCcoal or BEAST2 Bayesian phylogenetic frameworks for co-estimating mutation rate and divergence times, providing external time anchors.

3. Core Protocol: Calibrating Recombination Rate for Accurate PSMC Analysis

Protocol 3.1: Empirical Recombination Rate Estimation Objective: Generate an independent, population-specific estimate of r to feed into PSMC.

  • Data Preparation: Start with phased genotype data (VCF) for a population sample (n>20).
  • Interval File Creation: Use LDhat stat function to calculate likelihoods for a grid of ρ values per 50kb-100kb window.
  • MCMC Estimation: Run LDhat interval function on each window, using a block penalty of 5, 10 million MCMC iterations, sampling every 5000.
  • Map Construction: Combine per-window rates using convert function. Smooth using a loess function. Output: a genetic map in cM/Mb.
  • Conversion to r: Convert cM/Mb to per-generation per-base rate (r). (1 cM/Mb ≈ r = 1e-8). Adjust for generation time if known.

Protocol 3.2: Simulation-Based Bias Quantification Objective: Quantify the temporal distortion caused by r mis-specification specific to your study system.

  • Baseline Simulation: Using msprime, simulate a diploid genome with:
    • Sequence length: 100 Mb.
    • Mutation rate (μ): Use your study's assumed rate (e.g., 1.25e-8).
    • Recombination rate (r_true): Use your best estimate from Protocol 3.1.
    • Demography: Implement a known history (e.g., constant size, bottleneck, expansion).
  • PSMC Inference Loop: Run PSMC on the simulated genome multiple times, varying the -r parameter (input r) across a range (e.g., 0.25x, 0.5x, 1x, 2x, 4x of r_true).
  • Bias Calibration: For each run, compare the inferred time of key demographic events (Tinferred) to the known simulation time (Ttrue). Fit a regression model: Bias Factor = Tinferred / Ttrue = f(rinput / rtrue). This model becomes your calibration curve.

Protocol 3.3: Anchored PSMC Analysis with Uncertainty Integration Objective: Perform a final PSMC analysis incorporating recombination rate uncertainty.

  • Parameter Setup: Prepare your empirical diploid genome in FASTA format (.fq mask).
  • Multiple PSMC Runs: Execute PSMC (psmc -p "4+25*2+4+6" -r [r_value]) for a suite of r_values spanning the confidence interval of your empirical r estimate (e.g., 5th, 25th, 50th, 75th, 95th percentiles).
  • Time Scaling: Use your calibration curve from Protocol 3.2 to adjust the time axis of each PSMC output plot. If no calibration is available, scale time as: Years = (Generation Time * TPSMC * μ) / μused, acknowledging that T_PSMC itself is biased by r.
  • Consensus Plotting: Generate a composite plot showing the range of Nₑ(t) trajectories across all r inputs, visually representing the uncertainty in time scaling due to recombination rate.

4. Visualization of Relationships and Workflows

psmc_flow WGS Diploid Whole Genome Sequence Data HetCall Heterozygous Site Calling & Masking WGS->HetCall PSMCInput PSMC Input (.fq format) HetCall->PSMCInput PSMCcore PSMC Algorithm (Coalescent HMM) PSMCInput->PSMCcore ParamR Recombination Rate Parameter (r) ParamR->PSMCcore Critical Input OutputRaw Raw Output Nₑ(τ) in coalescent units PSMCcore->OutputRaw MuScale Time Scaling with Mutation Rate (μ) & Generation Time (g) OutputRaw->MuScale FinalPlot Final Demographic Plot Nₑ(t) in Years MuScale->FinalPlot

Title: PSMC Workflow with Recombination Rate Input

recombination_bias TrueHistory True Demographic History LowR Underestimated r (r_input < r_true) TrueHistory->LowR PSMC run with HighR Overestimated r (r_input > r_true) TrueHistory->HighR PSMC run with Distort1 Temporal Compression Recent events appear older LowR->Distort1 Distort2 Temporal Expansion Past events appear recent HighR->Distort2 Consequence Misleading Correlations with Paleoclimate or Fossil Records Distort1->Consequence Distort2->Consequence

Title: How r Mis-specification Distorts Inferred Time Scales

Within the broader thesis on the Pairwise Sequentially Markovian Coalescent (PSMC) method for inferring ancient population size, a critical step is understanding the robustness of the inferred demographic history, represented as the effective population size over time (Ne(t)) curve. This analysis identifies which input parameters and algorithmic settings exert the strongest influence on the Ne(t) trajectory, guiding researchers towards more reliable interpretations and highlighting potential artifacts.

Key Parameters and Settings in PSMC Analysis

The PSMC model infers Ne(t) from a single diploid genome sequence. Its output is sensitive to several user-defined parameters and inherent model assumptions.

Table 1: Core PSMC Parameters and Their Typical Influence

Parameter/Setting Default/Common Value Function Potential Impact on Ne(t) Curve
Mutation Rate (μ) e.g., 2.5e-8 per generation Scales the per-generation probability of mutation. Strong. Directly scales the time axis (t = generations / μ). Higher μ compresses the timeline, making events appear more recent.
Generation Time (g) e.g., 25 years Converts generational time to years. Strong. Linearly scales the time axis when presenting results in years (tyears = tgenerations * g).
Atomic Time Interval (-p) "4+25*2+4+6" Defines the piecewise constant time intervals (free population size parameters) for inference. Very Strong. Dictates the temporal resolution and smoothing of the curve. Fewer intervals produce smoother histories; more intervals can capture finer fluctuations but increase noise.
Initial θ/ρ Estimate (-t, -r) Calculated from data Initial values for the population-scaled mutation (θ=4Neμ) and recombination (ρ=4Ner) rates. Moderate. Influences convergence speed but final estimates are typically re-scaled via the iterative algorithm. Poor guesses may lead to local optima.
Number of Iterations (-n) 30+ Maximum number of EM algorithm iterations. Low-Moderate. Must be sufficient for convergence. Premature stopping yields an inaccurate likelihood maximum.
Number of Bootstraps 100 For assessing confidence intervals via data resampling. Low on central estimate. Does not change the central inference but quantifies variance due to sampling of the genomic sequence.
Masking of Low-Complexity Regions Recommended Excludes regions prone to mapping/assembly errors. Strong for ancient times. Unmasked low-complexity regions can create spurious, very ancient population size signals.
Quality of Genome Assembly & Mapping N/A Foundational data quality. Foundational. High error rates can mimic diversity, inflating recent Ne; gaps can break haplotype structure, affecting older Ne.

Experimental Protocol for Conducting Sensitivity Analysis

This protocol provides a step-by-step guide to systematically test the sensitivity of the PSMC-inferred Ne(t) curve.

Aim: To quantify the impact of key parameters on the inferred demographic history.

Materials:

  • High-coverage, diploid genome sequence data (BAM/FASTA format).
  • PSMC software (https://github.com/lh3/psmc).
  • Standard bioinformatics tools (bcftools, samtools).
  • High-performance computing cluster (recommended).

Procedure:

Step 1: Baseline Inference

  • Generate a consensus FASTA sequence from your aligned BAM file using bcftools mpileup, call, and vcf2fq, applying a quality filter (e.g., -Q 30).
  • Run PSMC with recommended default parameters to establish a baseline Ne(t) curve.

Step 2: Parameter Perturbation Experiments

  • Mutation Rate (μ) Sensitivity:
    • Run PSMC multiple times, keeping -p constant but varying the μ value used in the downstream psmc_plot.pl script. Generate plots with μ set to 0.5x, 1x (default), and 2x the assumed value. Compare the scaled time axes.
  • Time Interval Pattern (-p) Sensitivity:
    • Run PSMC with different -p patterns.
      • Fewer intervals: -p "4+10*2+4+6"
      • More intervals: -p "4+30*2+4+6"
    • Compare the smoothness and peak/trough structure of the resulting curves.
  • Data Quality Control Sensitivity:
    • Generate a new consensus sequence with more aggressive masking of low-complexity regions (e.g., using bedtools maskfasta with a RepeatMasker track).
    • Re-run the baseline PSMC analysis on this masked genome.
    • Compare the inferred ancient population size (>1 million years ago) with the baseline.

Step 3: Analysis and Visualization

  • Plot all resulting Ne(t) curves on the same log-log scale using psmc_plot.pl.
  • Key Comparison Metrics: Note (a) the timing of major expansion/contraction events, (b) the relative amplitude (Ne) of those events, and (c) the overall trajectory in the most ancient epochs.
  • Synthesize findings by ranking parameters based on the magnitude of deviation they cause from the baseline curve.

Visualizing the Sensitivity Analysis Workflow

PSA_Workflow Data Input Data Diploid Genome ParamGrid Parameter Grid (μ, -p, masking) Data->ParamGrid PSMCrun PSMC Execution ParamGrid->PSMCrun Multiple Runs Results Output Ne(t) Curves PSMCrun->Results Compare Comparative Analysis Results->Compare Rank Parameter Ranking Compare->Rank By Influence

PSMC Sensitivity Analysis Protocol

The Scientist's Toolkit: Key Research Reagents & Materials

Table 2: Essential Toolkit for PSMC Sensitivity Analysis

Item Category Function & Relevance to Sensitivity
High-Quality Reference Genome Assembly Data The foundational coordinate system. Fragmented or erroneous assemblies introduce spurious heterozygosity, directly impacting the initial θ estimate and all downstream Ne(t) inference.
Deep Coverage (>30X) Whole-Genome Sequencing Data Data Provides a confident call of heterozygous sites. Low coverage increases stochastic error, disproportionately affecting the recent (terminal) part of the Ne(t) curve.
RepeatMasker / Tandem Repeat Finder Database Software/Data Identifies low-complexity genomic regions for masking. Critical control for sensitivity of ancient Ne(t); unmasked repeats are a major source of artifactual, deep-time population signals.
PSMC Software Suite (psmc, psmc2history, etc.) Software Core inference engine. Understanding its internal parameters (-p, -t, -r) is the direct target of the sensitivity analysis.
Bootstrapping Scripts (splitfa, psmc) Software Quantifies variance due to the stochastic ordering of genomic segments. Distinguishes parameter-driven effects from inherent algorithmic uncertainty.
Plotting Script (psmc_plot.pl) Software Visualizes results and applies scaling factors (μ, g). Essential for translating generational output into real-time trajectories and comparing runs.
Mutation Rate Estimates from Phylogenetics Literature/Data External prior for μ. The single most influential scaling parameter. Sensitivity analysis must test a plausible range (e.g., 1e-8 to 2.5e-8) from independent studies.
Fossil/Archaeological Calibration Points Literature/Data External validation for timing of events (e.g., bottlenecks). Provides a biological reality check for the time axis scaled by different (μ, g) combinations.

Application Notes

This document addresses the critical temporal boundary limitations of the Pairwise Sequentially Markovian Coalescent (PSMC) model, a foundational tool for inferring historical population sizes from a single diploid genome sequence. While powerful, its inferences are bounded in time, creating blind spots in both recent and deep history that directly impact applications in evolutionary medicine, conservation genetics, and drug target identification.

Table 1: Temporal Boundaries and Resolution of PSMC and Related Methods

Method Effective Time Range (Years Before Present) Minimum Detectable Time (Years) Maximum Reliable Time (Years) Key Limiting Factor
Standard PSMC ~10,000 - 1,000,000 ~10,000 ~1-2 Million Mutation rate, Generation time
PSMC' (improved) ~5,000 - 1,500,000 ~5,000 ~1.5-2 Million Heterozygous site density
MSMC (multi-sample) ~1,000 - 500,000 ~1,000 ~500,000 Sample size, Computational load
SMC++ (composite) ~500 - 2,000,000 ~500 ~2 Million Number of haplotypes
Recent Methods (e.g., IBD) 0 - 10,000 ~50 ~10,000 Segment length detection threshold

Table 2: Impact of Parameter Choice on Inferred Time Boundaries

Parameter Typical Value Effect on Recent Boundary (↑ = older) Effect on Ancient Boundary (↑ = older) Recommended Calibration Strategy
Generation Time (g) 20-30 years Proportional to g Proportional to g Use species-specific life history data
Mutation Rate (μ) 1.25e-8 /site/gen Inversely proportional to μ Inversely proportional to μ Sequence pedigrees or ancient DNA
θ (4Neμ) Estimated Poor resolution if θ too high Poor resolution if θ too low Compare with independent diversity estimates
Tmax (max coalescent time) 15-20 Defines upper time limit Directly sets upper bound Set based on genome size & μ

The Ancient Time Boundary: Loss of Signal

The ancient boundary is primarily constrained by the total number of mutations accumulated since coalescence. As coalescent time increases, the probability of having multiple mutations at the same site (multiple hits) increases, violating the infinite sites model assumption. Furthermore, the limited number of ancestral recombination events within the available sequence length provides insufficient time points for reconstruction.

The Recent Time Boundary: Lack of Lineage Sorting

The recent boundary arises because very recent coalescent events leave few, if any, derived mutations segregating in the single diploid sample. For events within the last ~10-20 generations, lineages have not had sufficient time to coalesce, meaning the haplotype structure reflects the parental generation more than historical demography.

Experimental Protocols

Protocol 1: Assessing and Defining Temporal Boundaries for a PSMC Analysis

Objective: To empirically determine the reliable time range for a PSMC inference from a newly sequenced diploid genome.

Materials:

  • High-coverage (>20x) whole-genome sequencing data for a single diploid individual.
  • A high-quality reference genome for the species.
  • Reliable estimates of generation time (g) and mutation rate (μ).

Procedure:

  • Variant Calling: Map reads to reference genome using BWA-MEM. Call consensus sequence and heterozygous SNPs using bcftools mpileup/call.
  • Input Preparation: Generate a diploid consensus sequence in FASTQ format, then convert to a PSMC input file (.fq.gz to psmcfa) using fq2psmcfa. The number of segregating sites (S) is a key output.
  • Parameter Calculation:
    • Calculate θ = 4Neμ = S / L, where L is the total effective sequence length.
    • Calculate the recent boundary as approximately (2 * g) / θ generations. Convert to years by multiplying by g.
    • Calculate the ancient boundary using the formula: T_max ≈ (1/μ) * log(L * μ) generations. This estimates when multiple hits become frequent.
  • PSMC Run: Execute PSMC with parameters -p "4+25*2+4+6" -t 15 -r 5. The -t parameter sets the maximum coalescent time (Tmax).
  • Bootstrap: Run 100 bootstrap replicates with psmc -b to assess variance in the estimated trajectory, particularly at the boundaries.
  • Visualization: Plot results using psmc_plot.pl using your g and μ. The plotted confidence intervals will diverge dramatically outside the reliable range.

Protocol 2: Using Complementary Methods to Bridge Temporal Gaps

Objective: To combine PSMC with other methods to infer population history across both recent and ancient timescales.

Materials:

  • PSMC results from Protocol 1.
  • Multiple sequence alignments for the same species (for ∂a∂i).
  • Genotype data from multiple individuals (for IBD or SMC++ analysis).

Procedure: Part A: Bridging the Recent Gap (<10,000 years)

  • Perform identity-by-descent (IBD) segment analysis on a cohort of >10 individuals using software like GERMLINE or hap-ibd.
  • The distribution of very long IBD segments (>20 cM) is sensitive to recent population size (last ~100 generations). Fit an exponential model to segment lengths to estimate recent Ne.
  • Use SMC++ with the same multi-sample VCF to generate a demographic model that seamlessly integrates the recent signal from many samples with the deeper PSMC signal.

Part B: Bridging the Ancient Gap (>1 million years)

  • Use the Multi Sequentially Markovian Coalescent (MSMC) on a phased genome of high quality. MSMC can sometimes resolve slightly older events due to better modeling of multiple haplotypes.
  • For deeper inference, apply the site frequency spectrum (SFS)-based method ∂a∂i to genomic data from multiple populations. ∂a∂i models can be fitted to divergence times exceeding 2Ne generations.
  • Synthesize: Overlay the PSMC, IBD/SMC++, and ∂a∂i plots on a log-scaled time axis. The region of overlapping confidence between methods defines the highest-confidence history.

Visualizations

psmc_boundaries Title PSMC Inferential Boundaries Over Time InvisibleStart DataInput Input Data: Single Diploid Genome Heterozygous SNPs InvisibleStart->DataInput InvisibleEnd RecentBoundary Recent Boundary (~10,000 years ago) ReliableRange Reliable PSMC Inference Range RecentBoundary->ReliableRange Limited by Segregating Sites SignalLoss Signal Loss Causes: - Few coalescent events - Lineages not sorted RecentBoundary->SignalLoss MethodGap Inferential Gap RecentBoundary->MethodGap Use IBD/SMC++ AncientBoundary Ancient Boundary (~1-2 Mya) ReliableRange->AncientBoundary Limited by Multiple Hits & Recombination AncientBoundary->InvisibleEnd AncientBoundary->SignalLoss AncientBoundary->MethodGap Use ∂a∂i/MSMC DataInput->RecentBoundary

PSMC Temporal Boundaries and Complementary Methods

psmc_workflow cluster_input Input & Preparation cluster_calc Boundary Calculation cluster_infer Inference & Validation Title Protocol: Defining PSMC Time Boundaries WGS WGS Data (>20x coverage) Step1 1. Map & Call Variants (bwa, bcftools) WGS->Step1 RefGenome Reference Genome RefGenome->Step1 Step2 2. Create PSMC Input (fq2psmcfa) Step1->Step2 Calc1 3. Calculate θ = S / L Step2->Calc1 Params Parameters: g, μ Params->Step2 Calc2 4. Estimate Boundaries: Recent ≈ (2*g)/θ yrs Ancient ≈ (1/μ)*log(L*μ) yrs Calc1->Calc2 OutputBox Output: T_min, T_max Calc2->OutputBox Step3 5. Run PSMC with -t T_max OutputBox->Step3 Step4 6. Bootstrap Replicates (Assess Variance) Step3->Step4 Plot 7. Plot with Confidence Intervals Step4->Plot

Workflow for Empirical PSMC Boundary Determination

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Tools for PSMC and Temporal Boundary Analysis

Item Function & Relevance to Boundary Analysis Example/Format
High-Coverage WGS Data Provides dense heterozygous SNP calls. Low coverage reduces S, shrinking the reliable time range and increasing variance. FASTQ files, >20x coverage.
Species-Specific g & μ Critical for scaling coalescent time to years. Incorrect values systematically shift boundaries. Point estimates with confidence intervals (e.g., μ = 1.25e-8 ± 0.5e-8).
Reference Genome Essential for accurate read mapping and recombination rate estimation. Misassembly creates false recombination breakpoints. FASTA file, chromosome-level assembly preferred.
PSMC Software Suite Core algorithm for inference. Understanding its parameters (-t, -p, -r) is key to boundary control. psmc, psmc_plot.pl, fq2psmcfa.
Bootstrap Scripts Generates confidence intervals. Wide intervals at plot extremes visually indicate boundary limitations. Perl/Python scripts for psmc -b.
Complementary Software Bridges temporal gaps: IBD for recent, ∂a∂i/MSMC for ancient history. GERMLINE, hap-ibd, SMC++, ∂a∂i.
Population Genotype Data Required for complementary methods (IBD, SMC++, ∂a∂i) that overcome single-sample limits. Multi-sample VCF file.
Coalescent Simulator Validates boundaries by testing PSMC's accuracy on simulated data with known history. ms, msprime, cosi2.

1. Introduction This application note provides detailed protocols for optimizing data preprocessing for the Pairwise Sequentially Markovian Coalescent (PSMC) method. The accuracy of PSMC in inferring historical effective population size (Ne) trajectories is critically dependent on the quality of input genotype data and the calibration of the per-generation mutation rate (µ). These protocols are designed for researchers aiming to generate robust demographic histories for downstream applications in evolutionary medicine, conservation genetics, and drug target identification.

2. Key Concepts for PSMC Analysis PSMC infers Ne through time by analyzing the distribution of heterozygosity within a single diploid genome. It leverages the coalescent theory, where the time to the most recent common ancestor (TMRCA) between two homologous sequences varies with past population size. Incorrect genotype calls, systematic sequencing errors, and inaccurate mutation rate assumptions directly distort the inferred Ne trajectory.

3. Genotype Calling and Initial Data Processing

  • Protocol 3.1: High-Depth Whole-Genome Sequencing Alignment

    • Objective: Generate a high-quality BAM file for a single diploid individual.
    • Procedure:
      • Sequence: Obtain high-coverage (typically >20X) whole-genome sequencing data for a single individual.
      • Quality Control: Use FastQC v0.12.1 to assess raw read quality. Trim adapters and low-quality bases using Trimmomatic v0.39.
      • Alignment: Map reads to a high-quality reference genome using BWA-MEM v0.7.17. Use bwa mem -M -R "@RG\tID:sample\tSM:sample\tPL:ILLUMINA" reference.fa read1.fq read2.fq.
      • Post-processing: Sort and index the BAM file with Samtools v1.17. Mark duplicates using Picard Tools v3.0.0 or Sentieon.
  • Protocol 3.2: Genotype Calling for a Single Haplotype

    • Objective: Generate a consensus sequence in FASTQ format, representing one diploid genome's two haplotypes.
    • Procedure:
      • Variant Calling: Use BCFtools mpileup v1.17 (bcftools mpileup -C 50 -q 20 -Q 20 -a FORMAT/DP,AD -f reference.fa sample.bam) and call (bcftools call -c) to produce a VCF file.
      • Consensus Generation: Use BCFtools consensus (bcftools consensus -f reference.fa sample.vcf.gz) to generate a FASTA file.
      • FASTA to FASTQ Conversion: Convert the FASTA to FASTQ format, assigning a uniform high-quality score (e.g., 40) to all sites, using a custom script or seqtk seq -F 'I'.

4. Data Filtering for PSMC Input

The following table summarizes critical filtering steps and their quantitative impact on PSMC signal-to-noise ratio.

Table 1: Quantitative Impact of Filtering Strategies on PSMC Input Data

Filtering Step Typical Parameter Purpose Effect of Over/Under-Filtering
Minimum Mapping Quality MAPQ ≥ 20 Excludes poorly/uniquely mapped reads. Under: Inflates heterozygosity from mismapped reads. Over: Reduces power, especially in repetitive regions.
Minimum Base Quality BaseQ ≥ 20 Excludes low-confidence base calls. Under: Introduces false heterozygous calls. Over: May remove true variants in low-complexity regions.
Minimum Depth DP ≥ 10 Ensures sufficient evidence for genotype call. Under: Increases stochastic calling errors. Over: Biases against regions with low mappability.
Maximum Depth DP ≤ (2 × Mean Depth) Removes regions with copy number variants or collapsed repeats. Under: Retains CNVs, causing false long IBD tracts. Over: Removes normal high-depth regions.
SNP Quality & Distance QUAL ≥ 20, ≥10bp between SNPs Ensures high-confidence variants and prevents clustering. Under: Error-prone SNPs distort the TMRCA distribution. Over: Removes true variants in dense regions.
Missing Data ≤ 10% missing calls per 100kb window Ensures continuity of the genomic sequence. Under: Gaps break the sequential Markov process. Over: Excessively reduces total usable sequence.
  • Protocol 4.1: Comprehensive Site Filtering for VCF
    • Objective: Apply integrated filters to generate a high-confidence call set.
    • Procedure:

5. Mutation Rate Calibration and Generation Time Scaling

The mutation rate (µ) and generation time (g) are scaling parameters that convert PSMC's internal time units (T = λ * µ) to real years.

Table 2: Calibration Parameters Across Model Organisms

Organism Reported µ (per gen) Source Method Generation Time (g) Key Consideration for PSMC
Human 1.25e-8 Trio-based, pedigree 25-30 years Well-established; the de facto standard for calibration.
Great Ape ~1.2e-8 Phylogenetic divergence 15-25 years Requires cross-species alignment quality assessment.
Mouse 5.4e-9 Pedigree (lab strains) 0.2-0.5 years Inbred lab strain history differs from wild populations.
Drosophila 5.8e-9 Mutation Accumulation Lines 0.05-0.1 years High recombination rate improves PSMC resolution.
  • Protocol 5.1: Empirical Mutation Rate Calibration using Known Divergence
    • Objective: Estimate µ using phylogenetic divergence to an outgroup.
    • Procedure:
      • Run PSMC on the target genome (A) and a closely related outgroup genome (B) separately.
      • Calculate the average heterozygosity (θ) from the .psmc file for each genome over a recent time interval (e.g., 0-0.1 TMRCA units).
      • Obtain the species divergence time (T_div) in years from fossil/geological data.
      • Calculate µ: µ = (θ_A + θ_B) / (2 * T_div / g). This assumes the divergence is neutral and clock-like.
    • Protocol 5.2: Bootstrapping for Confidence Intervals
      • Objective: Assess robustness of the PSMC inference to stochastic variation in the input sequence.
      • Procedure:

6. The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials and Computational Tools

Item / Software Function Key Consideration
BCFtools / Samtools Core utilities for processing VCF/BAM files. Industry standard; ensures reproducibility and interoperability.
PSMC (v0.6.5-r67+) The core algorithm for inferring Ne trajectories. Requires input in FASTQ format; sensitive to run parameters (-p flag).
Reference Genome (GRCh38, etc.) The coordinate system for alignment and variant calling. Use a primary assembly, not a pan-genome, for clear haplotype definition.
High-Coverage WGS Data Raw input data. Minimum recommended depth is 20X; >30X preferred for robust heterozygous calls.
Compute Cluster (Linux) High-performance computing environment. PSMC bootstrapping is computationally intensive and easily parallelized.

7. Visualized Workflows

G RawWGS Raw WGS Reads Align Alignment & QC (BWA, Samtools) RawWGS->Align VCF Variant Calling (BCFtools mpileup/call) Align->VCF Filter Comprehensive Filtering (Table 1) VCF->Filter Consensus Consensus FASTA/Q Generation Filter->Consensus PSMCrun PSMC Inference with Bootstrapping Consensus->PSMCrun Plot Plotting & Calibration (Real Years) PSMCrun->Plot RefGenome Reference Genome RefGenome->Align Params PSMC Parameters (-p) Params->PSMCrun MuG µ & Generation Time (Calibration) MuG->Plot

Diagram 1: End-to-end workflow for PSMC analysis.

Diagram 2: Logical relationship of PSMC parameters.

This document provides application notes and protocols for implementing the Pairwise Sequentially Markovian Coalescent (PSMC) model, specifically focusing on computational efficiency when applied to large or complex genomes. The PSMC method infers historical population size trajectories from a single diploid genome sequence by analyzing the distribution of heterozygosity along its chromosomes. Within the broader thesis on refining PSMC for ancient population size inference, these notes address the critical software and hardware considerations required to manage the substantial computational load, especially with vertebrate-sized genomes.

Core Computational Challenges & Performance Benchmarks

The PSMC algorithm scales with genome size and the density of heterozygous sites. Processing large genomes (e.g., >3 Gb) requires significant memory and CPU time. The following table summarizes key performance metrics and bottlenecks.

Table 1: Computational Performance Benchmarks for PSMC on Large Genomes

Genome Size Approx. Input File Size (FASTA) Typical Peak Memory Usage Estimated CPU Time (Single Thread) Primary Bottleneck
~100 Mb (e.g., Drosophila) 200 MB 2-4 GB 2-4 hours Viterbi algorithm iteration
~3 Gb (e.g., Human) 6 GB 12-20 GB 24-48 hours Hidden Markov Model (HMM) computations
>10 Gb (e.g., Salamander) 20 GB 40-70 GB 5-10 days Memory I/O and likelihood calculations

Note: Performance is highly dependent on heterozygosity rate, pattern parameter (-p), and number of iterations (-t). Times are for a single bootstrap replicate. Using multiple bootstrap replicates linearly increases total compute time.

Detailed Protocol for Efficient PSMC Analysis

Protocol 3.1: Data Preprocessing and Input Generation

Objective: Generate the required .psmcfa consensus sequence file from aligned BAM files efficiently.

Materials:

  • High-quality whole-genome sequencing data (BAM format, >20x coverage).
  • Reference genome (FASTA format).
  • SAMtools (v1.10+), BCFtools (v1.10+).
  • The vcfutils.pl script (distributed with BCFtools).
  • The fq2psmcfa tool (from PSMC distribution).

Procedure:

  • Variant Calling (mpileup):

  • Convert to PSMC Input Format:

    • -q20: Filters out bases with quality score <20. This step reduces noise and file size.

Efficiency Note: For massive genomes, consider splitting the BAM by chromosome and processing each independently in parallel. The resulting .psmcfa files can be concatenated (cat chr*.psmcfa > genome.psmcfa).

Protocol 3.2: PSMC Model Execution with Optimized Parameters

Objective: Run the PSMC inference with parameters that balance accuracy and computational cost.

Materials:

  • sample.psmcfa file from Protocol 3.1.
  • PSMC software (v0.6.5-r67+).

Procedure:

  • Initial Run with Aggregated Pattern:

  • Bootstrapping (Parallel Execution):

    • Use xargs -P or a job scheduler (e.g., SLURM) to run multiple bootstrap replicates in parallel across multiple CPU cores or nodes. This is the most effective way to reduce wall-clock time.

Protocol 3.3: Visualization and Plot Generation

Objective: Generate the final population history plot from PSMC output.

Materials:

  • sample_combined.psmc file.
  • psmc_plot.pl script (from PSMC distribution).
  • Perl with GD library.

Procedure:

  • -g 5: Generation time in years.
  • -u 1.2e-8: Mutation rate per nucleotide per generation. Critical: This is species-specific and scales the time axis.
  • -Y 100 -X 1000: Sets the maximum y (population size, N0) and x (time ago) for plotting.

Signaling Pathway and Workflow Visualizations

G Start Start: Diploid Genome Sequence Data BAM Aligned Reads (BAM Format) Start->BAM Alignment VCF Variant Calling (BCFtools mpileup) BAM->VCF samtools/bcftools FQ Diploid Consensus (FASTQ-like) VCF->FQ vcfutils.pl vcf2fq PSMCfa Binary Sequence (.psmcfa file) FQ->PSMCfa fq2psmcfa PSMCrun PSMC HMM Inference (Coalescent Simulation) PSMCfa->PSMCrun psmc -p ... Result Inferred Population History Curve PSMCrun->Result psmc_plot.pl

Title: PSMC Analysis Workflow from Sequence to Plot

Title: PSMC's Hidden Markov Model and Inference Algorithm

The Scientist's Toolkit: Essential Research Reagents & Solutions

Table 2: Key Computational Tools and Resources for PSMC Analysis

Tool/Resource Category Primary Function & Relevance
SAMtools/BCFtools Data Preprocessing Manipulate alignments (BAM) and call variants (VCF). Essential for creating accurate, depth-filtered input for PSMC.
GNU Parallel / xargs Workflow Management Enables parallel execution of bootstraps or per-chromosome analysis, drastically reducing total computation time.
High-Memory Compute Node Hardware PSMC loads the entire .psmcfa into memory. For large genomes (>3Gb), nodes with 64GB-512GB RAM are often necessary.
Job Scheduler (SLURM/PBS) Cluster Computing Manages resource allocation and job queues for running multiple PSMC instances on an HPC cluster.
PSMC Software Suite Core Algorithm Implements the Pairwise Sequentially Markovian Coalescent model. The main executables are psmc, fq2psmcfa, and psmc_plot.pl.
Perl with GD Library Visualization Required dependency for the psmc_plot.pl script to generate PNG plots of the population history.
Mutation Rate (μ) Estimate Species-Specific Parameter Critical scaling parameter for converting model time to years. Must be sourced from prior pedigree or phylogenetic studies (e.g., 1.2e-8 for humans).
Generation Time (g) Estimate Species-Specific Parameter Converts generational timescale to real years. Sourced from life history data (e.g., 25 years for humans).

PSMC vs. Other Methods: Assessing Accuracy and Choosing the Right Tool

This document provides detailed application notes and protocols for four prominent methods used to infer historical population sizes from genomic data: PSMC, MSMC, SMC++, and Stairway Plot 2. The primary thesis context is an expansion and critical evaluation of the foundational Pairwise Sequentially Markovian Coalescent (PSMC) model. While PSMC pioneered the inference of population size history from a single diploid genome, subsequent methods have addressed its limitations, such as its inability to model population structure, leverage multiple genomes effectively, or handle very recent epochs with high resolution. This comparative framework is essential for researchers aiming to select the most appropriate tool for their specific evolutionary or demographic question, particularly in contexts where such inferences can inform population-specific genetic risk factors in drug development.

Core Algorithm Comparison

The following table summarizes the key algorithmic and data requirements of each method.

G Start Input: Genomic Sequences PSMC PSMC (HMM on a single diploid genome) Start->PSMC MSMC MSMC/MSMC2 (Coalescent HMM on multiple haplotypes) Start->MSMC SMCpp SMC++ (SMC on site frequency spectrum) Start->SMCpp Stairway Stairway Plot 2 (Composite likelihood from SFS) Start->Stairway Output Output: Historical Effective Population Size (Ne) PSMC->Output MSMC->Output SMCpp->Output Stairway->Output

Title: Logical Flow from Genomic Input to Demographic Inference

Table 1: Comparative Specifications of Demographic Inference Methods

Feature PSMC MSMC/MSMC2 SMC++ Stairway Plot 2
Min. Samples Required 1 diploid individual (2 haplotypes) 2-8 haplotypes (1-4 diploid individuals) ≥1 diploid individual ≥4 haplotypes (2 diploid individuals)
Max. Samples Utilized 2 haplotypes ~8-10 haplotypes optimally Dozens to hundreds of individuals Dozens of individuals (scalable)
Time Range (generations ago) ~10k - 5M ~10k - 5M (MSMC2 improves recent inference) ~100 - 5M (better recent resolution) User-defined, typically 1k - 5M
Key Data Input Consensus diploid sequence (masked) Phased haplotypes (masked) Unphased or phased genotypes; VCF file Site frequency spectrum (SFS)
Models Population Structure? No No (panmictic assumption) Yes (can integrate multiple populations) No (single population model)
Computational Intensity Low Medium-High High Medium (but bootstrap is intensive)
Primary Statistical Framework Hidden Markov Model (HMM) Sequentially Markovian Coalescent (SMC) Sequential Monte Carlo (SMC) Composite Likelihood (from SFS)
Optimal for Recent Epochs (<10kya)? Poor resolution Good (MSMC2 excellent) Very good Good (depends on SNP density)

Detailed Experimental Protocols

Protocol A: Standard PSMC Analysis Workflow

This protocol forms the baseline method from the core thesis.

1. Data Preparation:

  • Input: High-coverage whole-genome sequencing data from a single diploid individual.
  • Mapping & Variant Calling: Map reads to a reference genome (e.g., using BWA-MEM), call consensus sequence with samtools/bcftools.
  • Masking: Generate a mask for regions of low mappability and high repetitive content. Use bedtools to create a consensus mask. Highly repetitive regions and sites with exceptionally high or low depth must be excluded.

2. PSMC Execution:

  • Convert the consensus FASTA to the required psmcfa format using fq2psmcfa.
  • Run PSMC with recommended parameters: psmc -N25 -t15 -r5 -p "4+25*2+4+6" -o output.psmc input.psmcfa. The -p flag defines the atomic time interval pattern.
  • Perform 100 rounds of bootstrapping to estimate confidence intervals. Use the utils/bootstrap.pl script provided with PSMC to generate bootstrap sequences and run PSMC on each.

3. Visualization:

  • Convert the .psmc output to a plot using psmc_plot.pl. Specify the mutation rate (e.g., -u 2.5e-8) and generation time (e.g., -g 25).

Protocol B: MSMC2 Analysis for Multiple Genomes

1. Data Preparation:

  • Input: Phased haplotypes from 2-4 diploid individuals (4-8 haplotypes).
  • Phasing: Use a reference panel (e.g., SHAPEIT, Eagle) for accurate phasing. Masking of low-confidence regions remains critical.
  • File Generation: Generate a mask.bed file per sample and the multihetsep input file for MSMC using its helper scripts (e.g., generate_multihetsep.py).

2. MSMC2 Execution:

  • Run MSMC2 with command: msmc2 -o output.msmc2 -i 40 -t 6 -p "1*2+15*1+1*2" input.multihetsep. The -i flag sets iterations, -t sets threads.
  • For cross-population analysis, use the --fixedRecombination flag and separate haplotypes into different groups.

3. Visualization:

  • Use the msmc-tools suite (e.g., combinedcrosscoalfile.py, plot_MSMC.py) to combine results and generate plots with scaling factors.

Protocol C: SMC++ Analysis for Large Sample Sizes

1. Data Preparation:

  • Input: A VCF file (phased or unphased) for multiple individuals from one or more populations.
  • Masking: Provide a BED file of accessible genomic regions.
  • SFS Estimation: SMC++ internally computes the joint site frequency spectrum.

2. SMC++ Execution:

  • Estimate Model Parameters: Run smc++ estimate with a pre-defined mutation rate: smc++ estimate 2.5e-8 output_dir input.vcf --cores 8 --knots 15. The --knots parameter controls the flexibility of the population size curve.
  • Cross-Population Analysis: Use smc++ split to infer divergence times and size histories for two populations simultaneously.
  • Covariate Analysis (if applicable): Integrate geographic or environmental data using the --covariates flag in a custom analysis.

3. Visualization:

  • Use smc++ plot to generate PDF plots: smc++ plot plot.pdf output_dir/model.final.json.

Protocol D: Stairway Plot 2 Analysis from SFS

1. Data Preparation:

  • Input: A multi-sample VCF file.
  • SFS Construction: Use tools like easySFS.py to generate the unfolded or folded Site Frequency Spectrum from the VCF. The number of randomly sampled alleles (n) is a critical, user-defined parameter.

2. Stairway Plot 2 Execution:

  • Design a blueprint file specifying parameters: mutation rate, generation time, number of bins for estimation, bootstraps (e.g., 200), and sequence length.
  • Run the Java package: java -Xmx4G -jar stairway_plot_es two.blueprint.
  • The method performs a random search of demographic models and selects the best fit using a cross-validation procedure.

3. Visualization:

  • Use the provided R script (plot.blueprint.R) included with the software to generate the final population history plot with confidence intervals from bootstraps.

The Scientist's Toolkit: Key Research Reagent Solutions

Table 2: Essential Materials and Tools for Demographic Inference Studies

Item/Category Function & Relevance Example/Note
High-Coverage WGS Data Foundation for all methods. Accuracy scales with coverage (typically >20x). Illumina, PacBio HiFi, or ONT ultralong reads for improved phasing.
Reference Genome & Annotation Essential for read alignment and defining callable regions. Species-specific, high-contiguity assembly (e.g., GRCh38, GRCm39).
Variant Caller Converts aligned reads to genotype data. GATK, bcftools, FreeBayes. Critical for sensitivity/specificity balance.
Phasing Tool Required for MSMC; beneficial for others. SHAPEIT4, Eagle2, WhatsHap. Accuracy is paramount.
Genomic Mask (BED file) Defines "callable" regions, excluding repetitive/low-complexity areas. Generated using bedtools, RepeatMasker, and depth filters.
Mutation Rate Estimate Scalars to convert coalescent time to real time. Species-specific, often derived from pedigree studies (e.g., 1.2e-8 for humans).
Generation Time Estimate Converts generational timescale to years. From life history data (e.g., 25-30 years for humans).
High-Performance Computing (HPC) Computational resource for data processing and model fitting. Linux cluster with ample CPU, memory (≥32GB), and storage.
Visualization & Stats Software For plotting results and performing comparative analysis. R (ggplot2), Python (matplotlib), custom scripts from tool authors.

Application Notes

This document provides a comparative analysis of the Pairwise Sequentially Markovian Coalescent (PSMC) model and its multi-genome extensions, focusing on data requirements and temporal inference resolution. These methods are central to reconstructing historical effective population size (Ne) trajectories from genomic data, a critical component for understanding evolutionary pressures, demographic history, and the context of genetic disease variant emergence.

Core Method Comparison

The PSMC algorithm, introduced by Li and Durbin (2011), infers population size history by analyzing the distribution of heterozygotes (coalescent events) along a single diploid genome. Extensions like MSMC (Multiple Sequentially Markovian Coalescent) and its successor MSMC2 utilize multiple genomes to increase resolution and accuracy.

Key Quantitative Comparisons:

Table 1: Method Specifications and Data Requirements

Feature PSMC MSMC/MSMC2
Minimum Genomes 1 diploid (2 haplotypes) 2-8 haplotypes (1-4 diploids) for MSMC; 2+ for MSMC2
Typical Input Data High-coverage (>20x), long-read or accurately phased short-read WGS Multiple high-coverage, phased haplotypes are critical
Time Range ~10,000 to 1,000,000 years ago ~1,000 to 1,000,000 years ago
Recent Resolution Low (blurred within last ~10-20 kyr) High (can resolve events within last ~1-10 kyr)
Ancient Resolution Moderate High, with less stochastic variance
Computational Load Low (hours per genome) High (scales with number of haplotypes)

Table 2: Inferred Parameter Sensitivity & Limitations

Parameter PSMC Limitation Multi-Genome Improvement
Recent Bottlenecks Often undetected or smoothed Can be accurately timed and scaled
Population Splits Cannot be inferred Explicitly models divergence (e.g., MSMC's rate matrix)
Phasing Errors Robust to small errors Highly sensitive; requires precise phasing
Generation Time & μ Inferences scaled by g/μ; absolute timing uncertain Same scaling issue, but relative timing more precise

Experimental Protocols

Protocol 1: Standard PSMC Workflow for Single-Genome History Inference

Objective: To infer historical effective population size from a single diploid genome sequence.

Research Reagent Solutions & Essential Materials:

Item Function
High-Quality Genomic DNA Source material for sequencing to achieve high coverage.
Whole-Genome Sequencing Kit (e.g., Illumina NovaSeq, PacBio HiFi) Generates long or accurate short reads for robust variant calling.
BWA-MEM2 Aligner Aligns sequencing reads to a reference genome.
SAMtools & BCFtools Process alignments and call consensus sequence/variants.
PSMC Software (v0.6.5-r67+) Core algorithm for generating the Ne trajectory.
Reference Genome (species-specific) Alignment coordinate system.
Mutation Rate Estimate (e.g., 1.25e-8 per gen) Converts coalescent time to years.
Generation Time Estimate (e.g., 25 years) Converts generations to years.

Procedure:

  • Sequence & Align: Generate ≥20x coverage whole-genome sequencing. Align reads to reference genome using BWA-MEM. Sort output with SAMtools.
  • Generate Consensus: Use BCFtools mpileup and call to generate a diploid consensus sequence in FASTQ format. Filter for mapping quality >20 and base quality >20. Command: bcftools mpileup -C50 -uf ref.fa aligned.bam \| bcftools call -c \| vcfutils.pl vcf2fq -d 10 -D 100 > diploid.fq
  • Format Input for PSMC: Convert the FASTQ to the required input format (psmcfa) using the fq2psmcfa utility. Command: fq2psmcfa diploid.fq > sample.psmcfa
  • Run PSMC: Execute the PSMC model with predefined parameters. The -p flag specifies the atomic time interval pattern. Command: psmc -p "4+25*2+4+6" -o sample.psmc sample.psmcfa
  • Plot Results: Use the psmc_plot.pl utility to generate a plot of Ne over time, incorporating mutation rate (μ) and generation time (g). Command: psmc_plot.pl -g 25 -u 1.25e-08 -p plot_title sample.psmc

Protocol 2: MSMC2 Analysis for Multiple Genomes

Objective: To infer population size and separation history from multiple phased genomes.

Research Reagent Solutions & Essential Materials:

Item Function
Phased Haplotype Data (e.g., from trio-based or statistical phasing) Essential input; defines the chromosomal segments for coalescent analysis.
MSMC2 Software Implements the multi-genome sequentially Markovian coalescent model.
Shapeit4 or Eagle2 Software for accurate haplotype phasing if trio data unavailable.
High-Coverage WGS from Multiple Individuals Required to build accurate multi-haplotype input.

Procedure:

  • Variant Calling & Phasing: Perform joint variant calling on multiple samples using GATK or BCFtools. Phase the genotypes using a reliable method (e.g., trio-based phasing with Shapeit4 or population-based with Eagle2).
  • Generate Mask Files: Create mask files denoting callable genomic regions for each sample using BEDTools and sequencing depth filters.
  • Prepare MSMC Input: Use the generate_multihetsep.py script (provided with MSMC) to create a multi-sample heterozygous site file from the phased VCF and mask files. Command: generate_multihetsep.py -mask sample1_mask.bed ... -chr 1 phased.vcf.gz > chr1.multihetsep
  • Run MSMC2: Execute MSMC2 on the input files for all autosomes. Specify the number of haplotypes (e.g., 8 for 4 diploid individuals). Command: msmc2 -o multi_sample_output chr1.multihetsep chr2.multihetsep ...
  • Infer Population Splits (Optional): For two populations, run MSMC2 in cross-coalescence mode by specifying the --fixedRecombination flag and a predefined haplotype grouping to estimate split times.

Mandatory Visualizations

psmc_workflow DNA Single Diploid Genome DNA Seq Whole-Genome Sequencing DNA->Seq Align Alignment to Reference Seq->Align Consensus Generate Diploid Consensus (FASTQ) Align->Consensus Convert Convert to PSMC Input (psmcfa) Consensus->Convert RunPSMC Run PSMC Model Convert->RunPSMC Plot Plot Ne(t) with g & μ parameters RunPSMC->Plot Output Historical Ne Trajectory Plot->Output

Title: PSMC Analysis Workflow for a Single Genome

msmc_workflow MultiDNA Multiple Genomic DNA Samples JointCall Joint Variant Calling MultiDNA->JointCall Masks Generate Callable Masks MultiDNA->Masks Phasing Haplotype Phasing JointCall->Phasing Prepare Prepare Multi-Hetsep Input Files Phasing->Prepare Masks->Prepare RunMSMC2 Run MSMC2 Model (Coalescence Rates) Prepare->RunMSMC2 Results High-Resolution Ne(t) & Split Times RunMSMC2->Results

Title: MSMC2 Analysis Workflow for Multiple Genomes

time_resolution cluster_psmc PSMC Inference cluster_msmc MSMC Inference Past Deep Past (1 Mya) P_Ancient Moderate Resolution Past->P_Ancient M_Ancient High Resolution Past->M_Ancient Recent Recent Past (10 kya) P_Recent Low Resolution Recent->P_Recent M_Recent High Resolution Recent->M_Recent Now Present

Title: Comparative Time Resolution of PSMC vs. MSMC

Application Notes

The Pairwise Sequentially Markovian Coalescent (PSMC) method is a foundational tool for inferring historical population sizes from a single diploid genome sequence. Its application in population genetics, human evolutionary studies, and conservation biology necessitates rigorous validation to ensure the accuracy and reliability of its inferences. Validation strategies primarily involve benchmarking PSMC outputs against known demographic histories, which are either simulated computationally or derived from well-established demographic records. This protocol details the methodologies for designing and executing such validation experiments, framed within the broader thesis that PSMC is a powerful but model-dependent tool whose interpretations require careful calibration.

Core Validation Principles:

  • Simulation-based Validation: Coalescent simulations under a known demographic model (e.g., bottlenecks, expansions) are used to generate synthetic genome sequences. PSMC is then run on this synthetic data, and its inferred demography is compared to the known simulation input.
  • Benchmarking with Known Histories: PSMC results from real genomic data of populations with historically documented size changes (e.g., species with known bottleneck events from paleontological records) are used as a qualitative or semi-quantitative check.

Key Insights from Validation Studies:

  • PSMC is most reliable for inferring population size changes within the last ~1 million years (0.01 to 1-2 Ne generations ago).
  • Its resolution decreases for very recent (last 10,000 years) and very ancient history.
  • The method is sensitive to parameters such as mutation rate, generation time, and the atomic time interval parameter (-p).
  • Phasing errors and low coverage in real data can create spurious signals of population bottlenecks.

Experimental Protocols

Protocol 1: Coalescent Simulation for PSMC Benchmarking

Objective: To generate synthetic diploid genome data under a specified demographic history for validating PSMC inference accuracy.

Materials & Software:

  • High-performance computing cluster.
  • Coalescent simulator (e.g., ms, msprime, cosi2).
  • Post-processor to convert simulator output to FASTA/PSMC input format (e.g., ms2psmcfa.pl from PSMC utils).

Procedure:

  • Define the True Demographic Model (-eN flags in ms format):
    • Specify historical population size changes. Example:
      • -eN 0.0 1.0 (Current size = Nref)
      • -eN 0.1 0.5 (At 0.14Nref generations ago, size = 0.5Nref)
      • -eN 0.5 2.0 (At 0.54Nref generations ago, size = 2.0Nref)
  • Run the Coalescent Simulator:
    • Example using ms:

      • 2: Number of haplotypes (diploid individual).
      • 100: Number of independent sequences to simulate.
      • -t 20: Population mutation parameter θ = 4Nrefμ.
      • -r 20 1000000: Recombination parameter ρ = 4Nrefr.
  • Convert Output to PSMC Input:
    • Use the ms2psmcfa.pl script to convert the ms output to the psmcfa format.

  • Run PSMC on Simulated Data:

  • Plot and Compare:
    • Generate the PSMC plot using the psmc_plot.pl script with the known simulation parameters for scaling (mutation rate μ, generation time g).
    • Overlay the known, simulated demographic history onto the PSMC plot for visual comparison.
    • Calculate quantitative metrics like Mean Squared Error (MSE) between the inferred and true Ne(t) trajectories at defined time intervals.

Protocol 2: Benchmarking PSMC Against Known Demographic Histories

Objective: To assess PSMC's performance using genomic data from species/populations with independently documented population histories.

Materials:

  • High-coverage whole-genome sequencing data from a species with a well-resolved demographic history (e.g., the domestic dog post-bottleneck, Eurasian beaver after near-extinction, humans from isolated populations).
  • Published historical, archaeological, or paleogenomic estimates of past population sizes and timing of events.

Procedure:

  • Data Preparation:
    • Align reads to a reference genome and call consensus sequence for a diploid individual.
    • Generate the .psmcfa input file from the consensus using the fq2psmcfa utility.
  • PSMC Inference:
    • Run PSMC with multiple bootstrap replicates (e.g., 100) to assess confidence intervals.

  • Historical Data Compilation:
    • Create a table of known demographic events from the literature (e.g., "Bottleneck to ~1,000 individuals ~200 years ago").
  • Comparative Analysis:
    • Plot the PSMC inference using the best available estimates for μ and g.
    • Annotate the plot with vertical lines or shaded areas indicating the timing and magnitude of known historical events.
    • Perform a qualitative assessment: Do the direction and relative timing of PSMC-inferred changes correspond to the known history? Note that absolute scaling may be uncertain.

Table 1: Common Parameters for Coalescent Simulations in PSMC Validation

Parameter Symbol Typical Values Purpose in Validation
Reference Population Size Nref 10,000 - 25,000 Scales time and population size in simulations.
Mutation Rate per Gen. per Site μ 1.25e-8 to 2.5e-8 Converts coalescent time to real years. Critical for scaling.
Generation Time g 20 - 30 years Converts generational time to real years.
Recombination Rate per Gen. per Site r 1e-8 to 1e-9 Controls the correlation between adjacent genomic segments.
Sequence Length L 1e6 - 5e7 bp Longer lengths improve inference smoothness and accuracy.
Demographic Scenario Constant, Expansion, Bottleneck, Multiple Changes The "true" history to test if PSMC can recover it.

Table 2: Quantitative Metrics for PSMC Validation Performance

Metric Formula (Conceptual) Interpretation Ideal Value
Mean Squared Error (MSE) Σ (Inferred Ne(t) - True Ne(t))^2 / n Overall deviation of inferred trajectory. Closer to 0.
Time of Event Error |Inferred Time - True Time| / True Time Accuracy in timing key transitions (e.g., bottleneck start). < 0.2 (20% error).
Magnitude Error at Event |Inferred Ne - True Ne| / True Ne Accuracy in estimating population size at a known event. < 0.3 (30% error).
Correlation Coefficient (r) Pearson's r between log(Inferred Ne(t)) and log(True Ne(t)) Linear relationship between inference and truth. Closer to 1.

Mandatory Visualizations

G start Start: Define True Demographic Model sim Run Coalescent Simulator (ms/msprime) start->sim Parameters: Nref, μ, r, history conv Convert Output to PSMCFA Format sim->conv .out/.vcf run Run PSMC on Simulated Data conv->run .psmcfa comp Compare Inferred vs. True History run->comp .psmc plot metrics Calculate Quantitative Metrics (MSE, Error) comp->metrics

Title: PSMC Validation via Simulation Workflow

Title: PSMC Benchmarking Against Known History

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools & Data for PSMC Validation

Item Name Category Function/Brief Explanation
ms / msprime Coalescent Simulator Generates synthetic genomic sequences under a specified coalescent process with a defined demographic history. The ground truth for simulation-based validation.
PSMC Software Suite Inference Tool Core package containing the psmc binary and utility scripts (psmc_plot.pl, fq2psmcfa, ms2psmcfa.pl) for inference, plotting, and format conversion.
High-Coverage WGS Data Genomic Data Real sequencing data from a diploid individual of a species with a known demographic past. Used for empirical benchmarking.
Mutation Rate (μ) Estimates Calibration Parameter Critical for scaling PSMC output from generations to years. Often derived from pedigree studies or phylogenetic comparisons. A major source of uncertainty.
Generation Time (g) Estimates Calibration Parameter Average age of parents at childbirth. Converts generational time to real time. Obtained from life history studies.
BEDTools / SAMtools Data Processing For processing and manipulating alignment files (BAM/FASTQ) to generate the consensus sequence required for PSMC input.
R / Python with ggplot2/matplotlib Analysis & Visualization Used to calculate validation metrics, customize PSMC plots, and overlay true historical data for comparison.

Application Notes and Protocols

Thesis Context: This document details protocols and analytical frameworks for assessing concordance and discordance in demographic histories inferred via the Pairwise Sequentially Markovian Coalescent (PSMC) method. Within the broader thesis on PSMC for inferring ancient population size research, these notes address critical validation, caveats, and integration with complementary genomic signals.

1. Protocol: PSMC Inference and Replication Analysis

Objective: To generate and replicate PSMC trajectories from high-coverage whole-genome sequences, establishing a baseline for concordance assessment.

Materials & Reagents:

  • High-quality DNA from target individuals (≥30x WGS coverage recommended).
  • Reference genome (e.g., GRCh38/hg38).
  • Alignment software (BWA-MEM2, minimap2).
  • SAMtools, BCFtools, and HTSlib suites.
  • PSMC software (v0.6.5-r67 or later).
  • Perl scripts from the PSMC repository (utils/).

Procedure:

  • Alignment & Processing: Align sequencing reads to the reference genome. Convert to BAM format, sort, mark duplicates, and generate a consensus sequence in FASTQ format using samtools mpileup and bcftools call.
  • PSMC Input Preparation: Convert the diploid consensus FASTA to a PSMC input file (.psmcfa) using the fq2psmcfa utility. The default bin size is 100bp.
  • Parameter Estimation: Run PSMC with recommended parameters (e.g., -N25 -t15 -r5 -p "4+25*2+4+6"). The pattern "4+25*2+4+6" defines the atomic time intervals for the hidden Markov model.
  • Bootstrap Replication: Generate 100 bootstrap replicates of the .psmcfa file using splitfa and run PSMC on each to assess inference variance.
  • Plot Generation: Combine results using psmc_plot.pl to generate the final population size (Nₑ) vs. time (g) plot. Set generation time (e.g., -g 25) and mutation rate (e.g., -u 1.25e-8) for scaling.

Key Quantitative Data Outputs:

  • Inferred historical effective population size (Nₑ) at time intervals.
  • Time to the most recent common ancestor (T_MRCA).
  • Bootstrap confidence intervals.

Table 1: Exemplar PSMC Inferences from Public Datasets

Sample ID (Population) Nₑ at 100 kya Nₑ Minimum (Bottleneck) Bottleneck Time (kya) T_MRCA (Mya) Key Reference
NA12878 (CEU) ~9,500 ~3,100 ~20 kya ~1.2 Li & Durbin, 2011
HG004 (CHS) ~12,000 ~4,000 ~25 kya ~1.5 1000 Genomes Project
!Kung (Southern African) ~25,000 ~9,500 ~30 kya ~1.8 Gronau et al., 2011

2. Protocol: Discordance Analysis via Cross-Method Validation

Objective: To identify and quantify discordance by comparing PSMC inferences with those from complementary methods (e.g., MSMC, SMC++, Stairway Plot).

Procedure:

  • Data Preparation: Use the same aligned BAM files as input for all methods to ensure comparability.
  • Parallel Inference:
    • MSMC2: Call diploid genotypes per chromosome, generate mask and multihetsep files, run msmc2 on a single genome or combined cross-coalescence rates for multiple.
    • SMC++: Run smc++ vcf2smc to create input files, then smc++ estimate using a provided species divergence time for scaling.
    • Stairway Plot: Generate site frequency spectrum (SFS) from called SNPs and run the tailored blueprint.
  • Temporal Alignment: Carefully scale all outputs using the same generation time and mutation rate parameters. Note that methods have different sensitivity windows (PSMC: 10kya-1Mya; MSMC2: 1kya-200kya; SMC++: recent to 500kya).
  • Discordance Metric Calculation: For overlapping time intervals, calculate the log2 ratio of inferred Nₑ between PSMC and each alternative method. Visualize trajectories on a single log-scale plot.

Table 2: Comparison of Coalescent-Based Demographic Inference Methods

Method Core Algorithm Optimal Sample Size Best Time Range Key Strength Key Limitation
PSMC Pairwise HMM 1 diploid 10 kya - 1 Mya Deep history from 1 sample Poor resolution <10kya
MSMC2 Multi-SMC 2-8 haplotypes 1 kya - 200 kya More recent, phased data Needs phasing, small n
SMC++ Composite Likelihood Many (≥10) recent - 500 kya Uses SFS, better at recent times Complex model, computation
Stairway Plot SFS + Coalescent Many (≥10) recent - 500 kya Uses SFS, no phasing needed Assumes panmixia

3. Protocol: Interpreting Discordance via Technical and Biological Confounders

Objective: To diagnose whether observed discordance arises from methodological artifacts or real biological signals.

Experimental Workflow:

DiscordanceWorkflow Start Observed Discordance Between Methods TechCheck Technical Artifact Check Start->TechCheck BioCheck Biological Signal Check Start->BioCheck Param Parameter Sensitivity (e.g., μ, g) TechCheck->Param Re-run with bounds Data Data Quality & Coverage TechCheck->Data Subsample/ Simulate Model Model Assumptions TechCheck->Model Compare windows Sel Selection Linkage BioCheck->Sel Check B-statistic Struct Population Structure BioCheck->Struct Compare populations Archaic Archaic Introgression BioCheck->Archaic Scan introgression maps Conclusion Integrated Inference Param->Conclusion Data->Conclusion Model->Conclusion Sel->Conclusion Struct->Conclusion Archaic->Conclusion

Diagram Title: Diagnostic Workflow for Discordant Demographic Inferences

Procedure:

  • Test Technical Artifacts:
    • Parameter Sensitivity: Re-run PSMC with a ±25% range of mutation (μ) and generation time (g) values. Quantify the resultant shift in trajectory.
    • Coverage & Calling: Downsample high-coverage BAMs to 10x, 20x, and re-run the pipeline. Assess variance in Nₑ inference.
    • Genomic Windows: Run PSMC on individual chromosomes or 10Mb windows. Discordance within the PSMC method (e.g., autosomes vs. X chromosome) suggests selection or biased gene conversion.
  • Test Biological Hypotheses:
    • Selection: Overlay maps of B-statistic (background selection) or iHS (positive selection) from tools like SweepFinder2 or CLUES. Correlate regions of extreme PSMC inference with selection signals.
    • Structure: Compare PSMC plots from individuals representing distinct sub-populations (e.g., YRI vs. CEU). Shared ancient history but diverging recent trajectories indicate structure.
    • Introgression: Mask genomic regions known to harbor archaic introgression (using data from Denisovan or Neanderthal maps) and re-run PSMC. A shift in trajectory suggests introgression biases coalescent times.

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Resources for Demographic Inference Studies

Item Function Example/Provider
High-Coverage WGS Data Primary input for PSMC/MSMC. 1000 Genomes (30x), Simons Genome Diversity Project, gnomAD.
Reference Genome Alignment and coordinate basis. GRCh38 from Genome Reference Consortium.
Alignment & Variant Caller Process raw reads to consensus sequences. BWA-MEM2, GATK (for VCFs), Sentieon tools.
Demographic Inference Suite Core software for Nₑ(t) inference. PSMC, MSMC2, SMC++, Stairway Plot 2.
Simulation Framework Validate methods under known histories. msprime, SLiM, stdpopsim.
Selection Scan Tools Identify confounding selective sweeps. SweepFinder2, CLUES, RAiSD.
Introgression Maps Mask non-modern human sequences. Archaic genomes from MPI-EVA, Altai Neanderthal dataset.
Visualization Library Plot and compare trajectories. ggplot2 in R, matplotlib in Python.

Within the broader thesis on the Pairwise Sequentially Markovian Coalescent (PSMC) method for inferring ancient population size, a critical advancement lies in moving beyond demographic curves in isolation. The core thesis posits that PSMC-inferred trajectories gain biological and anthropological meaning only when integrated with independent lines of evidence from archaeology and paleoclimatology. This integration transforms inferred population changes into understandable population responses to documented environmental shifts or cultural innovations.

Foundational Data for Integration

Table 1: Core Paleoclimate Proxy Datasets for Integration with PSMC

Proxy Data Type Temporal Resolution Key Measured Variable(s) Primary Source/Archive Relevance to PSMC Interpretation
δ¹⁸O in Foraminifera (LR04) ~1-10 kyr Global Ice Volume, Temperature Lisiecki & Raymo (2005) Stack Glacial-Interglacial cycles as drivers of habitat change and population connectivity.
Greenland Ice Core (NGRIP, GISP2) Annual to Decadal Temperature (δ¹⁸O), Dust, Aerosols NOAA Paleoclimatology High-resolution Holocene climate events (e.g., 8.2k event, Younger Dryas).
Speleothem Records (e.g., Hulu, Sanbao Caves) Decadal to Centennial Regional Monsoon Intensity, Precipitation SISAL Database Regional aridity/humidity shifts affecting carrying capacity.
Pollen Assemblage Data Centennial to Millennial Vegetation Biome Composition Neotoma Database Direct evidence of habitat availability and quality.
Paleoshoreline & Bathymetry Models Millennial Continental Shelf Exposure (Sea Level) PALEOMAP Project Land area changes and potential migration corridors/barriers.

Table 2: Major Archaeological/Anthropological Transitions for Temporal Calibration

Transition/Event Approximate Date Range (ka) Associated Technological/Cultural Shift Potential PSMC Signal
Out-of-Africa Dispersal 70-50 ka Increased complexity in tools, symbolism Bottleneck followed by expansion in non-African lineages.
Initial Peopling of Eurasia ~45 ka Upper Paleolithic transition Divergence and separate demographic trajectories.
Last Glacial Maximum (LGM) 26-19 ka Refugium occupation, specialized tools Population decline, then post-glacial expansion.
Neolithic Revolution ~12-5 ka (varies) Domestication of plants/animals, sedentism Sustained population growth, altered effective size.
Bronze Age Collapses ~3.2 ka Urban decline, trade disruption Regional population contractions.

Application Notes & Experimental Protocols

Protocol: Temporal Alignment of PSMC Output with Climate Proxies

Objective: To statistically correlate PSMC-inferred Ne (effective population size) trajectories with quantitative paleoclimate time series.

Materials:

  • PSMC output file (*.psmc).
  • Paleoclimate time series data (e.g., δ¹⁸O, temperature anomaly).
  • Software: R with packages paleoTS, ggplot2, changepoint.

Procedure:

  • Rescale PSMC Time Axis: Convert the PSMC generation-based time axis to calendar years. Assume a generation time (g) and mutation rate (μ). Example: For humans, often g=25 years, μ=1.25e-8 per base per generation.

  • Interpolate & Log-Transform: Using R, linearly interpolate the PSMC Ne estimates onto a regular time grid (e.g., 100-year bins). Apply a log10 transformation to Ne to normalize variance.

  • Align Climate Data: Download and preprocess paleoclimate data for the same geographic region. Smooth high-resolution data to a comparable resolution (e.g., 500-yr rolling mean) and interpolate onto the same time grid.

  • Correlation & Lead-Lag Analysis: Perform cross-correlation analysis to identify significant relationships and potential time lags between climate and Ne.

  • Changepoint Detection: Apply changepoint detection algorithms to both time series independently to identify significant shift points. Compare the timing of changepoints in Ne with those in climate.

Protocol: Integrating Archaeological Transitions as PSMC Prior Constraints

Objective: To use well-dated archaeological events as informative priors for PSMC parameter estimation or for post-hoc interpretation.

Materials:

  • High-coverage diploid genome sequence (BAM/FASTQ).
  • Archaeological event timeline (Table 2).
  • Software: PSMC, MSMC, or related tools; custom R/Python scripts.

Procedure:

  • Pre-sequencing Design: If possible, select a genome from a population with a deep, well-studied archaeological record in its geographic region of origin.
  • PSMC Analysis with Time-Scaling: Run standard PSMC analysis (psmc -p "4+25*2+4+6"). During the plotting stage, explicitly test different generation times (g) that align with anthropological estimates for the population of interest.

  • Event Overlay & Hypothesis Testing: Superimpose the archaeological timeline onto the PSMC plot.

    • Hypothesis 1 (Bottleneck): Does a known period of climatic stress (e.g., LGM) or cultural collapse precede a PSMC-inferred decline?
    • Hypothesis 2 (Expansion): Does the onset of a major technological innovation (e.g., Neolithic) precede a sustained PSMC-inferred growth period?
    • Use a moving-window comparison to calculate the mean Ne for 5kyr windows before and after an event date. Perform a permutation test to assess if the difference is significant.

Visualizations

G cluster_inputs Input Data Streams cluster_psmc PSMC Core Analysis cluster_integration Integration & Interpretation A1 High-Coverage Diploid Genome B1 Variant Calling (Heterozygous Sites) A1->B1 A2 Paleoclimate Proxy Time Series (e.g., δ¹⁸O) C1 Temporal Scaling & Alignment A2->C1 A3 Archaeological Event Timeline A3->C1 B2 PSMC Model Inference (Coalescent Simulation) B1->B2 B3 Effective Population Size (Ne) Trajectory B2->B3 B3->C1 C2 Statistical Correlation & Changepoint Analysis C1->C2 C3 Causal Hypothesis Formulation & Testing C2->C3 C4 Refined Demographic History Model C3->C4 C4->B2 Feedback as Prior

Title: Workflow for Integrating PSMC, Climate, and Archaeology Data

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials and Tools for Integrative PSMC Studies

Item Category Function & Application Notes
High Molecular Weight Ancient DNA Biological Sample For direct sequencing of paleogenomes, allowing PSMC to be run on individuals from known archaeological contexts. Requires strict aDNA protocols.
Illumina NovaSeq X Plus Sequencing Platform Provides ultra-high-throughput sequencing needed for deep coverage (>30X) of diploid genomes for robust PSMC input.
bwa-mem2 / GATK Best Practices Bioinformatics Pipeline For accurate alignment and variant calling from raw sequencing data to generate the diploid consensus sequence for PSMC.
PSMC Software (v0.6.5-r13+) Core Analysis Tool Implements the PSMC algorithm. Must be used with appropriate parameters (-p) for the species ploidy and time span.
IntCal20 / SHCal20 Curves Calibration Data Radiocarbon calibration curves essential for converting archaeological dates into calendar years, enabling precise timeline alignment.
PaleoView / CHELSA-TraCE21k Climate Model Data Provides spatially explicit paleoclimate simulations (temperature, precipitation) for the last 21k years for regional correlation.
R dplyr, ggplot2, paleoTS Statistical Software Core packages for data manipulation, visualization, and time-series analysis of integrated Ne and climate data.
Google Earth Engine Geospatial Platform Enables analysis of modern and past landscape features (from models) to contextualize population movements and refugia.

The Pairwise Sequentially Markovian Coalescent (PSMC) method, introduced by Li and Durbin in 2011, revolutionized the inference of historical population size dynamics from a single diploid genome. Its core thesis is that the distribution of heterozygous sites along a genome reflects the time to the most recent common ancestor (TMRCA), which is influenced by past effective population size (Nₑ). While transformative, PSMC has inherent limitations: it requires high-quality, contiguous haplotype assemblies and loses resolution for recent (<10 kya) and very ancient (>1-2 mya) history.

The advent of telomere-to-telomere (T2T) assemblies, pangenome references, and highly accurate long-read sequencing (e.g., PacBio HiFi, Oxford Nanopore) provides data that can overcome these limitations. This application note details how PSMC analysis must be adapted and future-proofed within this new genomic context, validating its continued relevance for evolutionary biology, conservation genetics, and understanding demographic correlates of disease.

The following tables summarize key performance metrics of PSMC under different data paradigms.

Table 1: Sequencing Technology Comparison for PSMC Input

Technology Read Length (avg) Raw Error Rate Per-base Cost (approx.) Primary PSMC Limitation Addressed
Short-Read Illumina 100-300 bp ~0.1% $ Low Coverage depth, not phasing/contiguity
PacBio CLR 10-30 kb ~10-15% $$ High Contiguity (requires high coverage for correction)
PacBio HiFi 10-25 kb ~0.1% $$ High Phasing & Contiguity
Oxford Nanopore 10-100+ kb ~2-5% (Duplex ~Q30) $ Medium Contiguity & Structural Variant Inclusion

Table 2: PSMC Parameter Recovery Across Genome Assembly Types

Assembly Type Typical Contig N50 Phasing Accuracy Impact on PSMC Inferred Nₑ Curve Key Advantage
Short-Read Scaffolded 1-50 Mb Low (statistical) High noise, recent history obscured Widely available
HiFi-Based Haplotype 20-100+ Mb High (direct) Sharper recent history (<20 kya) Accurate long-range phase
T2T/Complete Genome Chromosome-scale Complete Eliminates assembly gap artifacts No missing regions
Pangenome Graph N/A (graph-based) Multiple haplotypes Enables lineage-specific inference Captures population diversity

Application Notes & Protocols

Protocol 1: Generating PSMC Input from a Haplotype-Resolved T2T Assembly

Objective: To create the high-quality, phased .psmcfa input file from a complete genome assembly.

Materials & Reagents:

  • Haplotype-resolved T2T assembly (FASTA files for hap1 & hap2).
  • Reference genome (standard linear reference, e.g., GRCh38).
  • Software: minimap2, samtools, bcftools, PSMC suite (utils/fq2psmcfa, psmc).

Procedure:

  • Pairwise Alignment: Independently align each haplotype assembly to the linear reference.

  • Variant Calling: Call SNPs relative to the reference for each haplotype.

  • Create Pseudo-Diploid Genome: Merge the haplotype-specific VCFs to simulate a diploid individual.

  • Generate Consensus Sequence: Use the merged VCF and the reference to create a consensus FASTA.

  • Convert to PSMC Format: Generate the final input file.

Protocol 2: Integrating Pangenome Graphs for Lineage-Specific PSMC Analysis

Objective: To infer population history for a specific lineage (e.g., a disease-associated haplotype) embedded within a pangenome graph.

Materials & Reagents:

  • Pangenome graph (e.g., in GFA/VG format).
  • Path information for the lineage of interest within the graph.
  • Software: vg, BCFtools, custom scripts for path extraction.

Procedure:

  • Extract Path Sequence: Isolate the linear sequence of the specific haplotype path from the graph.

  • Create a Pseudo-Reference: Use the extracted path as the reference for subsequent steps.
  • Map Related Sequences: Map short-read or assembly data from multiple individuals to this lineage-specific reference.
  • Call Variants and Generate Input: Follow Protocol 1, Steps 2-5, using the lineage-specific reference. This creates a .psmcfa file representing the diversity within that specific ancestral haplotype background.
  • Comparative Analysis: Run PSMC independently on multiple lineage-specific .psmcfa files to compare the demographic histories of different ancestral backgrounds within the same population.

Visualizations

Diagram 1: PSMC Workflow with Modern Sequencing Data

G cluster_old Legacy Pipeline cluster_new Future-Proofed Pipeline O1 Short-Read WGS O2 Map to Linear Ref O1->O2 O3 Statistical Phasing O2->O3 O4 PSMC Input (.psmcfa) O3->O4 PSMC PSMC Inference O4->PSMC N1 Long-Read Sequencing (PacBio HiFi, ONT) N2 T2T Haplotype Assembly OR Pangenome Graph N1->N2 N3 Direct Haplotype Extraction N2->N3 N4 High-Quality PSMC Input N3->N4 N4->PSMC Output Historical Nₑ Curve (Improved Recency & Accuracy) PSMC->Output

Diagram 2: Pangenome-Enabled Lineage-Specific Analysis

G Pangenome Pangenome Reference Graph HapA Haplotype A Path Pangenome->HapA Extract HapB Haplotype B Path (e.g., Disease Locus) Pangenome->HapB Extract VCF Merged VCF (Variants on Haplotype B) HapB->VCF Variant Calling Seq1 Individual 1 Reads Seq1->HapB Map to Seq2 Individual N Reads Seq2->HapB Map to PSMCfa Lineage-Specific .psmcfa File VCF->PSMCfa History Inferred Demography of Haplotype B Carriers PSMCfa->History PSMC Analysis

The Scientist's Toolkit: Research Reagent Solutions

Item Function in PSMC Future-Proofing Example/Supplier
High Molecular Weight (HMW) DNA Kit To extract ultra-long, intact DNA essential for T2T assembly and accurate phasing. PacBio SRE, Qiagen Genomic-tip, Nanobind CBB.
PacBio HiFi Library Prep Kit Generates highly accurate long reads (Q20+) for base-perfect haplotype assembly. SMRTbell prep kit 3.0 (PacBio).
Oxford Nanopore Duplex Sequencing Kit Produces ultra-long duplex reads with very high accuracy for contiguity. Kit 14, Duplex adapter (ONT).
Hi-C or Strand-Seq Library Kit Provides orthogonal phasing information to validate and scaffold haplotype assemblies. Dovetail Omni-C, s3-seq.
Pangenome Graph Software Suite Tools to construct, query, and extract paths from graph-based references. vg, minigraph, pggb.
PSMC Software Suite The core algorithm and utilities for data conversion and bootstrapping. psmc (https://github.com/lh3/psmc).
Cluster/Cloud Computing Resources Essential for computationally intensive assembly, graph construction, and parameter exploration. AWS, Google Cloud, SLURM HPC.

Conclusion

The PSMC method remains a powerful and accessible tool for inferring ancient population size from individual genomes, offering unique insights into demographic histories that are crucial for interpreting patterns of genetic variation in biomedical research. Mastering its foundational principles, methodological nuances, and limitations—as outlined across the four intents—enables researchers to robustly apply it to studies of human disease susceptibility, pathogen evolution, and comparative genomics. Future directions involve integrating PSMC inferences with deep phenotypic data, leveraging higher-quality genomes from long-read sequencing, and developing hybrid models that combine its strengths with those of multi-sample methods. For drug development professionals, these refined demographic histories can illuminate population-specific selection pressures on drug targets and inform more inclusive clinical trial designs, ultimately bridging deep evolutionary history with precision medicine initiatives.