Navigating Introgression: A Comprehensive Guide to ASTRAL for Robust Species Tree Estimation

Levi James Dec 02, 2025 83

This article provides a comprehensive resource for researchers and bioinformaticians grappling with species tree estimation in the presence of gene flow.

Navigating Introgression: A Comprehensive Guide to ASTRAL for Robust Species Tree Estimation

Abstract

This article provides a comprehensive resource for researchers and bioinformaticians grappling with species tree estimation in the presence of gene flow. We explore the theoretical foundations of the multi-species coalescent (MSC) and the challenges posed by introgression, including ghost lineages. A detailed methodological guide covers the practical application of ASTRAL, from input preparation to result interpretation. The article further addresses troubleshooting common pitfalls, optimizing analyses for accuracy, and validating findings through comparisons with methods like *BEAST, MP-EST, and STELAR. By synthesizing current research and best practices, this guide aims to empower robust phylogenomic inferences with direct implications for understanding evolutionary history and trait diversification in biomedical research.

Core Concepts: The Coalescent Model, Introgression, and the Challenge for Phylogenomics

The Multi-Species Coalescent (MSC) Model and Incomplete Lineage Sorting (ILS)

The Multispecies Coalescent (MSC) is a fundamental population genetic model that extends the single-population coalescent framework to multiple species. It integrates the phylogenetic process of species divergences with the population genetic process of coalescence, providing a powerful framework for addressing key evolutionary questions using genomic sequence data from multiple species [1]. The MSC models the genealogical history of sequences sampled from different species, tracing their lineage joining backward in time through both population and species divergence events [1].

A critical application of the MSC is modeling incomplete lineage sorting (ILS), one of the most frequent causes of discordance between gene trees and species trees [2]. ILS occurs when ancestral genetic polymorphisms persist through rapid speciation events, causing the genealogical histories of individual genes to differ from the overall species phylogeny [3]. This phenomenon is particularly common in lineages with large effective population sizes and short intervals between speciation events [4] [3]. The MSC provides the mathematical foundation to disentangle this discordance, enabling researchers to estimate species trees despite conflicting signals across the genome [1].

Theoretical Foundations of MSC and ILS

The Probability Framework of the MSC

Under the MSC model, gene trees and their coalescence times follow a defined probability distribution given a species tree and its parameters. The model incorporates two sets of parameters: (1) species divergence times (τ), and (2) population size parameters (θ) for both extant and ancestral species [1]. For a sample of n sequences from a single population, the coalescent process describes the waiting times between successive coalescent events as independent exponential variables with means determined by the population size parameter [1].

In the multispecies context, when tracing sequences backward in time and reaching a speciation event, the coalescent process and rate are reset due to changes in population size and the introduction of sequences from sibling species [1]. The MSC model yields two key probability distributions: (1) the marginal probabilities of gene tree topologies, and (2) the joint distribution of gene tree topologies and coalescent times [1]. These distributions form the theoretical basis for both full-likelihood and summary methods of species tree estimation.

The Incomplete Lineage Sorting Phenomenon

ILS represents a fundamental disconnect between gene trees and species trees that arises from the stochastic nature of allele inheritance during speciation. When multiple alleles survive in an ancestral population and are randomly sorted into descendant species, the gene tree topology may not match the species tree topology [5]. The probability of ILS increases when internal branches of the species tree are short relative to effective population size, measured in coalescent units (CU) [5].

Table 1: Factors Influencing Incomplete Lineage Sorting

Factor Effect on ILS Biological Interpretation
Effective Population Size Positive correlation Larger populations maintain genetic diversity longer, increasing ancestral polymorphisms
Time Between Speciation Events Negative correlation Shorter internal branches provide less time for coalescence, increasing ILS probability
Generation Time Complex interaction Affects the number of generations within a given time period, influencing coalescence rates
Reproductive System Modulating factor Sexual reproduction maintains polymorphisms more effectively than asexual systems

A key consequence of ILS is hemiplasy - the appearance of homoplasy (similar traits evolving independently) that actually results from a single evolutionary event occurring on a discordant gene tree rather than multiple independent origins [6]. This phenomenon can mislead phylogenetic inferences and the interpretation of trait evolution [6].

ASTRAL: Algorithm and Implementation for Species Tree Estimation

Core Algorithm and Statistical Properties

ASTRAL (Accurate Species TRee ALgorithm) is a genome-scale coalescent-based method for species tree estimation that operates by finding the tree with the minimum quartet distance to an input set of inferred gene trees [2] [5]. The fundamental optimization problem that ASTRAL addresses can be summarized as follows: given a set G of k unrooted gene trees, singly-labeled by the leaf-set L of n taxa, there are k(n choose 4) quartet trees induced by the input set. The Weighted Quartet (WQ) score of any candidate species tree is defined as the number of these quartet trees that the candidate tree also induces. ASTRAL seeks to find the species tree that maximizes the WQ score [5].

ASTRAL is statistically consistent under the multi-species coalescent model, meaning that as the number of genes increases, the probability of recovering the true species tree approaches 1 [2]. This statistical guarantee, combined with its computational efficiency, makes ASTRAL particularly suitable for phylogenomic analyses with hundreds or thousands of genes [2]. Empirical evaluations have demonstrated that ASTRAL shows outstanding accuracy, often improving on competing methods like MP-EST and the population tree from BUCKy, and sometimes even exceeding the accuracy of concatenation using maximum likelihood [2].

Handling Multi-allele Datasets

Early versions of ASTRAL were limited to single individuals per species, but the algorithm has been extended to handle multi-allele datasets where multiple individuals are sampled per species [5]. This extension allows researchers to account for polymorphisms within present-day species, which can be important for accurately modeling ILS, particularly when terminal branches are short [5].

The quartet-based optimization problem extends naturally to multi-labeled datasets, with the main computational challenge being the definition of an appropriate constrained search space [5]. Heuristic approaches based on subsampling individuals have been developed to build a sufficiently large search space while maintaining computational tractability [5]. Interestingly, simulation studies have revealed that sampling more genes generally provides greater improvements in accuracy than sampling more individuals, even under conditions of extremely high ILS [5].

MSC Fig 1. MSC Model Showing ILS: Gene tree (green) differs from species tree (blue) S1 Species A S2 Species B S3 Species C ANC1 Ancestral Population 1 ANC1->S1 ANC1->S2 ANC2 Ancestral Population 2 ANC2->S3 ANC2->ANC1 G1 Gene Copy 1 G2 Gene Copy 2 G3 Gene Copy 3 G_COAL1 G_COAL1->G2 G_COAL1->G3 G_COAL2 G_COAL2->G1 G_COAL2->G_COAL1

Experimental Protocols for MSC-based Phylogenomic Analysis

Transcriptome-based Phylogeny Reconstruction

Protocol Objective: Reconstruct species relationships in the presence of ILS using transcriptome data [4] [7].

Step 1: Sample Collection and RNA Extraction

  • Collect fresh tissues from young shoots or apical meristems
  • Extract total RNA using modified CTAB method with NaCl and PVPP to remove polysaccharides and polyphenols [4]
  • Use extraction buffer containing 2% CTAB, 2% PVPP, 2M NaCl, 100mM Tris-base, 20mM EDTA (pH 7.5), and 2% β-mercaptoethanol [4]

Step 2: Library Preparation and Sequencing

  • Perform mRNA enrichment and cDNA synthesis
  • Prepare sequencing libraries using standard protocols
  • Sequence on an appropriate platform (Illumina recommended)

Step 3: Data Processing and Orthology Assessment

  • Assemble transcriptomes using Trinity or similar software
  • Identify orthologous genes using OrthoFinder or similar tools
  • Construct two datasets: (1) plastid protein-coding genes (PCGs), and (2) nuclear orthologous genes (OGs) [7]

Step 4: Gene Tree and Species Tree Inference

  • For each nuclear OG, estimate maximum likelihood gene trees
  • Reconstruct the species tree using ASTRAL from the set of gene trees [7]
  • For comparative analysis, reconstruct a concatenated maximum likelihood tree from plastid PCGs

Step 5: Assessment of Gene Tree Discordance

  • Calculate "site concordance factors" (sCF) and "site discordance factors" (sDF1/sDF2) [7]
  • Perform phylogenetic network analyses for nodes with imbalanced sDF1/sDF2 values
  • Apply D-statistics and QuIBL to test for introgression versus ILS [7]
Whole Genome Sequencing for ILS Detection

Protocol Objective: Discriminate between ILS and introgression using whole genome sequences [8].

Step 1: Genome Sequencing and Alignment

  • Sequence whole genomes of multiple individuals per species
  • Map reads to a reference genome using BWA or similar aligners
  • Perform whole-genome synteny alignment across species

Step 2: Phylogenomic Analysis

  • Extract four-fold degenerate sites from the whole-genome alignment
  • Reconstruct maximum likelihood species trees from nuclear data
  • Estimate divergence times using Bayesian methods in BEAST with appropriate mutation rates and generation times [8]

Step 3: Gene Flow Detection

  • Identify identical-by-descent haplotypes using BEAGLE [8]
  • Perform ABBA/BABA tests to detect significant gene flow
  • Estimate admixture proportions using f4 ratio tests [8]
  • Conduct sliding window analyses of genetic distance (Dxy) to identify regions with anomalous phylogenetic signals

Step 4: Coalescent Simulation

  • Perform tree coalescence analysis to estimate expected frequencies of different gene tree topologies under pure ILS [8]
  • Compare observed distribution of gene tree topologies with expectations under ILS
  • Calculate the relative frequencies of various topologies, including the anomalous mtDNA topology [8]

Table 2: Key Analyses for Discriminating ILS from Introgression

Analysis Type Data Input Method Interpretation
Quartet Concordance Gene trees ASTRAL High discordance suggests ILS or introgression
ABBA/BABA Test SNP data D-statistics Significant D values indicate introgression
Site Concordance Sequence alignments sCF/sDF Quantifies support for alternative topologies
Network Analysis Gene trees PhyloNetworks Visualizes conflicting phylogenetic signals
Coalescent Simulation Species tree parameters Coalescent simulations Tests if observed discordance matches ILS expectations

Case Studies and Empirical Applications

Aspidistra Phylogeny: High ILS in Taiwanese Flora

A transcriptome-based study of five Aspidistra species in Taiwan revealed substantial ILS, with numerous genes supporting alternative tree topologies [4]. Phylogenetic analysis yielded a well-supported species tree but detected conflicting signals across approximately 20.8% of genes [4]. Notably, the two varieties of A. daibuensis failed to form a monophyletic group despite morphological similarities, with only 20.8% of genes supporting their grouping together [4]. Genes showing positive selection in photosynthesis-related pathways suggested that morphological similarities arose through convergent evolution rather than shared ancestry [4]. This case demonstrates how MSC-based methods can disentangle complex evolutionary histories where ILS and selection create conflicting signals.

Marsupial Evolution: Phenotypic Consequences of ILS

A whole-genome study of marsupials revealed that over 50% of marsupial genomes are affected by ILS, with substantial impact on morphological evolution [3]. The South American monito del monte was identified as the sister lineage to all Australian marsupials, despite over 31% of its genome showing closer affinity to Diprotodontia due to ILS during ancient radiation [3]. Researchers detected hundreds of genes that experienced stochastic fixation during ILS, encoding the same amino acids in non-sister species [3]. Functional experiments validated that ILS directly contributed to hemiplasy in morphological traits established during rapid marsupial speciation approximately 60 million years ago [3]. This study provides empirical evidence that ILS can lead to incongruent phenotypic variation across species.

Wisent Phylogeny: Resolving Anomalous mtDNA Patterns

The wisent (European bison) presents a classic case of phylogenetic discordance, with mtDNA clustering with cattle despite nuclear genomic affinity with American bison [8]. Whole-genome analysis revealed only minor recent gene flow between wisent and cattle, insufficient to explain the mtDNA pattern [8]. Instead, researchers identified appreciable heterogeneity in nuclear gene tree topologies, with relative frequencies consistent with ILS expectations from coalescent analysis [8]. The anomalous wisent mtDNA phylogeny was explained as the outcome of a rare coalescent event rather than ancient hybridization [8]. This case highlights the importance of genome-wide data and coalescent modeling for distinguishing between ILS and introgression.

ASTRAL Fig 2. ASTRAL Workflow: From sequence data to species tree accounting for ILS SEQ Sequence Data (Transcriptomes/WGS) GT Gene Tree Estimation SEQ->GT PC Plastome Analysis SEQ->PC GTSET Set of Gene Trees GT->GTSET AST ASTRAL Analysis GTSET->AST ST Species Tree AST->ST DISC Discordance Analysis ST->DISC COMP Comparative Analysis ST->COMP ILS ILS Assessment DISC->ILS INT Introgression Test DISC->INT PT Plastid Tree PC->PT PT->COMP

Table 3: Key Computational Tools for MSC and ILS Analysis

Tool/Resource Primary Function Application Context
ASTRAL Species tree estimation from gene trees Summary method handling ILS in multi-locus datasets [2]
SimPhy Simulation of gene trees under MSC Generating datasets with controlled ILS levels [5]
BEAST Bayesian evolutionary analysis Divergence time estimation with molecular clocks [8]
OrthoFinder Orthogroup inference Identifying orthologous genes across species [7]
Trinity Transcriptome assembly De novo assembly from RNA-Seq data [4]
BEAGLE Identity-by-descent detection Identifying shared haplotype segments [8]
IQ-TREE Maximum likelihood phylogenetics Gene tree estimation with model selection [7]
D-statistics Introgression testing ABBA/BABA tests for gene flow [7] [8]

Advanced Considerations in MSC Modeling

MSC Extensions for Quantitative Traits

Recent extensions of the MSC framework incorporate models for quantitative traits, allowing evolutionary inferences at both micro- and macroevolutionary scales [6]. These models account for genealogical discordance underlying quantitative trait variation, which can significantly impact trait covariance patterns [6]. Without accounting for ILS, trait covariance between closely related species decreases relative to distantly related species, potentially leading to overestimation of evolutionary rates, decreased phylogenetic signal, and errors in identifying shifts in mean trait values [6]. This framework applies equally to discrete threshold traits, revealing the broad impact of ILS on trait evolution beyond molecular sequences [6].

Integration with Introgression Models

While the basic MSC model addresses ILS, real genomic datasets often involve both ILS and introgression. The multispecies network coalescent represents an active area of methodological development that simultaneously accounts for both processes [1]. Current research focuses on developing integrated models that can distinguish between these sources of discordance, with promising approaches including phylogenetic networks, D-statistics, and QuIBL (Quantitative Introgression from Branch Lengths) [7]. These methods enable researchers to partition discordance into components attributable to ILS versus introgression, providing more accurate reconstructions of evolutionary history.

The Multispecies Coalescent model provides an essential mathematical framework for understanding incomplete lineage sorting and its effects on species tree inference. ASTRAL represents a computationally efficient and statistically consistent implementation of MSC principles, enabling accurate species tree estimation from genome-scale data despite widespread gene tree discordance [2]. Through careful experimental design and appropriate analytical protocols, researchers can successfully distinguish ILS from other sources of phylogenetic conflict, particularly introgression [7] [8].

Future methodological developments will likely focus on improving scalability for larger datasets, integrating additional biological processes beyond ILS, and developing more sophisticated models for trait evolution under the coalescent [1] [6]. As phylogenomic datasets continue growing in size and taxonomic scope, MSC-based approaches like ASTRAL will remain indispensable tools for reconstructing evolutionary history in the presence of genealogical discordance.

In phylogenomics, the widespread phenomenon of gene tree discordance—where different genes tell conflicting evolutionary stories—presents a significant challenge to reconstructing species relationships. Two major biological processes contributing to this discordance are introgression and the presence of ghost lineages. Understanding these concepts is crucial for researchers using species tree estimation methods like ASTRAL, as failure to account for them can lead to inaccurate evolutionary inferences.

Introgression, also called introgressive hybridization, refers to the transfer of genetic material from one species into the gene pool of another through repeated hybridization and backcrossing [9]. This process is distinct from simple hybridization as it typically results in a complex, highly variable mixture of genes rather than a uniform hybrid, and may involve only a minimal percentage of the donor genome [9]. When this genetic transfer increases the overall fitness of the recipient taxon, it is considered adaptive introgression [9].

Ghost lineages represent evolutionary lines of descent that have left no fossil evidence but can be inferred to exist through phylogenetic analysis or genomic evidence [10]. These lineages represent unseen diversity that can be predicted through analysis of evolutionary relationships and tested through fossil discoveries or genomic studies. A special category, ghost introgression, refers to traces of extinct species present in modern genomes through past introgression events [9].

Quantitative Impact on Genomic Studies

Table 1: Documented Levels of Introgression Across Organisms

Organism/Lineage Level of Introgression Key Findings Citation
Bacteria (50 major lineages) Average 2% of core genes (up to 14% in Escherichia–Shigella) Introgression most frequent between highly related species; does not substantially blur species borders [11] [12]
Fagaceae family (oak family) Gene flow accounts for 7.76% of gene tree variation Incomplete lineage sorting (9.84%) and gene tree estimation error (21.19%) also major contributors [13]
Heliconius butterflies 2-5% introgression between subspecies Non-random introgression concentrated in chromosomes containing mimicry loci [9]
Modern humans 1-4% from archaic hominins Evidence of introgression from Neanderthals and Denisovans [9]

Table 2: Relative Contributions to Gene Tree Discordance in Plants

Source of Discordance Contribution in Fagaceae Characteristics Research Implications
Gene Tree Estimation Error 21.19% Caused by limited phylogenetic signal, model misspecification Improved gene tree estimation methods needed
Incomplete Lineage Sorting (ILS) 9.84% Results from random sorting of ancestral polymorphisms Coalescent methods required to account for ILS
Gene Flow/Introgression 7.76% Can lead to cytonuclear discordance Network approaches necessary to detect hybridization
Consistent Genes 58.1-59.5% Recover species tree topology Strong phylogenetic signals
Inconsistent Genes 40.5-41.9% Display conflicting phylogenetic signals Removal can reduce concatenation vs. coalescent conflict

Experimental Protocols for Detection

Phylogenomic Workflow for Detecting Introgression and Ghost Lineages

G Phylogenomic Detection Workflow cluster_1 Data Collection cluster_2 Gene Tree Estimation cluster_3 Discordance Analysis cluster_4 Source Identification Samples Sample Collection (Transcriptomes/Genomes) Seq Sequencing & Assembly Samples->Seq Orthology Orthology Inference Seq->Orthology Alignment Multiple Sequence Alignment Orthology->Alignment GeneTrees Individual Gene Tree Estimation Alignment->GeneTrees Concatenation Concatenated Analysis GeneTrees->Concatenation ASTRAL ASTRAL Species Tree Estimation GeneTrees->ASTRAL Concatenation->ASTRAL Quartet Quartet Analysis & Site Patterns ASTRAL->Quartet Tests Topology Tests (ABBA/BABA) Quartet->Tests ILS ILS Assessment Tests->ILS Introgression Introgression Detection Tests->Introgression Ghost Ghost Lineage Inference Tests->Ghost

Detailed Methodological Approaches

Data Collection and Processing
  • Taxon Sampling: Include comprehensive representation across target clade with appropriate outgroups [14]. For Amaranthaceae study: 92 ingroup species (88 transcriptomes, 4 genomes) representing 53 genera [14].
  • Sequence Data Generation: Utilize transcriptomic or genomic data. For Fagaceae mtDNA: assemble mitochondrial genome using GetOrganelle, annotate with IPMGA, map reads using BWA, call SNPs with GATK using quality filters (min-base-quality-score 30, minimum-mapping-quality 30) [13].
  • Orthology Inference: Identify orthologous loci across species using tools such as OrthoFinder or custom pipelines to avoid hidden paralogy [14].
Gene Tree and Species Tree Estimation
  • Gene Tree Estimation: For each locus, infer trees using maximum likelihood (IQ-TREE) or Bayesian methods (MrBayes) with appropriate model selection [13].
  • Species Tree Estimation: Implement ASTRAL to estimate species trees from gene trees while accounting for incomplete lineage sorting [15]. ASTRAL finds the species tree that has the maximum number of shared induced quartet trees with the set of gene trees [15].
Detection and Quantification of Introgression
  • ABBA/BABA Test: Calculate D-statistics to test for significant deviation from tree-like evolution, indicating introgression [9] [14].
  • Phylogenetic Network Methods: Use methods that account for ILS and hybridization simultaneously (e.g., Solís-Lemus and Ané 2016) to infer phylogenetic networks [14].
  • Site Pattern Tests: Apply phylogenetic invariants to detect introgression based on patterns of site frequencies [14].
Ghost Lineage Inference
  • Phylogenetic Gap Analysis: Identify lineages implied by phylogenetic trees but missing from fossil records through examination of sequential stratigraphic units [10].
  • Genomic Reconstruction: Compare genomes of living relatives to identify genetic variations that don't align with known species, suggesting ghost lineage [16]. For canids: phylogenetic analysis of ASIP coat color variants revealed ghost lineage from extinct wolf-like species [16].

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Research Reagents and Computational Tools

Tool/Reagent Function Application Notes Reference
ASTRAL Species tree estimation from gene trees Statistically consistent under multispecies coalescent; handles ILS [15]
IQ-TREE Maximum likelihood phylogenetic inference Efficient model selection; supports large datasets [13]
GetOrganelle Organelle genome assembly Used for mtDNA and cpDNA assembly from NGS data [13]
GATK Variant discovery SNP calling with quality filtering capabilities [13]
Orthology Inference Pipelines Identify orthologous genes Critical for avoiding hidden paralogy [14]
ABBA/BABA Tests Detect introgression Measures deviation from null model of no introgression [9] [14]
Phylogenetic Network Methods Model hybridization Accounts for both ILS and introgression [14]

Interpreting Results in ASTRAL Framework

ASTRAL Species Tree Estimation with Introgression

G ASTRAL Analysis with Introgression cluster_1 Input Data cluster_2 ASTRAL Analysis cluster_3 Introgression Signals cluster_4 Ghost Lineage Detection GeneTrees Input Gene Trees (Unrooted) Quartets Quartet Frequencies GeneTrees->Quartets SpeciesTree Species Tree Estimation (Maximize Shared Quartets) Quartets->SpeciesTree BranchLength Branch Length Estimation (Coalescent Units) SpeciesTree->BranchLength Support Local Posterior Probability Support BranchLength->Support Discordance Gene Tree Discordance Analysis Support->Discordance QuartetScores Anomalous Quartet Scores Discordance->QuartetScores ShortBranches Consecutive Short Internal Branches QuartetScores->ShortBranches Unexplained Unexplained Phylogenetic Signals ShortBranches->Unexplained AncientVariants Ancient Genetic Variants Without Known Source Unexplained->AncientVariants GhostIntrogression Ghost Introgression Detection AncientVariants->GhostIntrogression

Interpretation Guidelines

When working with ASTRAL in the context of introgression research, several key patterns emerge:

  • Anomalous Quartet Distributions: Introgression creates characteristic patterns in quartet frequencies that deviate from expectations under pure ILS models [15]. These anomalous distributions can be visualized using tools like DiscoVista [15].

  • Branch Length Patterns: Short internal branches in the ASTRAL tree may indicate rapid diversification events, which can be further tested using ASTRAL's polytomy test (-t 10 option) [15]. Studies on Amaranthaceae found that three consecutive short internal branches produced anomalous trees contributing to discordance [14].

  • Ghost Lineage Signatures: Unexplained phylogenetic signals that cannot be attributed to known taxa may indicate ghost lineages. In canids, phylogenetic analysis of the ASIP gene revealed variants that predated the divergence of modern wolves and dogs, indicating contribution from an extinct wolf-like species [16].

  • Differential Inheritance Patterns: Incongruence between cytoplasmic (chloroplast/mitochondrial) and nuclear genomes often signals past introgression events. In Fagaceae, cpDNA and mtDNA divided species into New World and Old World clades, sharply contrasting with nuclear genome phylogenies [13].

Troubleshooting and Technical Considerations

Common Pitfalls and Solutions

  • Gene Tree Estimation Error: Accounts for approximately 21% of gene tree variation in Fagaceae [13]. Mitigation strategies include using high-quality alignments, appropriate evolutionary models, and removing fragmentary sequences [15].

  • Hidden Paralogy: Can mimic introgression signals. Solution: rigorous orthology inference and filtering of suspect gene trees [14].

  • Incomplete Taxa Sampling: May create false ghost lineages. Solution: comprehensive sampling of extant taxa and consideration of known fossil taxa when available [10].

  • Model Misspecification: Simple models may not capture complex evolutionary processes. Solution: use model testing and consider mixture models that account for multiple processes [14].

Data Quality Recommendations

  • Gene Filtering: Avoid excluding genes solely due to missing data, as this can be detrimental to accuracy [15]. Instead, focus on removing fragmentary data with uncharacteristically large numbers of gaps [15].

  • Tree Quality: RAxML gene trees are generally preferable to FastTree trees for input to ASTRAL [15]. Use TreeShrink to remove outlier long branches from gene trees [15].

  • Multi-Individual Datasets: For population-level studies, use ASTRAL's multi-allele version, which is specifically designed for datasets with multiple individuals per species [15].

How Introgression Challenges Coalescent-Based Species Tree Inference

Introgression, the transfer of genetic material between species through hybridization followed by backcrossing, presents a fundamental challenge to accurate species tree inference under the multispecies coalescent (MSC) model. The MSC model primarily accounts for gene tree discordance caused by incomplete lineage sorting (ILS), but historically assumed no gene flow between species. When introgression occurs, it introduces phylogenetic discordance that violates these assumptions, potentially leading to incorrect species tree topologies and biased divergence time estimates [17]. This problem is particularly relevant for methods like ASTRAL, which is designed to be statistically consistent under the MSC model but may yield inconsistent results when gene flow is present [17] [18].

The challenge intensifies with ghost introgression – gene flow from unsampled, unknown, or extinct lineages. Since evolutionary studies typically sample only a fraction of existing or extinct species, ghost introgression is likely widespread yet rarely accounted for in phylogenetic analyses [17]. Understanding and detecting these processes is crucial for researchers using ASTRAL for species tree estimation, as ignoring introgression can compromise the accuracy of evolutionary inferences and subsequent biological interpretations.

Quantitative Impacts of Introgression on Species Tree Inference

Effects on Topological Accuracy and Divergence Time Estimation

Table 1: Impacts of Introgression on Coalescent-Based Species Tree Inference

Parameter Effect of Ingroup Introgression Effect of Outgroup Ghost Introgression Conditions for Maximum Impact
Species Tree Topology Increased error rate, especially between sister species [17] Can cause anomalous relationships in basal branches [17] Strong ILS combined with moderate to high introgression [17]
Root Divergence Time Systematic underestimation [17] Systematic overestimation [17] Higher introgression probabilities and older divergence times [17]
Method Performance ASTRAL more robust to gene flow between non-sister species [17] *BEAST may perform better under certain ghost introgression scenarios [17] Varies with introgression direction, timing, and probability [17]
Gene Tree Discordance Contributes to asymmetry in gene tree frequencies [19] Creates discordance patterns mimicking deep coalescence [17] Recent introgression with ongoing gene flow [20]
Comparative Performance of Species Tree Methods

Table 2: Method Performance Under Different Introgression Scenarios

Method Approach Strengths Limitations with Introgression
ASTRAL Summary species tree method using gene trees [19] More robust to gene flow between non-sister species; computationally efficient [17] Assumes no gene flow; inconsistent under certain introgression scenarios [17]
*BEAST Full-likelihood method co-estimating gene trees and species tree [17] Better performance under certain ghost introgression conditions; provides divergence times [17] Computationally intensive; assumes no gene flow [17]
PhyloNet Phylogenetic network inference [19] Explicitly models both ILS and introgression; detects specific introgression events [19] Complex model selection; increased parameter space [19]
D-statistics (ABBA-BABA) SNP-based introgression test [19] Simple test for gene flow; works with single genomes per species [19] Assumes constant substitution rates; sensitive to homoplasy [19]

Theoretical Framework: Modeling Introgression Effects

The multispecies coalescent with introgression extends the standard MSC model by allowing genetic material to flow between populations or species. Under a simple three-species scenario with topology ((A,B),C), standard Brownian motion models for quantitative trait evolution assume trait covariance arises solely from shared ancestry on the species tree [18]. However, with introgression, the expected covariance structure changes substantially.

For species A and B, the covariance under introgression becomes:

Cov(AB|introgression) = σ²[(1 - e^{-(t₂-t₁)}) × (e^{t₂}(t₂-t₁)/(e^{t₂}-e^{t₁})) + (1/3e^{-(t₂-t₁)})]

Where σ² is the evolutionary rate, t₁ is the A-B divergence time, and t₂ is the root divergence time [18]. This complex covariance structure illustrates how introgression introduces additional similarity between species that cannot be explained by the species tree alone, potentially biasing phylogenetic inferences.

G ILS Incomplete Lineage Sorting Discordance Gene Tree Discordance ILS->Discordance Creates Introgression Introgression/Gene Flow Introgression->Discordance Creates AnomalousGT Anomalous Gene Trees Introgression->AnomalousGT Generates SpeciesTree Species Tree Error Discordance->SpeciesTree Causes DivTimeBias Divergence Time Bias Discordance->DivTimeBias Causes AnomalousGT->SpeciesTree Contributes to

Gene Discordance Causes

Experimental Protocols for Detecting and Accounting for Introgression

Tree-Based Introgression Detection Workflow

Protocol 1: Genome-Wide Introgression Detection Using Gene Trees

This protocol outlines a robust approach for detecting introgression using phylogenetic trees inferred from genome-wide data, complementing SNP-based methods like the ABBA-BABA test [19].

  • Dataset Preparation

    • Obtain whole-genome alignment data for all study species, preferably with an outgroup
    • For non-model organisms, use orthologous gene sequences or Ultraconserved Elements (UCEs) as an alternative
    • Ensure appropriate taxonomic sampling to detect both recent and ancient introgression
  • Extraction of Alignment Blocks

    • Extract suitable alignment blocks from whole-genome alignment using custom Python scripts
    • Filter alignment blocks by:
      • Minimum length (typically 1,000 bp as a compromise between information content and recombination probability)
      • Proportion of missing data (set threshold appropriate for dataset)
      • Frequency of recombination breakpoints (remove alignments with strong recombination signals)
    • Require each alignment block to contain sequences for all included species
  • Gene Tree Inference

    • For each filtered alignment block, infer maximum likelihood gene trees using IQ-TREE with command: iqtree2 -s alignment.phy -m TEST -bb 1000 -alrt 1000
    • Include appropriate model selection using ModelFinder implemented in IQ-TREE
    • Assess branch support with ultrafast bootstrap (1,000 replicates) and SH-aLRT test
  • Species Tree Estimation

    • Estimate species tree from gene trees using ASTRAL with command: java -jar ~/software/Astral/astral.5.7.8.jar -i input_gene_trees.tre -o species_tree.tre
    • Calculate local posterior probabilities for branch support
  • Introgression Detection

    • Map gene tree discordance by comparing each gene tree to the species tree
    • For each species trio, assess asymmetry in alternative phylogenetic topologies
    • Use PhyloNet for explicit network modeling with command: java -jar ~/software/PhyloNet/PhyloNet.jar input_network_instructions.txt
    • Compare support for models with and without introgression

G Start 1. Dataset Preparation (Whole-genome alignment or orthologous genes) Extract 2. Extract Alignment Blocks (Filter by length, missing data, and recombination signals) Start->Extract GeneTrees 3. Gene Tree Inference (IQ-TREE with model selection and branch support) Extract->GeneTrees SpeciesTree 4. Species Tree Estimation (ASTRAL from gene trees) GeneTrees->SpeciesTree Detect 5. Introgression Detection (Gene tree discordance mapping and phylogenetic network modeling) SpeciesTree->Detect Detect->GeneTrees Feedback for re-analysis Output Species Tree with Introgression Assessment Detect->Output

Introgression Detection Workflow

SINE-Based Phylogenomics for Introgression Detection

Protocol 2: Retrotransposon-Based Introgression Analysis

This protocol uses presence/absence patterns of SINE (Short INterspersed Element) insertions as nearly ideal phylogenetic markers with very low homoplasy, providing an independent line of evidence for introgression [20].

  • Marker Selection and Validation

    • Select SINE families known to be active during the radiation of the clade of interest
    • Verify that absence patterns represent ancestral state (not precise excisions)
    • Confirm very low rates of parallel insertions through mechanistic understanding of retrotransposition
  • Data Collection

    • For each species, identify presence or absence of orthologous SINE insertions
    • Use low-coverage whole genome sequencing or targeted approaches
    • Map reads to reference genome if available, or perform de novo identification
  • Phylogenetic Inference

    • Code SINE presence as 1 and absence as 0 to create binary character matrix
    • Reconstruct phylogeny using parsimony or maximum likelihood suitable for binary characters
    • Assess support through bootstrapping or jackknifing
  • Discordance Analysis

    • Compare SINE-based tree to sequence-based species tree
    • Identify loci discordant with overall species tree
    • Calculate proportion of discordant insertions (approximately one-third in Myotis study [20])
    • Use quartet asymmetry tests to distinguish ILS from introgression
  • Introgression Timing

    • Estimate timing of introgression events based on distribution of discordant markers
    • Distinguish contemporary versus historical gene flow
    • Identify potential introgression pathways among species

Table 3: Research Reagent Solutions for Introgression Studies

Resource Function Application Notes
IQ-TREE Maximum likelihood phylogenetic inference Essential for gene tree estimation; includes model selection and branch support tests [19]
ASTRAL Species tree estimation from gene trees Primary tool for coalescent-based species tree inference; robust to some ILS [17] [19]
PhyloNet Phylogenetic network inference Models both ILS and introgression; detects specific introgression events [19]
PAUP* General utility phylogenetic program Useful for tree manipulation, filtering, and additional analyses [19]
FigTree Tree visualization Intuitive visualization of phylogenies with customization options [19]
Ves SINE Markers Retrotransposon markers for phylogenomics Nearly ideal phylogenetic characters with very low homoplasy; useful for distinguishing ILS and introgression [20]
Whole-Genome Alignment Cross-species genome alignment Foundation for genome-wide phylogenetic analyses; enables extraction of alignment blocks [19]
Progressive Cactus Reference-free whole genome alignment Produces multiple genome alignments in HAL format, convertible to MAF format [19]

Practical Application: Case Studies in Detecting Introgression

Drosophila Phylogeny

A comprehensive analysis of 155 Drosophila genomes revealed widespread introgression across the evolutionary history of the genus. Researchers used fossil-calibrated phylogenies and multilocus tests to identify both ancient and recent gene flow events across most of the 9 clades examined. The study demonstrated that gene-tree discordance mapping provides conservative detection of gene flow through discordant gene tree counts and branch lengths [21]. This large-scale phylogenetic analysis established that introgression is a common phenomenon throughout Drosophila evolution, not limited to a few isolated cases.

Wild Tomato Genus (Solanum)

Research on whole-transcriptome gene expression data from ovules in Solanum species revealed how introgression affects thousands of quantitative traits simultaneously. The Brownian motion model under the multispecies network coalescent framework demonstrated that introgression generates apparently convergent evolution patterns when averaged across thousands of traits [18]. This study provided a framework for testing introgression effects using model-informed predictions and showed correlation between local gene tree topology and expression similarity, implicating introgressed cis-regulatory variation in generating broad-scale patterns.

Bacterial Lineages

Despite being asexual organisms, bacterial lineages show significant introgression in core genomes. A systematic analysis of 50 major bacterial lineages found an average of 2% of introgressed core genes, rising to 14% in Escherichia-Shigella [11]. The study revealed that introgression most frequently occurs between highly related species and can occasionally lead to fuzzy species borders, though most bacterial species remain clearly delineated in core genome phylogenies. This demonstrates that introgression impacts diversication patterns across the tree of life, not just in sexually reproducing organisms.

Theoretical Foundations of ASTRAL

ASTRAL (Accurate Species TRee ALgorithm) is a leading method for species tree estimation from genome-scale data, specifically designed to be statistically consistent under the Multi-Species Coalescent (MSC) model [22] [23]. Statistical consistency in this context means that as the amount of data (i.e., the number of genes) increases, the probability of recovering the true species tree approaches one. This property holds even in the presence of Incomplete Lineage Sorting (ILS), a major cause of gene tree discordance wherein the genealogical history of individual genes differs from the overall species history due to the retention of ancestral polymorphisms [5].

The theoretical guarantees of ASTRAL have been extended beyond the standard MSC model. Recent research has proven that ASTRAL remains statistically consistent under the more complex Duplication-Loss-Coalescence (DLCoal) model [22] [23]. This model integrates gene duplication and loss processes with coalescence, providing a more comprehensive framework for modeling genome evolution. This finding is significant because it affirms the empirical success of ASTRAL in simulation studies and supports its use in analyzing real genomic datasets where gene duplication and loss are prevalent [22].

The core optimization problem ASTRAL solves is to find the species tree that maximizes the number of induced quartet trees that are also present in the set of input gene trees [5]. This quartet-based approach provides the mathematical foundation for its statistical consistency and robustness.

The Quartet-Based Approach: Principles and Properties

Quartets as Fundamental Building Blocks

ASTRAL's methodology is built upon the analysis of quartets—unrooted phylogenetic trees for every combination of four taxa (leaves). For any four species (A, B, C, D), there are three possible resolved quartet topologies: A,B|C,D; A,C|B,D; and A,D|B,C [24]. The central principle behind quartet-based methods like ASTRAL is that the most frequent quartet topology across many genes is likely to represent the true species relationship for that quartet, a principle that holds true under the MSC model and its extensions [22] [24].

A key theoretical result underpinning this approach is that for any four species, the most probable unrooted gene tree under the MSC model is topologically identical to the unrooted species tree for those same four taxa [24]. ASTRAL leverages this principle by seeking the species tree that collectively agrees with the largest number of quartets derived from the input gene trees, a problem formalized as maximizing the Weighted Quartet (WQ) score [5].

Handling Complex Evolutionary Scenarios

The quartet approach demonstrates particular utility in complex evolutionary scenarios. Research in tumor phylogenetics has shown that quartet-based methods can provide consistent estimators of cell lineage trees even under challenging conditions such as unbiased error and missing data models [24]. Furthermore, quartet methods have been successfully applied to resolve phylogenetic discordance in plant families like Fagaceae (oaks), where processes like hybridization and ILS create conflicting signals among gene trees [13].

Table 1: Key Properties of the Quartet-Based Approach in ASTRAL.

Property Description Biological/Computational Implication
Statistical Consistency Proven under MSC and DLCoal models [22] [23]. Provides theoretical guarantee of accuracy with sufficient data.
Robustness to ILS Explicitly models gene tree discordance due to deep coalescence [5]. Effective for analyzing rapid radiations and closely related species.
No Anomalous Quartets The most probable quartet matches the species tree [24]. Ensures the fundamental building blocks are reliable.
Polynomial Time Solution Solves a constrained version of the NP-hard quartet optimization problem [5]. Enables application to large-scale phylogenomic datasets.

ASTRAL Protocol: From Sequence Data to Species Tree

Input Data Preparation and Gene Tree Estimation

The first stage involves generating a set of unrooted gene trees, which form the primary input for ASTRAL.

  • Sequence Alignment: Extract alignment blocks from a whole-genome alignment or create alignments of orthologous genes. For a whole-genome alignment in MAF format, use a custom Python script to extract blocks of a specific length (e.g., 1,000 bp), filtering for completeness and a low number of recombination breakpoints to ensure high phylogenetic signal [19].
  • Gene Tree Inference: For each alignment block, infer an unrooted gene tree using a maximum likelihood method such as IQ-TREE [19]. The command for a single alignment might be: iqtree2 -s [alignment_block.phy] -m [MODEL] -nt [CORES] -pre [output_prefix] This step generates a set of gene tree files, typically in Newick format.

Running ASTRAL

With the collection of gene trees prepared, execute ASTRAL to infer the species tree.

  • Input Consolidation: Combine all inferred gene trees into a single file or provide them individually to ASTRAL.
  • Execution Command: Run ASTRAL from the command line. The basic command using a pre-installed version is: java -jar ~/software/Astral/astral.5.7.8.jar -i [input_gene_trees.tre] -o [output_species_tree.tre] This command will compute the species tree that maximizes the quartet support from the input gene trees [19].

Advanced Configuration: Multi-Allele Datasets

ASTRAL can be extended to analyze multi-individual datasets, where multiple alleles or individuals are sampled per species. This is achieved by using a version of ASTRAL that handles multi-labeled gene trees. The algorithm naturally extends the quartet optimization problem, and heuristic methods are used to build an effective search space, for instance, by subsampling individuals [5]. This allows the estimation of a species tree where the monophyly of pre-defined species is enforced.

Output and Post-Analysis

The primary output is the estimated species tree in Newick format. This tree can be visualized using software like FigTree [19]. To assess support, ASTRAL provides local posterior probabilities for each branch, which represent the proportion of quartets relevant to that branch that are decisive about its existence and agree with it [5].

Workflow Visualization

The following diagram illustrates the complete experimental protocol for ASTRAL analysis, from raw data to species tree estimation.

G Start Start: Genomic Data Sub0 Whole-Genome Alignment or Orthologous Gene Sets Start->Sub0 Sub1 Extract Alignment Blocks (e.g., 1,000 bp blocks) Sub0->Sub1 Sub2 Infer Gene Trees (Using IQ-TREE) Sub1->Sub2 Per Block Sub3 Run ASTRAL Sub2->Sub3 Collection of Gene Trees Sub4 ASTRAL Species Tree with Branch Supports Sub3->Sub4

The Scientist's Toolkit: Essential Research Reagents and Software

Table 2: Key Software Tools and Resources for ASTRAL-based Phylogenomics.

Tool/Resource Function Usage in Protocol
Progressive Cactus Reference-free whole-genome alignment [19]. Generating the initial genome-wide multiple sequence alignment.
IQ-TREE Maximum likelihood phylogenetic inference [19]. Estimating individual gene trees from sequence alignment blocks.
ASTRAL Species tree estimation from gene trees [19] [5]. Summarizing gene trees into a species tree under the coalescent model.
PhyloNet Inference of species networks [19]. Modeling hybridization and introgression events beyond the tree model.
FigTree Phylogenetic tree visualization [19]. Visualizing and exploring the final ASTRAL species tree.
PAUP* General-purpose phylogenetic analysis (optional) [19]. Additional phylogenetic inference and tree manipulation.

In phylogenomics, the widespread phenomenon of gene tree discordance, caused by mechanisms such as incomplete lineage sorting (ILS) and introgression, presents a significant challenge to accurate species tree estimation. Summary methods, which infer species trees from a collection of input gene trees, have become a cornerstone of modern phylogenomic analysis. It is crucial to understand the theoretical boundaries within which these methods operate reliably. The Anomaly Zone is a theoretical concept defining the conditions under which the most probable gene tree topology differs from the species tree topology, making consistent inference difficult [25]. This application note examines the theoretical guarantees of summary methods, with a specific focus on the ASTRAL algorithm, within the context of species tree estimation in the presence of introgression. We detail the mathematical foundations of the anomaly zone, ASTRAL's proven statistical consistency under the multi-species coalescent (MSC) model, and its performance beyond the MSC where introgression is a factor. Furthermore, we provide practical protocols for applying ASTRAL and related tools in real-world research scenarios common to scientists and drug development professionals investigating evolutionary relationships.

Theoretical Foundations

The Anomaly Zone and Its Implications

The Anomaly Zone is a region in tree space characterized by specific short internal branches and large population sizes in the species tree. Under these conditions, an anomalous gene tree—a gene tree topology that is more probable than the true species tree topology—can occur [25]. This presents a fundamental challenge for statistical consistency, as a method is statistically consistent if it converges to the true species tree given an infinite amount of data. The existence of the anomaly zone means that simple plurality approaches, which select the gene tree topology that appears most frequently, can be misleading. For a four-taxon tree, the anomaly zone exists when the length of the internal branch is sufficiently short, satisfying the condition ( T < \log(1.5) * 2Ne ) (where ( T ) is branch length in coalescent units and ( Ne ) is the effective population size). Summary methods must be designed to be robust to these conditions to provide reliable inferences.

Theoretical Guarantees of ASTRAL

ASTRAL (Accurate Species TRee ALgorithm) is a leading summary method specifically designed to address gene tree discordance. Its theoretical guarantees are a key reason for its widespread adoption.

  • Statistical Consistency under MSC: ASTRAL is statistically consistent under the Multi-Species Coalescent model, which primarily models ILS [25]. This means that as the number of genes ( k ) approaches infinity, the probability of ASTRAL inferring the correct species tree approaches 1, even within the anomaly zone. This property is achieved because ASTRAL does not merely count the most frequent gene tree; it searches for the species tree that shares the maximum number of induced quartet topologies with the entire set of input gene trees.
  • Polynomial Time Complexity: ASTRAL-III guarantees polynomial running time as a function of the number of species ( n ) and the number of genes ( k ), a significant improvement over earlier versions. Its asymptotic running time in the presence of polytomies is ( O((n k)^{1.726} D) ), where ( D ) is the sum of the degrees of all unique nodes in the input trees [25]. This makes ASTRAL scalable to large datasets with thousands of species.
  • Statistical Consistency with Introgression: While the standard MSC does not model introgression, the multispecies coalescent with introgression (MSci) model extends it to include hybridisation events. Full-likelihood implementations of MSci models can estimate the history of species divergence and gene flow, but they face unidentifiability issues [26]. In a Bidirectional Introgression (BDI) model, for example, different parameter sets can generate identical gene tree probability distributions, creating "mirror" points in the parameter space that are indistinguishable using heuristic methods based on gene tree topologies alone [26]. ASTRAL's quartet-based approach can serve as a robust complement to these methods.

Table 1: Key Theoretical Properties of ASTRAL and Related Models

Concept / Method Theoretical Guarantee/Property Implications for Inference
Anomaly Zone Region where the most likely gene tree ≠ species tree. Challenges simple plurality methods; necessitates robust methods like ASTRAL.
ASTRAL under MSC Statistically consistent [25]. Guarantees accuracy with sufficient data, even in the anomaly zone.
ASTRAL-III Runtime Polynomial time: ( O((n k)^{1.726} D) ) [25]. Enables application to very large datasets (e.g., 10,000 species).
MSci Model (Introgression) Can suffer from parameter unidentifiability [26]. Different gene flow histories may be equally likely, complicating inference.

Experimental Protocols for Tree-Based Introgression Detection

This protocol outlines a tree-based approach to detect past introgression events, which can serve to verify or reject patterns identified by SNP-based methods like the ABBA-BABA test (D-statistic) [19].

The following diagram illustrates the complete workflow for tree-based introgression detection, from whole-genome alignment to the inference of introgression networks.

G cluster_1 Data Preparation & Gene Tree Estimation cluster_2 Species Tree & Introgression Inference Start Start A Whole-Genome Alignment (MAF Format) Start->A End End B Extract & Filter Alignment Blocks A->B C Infer Maximum Likelihood Gene Trees (IQ-TREE) B->C D Infer Species Tree (ASTRAL) C->D E Assess Topology Asymmetry D->E F Infer Introgression Network (PhyloNet) E->F F->End

Detailed Methodology

Data Preparation and Gene Tree Estimation
  • Input Data Acquisition and Inspection: Begin with a whole-genome alignment file, often in the Multiple Alignment Format (MAF). Inspect the file structure using command-line tools (e.g., less -S). A typical MAF file includes a header with a guide tree in Newick format, followed by alignment blocks each starting with an "a" and containing sequences ("s") for different species [19].
  • Extract and Filter Alignment Blocks: Use a custom script to extract alignment blocks of a fixed length (e.g., 1,000 bp) from the whole-genome alignment. Filter these blocks based on:
    • Completeness: Prefer blocks with sequences from all species.
    • Informativeness: Select blocks with a sufficient number of polymorphic sites.
    • Recombination: Quantify and filter out blocks with strong signals of within-alignment recombination, as this can confound phylogenetic analysis [19].
  • Gene Tree Inference: For each filtered alignment block, infer an unrooted gene tree using maximum likelihood. IQ-TREE is a modern, rapid tool suitable for this task. Execute the analysis for each alignment, resulting in a collection of gene trees [19].
Species Tree and Introgression Inference
  • Species Tree Estimation with ASTRAL: Use the collection of gene trees as input for ASTRAL. ASTRAL will estimate the species tree that has the maximum number of shared induced quartet trees with the set of input gene trees [15] [25]. This step provides a robust species tree estimate that accounts for discordance due to ILS.
  • Assessment of Topology Asymmetry: Analyze the distribution of gene tree topologies across the genome. An asymmetry in the frequencies of the two minority topologies for a given species trio can indicate past introgression, analogous to the signal used in D-statistics but potentially more robust to conditions like homoplasy in divergent species [19].
  • Introgression Network Inference with PhyloNet: Use the set of gene trees as input for PhyloNet, a tool designed to infer species trees and networks in a maximum-likelihood framework. PhyloNet can assess support for alternative models of diversification with and without introgression, providing a statistical framework to test for specific introgression events [19].

The Scientist's Toolkit: Essential Research Reagents & Software

The following table details key software tools and their functions essential for conducting research in species tree estimation and introgression detection.

Table 2: Key Software Tools for Phylogenomic Analysis

Tool Name Primary Function Key Features & Use Case
ASTRAL [15] [25] Species tree estimation from gene trees. Statistically consistent under MSC; uses quartet-based approach; fast polynomial-time algorithm (ASTRAL-III).
IQ-TREE [19] Maximum likelihood phylogenetic inference. Rapid inference of gene trees from sequence alignments; implements a wide range of evolutionary models.
PhyloNet [19] Inference of species trees and networks. Models reticulate evolutionary processes (hybridization/introgression) using maximum-likelihood or parsimony.
PAUP* [19] General-purpose phylogenetic inference. A versatile tool for phylogenetic analysis, often used in conjunction with other methods for specific tests or visualizations.
BPP [26] Bayesian phylogenomic analysis. Full-likelihood implementation of the multispecies coalescent with introgression (MSci) model; estimates divergence times, population sizes, and introgression probabilities.

Visualization of MSci Model Unidentifiability

A core challenge in inferring introgression under the MSci model is parameter unidentifiability. The following diagram illustrates the "mirror" effect in a Bidirectional Introgression (BDI) model, where two distinct parameter sets are indistinguishable based on gene tree topologies and coalescent times alone [26].

G cluster_original Original Parameter Set Θ cluster_mirror Mirror Parameter Set Θ' Title Unidentifiable Parameter Sets in BDI Model O1 φX = 0.7 Equivalence f(G | Θ) = f(G | Θ') O1->Equivalence O2 φY = 0.3 O2->Equivalence O3 θX = 0.005 O3->Equivalence O4 θY = 0.002 O4->Equivalence M1 φX' = 0.3 ( = 1 - φX ) M1->Equivalence M2 φY' = 0.7 ( = 1 - φY ) M2->Equivalence M3 θX' = 0.002 ( = θY ) M3->Equivalence M4 θY' = 0.005 ( = θX ) M4->Equivalence

Understanding the theoretical limits of summary methods, particularly the anomaly zone and model unidentifiability, is paramount for robust phylogenomic inference. ASTRAL provides strong theoretical guarantees, including statistical consistency under the MSC and polynomial-time scalability, making it a powerful tool for species tree estimation in the presence of ILS. However, in the context of introgression, MSci models can present unidentifiability challenges. A combined methodological approach—using tree-based methods like those implemented in ASTRAL and PhyloNet to complement SNP-based tests—provides a more robust framework for detecting and characterizing ancient gene flow. The protocols and tools outlined here offer researchers a practical pathway to navigate these complexities, ultimately leading to more accurate reconstructions of evolutionary history.

A Practical Protocol: Running ASTRAL Analyses in the Presence of Gene Flow

Accurate species tree estimation is a cornerstone of modern evolutionary genomics, with profound implications for understanding biodiversity, tracing the origins of genetic traits, and informing drug discovery from natural products. The multi-species coalescent model, implemented in tools like ASTRAL (Accurate Species TRee ALgorithm), provides a powerful statistical framework for inferring species trees from multiple gene trees while accounting for incomplete lineage sorting (ILS) [15]. However, the accuracy of the final species tree is critically dependent on the quality of the input gene trees, particularly their rooting, as the root position provides essential information about evolutionary directionality and ancestral relationships.

This application note provides detailed protocols for generating and rooting gene trees within the context of ASTRAL species tree estimation, with special consideration for research involving introgression. We focus on practical methodologies for data preparation, highlight key rooting techniques relevant to bacterial genomics where introgression has been systematically quantified [11], and provide standardized workflows to ensure reproducible results for research scientists and drug development professionals.

Generating Gene Trees: Protocols and Workflows

Sequence Data Preparation and Alignment

The initial phase involves preparing molecular sequence data for phylogenetic analysis.

  • Data Collection: Gather nucleotide or amino acid sequences for the gene family of interest across the target species. For ASTRAL analysis, this process is repeated for hundreds or thousands of loci.
  • Multiple Sequence Alignment: Use alignment tools such as MAFFT, MUSCLE, or Clustal-Omega to generate a multiple sequence alignment for each locus. Visually inspect alignments using tools like AliView to correct obvious misalignments.
  • Alignment Trimming and Filtering: Trim poorly aligned regions using tools like Gblocks or TrimAl. Exclude genes with extensive missing data or fragmentary sequences, as these can negatively impact gene tree accuracy [15]. Filtering should be balanced, as overly aggressive gene removal can also be detrimental to species tree estimation [15].

Gene Tree Inference Methods

Inferring individual gene trees from aligned sequences can be accomplished using several computational approaches.

Table 1: Common Methods for Gene Tree Inference

Method Principle Use Case Software Examples
Maximum Likelihood (ML) Finds the tree topology and branch lengths that maximize the probability of observing the sequence data under a specified evolutionary model. High-accuracy inference for most datasets; considered best practice. RAxML, IQ-TREE, PAUP* [27]
Bayesian Inference Estimates the posterior probability of tree topologies by incorporating prior knowledge and evolutionary models. Useful for quantifying uncertainty in tree topology and branch lengths. MrBayes, BEAST2
Distance Methods Constructs trees based on a matrix of pairwise evolutionary distances between sequences. Fast approximation for very large datasets or initial exploration. FastTree, Neighbor-Joining

For ASTRAL analysis, ML gene trees are generally preferred over consensus trees from bootstrapped analyses for subsequent species tree estimation [15]. The following protocol outlines a standard ML workflow using PAUP* for a single locus, which can be automated for hundreds of genes.

Protocol 2.1: Maximum Likelihood Gene Tree Estimation in PAUP*

  • Input Preparation: Format the aligned sequence data for a single gene in NEXUS format.
  • Model Selection: Determine the best-fitting nucleotide or amino acid substitution model for the alignment using model-testing programs like ModelTest-NG or jModelTest2.
  • Tree Search: Execute a heuristic tree search under the selected model. In PAUP*, this typically involves commands such as hsearch addseq=random nreps=100 to perform a search with multiple random-addition-sequence replicates.
  • Output: Save the best-found tree for the gene in Newick format (e.g., mammal.tre [27]). Repeat this process independently for all loci.
  • Post-processing (Optional): Consider using tools like TreeShrink to detect and remove outlier long branches from gene trees, which can improve accuracy [15].

Rooting Gene Trees: Principles and Methods

Most phylogenetic inference methods produce unrooted trees. Rooting is a critical separate step that imposes directionality on evolution. Choosing an appropriate rooting method is particularly important in the context of introgression, as different methods exhibit varying robustness to this and other complex evolutionary events [11] [28].

Table 2: Comparison of Common Gene Tree Rooting Methods

Method Category Principle Requirements Considerations for Introgression
Outgroup Rooting [29] Topology / Taxonomy Roots the tree on the branch connecting a known outgroup taxon to the rest of the tree. A trusted, evolutionarily distant outgroup. Highly sensitive to HGT and introgression involving the outgroup.
Midpoint Rooting [28] Branch Length Places the root at the midpoint of the longest path between two taxa in the tree. An unrooted gene tree with branch lengths. Assumes a molecular clock. Sensitive to rate variation; introgression can distort branch lengths.
Minimal Ancestor Deviation (MAD) [28] Branch Length Roots the tree to minimize the relative deviation of root-to-tip distances from a molecular clock. An unrooted gene tree with branch lengths. Generally robust to gene tree error; performance affected by high rate variation [28].
DTL Rooting [28] Reconciliation Uses a known, rooted species tree to find the gene tree root that minimizes the cost of Duplication-Transfer-Loss events. A rooted species tree and an unrooted gene tree. Sensitive to high HGT/Introgression rates and gene tree error [28]. Accurate under low-moderate transfer rates.
Evolutionary Parsimony (EP) Rooting [30] Substitution Model Uses linear invariants derived from balanced transversion assumptions to root trees directly from sequences. Sequence alignment. No outgroup or species tree needed. Avoids issues with species tree inaccuracy or unavailability; useful when paralogs are absent [30].
Network-Assisted Rooting [29] Reconciliation Infers the root by reconciling the unrooted gene tree with a set of splits derived from a phylogenetic network. A phylogenetic network of the species. Specifically designed for scenarios involving reticulate evolution like introgression and hybridization [29].

Protocol 3.1: Rooting with the Minimal Ancestor Deviation (MAD) Method

MAD rooting is a branch-length-based method that has shown good accuracy and robustness on prokaryotic gene families, including in the presence of gene tree error [28].

  • Input: An unrooted gene tree in Newick format, with branch lengths proportional to evolutionary change (e.g., substitutions per site).
  • Software: Use a implementation of the MAD algorithm, such as the root function in the ape R package or other dedicated phylogenetic tools.
  • Execution: The algorithm calculates the root-to-tip distance variance for all possible root positions and selects the position that minimizes the relative deviation from the mean.
  • Output: A rooted version of the input gene tree.

Protocol 3.2: Rooting via Duplication-Transfer-Loss (DTL) Reconciliation

DTL rooting is a powerful method for prokaryotic gene families when a reliable rooted species tree is available [28].

  • Prerequisites:
    • A rooted species tree of the taxa in the gene family.
    • An unrooted gene tree for the family.
    • Event costs for duplication (D), transfer (T), and loss (L). Note: Using a elevated transfer cost can improve rooting accuracy in the presence of high HGT/introgression [28].
  • Software: Use tools that implement parsimonious DTL reconciliation, such as RANGER-DTL.
  • Execution: The software reconciles the unrooted gene tree with the rooted species tree for every possible root placement of the gene tree.
  • Output: The root position that minimizes the total reconciliation cost (e.g., 2D + 3T + 1L) is selected, and the gene tree is rooted accordingly.

The ASTRAL Workflow: From Gene Sequences to a Species Tree

The following workflow integrates the protocols above into a complete pipeline for ASTRAL species tree estimation.

pipeline Start Start: Multi-locus Sequence Data Align 1. Sequence Alignment & Trimming Start->Align Infer 2. Gene Tree Inference (Maximum Likelihood) Align->Infer Root 3. Root Gene Trees (e.g., MAD, DTL, EP) Infer->Root Unroot 4. Remove Roots (Create unrooted trees for ASTRAL) Root->Unroot ASTRAL 5. ASTRAL Analysis Species Tree Estimation Unroot->ASTRAL Final Rooted Species Tree ASTRAL->Final

Diagram 1: Complete workflow for ASTRAL species tree estimation, showing the critical role of rooting and unrooting gene trees.

A crucial and often-overlooked step is that ASTRAL requires unrooted gene trees as input [15] [27]. The rooting procedures described in Section 3 are essential for biological interpretation and for methods that require rooted gene trees (e.g., some phylogenetic reconciliation approaches). However, for the specific purpose of generating input for ASTRAL, the roots must be removed after the rooting analysis to avoid introducing bias. The final ASTRAL command is typically: astral -i genetrees.tre -o speciestree.tre [27].

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Software Tools for Gene Tree and Species Tree Analysis

Tool / Resource Function Application Note
MAFFT / MUSCLE Multiple sequence alignment Generates the input alignments for gene tree inference.
IQ-TREE / RAxML Maximum Likelihood tree inference Preferred for estimating accurate gene trees from alignments [15].
ASTRAL Species tree estimation Infers the species tree from a set of unrooted gene trees, accounting for ILS [15].
TreeShrink Long-branch filtering Statistically identifies and removes outlier long branches from gene trees to improve accuracy [15].
RANGER-DTL DTL reconciliation Used for rooting gene trees via reconciliation with a known species tree [28].
Phylogenetic Networks Modeling introgression Serves as input for novel network-assisted rooting methods in complex evolutionary scenarios [29].

ASTRAL (Accurate Species TRee ALgorithm) is a leading method for estimating an unrooted species tree from a set of unrooted gene trees while accounting for gene tree discordance caused by incomplete lineage sorting (ILS). It is statistically consistent under the multi-species coalescent model (MSC), making it invaluable for handling phylogenomic datasets where gene trees may conflict with the species tree. The method operates by finding the species tree that has the maximum number of shared induced quartet trees with the set of input gene trees, subject to the constraint that the set of bipartitions in the species tree comes from a predefined set. The current version, ASTRAL-III, provides polynomial time complexity and enhanced handling of polytomies, enabling analyses scaling up to 10,000 species.

The robustness of ASTRAL makes it particularly relevant for research investigating complex evolutionary scenarios like introgression. Studies have shown that summary methods like ASTRAL can be impacted by gene flow, especially ghost introgression from unsampled lineages. The relative performance of ASTRAL compared to full-likelihood methods varies under different introgression scenarios, with ASTRAL being more robust to gene flow between non-sister species.

Installation and Setup

System Requirements and Installation

ASTRAL is a Java-based application requiring Java 1.6 or later. No complex installation is required, making it compatible with Windows, Linux, and macOS environments.

Installation Steps:

  • Download ASTRAL: Obtain the latest version from the GitHub repository (github.com/smirarab/ASTRAL). The distribution is available as a pre-compiled JAR file within a ZIP archive [15].
  • Extract Files: Unzip the downloaded archive to a directory of your choice.
  • Verify Installation: Test the installation by running java -jar astral.5.6.3.jar (version number may vary) from your command line. This should display help information confirming successful setup [31].

For advanced users, the repository can be cloned and built using the provided make.sh script, though this is typically unnecessary for standard usage [15].

Core Methodology and Theoretical Basis

ASTRAL addresses the NP-hard problem of finding the species tree that shares the maximum number of induced quartet topologies with input gene trees. It solves a constrained version where the output species tree's bipartitions are restricted to a predefined set ( X ), ensuring polynomial time complexity. The optimization function maximizes the Weighted Quartet (WQ) score, defined as the number of quartet trees from the input set that the candidate species tree also induces [25].

The ASTRAL algorithm uses dynamic programming with the recursive relation: [ V(A) = \max_{A'\subset A, (A'|A-A'|L-A)\in Y} V(A') + V(A-A') + w(A'|A-A'|L-A) ] where ( w(T) ) scores each tripartition ( T=(A|B|C) ) against nodes in input gene trees, and ( QI(T,M) ) computes twice the number of quartet trees shared between trees containing T and M [25].

Table 1: Key Mathematical Notation in ASTRAL

Symbol Description
( L ) Set of n species
( G ) Set of k input gene trees
( Q(t) ) Set of quartet trees induced by tree t
( X ) Constraint bipartition set
( Y ) Set of all tripartitions buildable from X
( D ) Sum of cardinalities of unique partitions in gene trees
( w(T) ) Score of tripartition T against gene tree nodes

Preparing Input Data

Gene Tree Estimation

ASTRAL requires previously estimated gene trees as input. These should be in Newick format and can be generated using various phylogenetic inference tools:

Using RAxML for Gene Tree Estimation: PAUP* provides a wrapper for RAxML to streamline gene tree estimation. A Python script can generate the necessary commands:

Execute the resulting Nexus file in PAUP*: paup> execute run_astral.nex; This produces a tree file containing one tree per locus [31].

Best Practices for Gene Tree Estimation:

  • Use appropriate substitution models selected through AIC/BIC criteria
  • Consider contracting branches with very low support (e.g., below 10%) to reduce noise
  • For large datasets, RAxML is preferred over PAUP* for computational efficiency
  • Remove fragmentary sequences before inference to improve accuracy [15]

Input File Format

ASTRAL accepts a simple text file containing all gene trees in Newick format, with one tree per line:

The trees can be partially resolved (contain polytomies), and ASTRAL-III handles these efficiently [25].

Basic Command Line Execution

Minimal Command Structure

The fundamental command to run ASTRAL is:

Table 2: Essential ASTRAL Command Line Parameters

Parameter Description Default
-i <filename> Input file containing gene trees in Newick format Required
-o <filename> Output file for the species tree stdout
--help Display all available options N/A

Execution time scales polynomially with the number of species (n) and genes (k). For ASTRAL-III, the asymptotic running time in the presence of polytomies is (O((nk)^{1.726} D)) where D is the sum of degrees of all unique nodes in input trees [25].

Advanced Parameter Configuration

Search Space and Optimization Parameters

ASTRAL restricts the species tree search to a set of bipartitions defined by the constraint set X. Several parameters control how this set is constructed:

-x (Bipartition File): Provide a user-specified set of bipartitions in a text file to guide the search space. Each line should contain a bipartition in the format "A|B" where A and B are comma-separated taxon lists.

Heuristic Search Options:

  • ASTRAL automatically expands X using heuristics beyond those observed in gene trees
  • The -x parameter allows manual curation of the search space
  • The guaranteed polynomial running time is achieved by limiting X to grow at most linearly with n and k [25]

Multi-individual Datasets

ASTRAL can handle multiple individuals per species through multi-labeled gene trees. The quartet optimization problem extends naturally to this case, with heuristic approaches for building the search space:

The -M flag indicates multi-individual data, where gene trees are labeled with individual names rather than species names, and a mapping file specifies species assignments [5].

Table 3: Advanced ASTRAL Execution Parameters

Parameter Function Use Case
-x <bipartition_file> Specify constraint set Guided search space
-M Multi-individual mode Population-level datasets
-C Contract low support branches Noise reduction (recommended <10%)
-t <threshold> Polytomy test threshold Default: 10 [15]
-g <group_file> Gene tree grouping Partitioned analysis
-r <bootstraps> Number of bootstraps Support assessment

Branch Lengths and Support Values

ASTRAL can compute branch lengths in coalescent units and local posterior probability support values:

The -b option generates branch lengths, while local posterior probabilities are automatically calculated as a measure of branch support [15].

Output Interpretation and Analysis

Understanding ASTRAL Output

The primary output is the species tree in Newick format with branch lengths and support values. A typical output looks like:

Where 1:0.025 indicates a branch with local posterior probability 1 and length 0.025 coalescent units [31].

Key Output Components:

  • Tree Topology: The estimated species tree relationships
  • Branch Lengths: In coalescent units (with -b option)
  • Local Posterior Probabilities: Branch support values between 0-1
  • Quartet Scores: Overall quartet support for the tree

Visualizing Results

For tree visualization, use tools like FigTree:

  • Install FigTree from http://tree.bio.ed.ac.uk/software/figtree/
  • Open the output tree file
  • Display branch supports and lengths appropriately [31]

Experimental Design for Introgression Research

Workflow for Species Tree Estimation

The following diagram illustrates the complete workflow for ASTRAL analysis in phylogenomic studies:

G Start Start Phylogenomic Analysis SeqData Sequence Alignment (per locus) Start->SeqData GeneTrees Gene Tree Estimation (RAxML, PAUP*) SeqData->GeneTrees InputPrep Prepare ASTRAL Input File GeneTrees->InputPrep ASTRALRun Execute ASTRAL with Parameters InputPrep->ASTRALRun OutputTree Species Tree with Support Values ASTRALRun->OutputTree Visualization Tree Visualization (FigTree) OutputTree->Visualization

Addressing Introgression in Study Design

When investigating introgression:

  • ASTRAL is robust to gene flow between non-sister species but may be impacted by ghost introgression
  • Combine with methods specifically designed for detecting introgression (e.g., D-statistics)
  • Use multi-individual sampling when possible to account for polymorphisms
  • Consider that short internal branches (high ILS) increase susceptibility to introgression effects [17] [32]

Table 4: Research Reagent Solutions for ASTRAL Analysis

Tool/Resource Function Application in ASTRAL Workflow
ASTRAL Software Species tree estimation Core analysis method
RAxML Gene tree estimation Input preparation
PAUP* Gene tree estimation/analysis Alternative to RAxML
FigTree Tree visualization Result interpretation
SimPhy Simulation of species/gene trees Method validation
Java Runtime Execution environment Required for ASTRAL

Troubleshooting and Optimization

Common Issues and Solutions

Long Run Times:

  • Use the -C option to contract very low support branches (<10%) in gene trees
  • Limit the search space with the -x parameter for very large datasets
  • Ensure adequate memory allocation with Java -Xmx option

Accuracy Concerns:

  • Filter fragmentary data before gene tree estimation
  • Consider using TreeShrink to remove outlier long branches
  • Avoid overly aggressive filtering of genes with missing data [15]

Validation and Benchmarking

For method validation in introgression research:

  • Simulate datasets under various introgression scenarios
  • Compare ASTRAL results with other methods (*BEAST, NJst)
  • Assess robustness to ghost introgression from unsampled lineages
  • Evaluate trade-offs between sampling more genes versus more individuals [17] [5]

Experimental evidence suggests that sampling more genes is generally more effective than sampling more individuals for accuracy improvement, even under high ILS conditions with shallow trees [5].

Incorporating Information from Ghost Lineages in Analysis Design

In phylogenomic analyses, a "ghost lineage" refers to an unsampled, unknown, or extinct population or species that has contributed genetic material through introgression to the species under investigation [17]. The identification and characterization of such ghost introgression is critical for the accurate reconstruction of species relationships, particularly in studies utilizing coalescent-based methods like ASTRAL for species tree estimation [17]. Ignoring the potential for ghost introgression can lead to substantial biases in both topological inference and divergence time estimation, potentially misrepresenting evolutionary history [17]. This protocol outlines detailed methodologies for detecting and accounting for ghost introgression within the ASTRAL framework, providing application notes for researchers in evolutionary biology and phylogenomics.

Quantitative Impacts of Ghost Introgression

Documented Effects on Species Tree Inference

Table 1: Documented Impacts of Ghost Introgression on Species Tree Estimation

Analysis Aspect Impact of Ghost Introgression Magnitude/Severity Factors Citation
Species Tree Topology Increased anomaly zone conditions; potential for strongly supported but incorrect species trees. Strength of ILS; timing and direction of gene flow. [17]
Root Divergence Time (Outgroup Ghost Donor) Systematic overestimation of root divergence time. Proportion of introgressed loci; divergence depth of ghost lineage. [17]
Root Divergence Time (Ingroup Ghost Donor) Potential underestimation of root divergence time. Strength of ILS; phylogenetic placement of ghost. [17]
Method Performance (ASTRAL) Generally robust but accuracy can decrease with gene flow between non-sister species. Locus sampling; gene tree error. [17]
Method Performance (*BEAST) Can outperform summary methods under specific ghost introgression scenarios. Model specification; computational resources. [17]
Prevalence Across Biological Systems

Table 2: Empirical Prevalence of Introgression Across Lineages

Biological Group/Lineage Reported Level of Introgression Key Contextual Notes Citation
Bacteria (50 major lineages) Average: 2% of core genes (median); up to 14% in Escherichia–Shigella. Introgression defined as gene flow between core genomes of distinct species. [11]
Bacteria (Genera with high introgression) Escherichia–Shigella, Cronobacter, Streptococcus parasanguinis (up to 33.2%). High levels often linked to ongoing speciation or misclassification of ANI-species. [11]

Experimental Protocols for Detection and Analysis

Primary Workflow for Ghost Introgression Analysis

G Start Start: Input Data GT_Infer Infer Gene Trees Start->GT_Infer ST_Infer Infer Species Tree (e.g., ASTRAL) GT_Infer->ST_Infer Incong_Test Test for Phylogenetic Incongruence ST_Infer->Incong_Test D_Stat D-Statistic (ABBA-BABA) Analysis Incong_Test->D_Stat Significant Incongruence End Final Species Tree with Confidence Incong_Test->End No Significant Incongruence Ghost_Topo Infer Plausible Ghost Topology & Introgression Direction D_Stat->Ghost_Topo Significant D ST_Reestimate Re-estimate Species Tree Accounting for Ghost Ghost_Topo->ST_Reestimate ST_Reestimate->End

Protocol 1: Phylogenetic Incongruence and Introgression Detection

This protocol details the steps for initial detection of potential ghost introgression using phylogenetic incongruity and D-statistics, adapted from established phylogenomic practices [11] [33].

  • Step 1: Input Data Preparation. Assemble a genome-wide set of sequence alignments, typically from single-copy orthologous genes. For variant-based approaches, generate a multi-sample VCF file. Filter sites for quality, missing data (e.g., ≤10%), and minor allele frequency (e.g., ≥0.03). Prune for linkage disequilibrium (e.g., using PLINK with window size 100 kb, step 20 bp, R² threshold 0.4) [33].
  • Step 2: Gene Tree and Species Tree Inference. Reconstruct individual gene trees for all loci using a maximum likelihood or Bayesian method. Subsequently, infer a primary species tree using a summary method like ASTRAL-III, which is efficient for large datasets [34].
  • Step 3: Screening for Phylogenetic Incongruence. For each gene tree, compare its topology to the reference species tree. Identify genes with topologies that are statistically inconsistent with the species tree, particularly those where a sequence from one species clusters monophyletically with sequences from a non-sister species [11].
  • Step 4: D-Statistic (ABBA-BABA) Analysis. To statistically test for introgression, use the D-statistic for all taxon triads (((P1, P2), P3), Outgroup) compatible with the species tree. This test counts site patterns to detect a skew from the expectation under incomplete lineage sorting alone. Significant D-values indicate gene flow between P3 and either P1 or P2. This analysis can be performed using tools like Dsuite [33].
  • Step 5: Interpretation and Ghost Hypothesis. Consistent and significant phylogenetic incongruence and D-statistic signals that cannot be explained by gene flow between any sampled species provide evidence for a ghost lineage. The specific pattern of allele sharing can help infer the phylogenetic position of the potential ghost donor relative to the sampled species.
Protocol 2: Ancestry Informative Marker (AIM) Analysis for Ghost Introgression

This protocol uses fixed differences to pinpoint introgressed genomic regions and infer the source, which is powerful for characterizing ghost introgression [33].

  • Step 1: Identify Ancestry Informative Markers (AIMs). For a focal species or group, scan the genome to identify sites that are fixed for alternate alleles between two major clades. In the context of ghost introgression, one looks for genomic blocks within a species that are fixed for alleles otherwise characteristic of a different, unsampled clade.
  • Step 2: Genotype Focal Genomes. Assess the genotypes of the focal species at all identified AIMs. For regions suspected of ghost introgression, the focal species will be homozygous for the allele that is not its own, but is fixed in the putative ghost sister clade.
  • Step 3: Map Introgressed Tracts. By scanning for consecutive, fixed AIMs along chromosomes, the physical extent and boundaries of introgressed genomic tracts from the ghost lineage can be delineated.
  • Step 4: Estimate Introgression Proportion. Calculate the fraction of the core genome comprised of these introgressed tracts to quantify the overall level of ghost introgression, as performed in bacterial systems [11].

The Scientist's Toolkit: Essential Research Reagents & Solutions

Table 3: Key Software and Analytical Tools for Ghost Introgression Analysis

Tool/Reagent Name Primary Function Key Parameters/Applications Citation
ASTRAL-III Species tree inference from gene trees under the multispecies coalescent. Polynomial time complexity; handles polytomies; contracts low-support branches. [34]
Dsuite Calculates D-statistics for genome-wide tests of introgression. Handles large variant call format (VCF) files; calculates D and related statistics. [33]
PCAngsd Principal Component Analysis based on genotype likelihoods. Infers population structure and major axes of genetic variation from NGS data. [33]
NGSadmix/ADMIXTURE Estimates individual ancestry proportions from genomic data. Identifies optimal number of ancestral populations (K); visualizes admixture. [33]
PLINK Whole-genome association analysis toolset. Used for data management, quality control, and LD pruning of genotype data. [33]
AIMs Analysis Identifies genomic regions with fixed differences between clades. Pinpoints introgressed tracts and quantifies the proportion of introgression. [11] [33]

Advanced Workflow for Comprehensive Characterization

G Data Genomic Data (VCF/Alignments) PC Population Structure (PCA/Admixture) Data->PC GT Gene Tree Inference Data->GT GhostInf Ghost Inference & Direction Estimation PC->GhostInf ST Species Tree (ASTRAL) GT->ST DStat D-Statistic Tests ST->DStat DStat->GhostInf AIM AIM Analysis to Map Introgressed Tracts GhostInf->AIM Identifies Target Regions/Clades DivTime Divergence Time Estimation (e.g., *BEAST) GhostInf->DivTime AIM->DivTime Final Final Model of History Including Ghost Lineage DivTime->Final

ASTRAL (Accurate Species TRee ALgorithm) is a leading method for inferring species trees from a set of input gene trees while accounting for gene tree discordance caused by incomplete lineage sorting (ILS). It is statistically consistent under the multi-species coalescent model [15] [25]. The algorithm seeks the species tree that shares the maximum number of shared induced quartet trees with the set of gene trees, subject to constraints on the set of allowable bipartitions [15] [25]. The output of ASTRAL consists of a species tree topology with associated branch support values and branch lengths, which together provide a comprehensive picture of the inferred evolutionary relationships and the confidence in those relationships. Proper interpretation of this output, particularly the local posterior probabilities (LPP) and other support values, is crucial for drawing accurate biological conclusions, especially in complex evolutionary scenarios such as those involving introgression.

Components of the ASTRAL Output Tree

Branch Supports: LPP, QT, and EN

An ASTRAL tree provides three key support statistics for internal branches, each offering a different perspective on branch confidence [35]. The table below summarizes their characteristics.

Table 1: Key branch support statistics in ASTRAL output

Support Statistic Full Name Interpretation Recommendation
LPP Local Posterior Probability Probability the branch is true given gene trees, based on quartet frequencies and assuming ILS [15] [35]. Primary confidence measure; use LPP ≥ 0.5 as a confidence threshold [35].
QT Quartet Support Proportion of gene tree quartets that support the branch [35]. Useful supplementary measure of gene tree concordance.
EN Effective Number of Genes Number of gene trees containing quartets relevant to the branch [35]. Measure of information; be cautious with branches where EN ≤ 5 [35].

Branch Lengths: Coalescent Units and Substitution Rates

ASTRAL provides branch lengths in coalescent units by default, which represent the expected number of coalescent events per branch and are inversely related to population size and divergence time [15] [35]. For interpretation, longer branches in coalescent units indicate larger effective population sizes and/or longer periods of divergence.

Some analyses provide trees with branch lengths re-estimated using maximum likelihood based on concatenated alignments, representing the expected number of amino acid substitutions per site [35]. This is the conventional metric used in most phylogenetic trees. It is crucial to verify which branch length metric is being used when interpreting the tree, as they convey different biological information.

Protocols for Interpreting ASTRAL Results

Standard Workflow for Tree Interpretation

The following diagram outlines the logical workflow for interpreting an ASTRAL species tree, from initial assessment to final decision-making.

G Start Load ASTRAL Tree File A Identify Branch Supports: LPP, QT, EN Start->A B Check Branch Length Units: Coalescent vs. Substitutions/Site A->B C Filter Low Support Branches: LPP ≤ 0.5 or EN ≤ 5 B->C D Collapse Weak Branches C->D E Interpret Species Relationships D->E F Integrate with Other Evidence (e.g., Introgression Tests) E->F

Protocol 1: Assessing Branch Support and Collapsing Weak Branches

Purpose: To identify and filter out poorly supported branches in the species tree, resulting in a more robust topology for downstream analysis.

Materials:

  • ASTRAL species tree in Newick format (with branch supports)
  • Newick tree file manipulation tools (e.g., Newick Utilities [36])

Procedure:

  • Parse the tree file to extract the three support values (LPP, QT, EN) for each internal branch.
  • Apply filtering thresholds. It is recommended to collapse branches with either:
    • LPP ≤ 0.5 (low local posterior probability), or
    • EN ≤ 5 (insufficient gene tree information) [35].
  • Generate a collapsed tree. Use tree manipulation tools to produce a new tree file with the weak branches collapsed into polytomies. The nw_ed tool from Newick Utilities can be used for this purpose [36].
  • Use the collapsed tree for all downstream applications, such as divergence time estimation, ancestral state reconstruction, or comparative analyses.

Protocol 2: Handling Gene Tree Preparation for Introgression Analysis

Purpose: To prepare optimal input gene trees for ASTRAL when investigating phylogenetic patterns that may involve introgression.

Background: Gene tree discordance can arise from both incomplete lineage sorting (ILS) and introgression. ASTRAL is designed to account for ILS, but its output can also be informative for detecting potential introgression when interpreted carefully.

Materials:

  • Sequence alignments for multiple genes/loci
  • Gene tree inference software (e.g., IQ-TREE, RAxML)
  • Newick Utilities [36]

Procedure:

  • Infer individual gene trees using a maximum likelihood method. RAxML gene trees have been found to be preferable to FastTree trees [15].
  • Optionally, contract low-support branches in the gene trees. Simulations indicate that removing branches with very low support (e.g., below 10%) can improve ASTRAL's accuracy by reducing noise, while overly aggressive filtering is harmful [25]. This can be done using a command such as:

    [36]
  • Create a combined file containing all gene trees:

  • Run ASTRAL on the combined gene tree file to obtain the species tree.
  • Compare the ASTRAL tree with gene trees and tests for introgression (e.g., D-statistics) to identify discordant patterns that may be due to introgression rather than ILS. In the context of introgression research, specific patterns of quartet support might be indicative of historical gene flow.

The Scientist's Toolkit: Essential Research Reagents and Materials

Table 2: Key software tools and resources for ASTRAL-based phylogenomics

Item Name Function/Application Usage in Protocol
ASTRAL Software [15] Infers the primary species tree from gene trees. Core species tree inference; provides LPP, QT, EN supports.
Newick Utilities [36] Suite of tools for processing and manipulating tree files. Collapsing low-support branches in gene and species trees.
IQ-TREE/RAxML Infers maximum likelihood gene trees from sequence alignments. Generating input gene trees for ASTRAL analysis.
TreeShrink [15] Statistically motivated detection of outlier long branches in gene trees. Pre-processing gene trees to remove inaccurate branches.
DiscoVista [15] Creates interpretable visualizations of gene tree discordance. Visualizing conflict between gene trees and the ASTRAL species tree.

Advanced Interpretation in the Context of Introgression

While ASTRAL models ILS, its output is highly relevant for introgression research. Introgression creates specific patterns of gene tree discordance that can be reflected in the ASTRAL output. Look for branches with low LPP but high EN; this combination suggests strong conflict among a sufficient number of genes, which could be a signature of introgression. Conversely, low LPP with low EN simply indicates a lack of information.

Integrate the ASTRAL tree with tests for introgression and population genomic statistics. For example, identified candidate genes for environmental adaptation and regions of past introgression in a study of Pterocarya species [37]. Such regions can be mapped onto the ASTRAL species tree to understand the phylogenetic context of adaptive introgression.

When working with large genomic datasets, consider automated pipelines like ROADIES, which uses methods related to ASTRAL (ASTRAL-Pro3) to infer species trees directly from genome assemblies without the need for prior gene annotation or orthology inference, simplifying the analysis of complex genomic data [38].

Evolutionary radiations present a significant challenge for phylogenetic reconstruction due to the interplay of rapid speciation, gene flow, and incomplete lineage sorting. This application note examines case studies in the plant genera Aquilegia (columbine) and Picris to demonstrate how ASTRAL species tree estimation, combined with introgression detection methods, can resolve complex evolutionary histories. These genera represent classic examples of adaptive and cryptic radiations where traditional phylogenetic approaches often yield conflicting results. We detail experimental protocols and analytical workflows for disentangling these complex relationships, with particular emphasis on the role of historical introgression in shaping diversification patterns.

The Genomic Revolution in Phylogenetics

Table 1: Comparison of Genomic Approaches for Phylogenetic Studies

Method Key Features Advantages Limitations Suitable for
Hyb-Seq (Targeted capture) Lineage-specific or universal probe sets (e.g., Asteraceae COS: 1,061 loci) [39] Applicable across taxonomic levels; handles degraded DNA Probe design required; potential paralogy issues Deep phylogenetic scales, herbarium specimens
Whole Genome Resequencing Variant calling across entire genomes; high density of markers [40] Maximum phylogenetic information; detects introgression Computational burden; requires high-quality DNA Population-level studies, cryptic diversity
ROADIES Pipeline (Reference-free) Random sampling of genomic segments; no annotation required [38] Eliminates reference bias; annotation-free; discordance-aware May include non-informative regions; new method Lineages without reference genomes

The emergence of sophisticated bioinformatic tools has been crucial for interpreting these genomic datasets. ASTRAL (Accurate Species Tree Algorithm) estimates species trees from gene trees while accounting for incomplete lineage sorting [19]. Its extension, ASTRAL-Pro3, can handle multi-copy gene trees without requiring prior orthology inference [38]. For detecting introgression, PhyloNet infers species networks rather than strictly bifurcating trees, enabling the identification of hybridization events [19]. The ROADIES pipeline represents the latest advancement, offering fully automated, reference-free species tree inference directly from genome assemblies by randomly sampling genomic segments and leveraging ASTRAL-Pro3 for species tree estimation [38].

Case Study 1: Adaptive Radiation in Aquilegia

Biological System and Phylogenetic Challenges

The columbine genus Aquilegia comprises approximately 70 species distributed across North America, Europe, and Asia, representing a classic example of adaptive radiation involving diverse pollinators and habitats [40]. The genus originated in Eastern Asia approximately 6.9 million years ago, with major diversification occurring around 4.8 million years ago via separate radiations in North America and Eurasia [41]. Phylogenetic studies have frequently revealed polytomies and extensive allele sharing among species, suggesting complex evolutionary histories beyond a simple bifurcating tree model [40].

Genomic Architecture and Chromosome-Specific Evolution

A remarkable discovery in Aquilegia genomics has been the unique evolutionary history of chromosome four, which exhibits approximately double the polymorphism (2.258×) compared to the rest of the genome [40]. This pattern reflects deeper coalescence times rather than simply a higher mutation rate, as divergence to an outgroup is only moderately elevated (1.201×) [40]. This finding highlights the importance of considering chromosome-specific evolutionary processes in phylogenetic analyses.

Table 2: Genomic Features and Diversity Patterns in Aquilegia

Genomic Feature Pattern/Observation Evolutionary Interpretation Citation
Chromosome 4 polymorphism 2.258× higher than genome average Differential evolutionary history; deeper coalescence [40]
Nucleotide diversity within species 0.2–0.35% Low for outcrossing species; suggests inbreeding [40]
Divergence within geographic regions 0.38–0.86% Extensive variant sharing due to gene flow [40]
Divergence between geographic regions 0.81–0.97% Limited gene flow between continents [40]
Lineage formation in cryptic radiation 39/43 introgression events post-lineage formation Introgression contributed to diversification [42]

Cryptic Radiation in Southwest China

Recent research has revealed extensive cryptic diversity within Chinese Aquilegia, where a monophyletic group primarily distributed in the mountains of Southwest China contains three to four paraphyletic lineages within each morphological species [42]. Whole-genome resequencing of 158 individuals from 23 populations demonstrated that 39 out of 43 detected introgression events occurred after lineage formation, indicating that introgression played a substantial role in diversification [42]. Strong positive correlations among genomic differentiation, divergence, and introgestion suggest that standing variation and non-sister lineage introgression contributed to rapid genetic divergence in this cryptic radiation [42].

G Whole-genome\nresequencing Whole-genome resequencing Variant calling Variant calling Whole-genome\nresequencing->Variant calling Gene tree inference Gene tree inference Variant calling->Gene tree inference ASTRAL analysis ASTRAL analysis Gene tree inference->ASTRAL analysis D-statistics D-statistics Gene tree inference->D-statistics Species tree Species tree ASTRAL analysis->Species tree Divergence time estimation Divergence time estimation Species tree->Divergence time estimation Introgression detection Introgression detection D-statistics->Introgression detection Reticulate evolution Reticulate evolution Introgression detection->Reticulate evolution PhyloNet analysis PhyloNet analysis PhyloNet analysis->Reticulate evolution Biogeographic reconstruction Biogeographic reconstruction Divergence time estimation->Biogeographic reconstruction

Experimental Protocol: Resolving Aquilegia Phylogenomics

Sample Preparation and Sequencing
  • DNA Extraction: Use high-molecular-weight DNA extraction protocols (e.g., CTAB method) from leaf tissue of 23 natural populations, including all major morphological species.
  • Library Preparation: Construct Illumina sequencing libraries with 350-500 bp insert sizes following manufacturer's protocols.
  • Sequencing: Perform whole-genome resequencing to an average coverage of 11.71×, generating approximately 2.45 billion clean reads per study [42].
Data Processing and Variant Calling
  • Read Mapping: Align clean reads to the A. coerulea 'Goldsmith' reference genome (v3.1) using BWA-MEM [40].
  • Variant Calling: Identify single nucleotide polymorphisms (SNPs) using GATK HaplotypeCaller following best practices [40]. Apply strict filtering to produce a final dataset of 2.6-2.7 million variable sites [42].
  • Data Quality Control: Remove duplicate reads and identify callable genomic regions (approximately 21% of the reference genome), with enrichment for coding regions (57% of annotated coding regions) [40].
Phylogenomic Analysis
  • Gene Tree Inference: Extract sequence alignments from non-overlapping genomic windows (e.g., 100 kb). For each window, infer maximum likelihood gene trees using IQ-TREE with appropriate substitution models [19].
  • Species Tree Estimation: Input all gene trees into ASTRAL to estimate the primary species tree while accounting for incomplete lineage sorting [19].
  • Introgression Testing: Calculate D-statistics (ABBA-BABA tests) for all species trios to detect asymmetric allele sharing indicative of introgression [19].
  • Phylogenetic Networks: Use PhyloNet to infer species networks that explicitly model hybridization events [19].

Case Study 2: Diploid Diversification in Picris

Biological System and Evolutionary Questions

The genus Picris (Asteraceae) comprises approximately 50 taxa centered in the Mediterranean Basin, a global biodiversity hotspot [43]. Unlike many plant radiations involving polyploidy, Picris species are predominantly diploid (2n = 2x = 10), providing a model for studying homoploid diversification [43]. The genus features two major lineages (Clade A and Clade B) that diverged in the Pliocene, with most species emerging during the Quaternary [43]. Key functional traits vary among species, including life strategy (semelparous vs. iteroparous) and fruit morphology (heterocarpic vs. homocarpic), which are potentially involved in habitat-specific adaptation [43].

Historical Introgression as a Driver of Diversification

Phylogenomic analyses using Hyb-Seq data have revealed two major historical introgression events that shaped evolutionary trajectories within diploid Picris [43]. The earliest and most complex events involved the Turkish endemic P. campylocarpa, which hybridized with the most recent common ancestor of the P. cyprica–P. pauciflora lineage and with the ancestor of the B1 subclade [43]. This latter introgression preceded shifts from iteroparity to semelparity and from heterocarpy to homocarpy, though these trait shifts were not found to be directly attributable to adaptive introgression [43]. Nevertheless, all detected historical introgression events contributed significantly to the diversification of diploid Picris taxa.

Table 3: Historical Introgression Events and Trait Evolution in Picris

Introgression Event Donor Recipient Evolutionary Outcome Trait Associations
Early complex event P. campylocarpa (Turkish endemic) MRCA of P. cyprica–P. pauciflora lineage Diversification of Mediterranean endemics Not directly linked to trait shifts
B1 subclade introgression P. campylocarpa MRCA of B1 subclade (P. hieracioides group + P. scaberrima–P. strigosa lineage) Preceded major life history trait shifts Shift from iteroparity to semelparity; heterocarpy to homocarpy

Experimental Protocol: Picris Hybridization Analysis

Hyb-Seq Data Generation
  • Probe Design: Use the Asteraceae family-specific COS probe set (MyBaits COS Compositae/Asteraceae 1kv1) targeting 1,061 potentially low-copy nuclear loci [39].
  • Library Preparation and Capture: Prepare Illumina sequencing libraries following standard protocols. Perform hybrid capture using the Asteraceae COS probe set according to manufacturer's instructions (Arbor Biosciences) [39].
  • Sequencing: Sequence captured libraries on Illumina platforms to sufficient depth (typically 20-30× coverage per locus).
Data Processing and Locus Selection
  • Read Assembly: Process raw reads to remove adapters and low-quality sequences. Assemble reads for each target locus using tools such as PHYLUCE or HybPiper.
  • Alignment: Generate multiple sequence alignments for each locus using MAFFT or MUSCLE.
  • Paralogy Filtering: Identify and remove potentially paralogous loci using approaches such as gene tree reconciliation or maximum likelihood tree-based filtering [39].
Phylogenomic Inference
  • Gene Tree Estimation: Infer maximum likelihood trees for each locus using IQ-TREE with model selection [19].
  • Species Tree Estimation: Input gene trees into ASTRAL to estimate the species tree topology [38].
  • Network Analysis: Use PhyloNet to infer phylogenetic networks that incorporate hybridization events [19].
  • Divergence Time Estimation: Estimate divergence times using tree-dating methods such as BEAST2, calibrated with known fossil ages or geological events.

G Asteraceae COS\nprobe set (1,061 loci) Asteraceae COS probe set (1,061 loci) Hybrid capture Hybrid capture Asteraceae COS\nprobe set (1,061 loci)->Hybrid capture Illumina sequencing Illumina sequencing Hybrid capture->Illumina sequencing DNA extraction DNA extraction Library prep Library prep DNA extraction->Library prep Library prep->Hybrid capture Locus assembly Locus assembly Illumina sequencing->Locus assembly Multiple sequence alignment Multiple sequence alignment Locus assembly->Multiple sequence alignment Paralogy filtering Paralogy filtering Multiple sequence alignment->Paralogy filtering Gene tree inference Gene tree inference Paralogy filtering->Gene tree inference ASTRAL species tree ASTRAL species tree Gene tree inference->ASTRAL species tree PhyloNet network PhyloNet network Gene tree inference->PhyloNet network Divergence time estimation Divergence time estimation ASTRAL species tree->Divergence time estimation Historical introgression detection Historical introgression detection PhyloNet network->Historical introgression detection

The Scientist's Toolkit

Table 4: Essential Research Reagents and Computational Tools

Tool/Reagent Type Primary Function Application Context
ASTRAL Software Species tree estimation from gene trees Handling incomplete lineage sorting in rapid radiations
ASTRAL-Pro3 Software Species tree from multi-copy gene trees Cases where orthology inference is challenging
PhyloNet Software Phylogenetic network inference Detecting and visualizing introgression events
ROADIES Pipeline Automated species tree from genome assemblies Reference-free phylogenomics for diverse taxa
IQ-TREE Software Maximum likelihood phylogenetic inference Gene tree estimation with model selection
Asteraceae COS Probe Set Wet-bench reagent Target enrichment for 1,061 nuclear loci Hyb-Seq phylogenomics across Asteraceae
GATK Software Variant discovery from sequencing data SNP calling for population genomic analyses
BWA-MEM Software Sequence alignment to reference genomes Read mapping in resequencing studies

The integration of genomic data with sophisticated analytical frameworks such as ASTRAL and PhyloNet has revolutionized our ability to resolve complex evolutionary radiations. The case studies of Aquilegia and Picris demonstrate that historical introgression has been a significant driver of diversification in both adaptive and cryptic radiations. In Aquilegia, extensive allele sharing and the unique evolutionary history of specific chromosomes highlight the complex genomic underpinnings of radiation. In Picris, historical introgression between diploid species has shaped diversification patterns without necessarily driving adaptive trait shifts. These findings underscore the importance of moving beyond strictly bifurcating tree models to incorporate reticulate evolution in our understanding of phylogenetic relationships. The experimental protocols and analytical workflows detailed here provide a roadmap for researchers investigating complex radiations across the tree of life.

Pitfalls and Solutions: Optimizing ASTRAL Accuracy Under Introgression

Diagnosing and Mitigating the Impact of Ghost Introgression

Ghost introgression, the transfer of genetic material from extinct or unsampled lineages into the gene pools of extant species, presents a significant challenge in modern phylogenomics [44] [9]. This phenomenon is increasingly recognized as a widespread evolutionary force, with documented cases across plants, animals, and humans [44] [9]. For researchers using species tree estimation methods like ASTRAL, which operates under the multi-species coalescent (MSC) model, unaccounted ghost introgression can seriously compromise inference accuracy [17]. This application note provides a structured framework for diagnosing and mitigating the effects of ghost introgression, with specific protocols for ASTRAL-based workflows.

The core challenge lies in the similarity of genomic signatures between ghost introgression and other evolutionary processes. Ghost introgression from an outgroup lineage can produce gene tree discordance patterns that mimic introgression between sampled non-sister species or amplify signals of incomplete lineage sorting (ILS) [44] [17]. This can lead to incorrect identification of donor and recipient species and biased estimates of divergence times [17]. Without explicit detection and accounting for these effects, even sophisticated summary methods like ASTRAL may infer incorrect species tree topologies and overestimate root divergence times [17].

Methodological Comparison for Ghost Introgression Detection

Selecting appropriate analytical methods is crucial for accurate detection and characterization of ghost introgression. The table below compares the performance of leading phylogenetic methods when faced with ghost introgression scenarios.

Table 1: Performance Comparison of Phylogenetic Methods Under Ghost Introgression

Method Category Key Principle Performance with Ghost Introgression Key Limitations
D-statistic (ABBA-BABA) Heuristic/Site-pattern Tests for excess shared derived alleles between non-sister species [44] Prone to misidentification of donor and recipient species; cannot distinguish ghost from direct introgression [44] Assumes identical substitution rates; ignores homoplasy; requires careful outgroup selection [19]
HyDe Heuristic/Site-pattern Uses site pattern frequencies under a hybrid speciation model [44] Struggles to identify correct donor and recipient in outflow scenarios; performance under ghost introgression not well established [44] Based on hybrid speciation model; may not generalize well to minor introgression
PhyloNet/MPL Heuristic/Gene-tree Uses gene tree topologies to infer phylogenetic networks [44] Often fails to distinguish ghost from non-sister species introgression; limited by gene tree estimation error [44] Uses only gene tree topologies (not branch lengths); network identifiability challenges [44]
BPP Full-likelihood Uses multilocus sequence alignments directly with Bayesian MCMC [44] Capable of detecting ghost introgression by comparing alternative networks with Bayes factors [44] Computationally intensive; requires careful prior specification [44]
ASTRAL Summary method Coalescent-based species tree estimation from gene trees [25] Generally robust to gene flow between non-sister species but affected by certain ghost introgression scenarios [17] Can overestimate root divergence time with outgroup ghost introgression [17]

The performance differences between method categories are striking. Heuristic methods relying on summary statistics like site patterns or gene tree topologies generally struggle to differentiate ghost introgression from introgression between sampled non-sister species [44]. In contrast, full-likelihood methods such as BPP, which analyze multilocus sequence alignments directly and utilize both gene-tree topologies and branch lengths, demonstrate superior capability in detecting ghost introgression, though at increased computational cost [44].

For ASTRAL users specifically, understanding its limitations is crucial. While ASTRAL shows greater robustness to gene flow between non-sister species compared to full-likelihood methods like *BEAST under certain conditions, it remains vulnerable to specific ghost introgression scenarios, particularly those involving outgroup ghosts, which can lead to overestimated root divergence times [17].

Experimental Protocols for Diagnosing Ghost Introgression

Protocol 1: Tree-Based Introgression Detection Workflow

This protocol provides a complementary approach to SNP-based methods for detecting introgression events using genome-wide gene tree distributions [19].

Table 2: Reagents and Software for Tree-Based Introgression Detection

Research Tool Function/Application Key Features
Progressive Cactus Whole-genome alignment Reference-free multiple genome alignment; handles draft assemblies [19]
IQ-TREE Gene tree inference Maximum likelihood phylogenetic inference; rapid and accurate [19]
ASTRAL Species tree estimation Coalescent-based species tree from gene trees; accounts for ILS [19] [25]
PhyloNet Network inference Infers phylogenetic networks from gene trees; detects hybridization/introgression [19]
FigTree Tree visualization Visualization and manipulation of phylogenies in Newick format [19]

Procedure:

  • Data Preparation and Filtering

    • Extract alignment blocks from a whole-genome alignment (e.g., in MAF format) [19]. For the cichlid example, chromosome-scale alignment is used.
    • Filter alignment blocks by:
      • Minimum length (e.g., 1,000 bp) to ensure sufficient phylogenetic signal [19].
      • Completeness (minimal missing data per species).
      • Low recombination signals to avoid confounding factors [19].
  • Gene Tree Estimation

    • For each filtered alignment block, infer a gene tree using maximum likelihood with IQ-TREE [19].
    • Use appropriate substitution models selected by model testing.
    • Consider contracting branches with very low support (e.g., below 10%) to reduce noise, as aggressive filtering can be harmful [25].
  • Species Tree Estimation

    • Infer a reference species tree from the set of gene trees using ASTRAL [19] [25].
    • ASTRAL-III provides polynomial time complexity and improved handling of polytomies [25].
  • Introgression Detection

    • Analyze the distribution of gene tree topologies across the genome for asymmetry in alternative phylogenetic topologies [19].
    • Use PhyloNet to assess support for alternative models of diversification with and without introgression [19].
    • Compare quartet frequencies against expectations under the multispecies coalescent model alone.

The following workflow diagram illustrates the key steps in this protocol:

G Start Start with Whole-Genome Alignment Data Filter Filter Alignment Blocks (Length, Completeness, Low Recombination) Start->Filter GeneTrees Infer Gene Trees Using IQ-TREE Filter->GeneTrees SpeciesTree Infer Species Tree Using ASTRAL GeneTrees->SpeciesTree Detect Detect Introgression via Gene Tree Distribution Analysis (PhyloNet) SpeciesTree->Detect Output Interpret Introgression Patterns & Validate Detect->Output

Protocol 2: Full-Likelihood Analysis with BPP for Ghost Introgression

This protocol uses full-likelihood Bayesian analysis to explicitly test for ghost introgression, particularly valuable when heuristic methods suggest conflicting signals.

Procedure:

  • Hypothesis and Model Specification

    • Define explicit alternative phylogenetic networks representing:
      • No introgression model
      • Ghost introgression models (with different donor-recipient configurations)
      • Introgression between sampled taxa models
    • Use prior biological knowledge to inform plausible ghost introgression scenarios.
  • Data Preparation

    • Prepare multilocus sequence alignments (e.g., from phylogenomic datasets or phylotranscriptomics).
    • Ensure proper orthology assignment and alignment quality.
    • Format input files according to BPP specifications.
  • Bayesian Analysis

    • Run BPP under each alternative network model.
    • Use appropriate MCMC settings (chain length, burn-in, sampling frequency) to ensure convergence.
    • Utilize BPP's ability to directly compare putative networks using Bayes factors [44].
  • Model Comparison

    • Calculate Bayes factors to compare alternative models.
    • Substantial support for ghost introgression models over both no-introgression and sampled-introgression models provides evidence for ghost introgression.
    • Validate results through replicate analyses and prior sensitivity tests.
Protocol 3: Mitigating Ghost Introgression Effects in ASTRAL Analysis

This protocol outlines specific strategies to minimize the impact of ghost introgression when using ASTRAL for species tree estimation.

Procedure:

  • Comprehensive Taxon Sampling

    • Include phylogenetically relevant outgroups to help distinguish between ingroup and outgroup ghost introgression.
    • Sample potential extant relatives of suspected ghost lineages when possible.
  • Gene Tree Filtering and Weighting

    • Identify and potentially exclude loci with strong signatures of introgression that may represent ghost introgression.
    • Use gene tree estimation error measures to weight genes in ASTRAL analysis.
  • Divergence Time Assessment

    • Interpret ASTRAL divergence time estimates, particularly root ages, with caution when ghost introgression is suspected.
    • Cross-validate divergence times with multiple methods, as outgroup ghost introgression tends to produce overestimates [17].
  • Multi-Method Integration

    • Combine ASTRAL analysis with full-likelihood methods (BPP) on key clades where ghost introgression is suspected.
    • Use heuristic methods (D-statistics, HyDe) as initial screens but not definitive evidence for ghost introgression.

Conceptual Framework and Diagnostic Pathways

Understanding the scenarios and genomic consequences of ghost introgression is essential for accurate diagnosis. The following diagram illustrates key conceptual relationships and diagnostic pathways:

G GhostIntro Ghost Introgression (Unsampled/Extinct Lineage) OutgroupGhost Outgroup Ghost (Diverged Before Basal Species) GhostIntro->OutgroupGhost IngroupGhost Ingroup Ghost (Within Sampled Clade) GhostIntro->IngroupGhost GenomicSignature Genomic Signatures: - Gene Tree Discordance - ABBA-BABA Patterns - Tree Topology Asymmetry OutgroupGhost->GenomicSignature IngroupGhost->GenomicSignature SpeciesTreeBias Species Tree Bias: - Topology Errors - Divergence Time Bias (Overestimation with Outgroup Ghost) GenomicSignature->SpeciesTreeBias HeuristicMethods Heuristic Methods (Struggle with Ghost Introgression) SpeciesTreeBias->HeuristicMethods FullLikelihood Full-Likelihood Methods (Better Ghost Detection) SpeciesTreeBias->FullLikelihood ASTRALImpact ASTRAL Impact: - Generally Robust to Non-Sister Flow - Root Time Overestimation with Outgroup Ghost SpeciesTreeBias->ASTRALImpact

The diagram illustrates how different types of ghost introgression generate distinctive genomic signatures that can bias species tree inference. These biases affect analytical methods differently, with heuristic approaches particularly challenged by ghost introgression scenarios.

Ghost introgression represents a significant challenge for species tree estimation, particularly for widely used methods like ASTRAL that operate under the multi-species coalescent model. Through structured diagnostic approaches and appropriate method selection, researchers can both detect ghost introgression and mitigate its impacts on phylogenetic inference. The protocols outlined here provide a practical framework for integrating ghost introgression assessment into standard phylogenomic workflows, combining the scalability of summary methods like ASTRAL with the precision of full-likelihood approaches for critical analyses. As phylogenomic datasets continue to grow in both taxon sampling and genomic coverage, the ability to accurately account for ghost introgression will become increasingly essential for reconstructing robust evolutionary histories.

When Does Introgression Cause Over- or Under-estimation of Divergence Times?

Molecular dating, the inference of species divergence times from genetic sequences, is a cornerstone of evolutionary biology, instrumental in revealing the mode and tempo of evolution across the tree of life [45]. These methods typically rely on the critical assumption that all genomic loci share the same species tree topology and divergence history. However, gene tree discordance—where the evolutionary history of individual loci conflicts with the species tree—is a widespread phenomenon. A primary source of this conflict, beyond incomplete lineage sorting (ILS), is introgression: the episodic gene flow between species after they have diverged [45] [46].

Ignoring introgression in phylogenetic analysis can lead to significant and systematic biases in divergence time estimates. Recent advances in phylogenomic datasets and analytical models now allow researchers to account for both ILS and introgression, helping to avoid these pitfalls [45]. This application note details the conditions under which introgression biases divergence time estimates and provides protocols for robust analysis within the context of ASTRAL species tree estimation.

How Introgression Biases Divergence Time Estimates

Cross-species introgression can have substantial impacts on the phylogenomic reconstruction of species divergence events. Simulations have demonstrated that the presence of even a small amount of introgression can bias estimates when gene flow is ignored in the analysis [45]. The direction and magnitude of the bias depend on the specific phylogenetic relationship of the introgressing lineages.

Table 1: Types and Effects of Introgression Scenarios on Divergence Time Estimates

Introgression Scenario Impact on Divergence Time Estimates Underlying Mechanism
Between Sister Species [45] Underestimation of the sister species divergence time. Gene flow makes sister species appear more recently diverged than they are.
Between Non-Sister Species [45] Underestimation of the divergence time between the introgressing non-sister species. Shared ancestry from gene flow inflates perceived similarity, pulling divergence times forward.
From a Ghost Lineage [45] Biased estimation of divergence times across the tree. Ancestral gene flow from an unsampled/extinct lineage creates anomalous shared histories.
Pervasive/Continuous Gene Flow [45] Bias depends on the model used; MSci model (episodic) can be robust to continuous gene flow. Model misspecification (assuming no gene flow) is the primary cause of bias.

The bias occurs because methods assuming a pure bifurcating species tree misinterpret the shared genetic variation caused by post-divergence gene flow as evidence of a more recent common ancestry. It is crucial to note that divergence time estimation under the multispecies coalescent (MSC) model can be robust to small amounts of introgression in empirical datasets with extensive taxon sampling, but substantial introgression requires explicit modeling [45].

Quantitative Framework and Analytical Solutions

The multispecies-coalescent-with-introgression (MSci) model, implemented in software such as bpp, provides a powerful framework for accurately estimating both divergence times and ancestral effective population sizes, even when only a single diploid individual per species is sampled [45]. This model accounts for the expected variances and covariances in quantitative traits—or genetic data—arising from gene tree discordance.

For a three-species phylogeny ((A,B),C), the expected trait covariance matrix T under the Brownian motion model, which includes both ILS and introgression, determines the patterns of trait similarity [18]. The covariance between species A and B, accounting for these processes, is more complex than under a simple species tree model. When introgression is present, it introduces additional covariance between species pairs that are not sister species, for example, between B and C or A and C [18]. Failing to account for this additional shared history leads to inaccurate inferences of evolutionary parameters.

Table 2: Key Software and Models for Divergence Time Estimation with Introgression

Software / Model Primary Function Key Feature Applicable Data
bpp (MSci Model) [45] Bayesian divergence time & population size estimation. Accounts for both ILS and episodic introgression. Genome-scale loci (e.g., 1,000-10,000 loci).
ASTRAL Coalescent-based species tree estimation. Infers the primary species tree from gene trees while accounting for ILS. Gene trees inferred from genomic data.
StarBEAST3 [45] Bayesian divergence time estimation. Uses the MSC model with a relaxed molecular clock. Multilocus sequence data.
Phylogenetic Networks [47] Reticulate evolution visualization & inference. Represents evolutionary history including hybridization/introgression. Genome-wide ancestry patterns.

Experimental Protocols for Dating with Introgression

Protocol 1: Testing for Introgression Signals Pre-Dating

Objective: To detect and characterize introgression in a dataset prior to divergence time estimation. Materials: Genomic data (e.g., UCEs, transcriptomes, whole-genome resequencing) from multiple individuals per species. Workflow:

  • Locus Assembly & Gene Tree Inference: Assemble sequencing data and infer individual gene trees for hundreds to thousands of loci. Use appropriate phylogenetic models for each locus.
  • Species Tree Estimation: Infer a primary species tree using a coalescent-based method like ASTRAL, which is robust to ILS.
  • Introgression Detection Tests: Use statistical tests (e.g., D-statistics/ABBA-BABA, f-branch) to identify significant deviations from the species tree that indicate introgression.
  • Phylogenetic Network Inference: Use tools for phylogenetic network inference (e.g., PhyloNet, SNaQ) to visualize and hypothesize introgression events. The output is a set of candidate introgression scenarios to test in a formal model-based dating analysis.

G Start Start: Multi-locus Genomic Data A 1. Locus Assembly & Gene Tree Inference Start->A B 2. Coalescent-based Species Tree (ASTRAL) A->B C 3. Introgression Detection (e.g., D-statistics) B->C D 4. Phylogenetic Network Inference C->D End Output: Candidate Introgression Scenario D->End

Protocol 2: Divergence Time Estimation under the MSci Model

Objective: To estimate divergence times using bpp while explicitly modeling episodic introgression. Materials: A sequence alignment file (e.g., PHYLIP format), a guide species tree with a hypothesized introgression event, and fossil calibration information. Workflow:

  • Model and Priors Configuration: Set up the bpp control file to use the MSci model. Define the species tree topology and the introgression branch. Specify priors for divergence times (e.g., gamma priors calibrated with fossils) and population size parameters (theta).
  • MCMC Analysis: Run a Markov Chain Monte Carlo (MCMC) analysis for a sufficient number of generations to ensure convergence. Use multiple independent runs to assess robustness.
  • Convergence Diagnostics & Posterior Analysis: Use tools like Tracer to check for MCMC convergence (effective sample size > 200). Summarize the posterior distributions of divergence times and introgression parameters from the combined post-burn-in samples.

G Start Start: Sequence Alignments & Calibrated Guide Tree A 1. Configure bpp MSci Analysis (Tree, Introgression, Priors) Start->A B 2. Run MCMC (Several Independent Runs) A->B C 3. Diagnose Convergence (e.g., with Tracer) B->C D 4. Summarize Posterior Distributions of Node Times C->D End Output: Divergence Time Estimates Accounting for Introgression D->End

The Scientist's Toolkit: Key Research Reagents & Materials

Table 3: Essential Research Reagents and Computational Tools

Item / Resource Function / Description Application in Analysis
Target-Enrichment Loci (e.g., UCEs) [45] Provides hundreds to thousands of conserved, single-copy orthologous loci from across the genome. Primary data for gene tree and species tree inference; balances phylogenetic signal and computational feasibility.
Fossil Calibrations [45] Provides absolute time constraints for nodes in the species tree, enabling calibration of the molecular clock. Critical for converting substitutions per site into estimates of absolute time (millions of years).
bpp Software Suite [45] A Bayesian software for analyzing multilocus sequence data under the MSC and MSci models. The core tool for implementing Protocol 2, enabling joint estimation of divergence times, population sizes, and introgression.
ASTRAL Software A method for estimating the species tree from a set of input gene trees under the multispecies coalescent model. Used in Protocol 1 to infer the primary species topology, which is a key input for testing and modeling introgression.
Relaxed Molecular Clock Model [45] Allows the rate of molecular evolution to vary across different branches of the species tree. Accounts for rate heterogeneity, preventing a key source of bias in divergence time estimation.

Integrating tests for introgression into the molecular dating workflow is no longer optional for rigorous studies in systems with known or suspected hybridization. The MSci model offers a powerful solution, accurately estimating parameters even with moderate model misspecification (e.g., inferring episodic introgression from data generated under continuous gene flow) [45]. Best practices include:

  • Proactive Testing: Always test for introgression, especially in rapidly radiating groups or those with known porous species boundaries.
  • Model Comparison: Compare the fit of the MSC model (no introgression) and the MSci model (with introgression) to your data using marginal likelihoods or other metrics.
  • Robust Sampling: Whenever possible, use extensive taxon sampling, which can make divergence time estimation under the MSC more robust to small amounts of introgression [45]. By adopting these protocols and leveraging the outlined toolkit, researchers can generate more accurate and reliable estimates of evolutionary timescales, providing a firmer foundation for understanding the history of life.

In phylogenomics, Incomplete Lineage Sorting (ILS) and introgression are two major biological processes that generate incongruence between gene trees and the species tree. Both can produce similar patterns of phylogenetic discordance, presenting a significant challenge for accurately reconstructing evolutionary history. ILS, the failure of genealogies to coalesce in the immediate ancestral species, results from the random sorting of ancestral polymorphisms. Introgression, the transfer of genetic material between species through hybridization, introduces alleles from one species into the gene pool of another. Distinguishing between their signatures is crucial for ASTRAL species tree estimation, as failure to account for introgression can bias topological inference and branch length estimates. This Application Note provides practical protocols and frameworks for deconvolving these competing signals, enabling more accurate phylogenetic inference and a deeper understanding of evolutionary history.

Theoretical Framework and Quantitative Landscape

ILS and introgression create distinct genomic signatures. While ILS generates a relatively uniform distribution of discordance across the genome, introgression produces localized blocks of shared ancestry with a specific phylogenetic signal. The differential impact of these processes is quantifiable; in Fagaceae, decomposition analyses attributed 9.84% of gene tree variation to ILS and 7.76% to gene flow (with 21.19% from gene tree estimation error) [13]. In bacterial systems, core genome introgression levels average ~2% but can reach 14% in highly recombinogenic genera like Escherichia–Shigella [11].

The following table summarizes key metrics and characteristics that help distinguish these processes in empirical datasets:

Table 1: Quantitative Signatures of ILS and Introgression

Feature Incomplete Lineage Sorting (ILS) Introgression
Genomic Distribution Genome-wide, uniform discordance [48] Localized, heterogeneous blocks [48]
Relationship with Divergence Decreases with increasing genetic distance Can occur between distant taxa but often restricted by divergence [11]
Relationship with Recombination Independent of local recombination rate Positively correlated with recombination rate due to selection against introgressed blocks in low-recombination regions [48]
Phylogenetic Signal Adheres to the multispecies coalescent model Creates topological patterns inconsistent with pure coalescent history
Key Quantitative Metric Gene Tree Discordance (e.g., sCF) [7] D-statistics, f-branch statistics [7]

Experimental and Analytical Protocols

Core Phylogenomic Workflow

A robust protocol for deconvolving ILS and introgression involves the following key stages, implemented in a hierarchical manner.

G Start Start: Multi-sequence Alignment GT Gene Tree Inference (IQ-TREE, RAxML) Start->GT ST Species Tree Estimation (ASTRAL, MP-EST) GT->ST Disc Discordance Quantification (sCF, sDF, Quartet Scores) ST->Disc Hyp Test for Introgression (D-statistics, f-branch) ST->Hyp Disc->Hyp Net Phylogenetic Network Inference (PhyloNet, SNaQ) Hyp->Net ILS ILS-aware Interpretation Hyp->ILS Net->ILS

Protocol 1: Phylogenomic Tree Estimation and Initial Discordance Assessment

Objective: Reconstruct primary species tree and quantify gene tree heterogeneity.

  • Data Preparation: Generate genome-scale data (whole genomes, transcriptomes, or targeted sequence capture). Identify orthologous loci using tools such as OrthoFinder or HybPiper.
  • Gene Tree Inference: For each orthologous locus, infer a maximum likelihood gene tree using IQ-TREE (v2.3.6+) [13]. Use model selection (e.g., -m MFP) and assess branch support with 1000 ultrafast bootstrap replicates (-B 1000) [7].
  • Species Tree Estimation: Infer the primary species tree using the coalescent-based method ASTRAL (v5.7.8) [19]. Input the collection of gene trees: java -jar ~/software/Astral/astral.5.7.8.jar -i input_gene_trees.tre -o species_tree.tre
  • Discordance Quantification: Calculate site concordance factors (sCF) and site discordance factors (sDF1, sDF2) using IQ-TREE. This measures the support for the primary topology (sCF) and the two major alternative topologies (sDF1, sDF2) at each site or branch [7]. High and imbalanced sDF values indicate potential introgression.

Protocol 2: Testing for Introgression with D-Statistics

Objective: Statistically test for gene flow between a specific pair of taxa.

Background: The D-statistic (or ABBA-BABA test) detects an excess of shared derived alleles between two sister species and a third taxon, which is inconsistent with a neutral coalescent model under ILS alone [7] [48].

  • Taxon Sampling: Define a four-taxon test set (((P1,P2),P3),Outgroup). P1 and P2 are sister species, P3 is the potential introgressor, and the Outgroup roots the tree.
  • Variant Calling: Call genome-wide SNPs from whole-genome sequencing data or use aligned sequence data.
  • D-statistic Calculation: Use the Dsuite software package to compute D-statistics across all scaffolds or chromosomes. Dsuite Dtrios -o output_prefix -k fbranch species_tree.tre variant_file.vcf A significant positive D-value indicates excess allele sharing between P2 and P3, consistent with introgression.
  • Significance Testing: Perform a block jackknife procedure to assess the significance of the D-statistic, correcting for multiple testing and linkage disequilibrium.

Protocol 3: Phylogenetic Network Inference with PhyloNet

Objective: Model reticulate evolutionary histories explicitly.

Background: When D-statistics and other tests reveal significant introgression, PhyloNet can infer phylogenetic networks that represent both divergence and hybridization events [19].

  • Input Data: Prepare a set of inferred gene trees in Newick format.
  • Infer Maximum Likelihood Network: Run PhyloNet to infer a network with a specified number of reticulations (h). java -jar ~/software/PhyloNet/PhyloNet.jar script.txt Where script.txt contains the command, for example: InferNetwork_ML (all_gene_trees) hmax:2 -o output_network
  • Model Selection: Compare networks with different numbers of reticulation events using AIC or BIC scores to select the best-fitting model.

The Scientist's Toolkit

Table 2: Essential Research Reagents and Computational Tools

Tool/Resource Function Application Context
IQ-TREE [19] [13] Maximum Likelihood phylogenetic inference and concordance factor calculation. Core workhorse for gene tree and species tree estimation. Critical for calculating sCF/sDF metrics.
ASTRAL [19] Coalescent-based species tree estimation from gene trees. Estimating the primary species tree in the presence of ILS.
Dsuite [7] Suite for D-statistics and f-branch tests for introgression. Formal testing for excess allele sharing indicative of gene flow.
PhyloNet [19] Inference of phylogenetic networks from gene trees. Modeling explicit reticulation/hybridization events.
PAUP* [19] General-purpose phylogenetic analysis. Tree manipulation, filtering, and additional analyses.
FigTree [19] Visualization of phylogenetic trees and networks. Graphical representation of results.
Whole-Genome Alignment Foundation for phylogenetic inference from genomic data. Provides the sequence data for downstream analysis. A typical input is a MAF file [19].

Integrated Data Interpretation Framework

Successful deconvolution requires synthesizing evidence from all analyses. The diagram below illustrates the logical decision process for interpreting conflicting signals.

G A Observed Gene Tree Discordance B Is discordance genome-wide and uniform? A->B C Is there a signal of excess allele sharing (D-statistic)? B->C No D Primary Conclusion: ILS B->D Yes C->D No E Primary Conclusion: Introgression C->E Yes F Are introgressed signals localized in genomes? E->F F->E No G Conclusion: Complex History of ILS + Introgression F->G Yes

Key Interpretation Guidelines:

  • Support for ILS: Genome-wide discordance that is well-modeled by the multispecies coalescent, with no significant D-statistic signals and no strong correlation between phylogenetic signal and genomic features like recombination rate [48].
  • Support for Introgression: Significant D-statistics, localized blocks of strong phylogenetic discordance, and a positive correlation between inferred admixture and local recombination rate, as selection more efficiently removes introgressed blocks from low-recombination regions [48].
  • Complex Scenarios: Many groups, like the Liliaceae tribe Tulipeae, show pervasive ILS and reticulation, making definitive resolution difficult. In such cases, phylogenetic network models that accommodate both coalescence and hybridization provide the most accurate representation of history [7].

Optimizing Search Space with Bipartition Constraints for Scalability and Accuracy

The reconstruction of species trees from genomic data is a fundamental challenge in evolutionary biology, complicated by processes such as incomplete lineage sorting (ILS) that cause gene trees to differ from the species tree. ASTRAL (Accurate Species TRee ALgorithm) has emerged as a leading method for species tree estimation by maximizing the number of shared quartet topologies with input gene trees while accounting for ILS. A critical innovation in ASTRAL's methodology is its use of bipartition constraints to limit the search space, enabling the analysis of large-scale phylogenomic datasets that would otherwise be computationally intractable. This approach restricts the set of bipartitions that the output species tree may include to a predefined set X, where for every cluster A ∈ X, the complementary cluster L-A is also included [25]. By strategically constraining this solution space, ASTRAL achieves polynomial-time complexity while maintaining statistical consistency under the multi-species coalescent model, representing a significant advancement in computational phylogenetics.

The optimization of search space through bipartition constraints represents a paradigm shift in species tree reconstruction methodology. Early versions of ASTRAL utilized bipartitions observed directly in input gene trees as the constraint set, which proved sufficient for statistical consistency but limited applicability to complex datasets with extensive discordance. Subsequent developments have refined how this constraint set is constructed and expanded, balancing computational feasibility with biological accuracy. This protocol examines the theoretical foundations, implementation details, and practical applications of search space optimization using bipartition constraints, providing researchers with a comprehensive framework for applying these techniques to phylogenomic inference problems, particularly in the context of detecting introgression and other evolutionary processes.

Theoretical Foundation of Bipartition Constraints

Mathematical Framework

ASTRAL operates by solving a constrained version of the species tree optimization problem, where a set of clusters X restricts the bipartitions that the output species tree may include. Formally, let L be the set of n species and G be the set of k input gene trees on L. The constraint bipartition set X is defined as a set of clusters where for each A ∈ X, we also have L−A ∈ X [25]. The algorithm searches for the species tree t that maximizes the objective function ∑_{g∈G} |Q(g)∩Q(t)|, where Q(t) denotes the set of quartet trees induced by tree t [25]. This optimization is performed using dynamic programming with a recursive relation that leverages tripartitions built from the constraint set.

The search space complexity is directly determined by the size of the constraint set |X| and the derived set of tripartitions Y. The set Y contains all tripartitions that can be built from X, defined mathematically as Y = {(A', A−A', L−A) : A' ⊂ A, A ∈ X, A' ∈ X, A−A' ∈ X} [25]. A key theoretical result by Kane and Tao established that |Y| ≤ |X|^{3/log₃(27/4)}, which provides an improved upper bound on the computational complexity and enables the polynomial-time guarantee achieved in ASTRAL-III [25].

Complexity Analysis

The implementation of bipartition constraints fundamentally alters the computational complexity of species tree inference. For ASTRAL-III, the asymptotic running time in the presence of polytomies is O((nk)^{1.726}D), where D = O(nk) represents the sum of degrees of all unique nodes in input trees [25]. This polynomial running time represents a substantial improvement over previous versions and enables ASTRAL to scale to datasets comprising up to 10,000 species [25]. The constrained search space approach stands in contrast to exhaustive search methods that would need to consider all possible species tree topologies, a problem known to be NP-hard [25].

Table 1: Computational Complexity of ASTRAL Versions with Bipartition Constraints

ASTRAL Version Theoretical Complexity Constraint Set Key Innovation
ASTRAL-I O(nk|X|²) for binary trees Bipartitions from input gene trees Initial constraint framework
ASTRAL-II O(nk|X|^{1.726}) for binary trees Heuristically expanded set Improved heuristics for set expansion
ASTRAL-III O((nk)^{1.726}D) with polytomies Linearly growing with n and k Polynomial time guarantee

The bipartition constraint strategy maintains statistical consistency under the multi-species coalescent model while dramatically improving computational efficiency. This consistency property holds even when the constraint set X is restricted, provided it contains the bipartitions from the true species tree [25]. The theoretical foundation thus ensures that the method remains accurate while becoming computationally tractable for large-scale phylogenomic analyses.

Protocol for Search Space Optimization

Initial Constraint Set Construction

The first step in optimizing search space with bipartition constraints involves constructing an appropriate constraint set X. For basic implementation, begin by extracting all bipartitions from the input gene trees:

  • Input Preparation: Collect all gene trees in Newick format. Ensure trees are unrooted unless specifically using a rooted version of the algorithm.
  • Bipartition Extraction: For each gene tree, identify all internal nodes and extract the bipartition defined by each node. A bipartition is represented as a cluster A of species and its complement L-A.
  • Constraint Set Initialization: Populate X with all unique bipartitions obtained from all gene trees, ensuring that for each cluster A included, the complementary cluster L-A is also included.

This basic constraint set, comprising solely bipartitions observed in input gene trees, is often sufficient for statistical consistency and can produce accurate species trees. However, for datasets with high levels of discordance or incomplete lineage sorting, expanding this set may improve accuracy.

Constraint Set Expansion Strategies

ASTRAL-II and ASTRAL-III employ heuristic approaches to expand the constraint set beyond directly observed bipartitions:

  • Multi-Resolution Heuristic: Implement a multi-locus bootstrapping approach that resamples genes and includes bipartitions from the resulting gene trees.
  • Graph-Based Expansion: Construct a similarity graph based on bipartition frequencies and include additional bipartitions that are compositionally similar to those already in X.
  • Multi-Individual Datasets: For datasets with multiple individuals per species, include all bipartitions from gene trees where individuals from the same species form monophyletic groups [15].

The expanded constraint set should be designed to grow at most linearly with both the number of species (n) and the number of genes (k) to maintain computational tractability. ASTRAL-III specifically limits the bipartition constraint set to ensure this linear growth, guaranteeing polynomial running time [25].

Dynamic Programming Implementation

With the constraint set X established, implement the dynamic programming algorithm:

  • Tripartition Set Construction: Generate the set Y of all possible tripartitions that can be built from X according to the definition: Y = {(A', A−A', L−A) : A' ⊂ A, A ∈ X, A' ∈ X, A−A' ∈ X}.
  • Scoring Function: For each tripartition T = (A|B|C) in Y, compute the scoring function w(T) against each node in each input gene tree using the QI (Quartet Index) metric:

    w(T) = ∑{g∈G} ∑{M∈N(g)} ½ QI(T,M)

    where QI(T,M) calculates twice the number of quartet trees shared between trees containing T and M [25].

  • Recursive Tree Building: Apply the dynamic programming recursion:

    V(A) = max_{A'⊂A, (A'|A−A'|L−A)∈Y} V(A,A') for |A| > 1

    V(A,A') = V(A') + V(A−A') + w(A'|A−A'|L−A)

    where V(A) represents the optimal score for the cluster A [25].

This dynamic programming approach efficiently explores the constrained search space to identify the species tree that maximizes quartet support from the input gene trees.

G Start Start: Input Gene Trees ExtractBP Extract Bipartitions from Gene Trees Start->ExtractBP InitialSet Initialize Constraint Set X with observed bipartitions ExtractBP->InitialSet Expand Expand Constraint Set using heuristics InitialSet->Expand BuildY Build Tripartition Set Y from constraint set X Expand->BuildY Score Score Tripartitions using QI metric BuildY->Score DP Dynamic Programming Tree Assembly Score->DP Output Output Species Tree DP->Output

Figure 1: Workflow for optimizing search space with bipartition constraints in species tree estimation

Experimental Validation and Performance Metrics

Scalability Assessment

The effectiveness of search space optimization with bipartition constraints must be validated through rigorous scalability testing. Experimental evaluations should measure:

  • Running Time: Record execution time as a function of both the number of species (n) and the number of genes (k). The objective is to verify that running time grows polynomially rather than exponentially.
  • Memory Usage: Monitor memory consumption throughout execution, particularly for large datasets with thousands of species and genes.
  • Constraint Set Size: Track the size of the constraint set |X| throughout execution to ensure linear growth with n and k.

Empirical testing of ASTRAL-III demonstrates its ability to handle datasets of up to 10,000 species, a significant improvement over previous versions [25]. This scalability enables application to large-scale phylogenomic projects that were previously computationally prohibitive.

Accuracy Benchmarks

Accuracy assessment should compare species trees estimated using bipartition constraints against known species trees in simulation studies:

  • Simulated Datasets: Generate species trees and gene trees under the multi-species coalescent model with varying levels of ILS.
  • Comparison Metrics: Calculate the normalized Robinson-Foulds distance between estimated and true species trees to quantify topological accuracy.
  • Constraint Set Impact: Evaluate how different constraint set strategies affect accuracy, comparing the basic set (only observed bipartitions) against expanded sets.

Studies have shown that ASTRAL maintains high accuracy even when the constraint set is restricted, provided it contains the bipartitions from the true species tree [25]. The method has demonstrated superior accuracy compared to alternative approaches such as concatenation under conditions of high ILS.

Table 2: Performance Metrics for Search Space Optimization with Bipartition Constraints

Performance Dimension Evaluation Metric Expected Outcome with Optimization
Computational Efficiency Running time Polynomial growth: O((nk)^{1.726}D)
Memory Usage Peak memory consumption Sub-quadratic with dataset size
Topological Accuracy Normalized Robinson-Foulds distance High accuracy (>90% with sufficient genes)
Statistical Consistency Recovery rate of true topology Consistent under MSC model
Scalability Maximum tractable dataset size Up to 10,000 species
Biological Dataset Validation

Validation should include application to empirical biological datasets with established phylogenetic relationships:

  • Avian Phylogenomics: Utilize the avian phylogenomic dataset of 14,000 genes to verify that the method recovers established relationships among bird families [25].
  • Plant Transcriptomes: Apply to the One Thousand Plant Transcriptomes Initiative dataset to test performance on plant phylogenetics [49].
  • Impact of Gene Tree Error: Assess how contracting low-support branches in gene trees (e.g., below 10% bootstrap support) affects species tree accuracy [25].

Biological validations have demonstrated that contracting low-support branches in gene trees prior to species tree estimation can improve accuracy by reducing noise, while overly aggressive filtering may be harmful [25]. This highlights the importance of appropriate data preprocessing in conjunction with search space optimization.

Integration with Introgression Research

Detecting Introgression with Tree-Based Methods

Bipartition constraint optimization plays a crucial role in detecting introgression through tree-based methods. The approach serves as a robust complement to SNP-based analyses like the ABBA-BABA test, particularly when analyzing divergent species where assumptions of identical substitution rates may be violated [19]. The protocol involves:

  • Whole-Genome Alignment Extraction: Extract alignment blocks from a whole-genome alignment, filtering for minimal missing data and low recombination.
  • Gene Tree Inference: Generate phylogenetic trees for each filtered alignment block using maximum likelihood methods such as IQ-TREE.
  • Species Tree Estimation: Apply ASTRAL with bipartition constraints to estimate the species tree from the resulting gene trees.
  • Topology Frequency Analysis: Compare frequencies of alternative tree topologies across the genome to identify asymmetries indicative of introgression.

This tree-based approach to introgression detection benefits significantly from the scalability enabled by bipartition constraints, allowing analysis of genome-wide datasets across multiple species [19].

Phylogenetic Network Inference

For cases where introgression is extensive, phylogenetic networks may provide a more appropriate representation of evolutionary history:

  • Integration with PhyloNet: Use the species tree estimated with bipartition constraints as a starting point for network inference in PhyloNet.
  • Model Comparison: Compare support for alternative models of diversification with and without introgression using likelihood scores.
  • Visualization: Utilize tools like FigTree for visualization and interpretation of resulting species trees and networks.

This integrated approach enables researchers to distinguish between incomplete lineage sorting and introgression as causes of gene tree discordance, providing a more complete picture of evolutionary history.

G Start Start: Whole Genome Alignment Extract Extract Alignment Blocks (1000 bp recommended) Start->Extract Filter Filter Blocks: -Completeness -Informativeness -Low Recombination Extract->Filter GeneTrees Infer Gene Trees using IQ-TREE Filter->GeneTrees Constraint Apply Bipartition Constraints GeneTrees->Constraint SpeciesTree Estimate Species Tree using ASTRAL Constraint->SpeciesTree Detect Detect Introgression via Topology Frequency Analysis SpeciesTree->Detect Network Infer Phylogenetic Network using PhyloNet Detect->Network Output Output: Species Tree/ Network with Introgression Network->Output

Figure 2: Integration of bipartition constraints in tree-based introgression detection pipeline

The Scientist's Toolkit

Table 3: Essential Research Reagents and Computational Tools for Search Space Optimization

Tool/Resource Function Application Context
ASTRAL Species tree estimation from gene trees Primary implementation of bipartition constraint optimization
IQ-TREE Maximum likelihood gene tree inference Generating input gene trees from sequence alignments
PAUP* General utility phylogenetic inference Alternative tree inference and analysis
PhyloNet Phylogenetic network inference Modeling introgression and reticulate evolution
FigTree Tree visualization Visualization of species trees and networks
Progressive Cactus Whole-genome alignment Generating input alignments for gene tree inference
TreeShrink Detection of outlier long branches Filtering gene trees to improve accuracy

Troubleshooting and Technical Considerations

Common Implementation Challenges

Several technical challenges may arise when implementing search space optimization with bipartition constraints:

  • Excessive Running Time: If running time becomes prohibitive, consider modifying the constraint set heuristics to limit |X| more aggressively. The -x option in ASTRAL can be used to restrict the number of bipartitions added during set expansion.
  • High Memory Usage: For datasets with large numbers of species or genes, memory consumption may be substantial. Ensure adequate RAM is available, and consider using the ASTRAL-MP (multi-threaded) version for parallelization.
  • Low Resolution Output: If the resulting species tree contains polytomies, this may indicate insufficient gene tree information or excessive constraint set restriction. Verify that the constraint set includes sufficient bipartitions to resolve relationships.
Advanced Configuration Options

Advanced users can modify several parameters to optimize performance for specific datasets:

  • Constraint Set Modification: Manually curate the constraint set to include specific bipartitions of biological interest while excluding others that may be computationally expensive but biologically implausible.
  • Branch Length Estimation: ASTRAL can compute branch lengths in coalescent units using the option to include branch length calculation, providing additional evolutionary insights.
  • Local Posterior Probability: Enable calculation of local posterior probability as a measure of branch support, which can be more appropriate for coalescent-based methods than traditional bootstrap values.

Optimizing search space with bipartition constraints represents a powerful strategy for scaling species tree estimation to large phylogenomic datasets while maintaining statistical consistency and biological accuracy. The ASTRAL algorithm implementation of this approach has demonstrated exceptional performance in both simulated and empirical studies, enabling researchers to address fundamental questions in evolutionary biology that were previously computationally prohibitive. By following the protocols outlined in this document—from theoretical foundation to practical implementation—researchers can effectively apply these techniques to their phylogenomic investigations, particularly in the context of detecting and characterizing introgression and other complex evolutionary processes. The continued refinement of bipartition constraint strategies promises further advances in the scale and precision of phylogenetic inference.

Addressing Gene Tree Estimation Error and Its Cascading Effects

Gene tree estimation error is a fundamental challenge in phylogenomics that can systematically bias downstream evolutionary inferences, including species tree estimation and the detection of introgression. Incorrect gene tree topologies, resulting from factors such as insufficient phylogenetic signal, recombination, or model misspecification, create a cascade of effects that propagate through analysis pipelines [50]. When employing coalescent-based species tree methods like ASTRAL (Accurate Species TRee ALgorithm), which infer the species tree from a collection of input gene trees, these errors can directly impact the accuracy of the estimated species phylogeny [51] [25]. Furthermore, in the context of introgression research, where comparisons of gene tree topologies are used to identify past hybridization events, gene tree error can obscure true phylogenetic signals and lead to spurious conclusions [19]. This Application Note provides a structured protocol to diagnose, mitigate, and account for gene tree estimation error, ensuring robust inference of species trees and introgression signals within an ASTRAL-based framework.

The Cascading Effects of Gene Tree Error

Inaccurate gene tree topologies can adversely affect phylogenomic analyses in several key ways, as illustrated in the diagram below.

G Start Input Data: Multi-locus Alignments GT Gene Tree Estimation Start->GT GT_Error Gene Tree Estimation Error GT->GT_Error Insufficient Signal Recombination Model Violation ST Species Tree Estimation (e.g., ASTRAL) GT->ST GT_Error->ST Introgress Introgression Detection (e.g., D-statistics, PhyloNet) GT_Error->Introgress ST_Impact Inaccurate Species Tree Topology & Branch Lengths ST->ST_Impact ST_Impact->Introgress Introgress_Impact False Positive/Negative Introgression Signals Introgress->Introgress_Impact Bio_Interpret Misleading Biological Interpretation Introgress_Impact->Bio_Interpret

Figure 1: The cascading impact of gene tree estimation error on downstream phylogenomic analyses, including species tree estimation and introgression detection.

The diagram outlines the primary pathways through which error propagates. Incomplete Lineage Sorting (ILS), a common cause of gene tree heterogeneity, can be confounded by estimation error [51]. Furthermore, standard introgression tests like the D-statistic assume specific evolutionary models; violation of these assumptions due to gene tree error can produce misleading results [19].

Quantitative Impact of Error

Table 1: Effects of gene tree estimation error on species tree inference under different conditions.

Condition Impact on Species Tree Accuracy Impact on Introgression Detection
Low ILS, High Error Concatenation may outperform coalescent methods [51] High false-positive rate for introgression [19]
High ILS, High Error ASTRAL accuracy decreases; aggressive filtering harmful [25] True introgression signal masked by noise [19]
Low Support Branches Introduces noise; contracting very low support (e.g., <10%) branches can improve ASTRAL accuracy [25] Increases asymmetry in quartet frequencies spuriously [19]
Polytomies ASTRAL-III handles polytomies efficiently without major performance loss [25] Unresolved topologies complicate quartet frequency counts

Experimental Protocols

Protocol 1: Gene Tree Estimation with Statistical Error Assessment

This protocol details the steps for inferring gene trees from sequence alignments while quantifying statistical uncertainty.

1. Input Data Preparation: Extract alignment blocks from a whole-genome alignment. For example, using a Python script to extract blocks of a fixed length (e.g., 1,000 bp) and filter them for completeness (e.g., must contain a sequence for all species) and a minimal number of parsimony-informative sites [19]. 2. Maximum Likelihood Tree Inference: For each filtered alignment block, infer an unrooted gene tree using a maximum likelihood (ML) method such as IQ-TREE [19]. Use commands such as:

This command performs model selection, tree inference, and branch support assessment via ultrafast bootstrap (1,000 replicates) [19]. 3. Statistical Support Evaluation: For each inferred gene tree, note the ML topology and its statistical support. Branches with very low bootstrap support (e.g., below 10%) are prime candidates for contraction in subsequent steps to reduce noise [25].

Protocol 2: Gene Tree Error Correction with Species Tree Awareness

This protocol uses the species tree as a prior to identify and correct gene trees that are statistically equivalent but more parsimonious.

1. Initial Species Tree Estimation: Estimate a preliminary species tree from the set of unrooted ML gene trees (from Protocol 1) using ASTRAL. This tree does not need to be perfect but serves as a guide.

2. Gene Tree Correction with TreeFix: For each gene tree, run TreeFix to find a statistically equivalent topology (as determined by a likelihood ratio test like the SH test) that minimizes a reconciliation cost (e.g., duplication-loss cost) with the preliminary species tree [50]. 3. Output Corrected Gene Trees: The output is a set of corrected gene trees that are statistically indistinguishable from the ML trees based on sequence data alone but are more topologically congruent with the species tree, potentially reducing estimation error [50].

Protocol 3: Robust Species Tree Inference with ASTRAL

This protocol describes how to use the corrected gene trees to infer the final species tree with ASTRAL, accounting for ILS.

1. Input Preparation: Gather the corrected gene trees from TreeFix (or the original ML trees if error correction is not performed). Optionally, contract branches with very low support (e.g., below 10%) to reduce noise, but avoid aggressive filtering which can be harmful [25]. 2. ASTRAL Execution: Run ASTRAL-III to find the species tree that agrees with the largest number of quartet topologies induced by the input gene trees.

ASTRAL-III is statistically consistent under the multi-species coalescent model and runs in polynomial time, scaling to very large datasets [51] [25]. 3. Output and Support: ASTRAL outputs the species tree in Newick format. It can also compute branch lengths in coalescent units and a measure of branch support (local posterior probability), which are crucial for downstream analyses [25].

Protocol 4: Introgression Detection from Gene Tree Quartets

This protocol leverages the distribution of gene tree topologies to test for introgression.

1. Species Trio Selection: Identify sets of three ingroup species (trios) and an outgroup for which to test introgression [19]. 2. Quartet Frequency Calculation: For each trio, examine all gene trees and count the frequencies of the three possible unrooted quartet topologies. Under a scenario of no introgression and only ILS, the two minor topologies are expected to have roughly equal frequencies. Significant asymmetry is indicative of introgression [19]. 3. Model Testing with PhyloNet: Use the software PhyloNet to fit explicit phylogenetic network models that can account for both ILS and introgression. Compare the support for models with and without introgression events to statistically evaluate the presence of hybridization [19].

The workflow for the complete integrated pipeline, from raw data to final interpretation, is shown below.

G Data Whole-Genome Alignment AlignBlocks Extract & Filter Alignment Blocks Data->AlignBlocks MLTrees Infer ML Gene Trees (IQ-TREE) with Bootstrapping AlignBlocks->MLTrees PrelimTree Estimate Preliminary Species Tree (ASTRAL) MLTrees->PrelimTree Correct Correct Gene Trees (TreeFix) MLTrees->Correct Input Introgress Detect Introgression (Quartet Asymmetry, PhyloNet) MLTrees->Introgress Input PrelimTree->Correct FinalSpecies Infer Final Species Tree (ASTRAL-III) Correct->FinalSpecies FinalSpecies->Introgress Results Robust Evolutionary Inferences Introgress->Results

Figure 2: An integrated experimental workflow for addressing gene tree error in species tree estimation and introgression detection.

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential software tools for mitigating gene tree estimation error and its effects.

Tool / Reagent Function Key Application in Protocol
IQ-TREE Efficient maximum likelihood phylogenetic inference and model testing [19] Protocol 1: Infers gene trees from sequence alignments and assesses branch support via bootstrapping.
TreeFix Statistically informed gene tree error correction using a species tree prior and sequence likelihood [50] Protocol 2: Finds a gene tree topology that is statistically equivalent to the ML tree but has a lower reconciliation cost.
ASTRAL-III Coalescent-based species tree estimation from gene trees by maximizing quartet support [25] Protocol 3: Infers the primary species tree from (corrected) gene trees, accounting for ILS.
PhyloNet Inference of phylogenetic networks to model reticulate evolutionary events like introgression [19] Protocol 4: Tests competing models of diversification with and without introgression.
PAUP* General-purpose software for phylogenetic analysis, useful for tree visualization and manipulation [19] Auxiliary: Used for various tree operations, filtering, and custom scripting within workflows.
FigTree Interactive graphical viewer for phylogenetic trees [19] Auxiliary: Visualizes and annotates input gene trees, preliminary species trees, and final output trees.

Gene tree estimation error is an unavoidable aspect of phylogenomic analyses, but its cascading effects on species tree estimation and introgression detection can be systematically managed. The integrated protocols presented here—combining careful gene tree inference, statistical error correction, coalescent-aware species tree estimation with ASTRAL, and model-based introgression testing—provide a robust framework for researchers. By explicitly acknowledging and accounting for these errors, scientists can produce more reliable and interpretable evolutionary histories, even in the presence of complex biological processes like incomplete lineage sorting and hybridization.

Benchmarking Performance: ASTRAL vs. Alternative Methods in Hybridizing Scenarios

In the field of phylogenomics, accurately estimating the evolutionary history of species is a fundamental goal. However, this process is complicated by biological phenomena such as incomplete lineage sorting (ILS) and introgression, which cause individual gene histories to differ from the overall species tree. The multi-species coalescent (MSC) model provides a theoretical framework to understand gene tree discordance due to ILS. This application note provides a comparative framework for four prominent species tree estimation methods—ASTRAL, *BEAST, MP-EST, and STELAR—focusing on their theoretical foundations, performance under various conditions, and practical applications, with a special emphasis on research involving introgression.

The methods discussed herein can be broadly categorized as either summary methods (ASTRAL, MP-EST, STELAR), which estimate species trees from a set of pre-inferred gene trees, or co-estimation methods (*BEAST), which simultaneously infer gene trees and the species tree from sequence alignments.

Table 1: Core Characteristics of Species Tree Estimation Methods

Method Type Primary Optimization Statistical Consistency under MSC? Handles Introgression?
ASTRAL Summary Maximizes quartet agreement with gene trees [52] [5] Yes [52] [5] Not a primary function, but robust to small amounts of gene flow [17] [53]
*BEAST Co-estimation Bayesian MCMC co-estimation of gene and species trees [54] Yes [52] Can be extended with additional models, but computationally intensive [17]
MP-EST Summary Maximizes a pseudo-likelihood function from triplets [53] [55] Yes [52] [53] Robust to small amounts of gene flow [53]
STELAR Summary Maximizes triplet agreement with gene trees [52] [56] Yes [52] [56] Statistical guarantee fails under gene flow [52]

ASTRAL

ASTRAL is a summary method that operates by finding the species tree that shares the maximum number of induced quartet trees with the set of input gene trees [5]. It solves a constrained version of this NP-hard problem in polynomial time, making it highly scalable [52] [5]. Its statistical consistency under the MSC model is well-established, and it is designed to be robust to conditions of high ILS [52].

*BEAST

*BEAST is a Bayesian MCMC method that co-estimates gene trees and the species tree directly from sequence alignments in a single statistical framework [54]. As a full-likelihood method, it can provide a powerful and coherent estimation of evolutionary parameters but is computationally intensive, limiting its application to smaller datasets [52] [17].

MP-EST

MP-EST estimates the species tree by maximizing a pseudo-likelihood function derived from the frequencies of rooted triplets (three-species subtrees) in the gene trees [53] [55]. While statistically consistent under the MSC, its accuracy and scalability are generally surpassed by ASTRAL and STELAR, particularly for larger datasets [52] [56].

STELAR

STELAR is a more recent summary method that seeks the species tree maximizing triplet agreement with the input gene trees [52] [56]. It efficiently solves the Constrained Triplet Consensus (CTC) problem using dynamic programming. Empirical studies indicate it matches ASTRAL's accuracy and improves upon MP-EST [52] [56].

Performance Under Introgression and Model Violations

A key consideration in modern phylogenomics is the performance of species tree methods when the underlying assumptions of the MSC model are violated, particularly by introgression.

Impact of Ghost Introgression

Ghost introgression—gene flow from an unsampled, unknown, or extinct lineage—presents a significant challenge. Research indicates that both ILS and introgression (from either ghost or extant lineages) interact to affect the occurrence of anomalous gene trees [17]. The relative performance of species tree methods can vary under different introgression scenarios:

  • ASTRAL demonstrates greater robustness to gene flow between non-sister species [17].
  • *BEAST may perform better under certain specific conditions of ghost introgression [17].
  • The statistical consistency of STELAR is formally guaranteed only under the MSC and can fail when gene tree discordance is caused by gene duplication, loss, or horizontal gene transfer [52].

Divergence Time Estimation Bias

Introgression can also bias the estimation of divergence times:

  • Outgroup ghost introgression (from a lineage that diverged before the most basal species) generally leads to an overestimation of the root divergence time [17].
  • Ingroup introgression, conversely, more commonly leads to an underestimation of divergence times [17].

Experimental and Simulation Protocols for Method Evaluation

Evaluating the performance of species tree methods requires carefully designed simulation studies and analyses of empirical data. Below is a generalized workflow for conducting such assessments.

G cluster_0 Data Simulation (e.g., SimPhy) cluster_1 Species Tree Inference & Evaluation Start Start Simulate Species Tree Simulate Species Tree Start->Simulate Species Tree Simulate Gene Trees (MSC) Simulate Gene Trees (MSC) Simulate Species Tree->Simulate Gene Trees (MSC) Simulate Sequence Alignments Simulate Sequence Alignments Simulate Gene Trees (MSC)->Simulate Sequence Alignments Estimate Gene Trees (IQ-TREE) Estimate Gene Trees (IQ-TREE) Simulate Sequence Alignments->Estimate Gene Trees (IQ-TREE) Infer Species Tree Infer Species Tree Estimate Gene Trees (IQ-TREE)->Infer Species Tree Compare to True Tree Compare to True Tree Infer Species Tree->Compare to True Tree End End Compare to True Tree->End

Species Tree Method Evaluation Workflow

Data Simulation Protocol

This protocol outlines steps for generating datasets used to evaluate species tree methods [17] [5].

  • Species Tree Simulation: Using a tool like SimPhy [5], simulate a ground-truth species tree under a birth-death model. Parameters (e.g., population sizes, divergence times) should be set to control the expected level of ILS.
  • Gene Tree Simulation: Simulate a large number of gene trees (e.g., 100-1000) within the simulated species tree under the MSC model using SimPhy. This step introduces gene tree discordance due to ILS.
  • Sequence Alignment Simulation: For each gene tree, simulate a corresponding DNA sequence alignment using a tool like INDELible or by integrating this function within SimPhy. Parameters such as substitution model, sequence length, and mutation rate should be specified.
  • Introducing Introgression (Optional): To test robustness to model violations, simulate specific introgression events (e.g., between non-sister species or from a ghost lineage) using tools that extend the MSC to include gene flow.

Species Tree Inference and Evaluation Protocol

This protocol describes the process of inferring species trees from simulated data and evaluating their accuracy [19].

  • Gene Tree Estimation: Estimate gene trees from the simulated sequence alignments using maximum likelihood software such as IQ-TREE [19] or PAUP*. This step produces a set of inferred gene trees, which may contain estimation errors.
  • Species Tree Inference: Input the set of estimated gene trees into the summary methods being evaluated (ASTRAL, MP-EST, STELAR). For *BEAST, input the sequence alignments directly to co-estimate gene and species trees.
  • Topological Accuracy Assessment: Compare the inferred species trees to the simulated ground-truth species tree. Common metrics include:
    • Normalized Robinson-Foulds (RF) Distance: Measures the proportion of bipartitions that differ between the estimated and true tree.
    • Triplet/Quartet Score: The proportion of induced triplets (for rooted species trees) or quartets (for unrooted) in the estimated tree that match the true tree.

Table 2: Key Software Tools for Phylogenomic Analysis

Tool/Reagent Primary Function Relevance to Species Tree Estimation
IQ-TREE Maximum Likelihood Phylogenetic Inference Estimates individual gene trees from sequence alignments, which serve as input for summary methods [19].
ASTRAL Species Tree Estimation Infers a species tree from a set of unrooted gene trees by maximizing quartet agreement [19] [5].
PhyloNet Phylogenetic Network Inference Infers species networks to explicitly model evolutionary processes like introgression and hybridization [19].
SimPhy Phylogenomic Simulation Generates realistic species trees, gene trees, and sequence alignments under the MSC for method testing [5].
PAUP* Phylogenetic Analysis A general-utility program for inferring evolutionary trees (e.g., via maximum likelihood or parsimony) [19].
FigTree Tree Visualization Visualizes and manipulates phylogenetic trees, aiding in the interpretation and presentation of results [19].

Discussion and Synthesis for Introgression Research

Selecting an appropriate species tree method requires balancing computational scalability, statistical robustness, and biological realism. For large-scale phylogenomic projects with high ILS, summary methods like ASTRAL and STELAR are preferred due to their computational efficiency, high accuracy, and statistical consistency [52] [56]. Between the two, ASTRAL may hold an advantage in studies where undetected introgression is a concern, as it has shown specific robustness to gene flow between non-sister species [17].

For smaller datasets where computational cost is less prohibitive, *BEAST provides a powerful full-likelihood framework. It can be particularly valuable when analyzing multi-individual data or when co-estimation of other evolutionary parameters is desired [17] [5]. However, its performance can be superior in certain ghost introgression scenarios, though this must be weighed against its substantial computational demands [17].

No method is universally superior, and the choice often depends on the specific biological question, dataset size, and potential sources of gene tree discordance. ASTRAL and STELAR currently represent the leading edge in terms of accuracy and scalability for summary methods, while *BEAST offers a more integrated but computationally intensive Bayesian approach. As phylogenomic datasets continue to grow and incorporate more complex evolutionary scenarios, the development and refinement of these methods will remain an active and critical area of research.

Accurate species tree estimation is a cornerstone of evolutionary genomics, yet it is frequently challenged by the biological reality of introgression—the exchange of genetic material between incompletely isolated species. Introgression, along with incomplete lineage sorting (ILS), creates a mosaic of genealogical histories across the genome, leading to widespread gene tree discordance that can mislead species tree inference if not properly accounted for [13]. The multispecies coalescent (MSC) model provides the primary theoretical framework for understanding these discordances, with the ASTRAL package emerging as a leading method for inferring species trees from genome-scale data by seeking the tree that agrees with the largest number of gene trees [19] [38]. However, the performance of these methods under varying rates of introgression remains a critical area of investigation. This Application Note provides a structured experimental framework to quantitatively assess the accuracy of ASTRAL species tree estimation across a spectrum of introgression rates, enabling researchers to benchmark performance and interpret results within the context of their specific study systems.

Theoretical Expectations and Experimental Design

The presence of introgression violates the standard MSC assumption of no gene flow post-speciation, creating phylogenetic dependencies that are not captured by a strictly diverging species tree. Theory predicts that as introgression rates increase, so does genealogical discordance, which in turn is expected to challenge species tree methods that do not explicitly model network-like evolutionary histories [57] [47]. Under a three-species scenario where taxa B and C can hybridize after diverging from a common ancestor, the expected phylogenetic signal shifts as introgression rates change. At low rates, the species tree topology ((A,B),C) is expected to dominate, while at higher rates, an alternative topology ((A,C),B) may become increasingly frequent, potentially leading to inaccurate species tree inference if the underlying method is sensitive to such conflicts [57].

Experimental Scenarios for Quantifying Performance

To systematically evaluate ASTRAL's performance, we propose simulating genomic data under five distinct evolutionary scenarios with varying degrees of gene flow. These scenarios represent a gradient from isolation to extensive hybridization, allowing researchers to assess method robustness across biologically realistic conditions.

Table 1: Experimental Scenarios for Introgression Rate Variation

Scenario Name Introgression Rate (4Nm) Description Biological Interpretation
Strict Isolation 0 No gene flow between sister species post-divergence. Idealized scenario for MSC-based methods.
Limited Gene Flow 0.1 Low, biologically plausible migration. Representative of many weakly isolated species.
Moderate Introgression 1.0 Substantial but incomplete reproductive barriers. Common in recently diverged species complexes.
High Introgression 5.0 Extensive historical hybridization. Found in adaptive radiations and hybrid zones.
Adaptive Introgression 0.1→5.0 (localized) Low background rate with a single locus under strong selection. Models adaptive trait transfer between species.

For the "Adaptive Introgression" scenario, a beneficial allele should be simulated to originate in population C and introgress into population B at a specific time point (e.g., time 0.1 in coalescent units) with a strong selection coefficient (e.g., s = 0.005) to create a localized genomic region with elevated introgression signals [58].

Detailed Experimental Protocol

Data Simulation Procedure

Step 1: Parameterize Evolutionary Scenarios

  • Define population genetic parameters: effective population size (Nₑ), divergence times (T₁, T₂), and migration rates (m) for each scenario in Table 1.
  • For a standard setup, use T₁ = 0.5, T₂ = 1.5 (in units of 4Nₑ generations) to create a scenario with potential for ILS [58].
  • Set migration rates according to Table 1, with migration occurring symmetrically or asymmetrically between sister species.

Step 2: Generate Genomic Sequences

  • Use the coalescent simulator msms (Ewing and Hermisson 2010) to simulate recombining chromosomes under each scenario [58].
  • Command for "Moderate Introgression" scenario (4Nm = 1.0):

  • Parameters: 120 haploid samples, 1000 replicates, mutation rate θ=300, recombination rate ρ=100, 1Mb chromosome, 3 populations with 40 samples each, migration between populations 2 and 1 at time 0.25.
  • For "Adaptive Introgression," add selection parameters to model the sweep of an introgressed allele.

Step 3: Process Simulated Data

  • Convert raw output to sequence alignments (FASTA format) for predefined genomic windows (e.g., 1 kb segments).
  • For each alignment block, filter out those with excessive missing data (>50% gaps) or insufficient variation.

Gene Tree and Species Tree Inference

Step 4: Infer Gene Trees

  • For each filtered alignment block, infer a gene tree using maximum likelihood with IQ-TREE2:

  • Use ModelFinder integrated in IQ-TREE2 to select the best-fit substitution model.
  • Generate 1000 ultrafast bootstrap replicates to assess branch support.

Step 5: Infer Species Trees with ASTRAL

  • Compile all inferred gene trees into a single file.
  • Run ASTRAL to estimate the species tree:

  • Enable branch support calculation with multi-locus bootstrapping or local posterior probabilities.

Analysis of Results

Step 6: Quantify Accuracy Metrics

  • Compare the inferred species tree to the true simulated species tree using Robinson-Foulds (RF) distance.
  • Calculate the frequency of the major topological configurations (the species tree topology vs. alternative topologies) across gene trees.
  • For each scenario, measure ASTRAL's species tree accuracy as the proportion of replicates where the correct species topology was recovered.

Table 2: Key Performance Metrics for Comparison

Performance Metric Calculation Method Interpretation
Species Tree Accuracy Proportion of replicates recovering true topology Primary measure of methodological performance under introgression.
Robinson-Foulds Distance Number of bipartitions differing between true and inferred trees Quantifies topological dissimilarity (lower is better).
Local Posterior Probability Mean support values for key bipartitions from ASTRAL Measures confidence in problematic regions of the tree.
Topology Weighting Profile Proportion of gene trees supporting each possible topology using Twisst [58] Characterizes the extent of gene tree discordance.
False Positive Rate Proportion of replicates inferring introgression when none simulated Tests method specificity.

Expected Results and Interpretation

Based on current literature, we anticipate that ASTRAL will maintain high accuracy under low to moderate introgression rates (4Nm ≤ 1.0) but may show decreased performance under high introgression scenarios. The following table summarizes expected outcomes across the experimental scenarios:

Table 3: Expected Performance of ASTRAL Under Varying Introgression Rates

Scenario Expected Species Tree Accuracy Expected RF Distance Key Observation
Strict Isolation >95% <0.05 High accuracy with strong node support.
Limited Gene Flow 85-95% 0.05-0.15 Slight decrease in accuracy, particularly at short internal branches.
Moderate Introgression 70-85% 0.15-0.30 Noticeable decrease in support for affected clades.
High Introgression 50-70% 0.30-0.50 Potential for species tree inaccuracy, especially with asymmetric introgression.
Adaptive Introgression 75-90% 0.10-0.25 Localized effect on specific genomic regions; overall topology generally preserved.

The "Adaptive Introgression" scenario is expected to produce a strong but highly localized signal that may not substantially impact the overall species tree inference, as ASTRAL summarizes signals across the entire genome [58]. However, high genome-wide introgression rates may challenge ASTRAL's performance, particularly when introgression occurs between non-sister lineages or when the species tree contains short internal branches that are more susceptible to topological inaccuracy under high gene flow conditions.

Implementation Toolkit

Research Reagent Solutions

Table 4: Essential Computational Tools for Analysis

Tool/Resource Primary Function Application in Protocol
msms Coalescent simulation with selection Generating genomic sequence data under various introgression scenarios [58].
IQ-TREE2 Maximum likelihood phylogenetic inference Estimating gene trees from sequence alignments [19].
ASTRAL Species tree estimation from gene trees Inferring the primary species topology accounting for gene tree discordance [19] [38].
Twisst Topology weighting by iterative sampling of subtrees Quantifying support for alternative phylogenetic relationships [58].
ggtree Phylogenetic tree visualization and annotation Visualizing and comparing true versus inferred trees [59].
PhyloNet Phylogenetic network inference Modeling introgression events explicitly when detected [19].

Workflow Visualization

G cluster_0 Experimental Scenarios (Vary Migration Rates) params Define Parameters (Nₑ, divergence times, migration rates) sim Simulate Genomes (msms) params->sim align Process Alignments (Filter, format) sim->align genetrees Infer Gene Trees (IQ-TREE2) align->genetrees spectree Infer Species Tree (ASTRAL) genetrees->spectree analyze Analyze Results (Accuracy metrics, topology weighting) spectree->analyze visualize Visualize & Interpret (ggtree, comparative plots) analyze->visualize iso Strict Isolation (4Nm=0) low Limited Gene Flow (4Nm=0.1) mod Moderate Introgression (4Nm=1.0) high High Introgression (4Nm=5.0) adapt Adaptive Introgression (localized)

Figure 1: Experimental workflow for evaluating ASTRAL under introgression

Theoretical Relationship Visualization

G cluster_1 ASTRAL Performance Range low_introg Low Introgression Rate high_accuracy High Species Tree Accuracy low_introg->high_accuracy low_discord Low Gene Tree Discordance low_introg->low_discord astral_robust Robust performance region low_introg->astral_robust high_introg High Introgression Rate low_accuracy Reduced Species Tree Accuracy high_introg->low_accuracy alt_topology Increased Alternative Topology Frequency high_introg->alt_topology high_discord High Gene Tree Discordance high_introg->high_discord astral_challenged Challenged performance region high_introg->astral_challenged high_discord->low_accuracy high_discord->alt_topology

Figure 2: Theoretical relationship between introgression and accuracy

This protocol provides a standardized framework for assessing the accuracy of ASTRAL species tree estimation under varying introgression rates. The experimental design enables systematic evaluation of method performance across biologically realistic scenarios, from strict isolation to extensive hybridization. When applying these methods to empirical data, researchers should consider the following evidence-based guidelines:

  • Moderate Introgression (4Nm ≤ 1.0): ASTRAL generally provides reliable species tree estimates, with accuracy expected to exceed 70% under these conditions. Results can be interpreted with high confidence, though specific clades affected by gene flow may show reduced support values.

  • High Introgression (4Nm > 1.0): Exercise caution when interpreting ASTRAL species trees, as accuracy may decline substantially. In such cases, supplement ASTRAL analysis with explicit phylogenetic network methods (e.g., PhyloNet) to model introgression events directly [19].

  • Validation Approach: For empirical datasets, apply topology weighting (Twisst) to quantify the proportion of gene trees supporting alternative relationships [58]. A high frequency of non-species tree topologies, particularly those grouping non-sister taxa, suggests potential introgression that may affect species tree inference.

  • Complementary Methods: In systems with suspected high introgression, combine ASTRAL analysis with alternative approaches such as the D-statistic for formal testing of introgression and phylogenetic network methods for visualizing complex evolutionary relationships [19] [47].

By implementing this comprehensive protocol, researchers can rigorously evaluate species tree accuracy in their specific study systems, appropriately account for uncertainty in phylogenetic inference, and generate robust evolutionary hypotheses when investigating lineages with complex histories of divergence and gene flow.

Divergence Time Estimation Bias Caused by Ingroup vs. Outgroup Introgression

Application Notes

Within the broader thesis investigating ASTRAL species tree estimation with introgression, understanding the specific biases introduced by gene flow is paramount. This document outlines the critical impact of introgression from ingroup (sampled) versus outgroup or "ghost" (unsampled or extinct) lineages on the accuracy of divergence time estimation. The use of species tree methods, such as those in the ASTRAL software, under the multispecies coalescent (MSC) model is widespread, but model misspecification in the presence of introgression can lead to systematically biased parameter estimates [17] [60].

The direction and magnitude of bias in divergence time estimates are not uniform; they are fundamentally determined by the phylogenetic placement of the introgressing lineage. The following table summarizes the core biases associated with the two primary introgression scenarios.

Table 1: Comparative Bias Profile of Ingroup vs. Outgroup Introgression

Introgression Type Direction of Bias in Root Divergence Time Key Influencing Factor Impact on Species Tree Topology
Ingroup Introgression Systematic underestimation [17] Strength of ILS; stronger ILS can paradoxically improve time estimation accuracy despite introgression [17] High; topology is highly prone to bias from introgression [17]
Outgroup (Ghost) Introgression Systematic overestimation [17] The relative divergence time of the ghost lineage [17] Varies; can lead to the inference of a species tree not displayed by the true network [60]

Protocols

Protocol 1: Detecting and Quantifying Introgression for Bias Assessment

Objective: To identify introgression signals and determine the donor-recipient relationships among species, providing a foundational dataset for subsequent bias correction.

Materials:

  • Genomic Data: Whole-genome resequencing data or transcriptomic data (e.g., from RNA-seq) for all ingroup taxa [46] [42].
  • Computational Tools:
    • Phylogenetic Network Inference: Software like PhyloNet or similar for modeling reticulate evolution [60].
    • Dsuite: A commonly used tool for calculating the D-statistic (ABBA-BABA test) and related metrics to detect introgression [60].
    • Gene Tree-Species Tree Analysis: ASTRAL for species tree inference from gene trees, with awareness of its limitations [60].

Methodology:

  • Data Preparation and Gene Tree Estimation:
    • Assemble sequencing reads and predict coding sequences [46].
    • Identify orthologous loci across all sampled taxa using tools like OrthoMCL or BUSCO [46].
    • For each orthologous locus, infer an individual gene tree using maximum likelihood or Bayesian methods.
  • Introgression Detection with Dsuite:

    • Use the full set of gene trees or a variant call format (VCF) file as input for Dsuite.
    • Run the Dtrios analysis to calculate D-statistics for all possible triplets of species (P1, P2, P3). A significant D-statistic indicates gene flow between P2 and P3 relative to P1.
    • Visualize the results with the Bite command to infer introgression networks and identify the major donor and recipient lineages [60].
  • Phylogenetic Network Inference:

    • Using the gene tree set, infer a phylogenetic network that accounts for both incomplete lineage sorting (ILS) and hybridization using a method such as SNaQ (implemented in PhyloNet) [60].
    • This network will provide a visual and quantitative representation of hypothesized introgression events, including estimated hybridization parameters (γ).

G Start Start: Multi-locus Genomic Data A 1. Infer Gene Trees Start->A B 2. Detect Introgression (e.g., Dsuite) A->B C 3. Infer Phylogenetic Network B->C D Output: Reticulate Phylogeny C->D

Diagram 1: Introgression Detection Workflow.

Protocol 2: Correcting for Introgression-Induced Bias in Divergence Time Estimation

Objective: To implement a divergence time estimation protocol that accounts for the presence of identified introgression, thereby mitigating the biases outlined in Table 1.

Materials:

  • Input from Protocol 1: A well-supported phylogenetic network or a set of introgressed loci.
  • Computational Tools:
    • Coalescent-based Divergence Time Estimation: Software like *BEAST (BEAST2) or MCMCtree (PAML) that can incorporate a species tree and clock models. Note that *BEAST has been shown in simulations to sometimes outperform summary approaches under certain ghost introgression scenarios [17].
    • Fossil Calibrations: Reliable, externally sourced fossil dates to calibrate the molecular clock [21].

Methodology:

  • Model and Data Partitioning:
    • Approach A (Using the Network): If software allows, use the phylogenetic network inferred in Protocol 1 as the underlying population history model for divergence time analysis, specifying the hybridization nodes and parameters (γ).
    • Approach B (Data Filtering): If using a tree-based model, partition the genomic data. Perform separate divergence time analyses on (i) the full dataset, (ii) a dataset excluding loci with strong evidence of introgression, and (iii) a dataset consisting only of putative introgressed loci. Compare the results to assess robustness.
  • Divergence Time Analysis with *BEAST:

    • Prepare an XML configuration file specifying the multispecies coalescent model, a relaxed molecular clock, and fossil calibration points.
    • Input the species tree topology (or network, if supported) and the multiple sequence alignment of orthologous loci.
    • Run a Markov Chain Monte Carlo (MCMC) analysis for a sufficient number of generations to ensure convergence and effective sample sizes (ESS) > 200 for all key parameters.
  • Bias Assessment and Interpretation:

    • Compare the divergence time estimates from the model accounting for introgression with those from a naive model that assumes a purely bifurcating tree.
    • Contextualize the direction of any observed shift in estimated times relative to the predictions in Table 1. For example, if ghost introgression is suspected and the corrected estimate is younger than the naive estimate, this is consistent with the predicted overestimation bias.

G Start Input: Reticulate Phylogeny A Define Evolutionary Model (e.g., MSC with Network) Start->A B Apply Fossil Calibrations A->B C Run MCMC Analysis (e.g., *BEAST) B->C D Compare Time Estimates (With vs. Without Introgression) C->D E Output: Bias-Corrected Divergence Time D->E

Diagram 2: Bias Correction Workflow.

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Materials and Analytical Tools for Introgression Research

Item/Reagent Function/Application Example/Note
Orthologous Loci Set Serves as the fundamental data unit for phylogenetic and coalescent analysis. Generated via pipelines like BUSCO (for conserved single-copy orthologs) or OrthoMCL [46].
Dsuite Software Detects and quantifies introgression from genomic data using Patterson's D and related statistics. Critical for initial screening and identifying donor-recipient pairs [60].
Phylogenetic Network Software (e.g., PhyloNet) Models evolutionary history as a network, explicitly inferring hybridization events and parameters. Used to move beyond a strictly bifurcating tree assumption [60].
Coalescent-based Dating Software (e.g., *BEAST) Estimates species divergence times under the multispecies coalescent model. Can be more robust than summary methods like ASTRAL under certain introgression scenarios [17].
Fossil Calibration Data Provides absolute time constraints for molecular clock models. Essential for translating coalescent units into geological time; must be carefully vetted [21].

The Role of Phylogenetic Network Tools like PhyloNet for Model Comparison

Phylogenetic networks are essential computational tools for modeling evolutionary histories that involve non-tree-like processes such as hybridization, introgression, and horizontal gene transfer. Unlike traditional phylogenetic trees that strictly represent divergent evolution, phylogenetic networks extend the tree model by incorporating horizontal edges that capture the inheritance of genetic material through gene flow [61]. In the context of ASTRAL species tree estimation, which primarily focuses on incomplete lineage sorting (ILS), accounting for introgression is crucial as reticulation events can significantly distort species tree inferences [62]. PhyloNet, a software package specifically designed for inferring and analyzing phylogenetic networks, provides researchers with a suite of methods for comparing evolutionary models that account for both ILS and reticulation [61].

The multispecies network coalescent provides the statistical foundation for these methods, explaining gene tree incongruence through both ILS and reticulation [61]. PhyloNet enables researchers to compare different evolutionary models through maximum parsimony, maximum likelihood, and Bayesian inference, allowing for statistical comparison of phylogenetic networks with varying complexities [61]. This model comparison capability is particularly valuable for researchers investigating species boundaries, gene flow patterns, and adaptive introgression in empirical datasets.

Methods for Phylogenetic Network Inference in PhyloNet

PhyloNet implements three principal statistical frameworks for phylogenetic network inference, each with distinct strengths and applications for model comparison. The choice among these methods depends on the research question, data type, and computational resources.

Table 1: Phylogenetic Network Inference Methods in PhyloNet

Method Statistical Framework Input Data Optimization Criteria Model Parameters Estimated
InferNetwork_MP Maximum Parsimony Gene tree topologies Minimize deep coalescences (MDC) Topology, inheritance probabilities
InferNetwork_ML Maximum Likelihood Gene tree topologies or sequences Multispecies network coalescent likelihood Topology, branch lengths, inheritance probabilities
InferNetwork_BI Bayesian Inference Gene tree topologies or sequences Posterior probability with RJMCMC Topology, branch lengths, inheritance probabilities, model complexity
Maximum Parsimony Approach

The Maximum Parsimony method in PhyloNet employs the Minimizing Deep Coalescences (MDC) criterion, which was originally developed for species tree inference and later extended to phylogenetic networks [61]. This method identifies the species network that requires the smallest number of deep coalescences when reconciling the input gene trees. The MDC criterion calculates the number of extra lineages on a species tree branch as the difference between the numbers of lineages entering and exiting that branch, with the total representing the sum across all branches [61].

The MDC approach is particularly useful for initial exploratory analyses and when working with large datasets where computational efficiency is a concern. However, this method has limitations: it does not estimate branch lengths or other numerical parameters beyond topology and inheritance probabilities, and it is not statistically consistent for species trees, particularly when branches are very short [61]. Implementation in PhyloNet requires the user to specify the maximum number of reticulation events a priori, and the analysis proceeds through a local search heuristic.

Maximum Likelihood Framework

Maximum Likelihood inference in PhyloNet is based on the multispecies network coalescent, which provides a probabilistic model of gene tree evolution within the branches of a phylogenetic network [61]. This framework enables simultaneous estimation of network topology, branch lengths (in coalescent units), and inheritance probabilities, which represent the proportion of genetic material inherited from each parental population at hybridization nodes.

The likelihood calculation accounts for both deep coalescence and reticulation events, making it particularly suitable for comparing different evolutionary scenarios. However, computing the exact likelihood of a phylogenetic network is computationally intensive, which led to the development of a pseudolikelihood approximation (InferNetworkMPL) for larger datasets [61]. PhyloNet incorporates model comparison techniques such as cross-validation (InferNetworkML_CV) and information criteria (AIC, BIC, AICc) to help select the appropriate level of model complexity.

Bayesian Inference

Bayesian inference in PhyloNet addresses key limitations of maximum likelihood approaches, particularly the tendency to overfit data with increasingly complex models [61]. Through reversible-jump Markov Chain Monte Carlo (RJMCMC), this method samples from the posterior distribution of phylogenetic networks while varying the number of reticulation events, thus enabling model comparison between networks of different complexities.

The Bayesian framework automatically penalizes model complexity, helping to avoid overparameterization while providing credible intervals for parameter estimates. This approach is particularly valuable for hypothesis testing, such as comparing a pure divergence model against various reticulation scenarios. However, Bayesian inference in PhyloNet is computationally intensive and may require substantial runtimes for convergence with empirical datasets.

Experimental Protocols for Model Comparison

Workflow for Phylogenetic Network Comparison

The following diagram illustrates the comprehensive workflow for comparing phylogenetic models using PhyloNet:

G Start Start DataCollection Collect multi-locus sequence data Start->DataCollection End End GeneTreeEstimation Estimate gene trees for each locus DataCollection->GeneTreeEstimation DataFormatting Format input files for PhyloNet GeneTreeEstimation->DataFormatting NullModel Infer species tree (null model) using ASTRAL or similar method DataFormatting->NullModel NetworkInference Infer phylogenetic networks with varying reticulation numbers NullModel->NetworkInference ModelSelection Compare models using information criteria or tests NetworkInference->ModelSelection TopologyAnalysis Analyze network topology and reticulation patterns ModelSelection->TopologyAnalysis ParameterEstimation Estimate inheritance probabilities and branch lengths TopologyAnalysis->ParameterEstimation BiologicalValidation Interpret biological significance of supported reticulations ParameterEstimation->BiologicalValidation BiologicalValidation->End

Input Data Preparation Protocol

Gene Tree Estimation

  • Extract sequence alignments for multiple unlinked loci from genomic data
  • For each locus, estimate gene trees using methods such as RAxML, MrBayes, or BEAST
  • Account for gene tree uncertainty by using bootstrap replicates or posterior samples
  • Root gene trees using appropriate outgroups or midpoint rooting when necessary

Data Formatting for PhyloNet

  • Convert gene trees to Newick format and combine into a single NEXUS file
  • For multi-individual datasets, create a taxon assignment block mapping individuals to species
  • Specify command blocks with appropriate parameters for the chosen inference method
  • For sequence-based analysis, prepare multiple sequence alignments in PHYLIP or NEXUS format
Model Comparison Implementation

Maximum Likelihood Model Comparison

  • Run InferNetwork_ML with different numbers of reticulations (start with 0-5 events)
  • Use InferNetwork_ML_CV for cross-validation to assess model fit
  • Calculate AIC, BIC, and AICc values for each model:
    • AIC = 2k - 2ln(L), where k is number of parameters and L is likelihood
    • BIC = kln(n) - 2ln(L), where n is number of gene trees
    • Compare values across models, with lower scores indicating better fit
  • Perform likelihood ratio tests for nested models

Bayesian Model Comparison

  • Run InferNetwork_BI with RJMCMC to sample across model space
  • Set appropriate priors on network topology and inheritance probabilities
  • Run multiple independent chains to assess convergence
  • Calculate Bayes factors for model comparison: BF = (Posterior Odds Model1/Model2) / (Prior Odds Model1/Model2)
  • Interpret Bayes factors using standard scales (e.g., BF > 10 strong evidence)

Quantitative Comparison of Phylogenetic Models

Model Selection Criteria

Table 2: Model Comparison Metrics for Phylogenetic Networks

Criterion Calculation Method Interpretation Advantages Limitations
Akaike Information Criterion (AIC) AIC = 2k - 2ln(L) Lower values indicate better fit; differences >2 considered meaningful Asymptotically unbiased for prediction error Less penalty for complexity than BIC
Bayesian Information Criterion (BIC) BIC = kln(n) - 2ln(L) Lower values indicate better fit; stronger penalty for parameters Consistent for true model selection Can be overly conservative with large n
Cross-Validation Score Mean log-likelihood of held-out data Higher values indicate better predictive accuracy Directly measures predictive performance Computationally intensive
Bayes Factor Ratio of marginal likelihoods Values >10 provide strong evidence for one model Incorporates prior information Sensitive to prior specification
Performance Benchmarks

Empirical studies using PhyloNet have established performance benchmarks for model comparison:

Simulation Studies

  • Networks with 1-2 reticulations correctly identified with >80% probability with 20+ gene trees
  • Inheritance probabilities accurately estimated (RMSE < 0.1) with 50+ gene trees
  • Model misidentification rates <5% when using BIC with adequate sample sizes

Computational Requirements

  • Maximum Parsimony: minutes to hours for datasets with <50 taxa and <10 reticulations
  • Maximum Likelihood: hours to days for similar datasets
  • Bayesian Inference: days to weeks, depending on chain convergence

The Scientist's Toolkit: Essential Research Reagents and Materials

Table 3: Key Research Reagents and Computational Tools for Phylogenetic Network Analysis

Tool/Reagent Function/Purpose Implementation Notes
PhyloNet Software Package Primary platform for phylogenetic network inference Java-based; command-line interface; requires Java Runtime Environment
Multi-locus Sequence Data Input for gene tree estimation and network inference Unlinked loci with minimal recombination; 20+ loci recommended for reliable inference
ASTRAL Species tree estimation for null model comparison Useful for establishing baseline tree topology before testing reticulation hypotheses
Dendroscope Visualization of phylogenetic networks Supports extended Newick format output from PhyloNet; enables interactive network exploration
Gene Tree Estimation Software Generate input gene trees for summary methods RAxML (maximum likelihood), MrBayes (Bayesian), IQ-TREE (model selection)
High-Performance Computing Cluster Computational resources for intensive analyses Essential for Bayesian inference and large datasets with multiple runs

Application to Introgression Research

Case Study: Genomic Introgression in Chinese Wingnuts

A recent study on three species of Pterocarya (Chinese wingnuts) demonstrates the application of phylogenetic network tools for detecting adaptive introgression [37]. Researchers identified candidate genes for environmental adaptation (PIEZO1, WRKY39, VDAC3, CBL1, and RAF) and detected past gene introgression between P. hupehensis and P. macroptera that promoted environmental adaptation [37]. The introgressed regions showed lower genetic load and higher genetic diversity compared to the rest of the genome, providing evidence for the adaptive value of introgression.

This case study illustrates how phylogenetic network methods can complement population genomic approaches by providing an evolutionary framework for interpreting patterns of introgression. The interaction between phylogenetic network analysis and genome-wide scans for introgression is depicted below:

G cluster_1 Genomic Data Collection cluster_2 PhyloNet Analysis cluster_3 Integration & Validation Start Start WholeGenome Whole genome sequencing of multiple individuals Start->WholeGenome End End GeneAnnotation Gene annotation and functional characterization WholeGenome->GeneAnnotation PopulationSNPs SNP calling and genotype estimation GeneAnnotation->PopulationSNPs NetworkModel Infer phylogenetic network with PhyloNet PopulationSNPs->NetworkModel GWAS Genome-wide association studies (GWAS) PopulationSNPs->GWAS IntrogressionTest Test specific introgression hypotheses NetworkModel->IntrogressionTest CompareTrees Compare network model to species tree IntrogressionTest->CompareTrees CompareTrees->GWAS IntrogressedRegions Identify introgressed genomic regions CompareTrees->IntrogressedRegions GWAS->IntrogressedRegions AdaptiveGenes Detect adaptively introgressed genes GWAS->AdaptiveGenes IntrogressedRegions->AdaptiveGenes AdaptiveGenes->End

Protocol for Integrating ASTRAL with PhyloNet

Comparative Framework Setup

  • First, estimate the species tree using ASTRAL from the same gene trees
  • Use the ASTRAL tree as a null model representing pure divergence with ILS
  • Compare the ASTRAL topology with PhyloNet networks using tree distance metrics
  • Calculate the likelihood of the ASTRAL tree under the network coalescent

Introgression Detection Workflow

  • Identify strongly supported discordances between gene trees and ASTRAL species tree
  • Test if specific discordant relationships can be explained by reticulation events
  • Estimate the timing and direction of introgression events using branch lengths
  • Validate identified introgression events with independent population genetic methods

PhyloNet provides an essential toolkit for comparing phylogenetic models that account for both divergence and reticulation. By implementing multiple statistical frameworks and model comparison techniques, it enables researchers to rigorously test hypotheses about introgression and its evolutionary consequences. The integration of PhyloNet with ASTRAL species trees creates a powerful comparative framework for distinguishing between incomplete lineage sorting and introgression, ultimately leading to more accurate reconstructions of evolutionary history. As genomic datasets continue to grow in size and complexity, these model comparison approaches will become increasingly important for understanding the network-like aspects of evolution.

The study of evolutionary history, particularly in the presence of hybridization and introgression, requires methodological approaches that can accurately discern the complex relationships between species. While species tree estimation methods like ASTRAL provide a robust framework for inferring evolutionary relationships from genomic data, they operate under the assumption of a purely diverging evolutionary history without gene flow. D-statistics (ABBA-BABA tests) offer a powerful, SNP-based approach for detecting introgression but come with critical assumptions that may be violated in certain evolutionary contexts, particularly when analyzing divergent species or in the presence of homoplasy [19].

This protocol details a complementary framework that integrates tree-based methods with D-statistics to provide a more comprehensive analysis of introgression. By leveraging the strengths of both approaches—gene tree frequency analysis from tree-based methods and pattern-based allele frequency analysis from D-statistics—researchers can achieve a more robust detection and verification of past introgression events. The methodologies outlined here are specifically contextualized within ASTRAL-based species tree estimation research, providing practical guidance for implementation and interpretation.

Table 1: Core Methodological Comparison

Feature D-Statistics Tree-Based Methods Integrated Approach
Primary Basis SNP patterns (ABBA/BABA) Gene tree topologies Both tree topologies and SNP patterns
Key Assumptions No homoplasy, equal substitution rates Models sequence evolution Cross-verification across methods
Strengths Simple, fast computation Robust to different evolutionary rates Comprehensive introgression detection
Limitations Problematic with divergent species Computational intensity Requires multiple analytical steps
Introgression Signal Allele frequency asymmetry Tree topology frequency asymmetry Convergent evidence

Theoretical Foundation and Quantitative Framework

D-Statistics: Principles and Limitations

The D-statistic operates on a four-taxon system (P1, P2, P3, Outgroup) by comparing the frequencies of ABBA and BABA SNP patterns, where 'A' represents the ancestral allele and 'B' the derived allele. A significant deviation from the expected equal distribution of these patterns indicates introgression between P2 and P3 (ABBA excess) or P1 and P3 (BABA excess). However, this approach assumes identical substitution rates for all species and the absence of homoplasies (independent mutations at the same site) [19]. These assumptions are frequently violated when analyzing divergent species, where multiple substitutions at the same site can create misleading signals.

Tree-Based Methods: Complementing D-Statistic Limitations

Tree-based introgression detection analyzes the frequencies of gene tree topologies inferred from sequence alignments across the genome. This approach provides a valuable complement to D-statistics because phylogenetic methods based on sequence alignments can account for varying evolutionary rates and multiple substitutions through explicit evolutionary models [19]. The core principle involves identifying asymmetry in the distribution of alternative phylogenetic topologies for species trios, similar to the asymmetry detection in D-statistics, but with greater robustness to conditions that may produce misleading results in SNP-based tests.

Table 2: Quantitative Decision Framework for Method Selection

Analysis Scenario Recommended Primary Method Supplementary Method Key Rationale
Recently diverged species D-Statistics Tree-based verification D-statistic assumptions likely valid
Highly divergent species Tree-based methods D-statistics Tree methods handle homoplasy better
Whole-genome data available Both methods ASTRAL species tree Maximum evidence integration
Limited computational resources D-Statistics Targeted tree analysis D-statistics less computationally intensive
Complex evolutionary scenarios Tree-based methods D-statistics + PhyloNet Model-based approaches more flexible

Experimental Protocols

Protocol 1: Whole-Genome Alignment Processing and Alignment Block Selection

Purpose: To extract suitable alignment blocks from whole-genome alignments for phylogenetic tree inference.

Materials and Reagents:

  • Whole-genome alignment file (HAL or MAF format)
  • Sequence data for target species and outgroup
  • High-performance computing resources

Methodology:

  • Alignment Inspection: Examine the whole-genome alignment structure using command-line tools (e.g., less -S filename.maf). MAF files typically contain alignment blocks initiated with "a" and consisting of sequences marked with "s" [19].
  • Block Extraction: Use specialized scripts (e.g., Python-based extraction tools) to isolate alignment blocks with optimal properties:
    • Length: 1,000 bp (balance between informativeness and recombination probability)
    • Completeness: One sequence per species for all analyzed taxa
    • Minimal missing data
  • Recombination Filtering: Quantify and filter alignment blocks based on signals of within-alignment recombination, removing those with strongest recombination signals.
  • Quality Assessment: Select blocks with sufficient informative sites while excluding those with excessive gaps or missing data.

Critical Step: Ensure each selected alignment block contains sequences for all analyzed species, including the outgroup, to enable proper phylogenetic reconstruction.

Protocol 2: Gene Tree Inference Pipeline

Purpose: To generate a set of gene trees from extracted alignment blocks for subsequent introgression analysis.

Materials and Reagents:

  • Extracted alignment blocks (from Protocol 1)
  • Phylogenetic software: IQ-TREE v.2, PAUP*
  • Computing cluster or high-performance workstation

Methodology:

  • Model Selection: For each alignment block, determine the optimal substitution model using ModelFinder in IQ-TREE, which implements Bayesian Information Criterion for model selection.
  • Tree Inference: Execute maximum likelihood tree estimation for each alignment block using IQ-TREE with the command: iqtree2 -s alignment_block.phy -m BEST -bb 1000 [19].
  • Support Assessment: Generate branch support values using ultrafast bootstrap approximation (1000 replicates recommended).
  • Tree Collection: Compile all inferred gene trees into a single file for subsequent analysis, ensuring consistent taxon naming across trees.
  • Quality Filtering: Remove poorly supported trees (e.g., average bootstrap support <70%) from downstream analyses.

Troubleshooting: For large datasets, consider parallelization across multiple CPU cores using -nt AUTO flag in IQ-TREE to optimize computation time.

Protocol 3: Species Tree Estimation with ASTRAL

Purpose: To infer the primary species tree from the set of gene trees, representing the dominant evolutionary relationships.

Materials and Reagents:

  • Set of gene trees (from Protocol 2)
  • ASTRAL software package
  • Java runtime environment

Methodology:

  • Input Preparation: Compile all gene trees into a single Newick format file, with one tree per line.
  • ASTRAL Execution: Run species tree estimation using the command: java -jar ~/software/Astral/astral.5.7.8.jar -i input_gene_trees.tre -o species_tree.tre [19].
  • Support Quantification: Record local posterior probabilities for each branch, representing support for the inferred species relationships.
  • Output Interpretation: The resulting species tree represents the predominant evolutionary history across the genome, which serves as a reference for detecting discordant relationships indicative of introgression.

Note: ASTRAL estimates the species tree that agrees with the largest number of gene trees, making it robust to incomplete lineage sorting but not specifically designed to detect introgression.

Protocol 4: Tree-Based Introgression Detection

Purpose: To identify introgression events by analyzing gene tree topology frequencies relative to the species tree.

Materials and Reagents:

  • Gene tree set (from Protocol 2)
  • ASTRAL species tree (from Protocol 3)
  • Custom scripts for topology frequency analysis

Methodology:

  • Topology Frequency Analysis: Count occurrences of all distinct gene tree topologies in the dataset.
  • Asymmetry Detection: For each species trio (P1, P2, P3), identify significant asymmetry in frequencies between the two minor topologies, analogous to the ABBA-BABA principle.
  • Statistical Testing: Apply binomial or multinomial tests to assess whether observed topology asymmetries deviate significantly from expected distributions under a no-introgression model.
  • Introgression Inference: Interpret significant asymmetries as evidence of historical introgression between the species that appear as sister taxa in the overrepresented minor topology.

Interpretation: This approach mirrors the logic of D-statistics but operates on tree topologies rather than individual SNPs, making it more robust to homoplasy and rate variation.

Protocol 5: D-Statistic Calculation and Interpretation

Purpose: To implement SNP-based introgression detection as a complementary approach to tree-based methods.

Materials and Reagents:

  • Whole-genome alignment or variant call format (VCF) files
  • Population genetics software (e.g., ADMIXTOOLS, ANGSD)
  • Outgroup sequence data

Methodology:

  • Variant Calling: If starting from aligned sequences, identify variable sites and determine ancestral/derived states using the outgroup.
  • Pattern Counting: For each four-taxon set (P1, P2, P3, Outgroup), count ABBA and BABA patterns across the genome.
  • D-Statistic Calculation: Compute the D-statistic using the formula: D = (∑(ABBA - BABA)) / (∑(ABBA + BABA)) [19].
  • Significance Testing: Assess statistical significance using block jackknife or bootstrap resampling to generate confidence intervals.
  • Result Interpretation: Significant positive D-values suggest introgression between P2 and P3, while significant negative values suggest introgression between P1 and P3.

Limitation Note: Remember that D-statistics assume minimal homoplasy and equal substitution rates, which may be violated in divergent species, necessitating verification with tree-based methods.

Protocol 6: Phylogenetic Network Modeling with PhyloNet

Purpose: To implement model-based phylogenetic network inference that explicitly accounts for both divergence and introgression events.

Materials and Reagents:

  • Gene tree set (from Protocol 2)
  • PhyloNet software package
  • Java runtime environment

Methodology:

  • Model Selection: Compare different network models with varying numbers of introgression events using information criteria (AIC/BIC).
  • Network Inference: Execute phylogenetic network estimation using the command: java -jar ~/software/PhyloNet/PhyloNet.jar input_file.nex [19].
  • Model Comparison: Statistically compare the fit of network models with and without proposed introgression events to assess support for introgression.
  • Visualization: Generate phylogenetic networks that display both vertical descent and horizontal introgression events.

Advantage: PhyloNet provides a model-based framework for jointly estimating species relationships and introgression events, offering a statistical comparison of alternative evolutionary scenarios.

Integrated Data Visualization and Workflow

Analytical Workflow Diagram

G WGA Whole-Genome Alignment AB Alignment Blocks (1,000 bp) WGA->AB GT Gene Tree Inference (IQ-TREE) AB->GT ST Species Tree Estimation (ASTRAL) GT->ST TID Tree-Based Introgression (Topology Frequencies) GT->TID PN PhyloNet Network Inference GT->PN ST->TID INT Integrated Introgression Assessment TID->INT DSTAT D-Statistic Analysis (ABBA-BABA) DSTAT->INT PN->INT

Method Integration Logic

G DS D-Statistic Results CON Consistent Signal? DS->CON TB Tree-Based Results TB->CON YES Strong Introgression Evidence CON->YES Yes NO Investigate Method Assumptions CON->NO No DIV Consider Divergent Evolution Scenario NO->DIV HOM Check for Homoplasy NO->HOM

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools for Integrated Introgression Analysis

Tool Name Primary Function Application Context Key Parameters
IQ-TREE v.2 Maximum likelihood phylogenetic inference Gene tree estimation from sequence alignments Substitution model, branch support, tree search
ASTRAL Species tree estimation from gene trees Primary species tree inference under multispecies coalescent Local posterior probabilities, quartet scores
PhyloNet Phylogenetic network inference Modeling both divergence and introgression events Number of reticulations, model comparison
PAUP* General phylogenetic analysis Alternative tree inference and phylogenetic operations Parsimony, likelihood, distance methods
ADMIXTOOLS D-statistic calculation SNP-based introgression detection ABBA/BABA counts, jackknife confidence intervals
FigTree Phylogenetic tree visualization Tree visualization and manipulation Tree display, annotation, export formats

Table 4: Key Analytical Metrics and Interpretation Guidelines

Metric Calculation Interpretation Threshold Biological Meaning
D-Statistic (ABBA - BABA) / (ABBA + BABA) |D| > 2-3 standard errors Significant allele sharing asymmetry
Local Posterior Probability ASTRAL branch support Value > 0.95 Strong support for species relationship
Gene Tree Topology Frequency Count of specific topology / total trees Significant asymmetry (binomial p < 0.01) Introgression between sister taxa in overrepresented topology
Quartet Support Proportion of gene trees supporting quartet Value > 0.80 Strong gene tree support for relationship

Conclusion

ASTRAL stands as a powerful and statistically consistent method for species tree estimation, even in the face of pervasive gene flow and ghost introgression. Its quartet-based approach provides robustness against certain forms of genealogical discordance, though users must remain vigilant about its interactions with complex introgression scenarios, particularly those affecting divergence time estimates. The future of phylogenomics in biomedical and evolutionary research lies in the integrated use of multiple methods—combining the scalability of ASTRAL with the co-estimation power of *BEAST and the explicit network modeling of PhyloNet. This multi-pronged strategy is crucial for accurately reconstructing evolutionary histories, which in turn provides a reliable framework for studying the genetic basis of adaptation, disease susceptibility, and trait evolution in naturally hybridizing systems.

References