Selecting an appropriate nucleotide substitution model is a critical, yet often overlooked, step in phylogenetic analysis that directly impacts the accuracy of inferred evolutionary relationships, divergence times, and population histories.
Selecting an appropriate nucleotide substitution model is a critical, yet often overlooked, step in phylogenetic analysis that directly impacts the accuracy of inferred evolutionary relationships, divergence times, and population histories. This article provides a comprehensive guide for researchers and scientists, covering the foundational principles of substitution models, current methodologies and software for model selection, strategies for troubleshooting complex datasets, and advanced techniques for model validation. By synthesizing the latest evidence and best practices, this guide empowers professionals in drug development and biomedical research to make informed decisions, avoid common pitfalls, and enhance the reliability of their molecular evolutionary analyses.
A substitution model is a mathematical description of the process by which one character in a molecular sequence (such as a nucleotide or amino acid) replaces another over evolutionary time. These models are fundamental to probabilistic methods in molecular evolutionary analysis, including phylogenetic tree reconstruction, ancestral sequence reconstruction, and estimating evolutionary parameters [1]. At their core, substitution models are based on Markov processes, which describe systems that transition between states with probabilities dependent only on the current state [2].
A Markov process is a stochastic process that satisfies the Markov property, meaning that the future state of the system depends only on its present state, not on the sequence of events that preceded it [2].
In molecular evolution, the "states" are the character states in a sequence. For nucleotide sequences, the four states are A, C, G, and T/U. The Markov process describes the probability of a nucleotide substituting for another at a given site over a time period [2].
The core of a substitution model is the instantaneous rate matrix (Q). Its elements, ( Q{ij} ), represent the instantaneous rate of change from state ( i ) to state ( j ). The diagonal elements ( Q{ii} ) are defined so that the sum of each row is zero.
Substitution models provide the statistical foundation for calculating the probability of observing sequence data given a phylogenetic tree. They are essential for obtaining accurate phylogenetic inferences, as they directly influence the reliability of resulting trees and downstream analyses [3] [1]. Using an inappropriate model can lead to incorrect evolutionary conclusions.
The statistical selection of the best-fit model is routine in phylogenetics. The standard methodology involves [4] [3]:
Comparative studies have demonstrated that the Bayesian Information Criterion (BIC) consistently outperforms both AIC and the corrected Akaike Information Criterion (AICc) in accurately identifying the true nucleotide substitution model [3]. BIC more heavily penalizes extra parameters, which helps avoid overfitting. The choice of software (jModelTest2, ModelTest-NG, or IQ-TREE) does not significantly affect the accuracy of model identification [3].
The effect depends on your dataset's genetic diversity [1]. For datasets with low genetic diversity, model selection has less impact because evolutionary trajectories are clearly defined. For datasets with high genetic diversity, model selection plays a more important role in determining past evolutionary states, especially among states with similar probabilities under neutral evolution [1].
Nucleotide substitution models typically have fewer parameters (e.g., transition/transversion ratio, nucleotide frequencies) that are estimated directly from the study data [1]. Protein substitution models are predominantly empirical; they incorporate pre-estimated parameters (like exchangeability matrices and equilibrium amino acid frequencies) derived from large databases of protein sequences [1].
This protocol describes how to statistically select a best-fit model using standard software tools [3].
GTR+I+G) in your methodology.
This methodology is used in simulation studies to assess how well model selection procedures recover true known models [3].
Data from a comprehensive analysis of 88 simulated datasets showing how frequently each criterion correctly identified the true model [3].
| Selection Criterion | Accuracy in Identifying True Model | Key Characteristics |
|---|---|---|
| Bayesian Information Criterion (BIC) | Consistently Highest | Most heavily penalizes extra parameters, most effective at avoiding overfitting [3]. |
| Akaike Information Criterion (AIC) | Lower than BIC | Derived from frequentist probability; less stringent penalty for complexity [3]. |
| Corrected Akaike Information Criterion (AICc) | Lower than BIC | A corrected version of AIC for small sample sizes; performance similar to AIC [3]. |
Key software tools used for statistically selecting best-fit models of nucleotide substitution [3].
| Software | Primary Function | Notable Features |
|---|---|---|
| jModelTest2 | Nucleotide substitution model selection | Implements AIC, AICc, and BIC for model comparison [3]. |
| ModelTest-NG | Nucleotide and protein model selection | One to two orders of magnitude faster than jModelTest2; comprehensive model set [3]. |
| IQ-TREE | Integrated model selection and tree inference | Performs model selection simultaneously with tree search; very fast [3]. |
| Tool / 'Reagent' | Function in Analysis |
|---|---|
| Multiple Sequence Alignment (MSA) | The fundamental input data; the molecular sequences to be analyzed. |
| ModelTest-NG Software | A fast and comprehensive program for selecting the best-fit substitution model [3]. |
| BIC (Bayesian Information Criterion) | The statistical criterion shown to most accurately identify the true model [3]. |
| Exchangeability Matrix (Q) | The core of the model, defining relative substitution rates between character states [1]. |
| Equilibrium Frequency Parameters (π) | Model parameters for the stationary frequencies of nucleotides or amino acids [1]. |
In molecular evolutionary genetics, a nucleotide substitution model is a mathematical description of how DNA sequences change over evolutionary time. It specifies the rates of substitution between all pairs of nucleotides (A, C, G, T) and the frequencies of each nucleotide in the sequence [5]. The selection of an appropriate substitution model is crucial for obtaining accurate phylogenetic inferences, as it directly influences the reliability of resulting evolutionary trees and all downstream analyses [3].
Substitution models range from simple to highly parameterized complex models. A simple model may assume all nucleotide substitutions are equally likely and that all nucleotides have the same frequency, while complex models account for biological realities such as different rates for transitions versus transversions, variation in nucleotide frequencies, and rate variation across sites [3] [6]. This technical guide explores the spectrum of available models and provides practical guidance for researchers navigating model selection in their phylogenetic analyses.
The following table summarizes the most common nucleotide substitution models, ordered by complexity from simplest to most parameter-rich:
Table 1: Common DNA Substitution Models and Their Characteristics
| Model Name | Parameters | Explanation | Key Reference |
|---|---|---|---|
| JC69 (Jukes-Cantor) | 0 | Equal substitution rates and equal base frequencies. | Jukes and Cantor (1969) [7] |
| F81 | 3 | Equal substitution rates but unequal base frequencies. | Felsenstein (1981) [7] |
| K80/K2P (Kimura 2-parameter) | 1 | Distinguishes between transitions and transversions rates; equal base frequencies. | Kimura (1980) [7] [8] |
| HKY85 | 4 | Different transition/transversion rates and unequal base frequencies. | Hasegawa et al. (1985) [7] [8] |
| TN93 (Tamura-Nei) | 5 | Extends HKY85 with different rates for two types of transitions (AG vs. CT). | Tamura and Nei (1993) [7] [8] |
| SYM | 5 | Symmetric model with unequal substitution rates but equal base frequencies. | Zharkikh (1994) [7] |
| GTR (General Time Reversible) | 8 | Most general time-reversible model with unequal rates and unequal base frequencies. | Tavaré (1986) [5] [7] |
Each substitution model is defined by its instantaneous rate matrix (Q), which describes the relative rates of change between nucleotide states [5] [9]. For the GTR model, this matrix takes the form:
Where parameters a through f represent the relative exchangeability rates between nucleotide pairs, and πA through πT represent the equilibrium base frequencies [5]. The diagonal elements are set such that each row sums to zero, ensuring mathematical consistency [5] [6].
Simpler models place equality constraints on these parameters. For example, the JC69 model assumes all exchangeability parameters (a-f) are equal and all base frequencies are equal (0.25 each) [9]. The HKY85 model distinguishes only between transitions (AG, CT) and transversions (all other changes), with two different rate parameters [8].
A comprehensive 2025 study analyzed model selection across 34 real datasets and 88 simulated datasets using three widely-used phylogenetic programs: jModelTest2, ModelTest-NG, and IQ-TREE [3]. The experimental protocol involved:
Dataset Preparation: The 34 real datasets contained multilocus DNA alignments from mitochondrial, nuclear, and chloroplast genomes from diverse animals and plants, with taxa ranging from 13 to 2,872 and alignment lengths from 823 to 25,919 sites. The 88 simulated datasets each contained 100 taxa with 10,000 nucleotides length, generated based on 88 random trees by AliSim software using different nucleotide substitution models [3].
Model Selection Execution: For each dataset, statistical selection of the best-fit nucleotide substitution model was performed using AIC, AICc, and BIC criteria in all three programs, testing all substitution models offered by each software [3].
Consistency Evaluation: The researchers evaluated whether different substitution models selected using different criteria within the same software were similar, with models differing by four or fewer parameters considered similar, and those differing by five or more considered dissimilar [3].
Accuracy Assessment: For simulated datasets with known true models, analysis determined whether best-fit models identified by each program and selection criterion were consistent with the known true model [3].
Statistical Analysis: Chi-squared tests of independence were employed to determine if significant associations existed between programs, selection criteria, and model selection consistency. All statistical analyses were conducted in RStudio using various packages including 'rcompanion' for post hoc analysis [3].
Table 2: Performance of Information Criteria in Identifying True Model (Based on 88 Simulated Datasets)
| Information Criterion | Accuracy in Identifying True Model | Model Complexity Preference | Key Characteristics |
|---|---|---|---|
| BIC (Bayesian Information Criterion) | Consistently outperformed AIC and AICc | Prefers simpler models with fewer parameters | Derived from Bayesian probability; most heavily penalizes extra parameters [3] |
| AIC (Akaike Information Criterion) | Lower accuracy than BIC | Prefers more complex models | Derived from frequentist probability; moderate parameter penalty [3] |
| AICc (Corrected AIC) | Lower accuracy than BIC; nearly identical to AIC | Prefers more complex models | Corrected version of AIC for small sample sizes; similar performance to AIC [3] |
The comparative analysis demonstrated that the choice of phylogenetic program (jModelTest2, ModelTest-NG, or IQ-TREE) did not significantly affect the ability to accurately identify the true nucleotide substitution model [3]. This indicates researchers can confidently rely on any of these programs for model selection, as they offer comparable accuracy without substantial differences [3].
However, the study revealed that selection by AIC and AICc was nearly identical across most datasets, with differences observed in only one real dataset ('Devitt_2013') in ModelTest-NG and IQ-TREE [3].
Q: When the best-fit model selected by BIC is inconsistent with that selected by AIC/AICc, which should I use?
A: Based on empirical evidence, BIC should be preferred when it conflicts with AIC or AICc. The 2025 study found BIC consistently outperformed both AIC and AICc in accurately identifying the true model, regardless of the program used [3]. BIC's heavier penalty for additional parameters helps prevent overparameterization.
Q: Are the best-fit models selected by BIC typically simpler or more complex than those selected by AIC and AICc?
A: BIC typically selects simpler models with fewer parameters because it more heavily penalizes the addition of extra parameters compared to AIC and AICc [3]. This parsimonious approach often leads to better performance in identifying the true underlying model.
Q: When different software programs suggest different best-fit models, which should I trust?
A: The research indicates that choice of program (jModelTest2, ModelTest-NG, or IQ-TREE) does not significantly affect accurate model identification [3]. You can confidently use results from any of these programs. If they disagree, consider using Bayesian model averaging approaches that account for model uncertainty [10] [11].
Q: How important is model selection for phylogenetic analysis?
A: Extremely important. Different substitution models can change the outcome of phylogenetic analyses [3]. Appropriate model selection is crucial for obtaining accurate phylogenetic inferences, as it directly influences the reliability of resulting trees and downstream analyses [3].
Problem: Inconsistent model selection between criteria.
Problem: Computational limitations with complex models.
Problem: Uncertainty in model selection.
Problem: Rate variation across sites.
An alternative to selecting a single best-fit model is Bayesian model averaging, which treats the substitution model as a nuisance parameter and integrates over all available models [11]. This approach, implemented in BEAST2's bModelTest package, uses reversible jump MCMC (rjMCMC) to jump between states representing different possible substitution models [11].
This method allows simultaneous estimation of phylogeny while integrating over 203 possible time-reversible nucleotide substitution models (or 31 models when considering only transitions/transversions splits) [11]. Parameter estimates are effectively averaged over different substitution models, weighted by each model's support. The proportion of time the Markov chain spends in a particular model state can be interpreted as the posterior support for that model [11].
More advanced approaches account for heterogeneity in the substitution process across sites in an alignment. The Substitution Dirichlet Mixture Model (SDPM) uses Dirichlet process mixture models to simultaneously estimate the number of partitions, assignment of sites to partitions, and substitution model for each partition [10].
This method addresses the limitation of traditional approaches that assume a homogeneous substitution process across all sites or require a priori partitioning of the alignment [10]. Studies have shown that as many as nine partitions may be required to explain heterogeneity in nucleotide substitution processes across sites in single-gene analyses [10].
Table 3: Essential Software Tools for Nucleotide Substitution Model Selection
| Tool/Resource | Function | Key Features |
|---|---|---|
| IQ-TREE | Phylogenetic inference and model selection | Implements ModelFinder for automatic model selection; supports wide range of substitution models including Lie Markov models [7] |
| jModelTest2 | Nucleotide substitution model selection | Statistical selection of best-fit models using AIC, AICc, and BIC [3] |
| ModelTest-NG | Nucleotide substitution model selection | One to two orders of magnitude faster than jModelTest; uses same statistical criteria [3] |
| BEAST2 + bModelTest | Bayesian phylogenetic analysis with model averaging | Uses reversible jump MCMC to average over 203 time-reversible substitution models [11] |
| RevBayes | Bayesian phylogenetic inference | Flexible platform for implementing various substitution models and conducting MCMC analyses [9] |
| RStudio + 'rcompanion' | Statistical analysis | Post hoc analysis of model selection consistency and performance [3] |
The spectrum of nucleotide substitution models ranges from the highly simplistic JC69 to the parameter-rich GTR model. Empirical research demonstrates that while choice of software platform (jModelTest2, ModelTest-NG, or IQ-TREE) has minimal impact on model selection accuracy, the choice of statistical criterion is crucial, with BIC consistently outperforming AIC and AICc in identifying the true model [3].
For practitioners, this suggests a workflow beginning with automated model selection in established software, prioritizing BIC when criteria conflict, and considering Bayesian model averaging approaches when model uncertainty remains high. As phylogenetic analyses continue to evolve with larger datasets and more complex evolutionary questions, proper model selection remains a fundamental step in ensuring accurate and reliable phylogenetic inference.
In molecular evolutionary genetics, the rate matrix (Q) and equilibrium frequencies (π) are fundamental components of nucleotide substitution models. These parameters define the mathematical framework for describing how DNA sequences evolve over time, influencing the accuracy of phylogenetic inference, detection of natural selection, and estimation of evolutionary distances [6] [12]. Proper specification and understanding of these components are crucial for researchers and drug development professionals selecting best-fit models for their analyses.
| Problem Area | Specific Issue | Possible Causes | Recommended Solutions |
|---|---|---|---|
| Model Selection & Fitting | Inconsistent best-fit model selection | Using different information criteria (AIC, BIC, AICc) or software [3] | Use BIC for model selection; validate with multiple criteria [3]. |
| Poor model fit to data | Overly simple model; ignoring rate heterogeneity; incorrect equilibrium frequencies [4] | Use model testing software (jModelTest, ModelTest-NG, IQ-TREE); consider +G or +I extensions [3] [12]. | |
| Parameter Estimation | Unstable or biased parameter estimates (e.g., κ, π) | Small sample size; insufficient MCMC convergence; model misspecification [13] | Increase sequence data; run MCMC longer; check prior distributions in Bayesian analysis [13]. |
| Estimates of π do not match observed base frequencies | Violation of model assumptions (e.g., selection, non-stationarity) [14] | Test for stationarity; consider models that do not assume equilibrium [14]. | |
| Technical Implementation | Poor phylogenetic inference (e.g., wrong tree) | Model misspecification; inadequate handling of rate variation across sites [12] | Use GTR+G+I model as starting point; perform rigorous model selection [12]. |
| Computational bottlenecks with complex models | Large datasets; highly parameterized models (e.g., GTR) [3] | For large datasets, use BIC to select simpler models; use efficient software (IQ-TREE, ModelTest-NG) [3]. |
What is the biological interpretation of the f parameter in +gwF models? The f parameter in generalized weighted frequency (+gwF) models determines the relative influence of the starting versus ending nucleotide states on substitution rates [14]. While a value of f=0.5 is expected under a nearly neutral model with weak selection and drift, evolutionary simulations show this parameter is highly sensitive to violations of model assumptions, including population disequilibrium and mutation rate biases [14]. Consequently, f should typically be estimated as a free parameter rather than fixed a priori, as its biological interpretation is complex and context-dependent [14].
How do equilibrium frequencies (π) relate to the Hardy-Weinberg principle? The Hardy-Weinberg principle describes a state of equilibrium in population genetics where allele and genotype frequencies remain constant from generation to generation in the absence of evolutionary forces [15]. In substitution models, the equilibrium frequencies (π) represent a similar steady-state distribution of nucleotides expected under the model over long evolutionary timescales [6]. Both concepts describe stable distributions, though Hardy-Weinberg applies to genotype frequencies in a population, while π applies to nucleotide frequencies in a sequence evolution model.
Which information criterion (AIC, AICc, or BIC) should I use for model selection? A 2025 comparative analysis demonstrates that the Bayesian Information Criterion (BIC) consistently outperforms AIC and AICc in accurately identifying the true underlying nucleotide substitution model [3]. BIC's stronger penalty for extra parameters helps avoid overfitting. While the choice of software (jModelTest2, ModelTest-NG, IQ-TREE) does not significantly impact accuracy, the choice of criterion does [3].
When should I use a reversible versus non-reversible model? Most common substitution models assume time reversibility (πiqij = πjqji), which simplifies calculations and is often biologically reasonable [6]. However, non-reversible models are crucial for detecting certain evolutionary events like horizontal gene transfer, as they can capture changes in the substitution rate matrix that occur when a gene moves to a new genomic context with different mutational biases [16].
How do I implement and test different substitution models in practice? Practical implementation involves using phylogenetic software that supports model specification and parameter estimation [13] [12]. The workflow typically involves: (1) specifying priors for model parameters (e.g., κ ~ Uniform(0,10)), (2) defining the Q matrix as a deterministic function of these parameters, (3) using MCMC moves to explore the parameter space, and (4) analyzing posterior distributions to obtain parameter estimates and assess convergence [13].
What are the key extensions to basic substitution models? Two critical extensions are the +I (proportion of invariable sites) and +G (rate variation across sites) parameters [12]. The GTR+G+I model, which incorporates both general time-reversible rates and these extensions, often provides the best fit to real data, reflecting the complexity of actual evolutionary processes [12].
Purpose: To systematically identify the most appropriate nucleotide substitution model for a given DNA sequence alignment to ensure accurate phylogenetic inference [4] [3].
Materials:
Procedure:
iqtree -s alignment.phy -m TEST -mset ALL -BIC -AIC -AICc
This command tests all available models and calculates AIC, AICc, and BIC values [3].Purpose: To estimate parameters of different substitution models and assess their impact on phylogenetic tree inference using Bayesian methods [13].
Materials:
Procedure:
kappa ~ dnUniform(0,10) for T92; pi ~ dnDirichlet(1,1,1,1) for HKY [13].Q := fnHKY(kappa=kappa, baseFrequencies=pi)).mvSlide for frequencies, mvScale for κ) and monitors.
| Category | Item/Software | Specific Function in Analysis |
|---|---|---|
| Model Selection Software | jModelTest2, ModelTest-NG, IQ-TREE [3] | Statistically compares fit of different nucleotide substitution models to a given DNA alignment using AIC, AICc, and BIC. |
| Phylogenetic Inference Packages | RevBayes [13], BEAST [12], RAxML [12], PhyML [12] | Performs phylogenetic tree inference under specified substitution models, often with Bayesian or Maximum Likelihood methods. |
| Sequence Simulation Tools | Seq-Gen, INDELible, EVOLVER [12] | Generates synthetic DNA sequence alignments under a known substitution model for method testing and validation. |
| Model Components | Rate Matrix (Q) [6] [13] | Encodes instantaneous rates of change between nucleotides; the core of any substitution model. |
| Equilibrium Frequencies (π) [6] [14] | Represents the stable, long-term distribution of nucleotide frequencies the model converges to. | |
| Γ-distributed Rate Variation (+G) [12] | Accounts for the fact that different sites in a sequence evolve at different rates. | |
| Proportion of Invariable Sites (+I) [12] | Accounts for a fraction of sites in a sequence that are completely unable to change. |
What do time-reversibility and stationarity mean in molecular evolution models? Time-reversibility describes a substitution model where the probability of changing from state i to state j is identical to that of changing from j to i when considering the equilibrium frequencies of the bases, satisfying the condition π~i~Q~ij~ = π~j~Q~ji~ [5]. Stationarity means that the substitution process parameters, particularly the equilibrium base frequencies (π), remain constant across the entire phylogenetic tree and over evolutionary time [5] [17].
Why are these assumptions practically important for my phylogenetic analysis? These assumptions dramatically simplify the mathematical complexity and computational burden of phylogenetic inference [5]. Time-reversibility makes the identification of ancestral sequences unnecessary, allowing you to use unrooted trees and re-root analyses without changing the likelihood scores [5]. Stationarity enables the application of consistent evolutionary parameters across all lineages, making statistical inference tractable.
What are the consequences of violating these assumptions in real data analysis? Violating stationarity assumptions (non-stationarity) can lead to systematic errors in tree inference, particularly when base composition differs significantly among lineages [17]. Non-stationary processes can create homoplasy that may be misinterpreted as phylogenetic signal. For time-reversibility violations, the major practical consequence is the inability to use standard modeling approaches, requiring more complex non-reversible models that are computationally intensive [5].
How can I detect violations in my dataset? Statistical tests for stationarity include examining base composition homogeneity across taxa using χ² tests or more sophisticated approaches. For time-reversibility, specialized tests exist that compare observed site patterns with those expected under reversible models. Visualization of base composition trends across lineages can also reveal non-stationarity.
Symptoms
Diagnostic Steps
Solutions
Table 1: Comparison of Substitution Model Performance Under Stationary vs. Non-Stationary Conditions
| Data Characteristic | Stationary Model Performance | Non-Stationary Model Performance |
|---|---|---|
| Homogeneous base composition | Optimal | Similar to stationary |
| Heterogeneous base composition | Potentially misleading | More accurate |
| Computational requirements | Lower | Significantly higher |
| Parameter count | Fewer | Increased (more complex) |
| Implementation availability | Widespread | Limited |
Background: A 2020 study surprisingly demonstrated that even extremely simple substitution models (like JC69) can produce divergence time estimates similar to those from complex models (like GTR+Γ) in many practical phylogenomic datasets [18].
Diagnostic Checklist
Solutions and Recommendations
Table 2: Factors Affecting Robustness of Time Estimates to Model Complexity
| Factor | Effect on Time Estimate Robustness | Practical Implication |
|---|---|---|
| Number of calibrations | Increases robustness | Include multiple high-quality calibrations |
| Taxon sampling | Moderate effect | Dense sampling recommended |
| Sequence length | Limited effect beyond sufficient data | Focus on quality over quantity |
| Rate variation among lineages | Significant effect | Use relaxed clock methods |
| Tree depth | Variable effect | Test sensitivity for your data |
Purpose: To empirically evaluate whether your data violates time-reversibility or stationarity assumptions and determine the impact on phylogenetic inference.
Materials and Reagents
Procedure
Stationarity Testing
Model Comparison
Impact Assessment
Troubleshooting Notes
Table 3: Essential Computational Tools for Testing Evolutionary Assumptions
| Tool/Software | Primary Function | Application Context |
|---|---|---|
| RevBayes | Bayesian phylogenetic inference | Flexible model testing including non-reversible models [9] |
| IQ-TREE | Maximum likelihood phylogenetics | Efficient model testing and comparison |
| PhyloBayes | Bayesian phylogenetics | Non-stationary and non-reversible model implementation |
| PAUP* | Phylogenetic analysis | Compositional homogeneity testing |
| R + phangorn | Statistical analysis | Custom tests and visualization |
Workflow for Testing Evolutionary Assumptions
Q1: What are the practical consequences if I use a standard site-independent model on data that has evolved with epistasis?
Using a standard site-independent model on data with underlying epistasis (where the evolutionary change at one site depends on the state of another site) is a form of model misspecification. The consequences are twofold [19]:
Q2: How can I detect if my dataset suffers from model misspecification due to unmodeled epistasis?
A powerful method for detecting unmodeled features like epistasis is the use of Posterior Predictive Checks [19]. This Bayesian methodology involves:
Researchers have developed alignment-based test statistics that act as diagnostics for pairwise epistasis for this exact purpose [19].
Q3: When selecting a best-fit nucleotide substitution model, does the choice of software or statistical criterion significantly impact the result?
A recent comparative analysis provides clear guidance [3] [20]:
Q4: What are the core problems with the current standard protocol for phylogenetic analysis?
The standard protocol is missing two critical steps that allow model misspecification and researcher confirmation bias to influence results [21]:
Q5: What is the "relative worth" of an epistatic site compared to an independent site?
The concept of "relative worth" (r) quantifies the informational value of an epistatic site analyzed under a misspecified (site-independent) model, relative to a site that truly evolved independently [19]. The value of r defines two scenarios:
The actual value of r depends on the strength of the epistatic interaction and the specific evolutionary context.
Table 1: Impact of Model Selection Criteria on Identification of True Model
| Information Criterion | Penalization of Parameters | Performance in Identifying True Model | Recommended Use Case |
|---|---|---|---|
| BIC (Bayesian Information Criterion) | Heaviest | Consistently outperformed AIC and AICc [3] [20] | Default choice for model selection |
| AICc (Corrected Akaike IC) | Moderate | Similar to AIC, but less accurate than BIC [3] | Consider for very small sample sizes |
| AIC (Akaike Information Criterion) | Lightest | Less accurate than BIC in simulations [3] | Less recommended for phylogenetic model selection |
Table 2: Consequences of Model Misspecification from Unmodeled Pairwise Epistasis
| Scenario | Impact on Phylogenetic Inference | Effect of Removing Epistatic Sites | Relative Worth (r) |
|---|---|---|---|
| Epistasis introduces bias | Inferred tree is less accurate (systematic error) [19] | Inference improves [19] | r < 0 |
| Epistasis increases variance | Estimator is less stable; effective sites are fewer [19] | Inference worsens (information is lost) [19] | r > 0 |
Protocol 1: Assessing Model Adequacy Using Posterior Predictive Checks for Epistasis
This protocol is used to evaluate whether a standard site-independent model is a good fit for your data or if it fails to account for pairwise epistasis [19].
Protocol 2: A Revised Phylogenetic Protocol to Minimize Bias
This updated protocol adds critical steps to the standard workflow to guard against model misspecification and confirmation bias [21].
Revised Phylogenetic Protocol
Posterior Predictive Check for Epistasis
Table 3: Essential Software and Statistical Tools for Phylogenetic Model Evaluation
| Tool Name | Type / Category | Primary Function in Model Assessment |
|---|---|---|
| jModelTest2 [3] | Software Package | Statistical selection of best-fit nucleotide substitution model using AIC, AICc, and BIC. |
| ModelTest-NG [3] | Software Package | Faster implementation for nucleotide and amino acid substitution model selection. |
| IQ-TREE [3] | Software Package | Integrated model selection and phylogeny inference, also supports AIC, AICc, and BIC. |
| BIC (Bayesian Information Criterion) [3] [20] | Statistical Criterion | Model selection criterion that penalizes complexity; recommended for its accuracy in identifying the true model. |
| Posterior Predictive Checks [19] | Statistical Methodology | A Bayesian method to assess the absolute goodness-of-fit of a model to a specific dataset. |
| Epistasis Diagnostic Test Statistics [19] | Diagnostic Tool | Alignment-based statistics designed to detect signatures of pairwise interactions for use in posterior predictive checks. |
Selecting the best-fit nucleotide substitution model is a critical step in phylogenetic analysis, as the choice of model can directly influence the reliability of the resulting evolutionary trees and downstream conclusions [22] [3]. For molecular evolutionary genetic research, particularly in fields like drug development where accurate phylogenetic inference can inform understanding of pathogen evolution or drug target relationships, this step is indispensable. Three software tools are currently widely used for this purpose: IQ-TREE, jModelTest2, and ModelTest-NG [22] [3] [23]. This technical support center provides a comparative overview of these tools, detailed troubleshooting guides, and frequently asked questions to assist researchers in implementing robust model selection protocols within their research workflows.
Understanding the performance characteristics and output of each software tool is the first step in selecting and troubleshooting them. The following sections provide a detailed comparison based on recent empirical evaluations.
A comprehensive 2025 study analyzing 34 real and 88 simulated datasets provides key quantitative metrics for comparing these tools. The findings indicate that the choice of software itself does not significantly impact the accuracy of identifying the true model, but the choice of statistical criterion is crucial [22] [3].
Table 1: Software Performance and Characteristics Comparison
| Feature | IQ-TREE (ModelFinder) | jModelTest2 | ModelTest-NG |
|---|---|---|---|
| Primary Use Case | Integrated model selection & tree inference | Dedicated model selection | Dedicated model selection |
| Speed (Relative to jModelTest2) | Very Fast (Faster than ModelTest-NG on synthetic data) [23] | Baseline (Slowest) [23] | Very Fast (1-2 orders of magnitude faster than jModelTest2) [23] |
| Best-Fit Model Accuracy (on simulated DNA) | 70% [23] | 81% [23] | 81% [23] |
| Key Strength | High integration & speed; user-friendly [24] | High accuracy [23] | High speed & accuracy; scalable for large datasets [23] |
Table 2: Impact of Statistical Criterion on Model Selection (2025 Study)
| Information Criterion | Model Identification Accuracy | Tendency in Model Selection | Recommendation |
|---|---|---|---|
| Akaike Information Criterion (AIC) | Lower than BIC [22] | Prefers more complex models [22] | -- |
| Corrected AIC (AICc) | Nearly identical to AIC [22] | Prefers more complex models [22] | Can be used interchangeably with AIC |
| Bayesian Information Criterion (BIC) | Consistently outperforms AIC and AICc [22] [3] | Prefers simpler models; heavier penalty for extra parameters [22] | Preferred for accurate model identification |
The following diagram illustrates the general workflow for statistical selection of the best-fit nucleotide substitution model, applicable to all three software tools.
Table 3: Troubleshooting Common Software Issues
| Problem | Possible Cause | Solution |
|---|---|---|
| IQ-TREE refuses to run,citing a finished checkpoint. | A previous analysis completed successfully, and IQ-TREE prevents overwriting outputs [24]. | Use the -redo option: iqtree -s alignment.phy -m MFP -redo [24]. |
| ModelTest-NG generates aninvalid IQ-TREE command. | Model naming conventions differ between software (e.g., JTT-DCMUT+G4 vs. JTTDCMut+G4) [25]. |
Manually translate the model name to the syntax required by the target software [25]. |
| Model selection fails or isinconsistent on a large dataset. | jModelTest2 is computationally intensive and may not be optimized for very large alignments [23]. | Switch to ModelTest-NG or IQ-TREE, which are designed for better scalability and speed [23] [24]. |
| AIC and AICc select amore complex model than BIC. | This is expected behavior. BIC more heavily penalizes model complexity [22]. | Prefer the model selected by BIC as it has been shown to more accurately identify the true generating model [22] [3]. |
The flowchart below helps troubleshoot situations where different software or criteria yield different results, guiding you toward the most reliable outcome.
Q1: I have limited computational resources. Which software should I use? For speed and efficiency, ModelTest-NG or IQ-TREE are the best choices. ModelTest-NG is one to two orders of magnitude faster than jModelTest2 [23]. IQ-TREE's ModelFinder is also known for its computational efficiency [24].
Q2: The model selected by BIC is simpler than the one selected by AIC. Which one should I trust for my analysis? You should trust the model selected by BIC. Empirical evidence shows that BIC consistently outperforms AIC and AICc in accurately identifying the true nucleotide substitution model, even though it tends to select simpler models [22] [3].
Q3: Is it necessary to run both AIC and AICc selection? No, it is generally not necessary. Research shows that the best-fit model selected by AIC and AICc is the same in the vast majority of cases. Their results are nearly identical [22].
Q4: My thesis requires the most accurate model selection possible, not just the fastest. What is the recommended strategy? Based on current evidence, the most robust strategy is to use ModelTest-NG (or jModelTest2) with the BIC criterion. These dedicated tools showed a 81% accuracy in recovering the true model on simulated DNA data, compared to 70% for IQ-TREE's ModelFinder in one study [23]. Furthermore, BIC has been independently validated as the most accurate criterion [22] [3].
Q5: How do I handle a situation where ModelTest-NG suggests a model that IQ-TREE does not recognize?
This is a known issue due to differences in model naming conventions between software packages [25]. The solution is to consult the documentation of your phylogenetic inference tool (e.g., IQ-TREE, RAxML) to find the correct model name string and manually adjust the command. For example, translate JTT-DCMUT+G4 to JTTDCMut+G4 for IQ-TREE [25].
This protocol is adapted for use with ModelTest-NG but can be easily modified for other tools.
-i: Input alignment file.-t ml: Use maximum likelihood tree reconstruction.-h ugf: Specify the base tree search method.-c 4: Number of CPU cores to use.-r 100: Number of bootstraps for tree assessment.Table 4: Key Software and Data "Reagents" for Model Selection Experiments
| Item Name | Function/Description | Example/Note |
|---|---|---|
| Multiple Sequence Alignment (MSA) | The primary input data; a matrix of aligned nucleotide or amino acid sequences. | Can be generated from raw sequences using aligners like MAFFT or ClustalW [24]. |
| ModelTest-NG Software | Dedicated tool for fast and accurate statistical selection of best-fit substitution models. | Preferred for its balance of high speed and high accuracy [23]. |
| IQ-TREE Software | Integrated tool for model selection (ModelFinder) and subsequent tree inference. | Ideal for a seamless workflow from model selection to tree building [24]. |
| Information Criteria (BIC) | Statistical metrics used to compare and select the best-fit model by balancing fit and complexity. | BIC is the recommended criterion due to its superior performance in model identification [22] [3]. |
| Template Configurations | Pre-defined settings that ensure model selection results are compatible with specific tree-building tools. | ModelTest-NG can output commands for RAxML, IQ-TREE, etc., reducing syntax errors [23] [25]. |
Selecting the correct nucleotide substitution model is a critical step in molecular evolutionary analysis. The right model accurately captures how DNA sequences change over time, influencing the reliability of the resulting phylogenetic tree, while a poor model can lead to incorrect evolutionary inferences [3]. Researchers often rely on information criteria like the Akaike Information Criterion (AIC), its small-sample correction (AICc), and the Bayesian Information Criterion (BIC) to choose the best-fit model from a set of candidates [26] [27].
This guide is designed as a technical support center to help you navigate this selection process. A foundational 2025 study by Li et al. demonstrated that BIC consistently outperforms AIC and AICc in accurately identifying the true nucleotide substitution model [3]. We will delve into the evidence behind this conclusion, provide troubleshooting guides for common issues, and outline best practices for your research.
A comprehensive 2025 analysis of 122 datasets (34 real and 88 simulated) provides the strongest recent evidence for the performance of these criteria [3]. The table below summarizes the key findings.
Table 1: Performance of Information Criteria in Identifying the True Model (Li et al., 2025)
| Information Criterion | Performance in True Model Identification | Key Characteristic | Typical Model Selection Tendency |
|---|---|---|---|
| BIC | Consistently outperformed AIC and AICc [3] | Heavier penalty for model complexity, especially with large n [26] [27] | Simpler, more parsimonious models |
| AIC | Less accurate than BIC in model identification [3] | Focuses on predictive accuracy, lighter penalty [26] [28] | More complex models with more parameters |
| AICc | Less accurate than BIC in model identification [3] | Corrects AIC for small sample size; converges with AIC as n increases [29] | More complex models with more parameters |
The study concluded that the choice of software (jModelTest2, ModelTest-NG, or IQ-TREE) did not significantly affect the ability to identify the true model. The critical factor was the choice of information criterion, with BIC providing superior accuracy [3].
The superior performance of BIC in identifying the true model is not accidental; it is rooted in its mathematical formulation and theoretical goal.
Table 2: Core Theoretical Differences Between AIC and BIC
| Aspect | Akaike Information Criterion (AIC) | Bayesian Information Criterion (BIC) |
|---|---|---|
| Mathematical Formula | AIC = 2k - 2ln(L) [26] | BIC = ln(n)k - 2ln(L) [26] |
| Primary Goal | Selects the model that best approximates reality (best predictive accuracy) [28] | Attempts to identify the true data-generating model [28] |
| Penalty Term | 2k (linear penalty for parameters) [26] | ln(n)k (penalty grows with sample size) [26] |
| Implied Assumption | The "true model" is not in the candidate set; all models are approximations [28] | The "true model" is among the candidates being evaluated [28] |
The key is the penalty term. BIC's penalty, which includes the log of the sample size (ln(n)), is more severe than AIC's, especially as your dataset grows larger. This stronger penalty discourages overfitting by making it harder to justify unnecessary parameters. In phylogenetic simulations where the true model is known, this property allows BIC to more reliably arrive at the correct answer [3].
Diagram 1: Decision workflow for selecting between AIC and BIC
When conducting nucleotide substitution model selection, you will interact with several key software tools and statistical concepts. The table below lists these essential "research reagents" and their functions.
Table 3: Key Reagents and Tools for Nucleotide Substitution Model Selection
| Tool / Concept | Function in Model Selection | Key Feature |
|---|---|---|
| jModelTest2 | Calculates best-fit model using AIC, AICc, BIC, etc. [3] | User-friendly GUI and command-line interface [30] |
| ModelTest-NG | Calculates best-fit model using AIC, AICc, BIC, etc. [3] | 1-2 orders of magnitude faster than jModelTest2 [3] |
| IQ-TREE | Performs model selection and tree inference simultaneously [3] | Integrated workflow, often faster [3] |
| BIC (Formula) | BIC = ln(n)k - 2ln(L); Selects models by penalizing complexity [26] | Stronger penalty term than AIC, prefers simpler models [27] |
| AIC (Formula) | AIC = 2k - 2ln(L); Selects models balancing fit and complexity [26] | Tends to favor more complex models than BIC [28] |
| AICc (Formula) | AICc = AIC + (2k(k+1))/(n-k-1); Corrects AIC for small n [29] | Recommended when n/k < 40 [29] |
To ensure the reproducibility of your work or the studies you read, here is a detailed methodology based on the pivotal 2025 study by Li et al.
Objective: To determine the impact of software and information criteria on the accuracy of selecting the true nucleotide substitution model [3].
Datasets:
Software & Commands:
Analysis:
Diagram 2: Experimental protocol for comparing model selection criteria
FAQ 1: The best-fit model selected by BIC is different from the one selected by AIC. Which one should I trust for my phylogenetic analysis?
FAQ 2: My dataset is relatively small. Should I use AICc instead of AIC or BIC?
FAQ 3: Does the software I use for model selection (jModelTest2, ModelTest-NG, or IQ-TREE) significantly impact the outcome?
FAQ 4: BIC always selects a simpler model than AIC. Is it prone to underfitting?
Q1: The model selection process is taking a very long time for my large alignment. How can I speed it up?
You can reduce the computational burden by restricting the set of models tested. Use the -mset option to specify only a subset of base models for testing. For amino acid data, the -msub option allows you to test only models relevant to your data type (e.g., -msub nuclear for general sequences or -msub viral for viral sequences) [24]. Furthermore, ensure you are using the multicore version of IQ-TREE and specify the number of CPU cores with the -nt option. For large alignments, using more cores significantly speeds up the analysis. You can use -nt AUTO to let IQ-TREE automatically determine the best number of threads [31].
Q2: How do I interpret the support values from ultrafast bootstrap (UFBoot) in my analysis?
UFBoot support values are more unbiased than standard bootstrap. A support value of 95% corresponds roughly to a 95% probability that the clade is true. For UFBoot, you should generally rely on branches with support ≥ 95%. It is also recommended to perform the SH-aLRT test (e.g., by adding -alrt 1000). You can be more confident in a clade if it has SH-aLRT ≥ 80% and UFBoot ≥ 95% [31]. Note that these thresholds are recommended for single-gene trees and may not hold for phylogenomic analyses with concatenated data.
Q3: ModelFinder selected a complex model with many parameters. Is it safe to use a simpler model for computational efficiency?
While a simpler model might be computationally faster, your primary goal should be model accuracy for robust phylogenetic inference. Recent research indicates that the Bayesian Information Criterion (BIC), which ModelFinder uses by default, consistently outperforms AIC and AICc in accurately identifying the true underlying nucleotide substitution model [3]. BIC more heavily penalizes unnecessary parameters, helping to avoid overfitting. Therefore, it is recommended to trust the ModelFinder selection based on BIC. If computational resources are a constraint for subsequent analyses, you can consider the model's specific parameters and consult the literature for potential simplifications, but this should be done cautiously.
Q4: What does the composition chi-square test at the beginning of the run mean, and what should I do if sequences "fail" this test?
IQ-TREE performs a composition chi-square test for each sequence to check for significant deviation from the average character composition of the entire alignment [31]. A "failed" sequence has a composition that is statistically different from the average. This test is an explorative tool. You should not automatically remove failing sequences. However, if your final tree shows an unexpected topology, this test might help identify sequences causing potential problems. For phylogenomic protein data, you can also try profile mixture models (e.g., C10 to C60), which account for different amino-acid compositions across the sequences and sites [31].
Q5: Can I use ModelFinder for mixed data types in a partitioned analysis?
Yes. IQ-TREE allows partitioned analysis with mixed data types (e.g., DNA, protein, codon) specified via a NEXUS partition file [31]. In the partition file, you can define subsets (charsets) from different alignment files and assign them different models. When mixing codon and DNA data, note that the branch lengths are interpreted as the number of nucleotide substitutions per nucleotide site [31].
Q6: My analysis was interrupted. Do I have to start from the beginning?
No. IQ-TREE periodically writes checkpoint files (e.g., .ckp.gz). Simply re-run the same command, and IQ-TREE will resume from the last checkpoint [24]. If the run successfully completed, re-running the command will result in an error to prevent overwriting outputs. Use the -redo option to force IQ-TREE to overwrite all previous output files and start anew [24].
Q7: I want to perform a more thorough model selection that searches tree-space for each model. How can I do this?
For a more accurate but computationally intensive analysis, you can use the -mtree option with ModelFinder. This instructs IQ-TREE to perform a full tree search for each model considered, rather than comparing all models on a single initial parsimony tree [32] [24]. This can lead to a better fit between the tree, model, and data, especially if the initial tree is suboptimal [32].
The following diagram illustrates the logical workflow for automating model selection with IQ-TREE's ModelFinder, integrating key decision points from the troubleshooting guides.
The table below details key software and methodological "reagents" essential for conducting automated model selection with IQ-TREE.
Table 1: Essential Research Reagents for Model Selection with IQ-TREE
| Item Name | Type | Function / Purpose | Key Implementation Notes |
|---|---|---|---|
| ModelFinder | Algorithm | Automatically determines the best-fit substitution model for a given alignment by comparing models using AIC, AICc, or BIC [32] [24]. | Activated with -m MFP in IQ-TREE. Default is BIC, which has been shown to accurately identify the true model [3]. |
| Probability-Distribution-Free (PDF) Model | Rate Heterogeneity Model | A key feature of ModelFinder that models rate variation across sites without assuming a Gamma distribution, providing a more flexible fit to the data [32]. | Labeled as Rk (e.g., R5) in model output. Often leads to a better fit than traditional +G or +I+G models [32]. |
| Ultrafast Bootstrap (UFBoot) | Branch Support Method | Approximates standard bootstrap supports with much less computational time, yielding less biased support values [31]. | Implemented with -bb option. Support values ≥ 95% are considered significant. |
| SH-aLRT Test | Branch Support Method | A fast likelihood-based test for branch support [31]. | Used with -alrt option. Supports ≥ 80% are considered significant. Often reported alongside UFBoot. |
| NEXUS Partition File | Data Specification Format | Enables complex partitioned analyses, including mixing of different data types (DNA, protein, codon) in a single analysis [31]. | Required for mixing data types. Each partition's site indices are specified independently. |
Detailed Protocol: Automating Model Selection and Tree Inference
This protocol outlines the steps for finding the best-fit model and reconstructing a maximum-likelihood phylogeny for a nucleotide alignment using IQ-TREE.
-m MFP flag activates ModelFinder Plus.
-s your_alignment.phy: Specifies the input alignment file.-m MFP: Triggers ModelFinder to select the best-fit model and then uses it for the subsequent tree search.-bb 1000: Performs 1000 ultrafast bootstrap replicates.-alrt 1000: Performs the SH-like approximate likelihood ratio test with 1000 replicates.-nt AUTO: Automatically determines the optimal number of CPU threads to use.your_alignment.phy.iqtree) contains the selected model, model parameters, log-likelihood scores, and the final tree with support values. The best-fit model (e.g., TIM2+F+I+G4) will be clearly stated in this file [24].-m MF), or wish to run the final tree with a specific model, use the model name identified in the previous step.
Table 2: Impact of Model Selection Criterion on Identification of True Model (Based on Simulated Data)
| Selection Criterion | Performance in Identifying True Model | Key Characteristics |
|---|---|---|
| BIC (Bayesian Information Criterion) | Consistently outperforms AIC and AICc in accurately identifying the true model, regardless of the software used (IQ-TREE, jModelTest2, ModelTest-NG) [3]. | Heavily penalizes extra parameters, favoring simpler models when appropriate, which helps prevent overfitting. |
| AICc (Corrected Akaike Information Criterion) | Shows a 2-3% bias towards selecting more parameter-rich models of rate heterogeneity (Rk) compared to the true model [32]. | Provides a correction for small sample sizes. Results are often identical to AIC for large datasets [3]. |
| AIC (Akaike Information Criterion) | Performance similar to AICc, but may select overly complex models compared to BIC [3]. | Derived from frequentist probability, it is less punitive against model complexity than BIC. |
What do the 6-digit codes in my model selection output mean?
These codes represent equality constraints on the six relative substitution rates between nucleotide pairs (A-C, A-G, A-T, C-G, C-T, G-T). Identical digits indicate that the corresponding rates are set equal to each other. For example, the code 010010 means the A-G rate equals the C-T rate (both coded as 1), while the other four rates are equal to each other (coded as 0). The code 012345 indicates a General Time Reversible (GTR) model where all six rates are free to vary independently [7].
AIC and AICc selected one model, but BIC selected a simpler one. Which should I trust? Your observation is a common and important outcome. Research indicates that the Bayesian Information Criterion (BIC) consistently outperforms both AIC and AICc in accurately identifying the true underlying nucleotide substitution model [3]. BIC more heavily penalizes model complexity, which helps prevent overfitting. For robust phylogenetic analysis, particularly with large datasets, trusting the BIC selection is generally recommended [3].
The same criterion selects different best-fit models in jModelTest2, ModelTest-NG, and IQ-TREE. Is this normal? A comparative analysis has demonstrated that the choice of software (jModelTest2, ModelTest-NG, or IQ-TREE) does not significantly affect the accuracy of model selection [3]. The key factor is the statistical criterion chosen (AIC, AICc, or BIC). You can confidently rely on results from any of these three programs, as they offer comparable performance [3].
What is the practical difference between +F, +FO, and +FQ in model names?
These suffixes specify how base frequencies are handled [7]:
+F: Uses empirical base frequencies counted directly from your alignment. This is often the default for models that allow unequal base frequencies.+FO: Optimizes base frequencies using maximum likelihood, which can provide a better fit but is computationally more intensive.+FQ: Forces equal base frequencies.The table below summarizes common DNA substitution models, their parameters, and their standard 6-digit codes to help you interpret software output [7].
| Model Name | Abbreviation | Degrees of Freedom (df) | Explanation | Standard 6-Digit Code |
|---|---|---|---|---|
| Jukes-Cantor | JC / JC69 | 0 | Equal substitution rates & equal base frequencies | 000000 |
| Kimura 2-Parameter | K80 / K2P | 1 | Unequal transition/transversion rates & equal base frequencies | 010010 |
| Hasegawa-Kishino-Yano | HKY / HKY85 | 4 | Unequal transition/transversion rates & unequal base frequencies | 010010 |
| Tamura-Nei | TN / TN93 | 5 | Like HKY but with unequal purine/pyrimidine rates | 010020 |
| General Time Reversible | GTR | 8 | Unequal rates & unequal base frequencies (most complex common model) | 012345 |
| Symmetric | SYM | 5 | Unequal rates but equal base frequencies | 012345 |
A standard workflow for nucleotide substitution model selection using IQ-TREE is outlined below [7].
Software Commands: The analysis can be performed in several software packages [3]:
iqtree2 -s your_alignment.phy -m TEST -mset GTR,HKY,JC (The -m TEST option activates the built-in ModelFinder).modeltest-ng -i your_alignment.phy -t mlOutput Interpretation: The program will generate a table ranking all tested models by your chosen criterion (AIC, AICc, or BIC). The model with the lowest score is the best-fit. The output will clearly state the selected model name (e.g., GTR+I+G).
| Item Name | Type | Function in Experiment |
|---|---|---|
| IQ-TREE | Software Package | Performs fast and effective phylogenetic inference and model selection via ModelFinder [7]. |
| ModelTest-NG | Software Package | A dedicated tool for selecting best-fit nucleotide substitution models; known for being 1-2 orders of magnitude faster than jModelTest [3]. |
| jModelTest2 | Software Package | A widely-used program for statistical selection of best-fit nucleotide substitution models [3]. |
| BIC (Bayesian Information Criterion) | Statistical Criterion | A model selection criterion that penalizes complexity more heavily than AIC; shown to most accurately identify the true model [3]. |
| AIC (Akaike Information Criterion) | Statistical Criterion | A model selection criterion that balances model fit and complexity, favoring more parameter-rich models than BIC [3] [33]. |
The assumption that all sites in a DNA sequence evolve at the same rate is biologically unrealistic. Real genomic sequences contain sites under different functional constraints, leading to substantial rate variation across sites. For instance, coding regions may have codon positions under different selection pressures, while RNA stems and loops evolve at different rates. To account for this heterogeneity, two primary mechanisms are incorporated into nucleotide substitution models: Gamma-distributed rate variation (+G) and Invariant Sites (+I).
The Gamma distribution (+G) models sites that evolve at different rates, with the shape parameter (α) determining the distribution of rates across sites. A small α value indicates high rate variation (some sites evolve very slowly while others very rapidly), while a large α value suggests more uniform rates across sites [34] [35]. The Invariant Sites model (+I) accounts for a proportion of sites (p-inv) that are completely resistant to change over evolutionary time [36]. These extensions can be added to any standard nucleotide substitution model (e.g., JC+G, HKY+I, GTR+G+I) to better reflect biological reality and improve the accuracy of phylogenetic inference [9] [7].
Problem: How do I choose between +G, +I, or both (+G+I) for my analysis?
ModelTest-NG, jModelTest2, or IQ-TREE's built-in ModelFinder [20].Problem: My analysis with +G runs extremely slowly or fails to converge.
+G4 in many programs) to see if stability improves [7].Problem: The estimated proportion of invariant sites (p-inv) is very high (>0.9) or very low (<0.01).
Problem: I get different results when using different phylogenetic software with the same model (e.g., GTR+G+I).
jModelTest2, ModelTest-NG, or IQ-TREE) did not significantly affect the ability to identify the true model. Differences are more likely due to the chosen information criterion [20].Problem: How do I interpret the Gamma shape parameter (α) and the proportion of invariant sites (p-inv)?
Problem: The confidence intervals for the Gamma shape parameter (α) are very wide.
Q1: What is the mathematical basis of the Gamma distribution for modeling rate variation? The Gamma distribution is a two-parameter family of continuous probability distributions. In phylogenetics, it is used to model the distribution of rates (r) across sites. The probability density function (PDF) with shape parameter α and rate parameter λ is [34] [35]: f(r; α, λ) = (λ^α / Γ(α)) * r^(α-1) * e^(-λr) For convenience, the mean rate is standardized to 1 (which implies λ = α). The shape parameter α entirely controls the distribution's properties: the variance is 1/α, the skewness is 2/√α, and the kurtosis is 6/α. In practice, the continuous distribution is approximated by a discrete number of rate categories (e.g., 4) to make likelihood calculations feasible [9] [7].
Q2: Can I use +G and +I models for data other than standard DNA (e.g., RNA, proteins)? Yes. The +G and +I extensions are applicable to various data types, including RNA and proteins. For RNA genes with secondary structure, specialized models (7-state or 16-state dinucleotide models) exist that account for paired-site evolution, and these can also be combined with +G and +I to account for rate variation among different stem and loop regions [37]. Protein models (e.g., WAG, LG) are almost always used with a +G component to account for site-specific rate variation [7].
Q3: My model selection test keeps choosing a complex model like GTR+G+I. Is this always the best? Not necessarily. While GTR+G+I is often the best-fitting model for real datasets, it is also the most parameter-rich. It is crucial to use a strict model selection criterion like BIC, which penalizes model complexity more heavily than AIC. A 2025 study confirms that BIC is more reliable at identifying the true model without overfitting [20]. If your dataset is small, a simpler model with fewer parameters (e.g., HKY+G) might be more robust and still provide a good fit.
Q4: What are the key differences between the implementations in different software? The underlying mathematics is consistent, but key practical differences exist, as summarized in the table below.
Table 1: Comparison of +G and +I Model Implementation in Different Software
| Software | Gamma Distribution Discretization | Default Number of Categories | Typical Model Selection Method |
|---|---|---|---|
| IQ-TREE [7] | Discrete approximation via mean | 4 (user can change) | Built-in ModelFinder (BIC, AIC, AICc) |
| RevBayes [9] | MCMC sampling of continuous distribution | N/A (sampled directly) | Stepping-stone sampling for Bayes factors |
| jModelTest2/ ModelTest-NG [20] | Discrete approximation | Fixed (e.g., 4) during testing | Hierarchical Likelihood Ratio Tests (hLRT), BIC, AIC |
Q5: Are there alternatives to +G and +I for modeling rate variation? Yes, though they are less common. These include:
C10 to C60 in IQ-TREE, which assign sites to a number of predefined rate categories based on amino acid profiles [7].Objective: To systematically determine the best-fit nucleotide substitution model for a given DNA sequence alignment, including the assessment of Gamma-distributed rate variation (+G) and a proportion of invariant sites (+I).
-m MFP, ModelTest-NG, or jModelTest2) [20].iqtree -s alignment.phy -m MFP will automatically test a wide range of models, including those with +G and +I [7].GTR+F+I+G4). Note the estimated values for the shape parameter (α) and the proportion of invariant sites (p-inv).The following diagram visualizes this workflow and the logical relationship between model components:
Objective: To use phylogenetic invariants as an independent check of model fit, particularly for testing symmetry assumptions in models like JC69 and K80 [38].
Table 2: Essential Software and Analytical Tools for Modeling Rate Variation
| Tool Name | Type | Primary Function | Relevance to +G and +I |
|---|---|---|---|
| IQ-TREE [7] | Software Package | Phylogenetic inference | Implements all standard models with +G and +I. Features fast model selection via -m MFP. |
| ModelTest-NG [20] | Standalone Program | Model Selection | Performs statistical comparison of nucleotide substitution models, including +G and +I, using multiple criteria. |
| RevBayes [9] | Software Package | Bayesian Phylogenetics | Provides a flexible environment for specifying custom models with +G and +I using MCMC for parameter estimation. |
| PHASE [37] | Software Package | RNA Phylogenetics | Implements specialized 7- and 16-state RNA substitution models that can be combined with +G and +I. |
| BIC (Statistic) [20] | Information Criterion | Model Selection | The recommended criterion for selecting the best-fit model, as it penalizes complexity and reduces overfitting. |
Q1: Why is data partitioning critical in phylogenetic analysis? Molecular sequence alignments are composed of sites with different evolutionary histories and pressures. Using a single model for all sites can lead to biased and inaccurate phylogenetic estimates. Partitioning involves estimating independent models of molecular evolution for different subsets of sites, which significantly improves the accuracy of phylogenetic reconstruction [39] [40] [41]. For example, in protein-coding genes, codon positions often evolve at different rates, and separate genes may have distinct evolutionary dynamics.
Q2: What are the main types of partitioning strategies? The two primary approaches are a priori partitioning and data-driven partitioning:
Gene1_codon1, Gene1_codon2, Gene1_codon3, Gene2_codon1, etc.) [40] [41].Q3: My analysis software selected different best-fit models when I used AIC versus BIC. Which criterion should I trust? A recent comprehensive study found that the Bayesian Information Criterion (BIC) consistently outperforms AIC and AICc in accurately identifying the true underlying nucleotide substitution model [3]. BIC more heavily penalizes model complexity, which helps avoid over-parameterization. It is recommended to use BIC for model selection, especially with larger datasets.
Q4: How can I objectively select the best partitioning scheme for my dataset? Software like PartitionFinder and ModelTest-NG are designed for this purpose. They allow you to input a set of candidate a priori partitions (e.g., by gene and codon position) and use statistical criteria (like BIC) to compare millions of partitioning schemes. The program will identify the scheme that best fits your data by potentially merging subsets that share similar model parameters [39]. For a fully data-driven approach, the TIGER-rates/RatePartitions pipeline provides an objective method based on site-specific relative evolutionary rates [40].
Q5: I am getting discordant divergence time estimates from my mitochondrial vs. nuclear DNA data. Could partitioning be the issue? Yes. Different genes can have varying levels of phylogenetic informativeness and saturation over time. A study on ray-finned fishes demonstrated that mitochondrial partitions often peak in informativeness deep in the past and then experience a "rain shadow of noise" (saturation) for more recent nodes, leading to artificially older age estimates. In contrast, nuclear genes often provide more reliable estimates for younger divergences. Applying appropriate partitioning and model selection can help reconcile these discordant estimates [42].
Q6: What is the computational cost of over-partitioning my data? While under-partitioning (too few partitions) is a known source of bias, over-partitioning (too many partitions) also carries risks. Excessively partitioning a dataset can lead to over-parameterization, where each partition has too few sites to reliably estimate its model parameters. This can increase error variance and computational burden. The goal of model selection tools like PartitionFinder is to find a balance, grouping sites only when they share similar substitution processes [39] [41].
Problem: Model selection yields an overly complex model with many parameters.
Problem: Analysis fails to converge or runs extremely slowly after implementing a complex partitioning scheme.
Problem: Disagreement between phylogenetic trees inferred from different genes.
Problem: Suspected saturation in fast-evolving sites is biasing results.
Protocol 1: Selecting a Partitioning Scheme with PartitionFinder
This protocol is used to find the best-fit partitioning scheme and substitution models for a set of candidate partitions [39].
Prepare Data and Configuration:
Run PartitionFinder:
Interpret Output:
Protocol 2: Data-Driven Partitioning using Relative Evolutionary Rates
This protocol uses TIGER-rates and RatePartitions to partition data based on site rates without prior assumptions [40].
Calculate Site Rates:
Generate Partitions:
Validate the New Scheme:
Table 1: Essential Software and Tools for Partitioning and Model Selection
| Tool Name | Function | Key Feature |
|---|---|---|
| PartitionFinder [39] | Combined selection of partitioning schemes and substitution models. | Can compare millions of partitioning schemes in realistic time frames. |
| TIGER / TIGER-rates [40] | Calculates the relative evolutionary rate for each site in an alignment. | Provides a tree-independent, data-driven measure of site heterogeneity. |
| RatePartitions [40] | Partitions alignment sites into subsets based on relative rates from TIGER. | Objective partitioning that avoids pitfalls of other algorithms like k-means. |
| ModelTest-NG [3] | Selects best-fit nucleotide substitution models. | Fast implementation of common model selection criteria (AIC, AICc, BIC). |
| IQ-TREE [7] | Performs phylogenetic inference and model selection. | Integrated ModelFinder (MFP option) for automatic best-fit model selection. |
| RevBayes [41] | Bayesian phylogenetic inference using probabilistic graphical models. | Flexible framework for implementing custom partitioned models and conducting Bayes factor model comparison. |
The following diagram outlines a logical workflow for selecting and implementing a partitioning strategy in a phylogenetic analysis.
In molecular phylogenetics, the model of nucleotide substitution chosen for analysis is not just a technicality—it is a fundamental assumption that can profoundly influence the inferred evolutionary relationships. For years, the General Time-Reversible (GTR) model has dominated phylogenetic analyses due to its mathematical convenience and flexibility. However, growing biological evidence suggests that real evolutionary processes often violate the assumptions of time-reversibility and process homogeneity across lineages. These limitations have spurred the development of more sophisticated models, including Lie Markov models and non-reversible models, which provide a more realistic framework for understanding sequence evolution without sacrificing mathematical rigor.
The standard GTR model, while powerful, possesses a significant mathematical limitation: it lacks multiplicative closure [43] [44]. This means that if a sequence evolves under one set of GTR parameters for a period of time, then under a different set of parameters, the overall evolutionary process cannot be described by a single GTR model. This becomes problematic when modeling real biological scenarios where substitution processes may vary across different branches of a phylogenetic tree. Lie Markov models address this limitation by providing a class of models that remain consistent even when the evolutionary process changes over time [44].
Similarly, traditional reversible models assume that the substitution process looks the same forward and backward in time, an assumption that may not hold true in biological systems where directional pressures influence sequence evolution. Non-reversible models break this symmetry, offering the potential for more accurate rooting of phylogenetic trees and better representation of asymmetric evolutionary processes [45].
Lie Markov models are defined by a critical mathematical property: their rate matrices form a Lie algebra [43] [46]. In practical terms, this means the models are multiplicatively closed. This closure property ensures that if you compose two Markov processes from the same Lie Markov model, the resulting combined process remains within the same model family. This is not true for GTR and some other common models [44].
The formal mathematical requirement is that the set of rate matrices (Q) in a Lie Markov model must be closed under the commutator operation [A,B] = AB - BA. This structure permits the model to handle nonhomogeneous evolution—where substitution rates vary across evolutionary time—while ensuring that the "average" process over time still belongs to the same model class [43]. This property becomes crucial when analyzing deep evolutionary relationships where process heterogeneity is likely.
Lie Markov models can be organized into natural hierarchies based on their number of parameters and symmetry properties. A key advancement was the development of models that recognize the biochemical distinction between purines (A, G) and pyrimidines (C, T), allowing them to accommodate the well-known transition-transversion bias [44].
Table 1: Hierarchy of Selected Lie Markov Models
| Model Name | Parameters | Symmetry | Biological Features Captured |
|---|---|---|---|
| JC69 [44] | 1 | Full symmetry (S₄) | Equal nucleotide frequencies and substitution rates |
| K3ST [44] | 3 | Full symmetry (S₄) | Three substitution types; no transition-transversion bias |
| F81 [44] | 4 | Full symmetry (S₄) | Unequal nucleotide frequencies; equal exchangeability |
| F81+K3ST [44] | 6 | Full symmetry (S₄) | Combination of F81 and K3ST features |
| RY5.6b [44] | 5 | RY symmetry | Distinguishes purine/pyrimidine exchanges; transition-transversion bias |
| General Markov [44] | 12 | Minimal symmetry | Maximum flexibility without constraints |
The RY hierarchy, comprising 37 distinct models, offers researchers a spectrum of choices that balance biological realism against parameter richness [44]. For example, the RY5.6b model can be conceptually understood as the sum of the Kimura 2-substitution-type (K2ST) model and the F81 model [44].
The primary advantage of Lie Markov models is their mathematical consistency in nonhomogeneous modeling scenarios [44]. When evolution proceeds under varying processes across a phylogenetic tree, Lie Markov models ensure that the site pattern distributions observed at the tips remain consistent regardless of taxon sampling. This prevents the potentially confounding situation where adding or removing sequences from an analysis fundamentally changes the inferred pattern of evolution.
From a practical perspective, Lie Markov models offer a principled approach to model selection through a natural hierarchy of increasing complexity. This allows researchers to select models with an appropriate number of parameters for their dataset, avoiding both over-simplification and over-parameterization [44].
Traditional time-reversible models operate under the assumption that the substitution process appears identical forward and backward in time, mathematically expressed as π_i Q_ij = π_j Q_ji for all pairs of nucleotides i and j, where π represents the equilibrium frequencies and Q the rate matrix [6] [5]. While mathematically convenient, this assumption may not reflect biological reality, where directional processes such as mutational biases, selective pressures, and differential repair mechanisms can create inherent asymmetries in evolutionary patterns [45].
Non-reversible models relax this constraint, allowing for a more flexible representation of evolutionary processes. These models are particularly valuable for rooting phylogenetic trees without relying on external outgroups, as the inherent directionality in the substitution process contains information about the root position [45].
It is important to distinguish between two types of non-reversible models:
Research has demonstrated that nonstationary models significantly outperform stationary models in root position inference, suggesting that accounting for evolving base composition is crucial for accurate phylogenetic analysis [45].
Table 2: Troubleshooting Common Implementation Challenges
| Problem | Possible Causes | Solutions |
|---|---|---|
| Poor model convergence | Over-parameterization for dataset size | Use BIC for model selection; switch to simpler Lie Markov model [3] [20] |
| Inaccurate root placement | Assuming reversibility when process is directional | Implement nonstationary non-reversible model [45] |
| Inconsistent results when adding/removing taxa | Using non-closed model (e.g., GTR) on nonhomogeneous process | Switch to Lie Markov model with appropriate symmetry [43] [44] |
| Biased branch length estimates | Model misspecification, particularly regarding RY distinction | Use RY Lie Markov model hierarchy [44] |
| Computational bottlenecks | Complex model with many parameters on large dataset | Use ModelTest-NG for faster analysis; consider lower-parameter Lie Markov model [3] |
Protocol 1: Model Selection Workflow
Protocol 2: Root Placement Using Non-Reversible Models
Model Selection Workflow: This diagram illustrates the systematic process for selecting the most appropriate nucleotide substitution model, from initial data preparation to final statistical assessment.
Q1: Which software packages support Lie Markov and non-reversible models? While standard phylogenetic software may not comprehensively implement all Lie Markov models, specialized packages and development versions are increasingly incorporating these frameworks. For model selection, jModelTest2, ModelTest-NG, and IQ-TREE represent state-of-the-art tools, with IQ-TREE having particular strengths in complex model implementation [3].
Q2: How do I choose between different Lie Markov models? The RY hierarchy of Lie Markov models offers a natural progression from simpler to more complex models. Selection should be based on biological plausibility (e.g., the ability to distinguish transitions from transversions) and statistical criteria, with BIC demonstrating superior performance in identifying the true model [3] [44] [20].
Q3: Are non-reversible models computationally feasible for large datasets? Modern implementations have significantly improved the computational efficiency of non-reversible models. For very large datasets, starting with a simpler Lie Markov model and progressing to more complex ones can be an effective strategy. ModelTest-NG offers performance advantages for large-scale analyses [3].
Q4: When should I prefer Lie Markov models over GTR? Lie Markov models are particularly advantageous when: (1) analyzing data where evolutionary processes may have varied across the tree, (2) working with datasets where taxon sampling may change, and (3) studying deep evolutionary relationships where process heterogeneity is likely [43] [44].
Q5: Can non-reversible models correctly identify the root position without an outgroup? Empirical tests demonstrate that nonstationary non-reversible models significantly outperform both stationary non-reversible and reversible models in root position inference across multiple biological datasets [45]. However, performance depends on the degree of non-reversibility in the actual evolutionary process and the appropriateness of the model for the data.
Table 3: Essential Research Reagents and Computational Tools
| Tool/Reagent | Function/Purpose | Implementation Notes |
|---|---|---|
| jModelTest2 [3] | Statistical selection of best-fit nucleotide substitution models | User-friendly interface; suitable for small to medium datasets |
| ModelTest-NG [3] | Rapid model selection implementation | One to two orders of magnitude faster than jModelTest; ideal for large datasets |
| IQ-TREE [3] | Integrated model selection and tree inference | Implements both model selection and phylogenetic inference in one package |
| BIC (Bayesian Information Criterion) [3] [20] | Model selection criterion | Consistently outperforms AIC and AICc in identifying true model |
| RY Lie Markov Model Hierarchy [44] | Suite of models with purine/pyrimidine distinction | Provides 37 models spanning range of parameter complexity |
| Nonstationary Model Framework [45] | Root inference without outgroups | Allows initial and equilibrium base frequencies to differ |
The expanding toolkit of nucleotide substitution models, particularly Lie Markov models and non-reversible models, offers phylogenetic researchers powerful alternatives to overcome limitations of standard time-reversible frameworks. By providing mathematical consistency in the face of process heterogeneity and biological realism for asymmetric evolutionary processes, these advanced models enable more accurate inference of evolutionary relationships and processes.
Implementation of these models requires careful attention to model selection criteria, with the Bayesian Information Criterion (BIC) demonstrating superior performance for identifying appropriate models [3] [20]. As phylogenetic methods continue to evolve, these sophisticated modeling approaches promise to extract deeper insights from molecular sequence data, ultimately advancing our understanding of evolutionary mechanisms and relationships.
Problem 1: Inconsistent or conflicting results from different model selection criteria.
Problem 2: Model misspecification leading to biased parameter estimates.
Problem 3: Computational limitations with large candidate model sets.
Problem 4: Poor recovery of true models with certain selection criteria.
Problem 1: Difficulty detecting low-magnitude disturbances in time-series data.
Problem 2: Inadequate handling of multiple timescales in ecosystem dynamics.
Problem 3: Constrained computational resources for high-resolution imagery analysis.
Q1: What are the key advantages of Bayesian Model Averaging over traditional model selection methods?
Q2: When different criteria (AIC, AICc, BIC) select different best-fit models, which should I trust?
Q3: How does BEAST handle algorithmic uncertainty in time-series analysis?
Q4: What are the practical implications of selecting different nucleotide substitution models?
Q5: How can I quantify and reduce uncertainty in Bayesian experimental designs?
Purpose: To select the most appropriate nucleotide substitution model while accounting for model uncertainty.
Materials:
Procedure:
Validation: Compare results with simulation studies where true model is known. BIC should recover true model with higher accuracy than AIC or AICc [3] [47].
Purpose: To detect changepoints, trends, and seasonality in satellite time-series data while accounting for algorithmic uncertainty.
Materials:
Procedure:
Validation: Apply to simulated data with known changepoints and trends to verify detection accuracy. BEAST should outperform single-best-model algorithms in detecting low-magnitude disturbances and providing realistic uncertainty measures [49].
Table 1: Performance Comparison of Model Selection Criteria Based on Simulation Studies
| Criterion | Accuracy in Model Recovery | Precision (Number of Different Models Selected) | Preferred Use Cases |
|---|---|---|---|
| BIC | High (Superior to AIC/AICc) | Stable with fewer different models selected | General purpose, especially when true model is simpler |
| AIC | Moderate to Low | Higher (7.79-9.75 different models across replicates) | When complex models with more parameters are acceptable |
| AICc | Similar to AIC | Similar to AIC | Small sample sizes |
| Decision Theory | Similar to BIC | Similar to BIC | Alternative to BIC with similar performance |
| hLRT | Poor for SYM-like models | Variable | Not recommended based on comprehensive studies |
Table 2: Impact of Substitution Model Selection on Phylogenetic Inference
| Aspect | Impact of Model Selection | Magnitude of Effect |
|---|---|---|
| Tree Topology | Different best-fit models can yield different tree topologies | >10% topological difference in ~5% of inferences [30] |
| Branch Lengths | Model choice affects genetic distance estimates | Significant impact on evolutionary rate estimation [50] |
| Bootstrap Support | Model selection influences support values | Can affect statistical confidence in clades [47] |
| Parameter Estimates | Substitution rates and base frequencies vary by model | Direct impact on evolutionary parameter interpretation |
BMA addresses model uncertainty by combining predictions from multiple models weighted by their posterior probabilities. For linear regression with model uncertainty, BMA computes the posterior distribution of a parameter Δ as:
p(Δ|D) = Σ p(Δ|Mk,D) p(Mk|D)
where D is the data, Mk are the candidate models, p(Δ|Mk,D) is the posterior distribution of Δ under model Mk, and p(Mk|D) is the posterior model probability [48].
For nucleotide substitution model selection, BMA can be implemented using the following workflow:
BEAST implements a Bayesian ensemble approach for time-series decomposition by embracing algorithmic uncertainty rather than selecting a single best model. The algorithm decomposes a time series Y(t) into components:
Y(t) = S(t) + T(t) + e(t)
where S(t) is the seasonal component, T(t) is the trend component, and e(t) is the error term. BEAST computes the posterior probability of changepoints in both seasonal and trend components through Bayesian model averaging across all possible decomposition models [49].
Table 3: Essential Software Tools for Bayesian Analysis of Model Uncertainty
| Tool Name | Primary Function | Application Context | Implementation Details |
|---|---|---|---|
| Rbeast | Bayesian time-series decomposition | Changepoint detection, trend analysis | R package implementing BEAST algorithm for multiple timescale analysis [49] |
| jModelTest2 | Nucleotide substitution model selection | Phylogenetic analysis | Java-based, tests 203 substitution models using AIC, AICc, BIC [3] [30] |
| ModelTest-NG | Nucleotide substitution model selection | Phylogenetic analysis | Faster implementation (1-2 orders magnitude faster than jModelTest), same criteria [3] |
| IQ-TREE | Phylogenetic inference with model selection | Maximum likelihood phylogenetics | Implements model selection alongside tree inference [3] |
| BMA R packages | Bayesian Model Averaging | Linear regression with variable selection | Implements adaptive BMA methods with Zellner's g-prior [48] |
Problem: The General Time-Reversible (GTR) model, a common default in phylogenetic analyses, can sometimes be inapplicable or lead to misleading results. Standard estimation procedures may fail to converge, produce implausible parameter estimates, or yield unreliable tree topologies.
Primary Symptoms:
Diagnostic Procedure:
Step-by-Step Instructions:
Verify Numerical Instability:
Quantify Model Fit Using Information Criteria:
jModelTest, ModelTest) to evaluate all 203 possible time-reversible nucleotide substitution models [30] [52].Assess Topological Impact of Model Choice:
Identify Underlying Data Issues:
Resolution: Based on the diagnostics, implement one or more solutions from the Alternative Protocols section below.
Problem: The use of different model selection criteria (AIC, AICc, BIC) can lead to the selection of different best-fit models, creating uncertainty.
Diagnosis: This is a known issue in phylogenetics. The choice of criterion involves a trade-off between model complexity and fit [47].
Solution: Select the appropriate criterion based on your research goal and data properties.
Table 1: Comparison of Model Selection Criteria Performance
| Criterion | Full Name | Best For | Tendency | Accuracy & Precision | Key Reference |
|---|---|---|---|---|---|
| AIC | Akaike Information Criterion | Prediction, exploratory analysis | Selects more complex models | Lower accuracy, lower precision | [47] |
| BIC | Bayesian Information Criterion | Inference, phylogenetic estimation | Selects simpler models | Higher accuracy, higher precision | [47] |
| AICc | Corrected AIC | Small sample sizes | Similar to AIC but with sample size correction | Varies with sample size | [30] |
| hLRT | Hierarchical Likelihood Ratio Test | Less recommended for general use | Path-dependent, can favor complex models | Poor performance with invariable sites | [47] |
Recommendation: For the goal of phylogenetic inference (the context of this thesis), BIC is generally preferred over AIC and hLRT due to its higher accuracy and precision in selecting the true model in simulation studies [47]. Always report which criterion was used.
FAQ 1: The GTR model is the most general time-reversible model. Why would it ever be inappropriate?
While the GTR model is parameter-rich, it is still a simplification of reality. It assumes evolutionary processes are time-reversible, stationary, and homogeneous across the tree [30] [52]. Your data may violate these assumptions due to:
FAQ 2: My model selection analysis chose a very simple model (e.g., JC69). Does this mean my data is uninformative?
Not necessarily. It can indicate that your data lacks a strong "signal" for more complex model parameters. This often happens with very short sequences or highly conserved genes. However, it can also result from substitution saturation, where multiple hits at the same site have erased the historical signal, making complex patterns undetectable. You should check for saturation by plotting transitions and transversions against genetic distance [52].
FAQ 3: How does the sample size definition affect model selection, and which should I use?
The definition of "sample size" is a critical but often overlooked choice when calculating AICc and BIC. The two main options are the number of alignment sites or the number of sites × taxa [30]. Research has shown that this choice can affect the selected model and, in approximately 5% of cases, lead to substantially different final tree topologies. For a more conservative approach that tends to select simpler models, use the number of alignment sites as the sample size.
Purpose: To systematically identify the best-fit nucleotide substitution model and validate its impact on phylogenetic inference.
Methodology:
Data Preparation: Use a curated, multiple sequence alignment. Remove ambiguously aligned regions.
Comprehensive Model Testing:
jModelTest2).Model Selection: Identify the best-fit model based on the BIC for phylogenetic inference (see [47] and Table 1).
Topological Impact Assessment:
Validation: Assess clade support using non-parametric bootstrap (≥1000 replicates) under the best-fit model.
Purpose: To correctly model rate variation across sites, a common source of model misspecification.
Methodology:
Define Candidate Models: Test a nested model series for a given base model (e.g., HKY85):
Model Comparison: Use LRTs to compare these nested models. The test statistic is 2ΔlnL and is approximately χ2 distributed with degrees of freedom equal to the difference in the number of parameters [52].
Decision Logic:
Table 2: Research Reagent Solutions for Nucleotide Substitution Model Selection
| Reagent / Resource | Type | Primary Function | Application Note |
|---|---|---|---|
| jModelTest / ModelTest | Software Package | Automates model selection via AIC, AICc, BIC, hLRT. | Compares 56-203 models. Essential for Protocol 1. [30] [52] |
| Phylogenetic Likelihood Library (PLL) | Software Library | Provides a high-performance computational engine for likelihood calculations. | Used in RAxML and other tools for fast analysis. [30] |
| RaxML | Software Tool | Performs efficient large-scale Maximum Likelihood phylogeny inference. | Used for final tree inference under the selected model. [30] |
| MrBayes | Software Tool | Performs Bayesian Phylogenetic inference. | Allows model averaging via lset nst=mixed. [30] |
| Codon Frequency Table | Data Resource | Provides species-specific codon usage biases. | Critical for analyzing protein-coding genes and codon model design. [53] [54] [55] |
Q1: What is the most critical choice affecting the accuracy of nucleotide substitution model selection?
The choice of information criterion has a more critical impact on accuracy than the choice of software. The Bayesian Information Criterion (BIC) consistently outperforms both AIC and AICc in accurately identifying the true underlying nucleotide substitution model, regardless of which phylogenetic program is used [22] [3]. While popular programs like jModelTest2, ModelTest-NG, and IQ-TREE offer comparable accuracy, selecting BIC as your criterion is recommended for more robust model selection [20].
Q2: When different criteria select different models, which should I choose and why?
When results are inconsistent, you should generally prefer the model selected by BIC. Studies show that BIC consistently identifies the true model more accurately [22]. Furthermore, when inconsistencies occur, the models selected by BIC are typically simpler (with fewer parameters) than those selected by AIC or AICc [22]. Selecting a simpler model is often preferable for computational efficiency, especially with large datasets [3].
Q3: How can I reduce the computational cost of model selection for very large genomic alignments?
You can dramatically reduce costs by using the Subsample-Upsample (SU) approach [56]. This method involves analyzing tiny subsamples of site patterns from the full alignment and then upsampling them to the original length, reducing the number of unique site patterns for the model testing phase. This can achieve a cost reduction of orders of magnitude [56].
Table 1: Computational Efficiency of the SU Approach on Empirical Datasets
| Dataset | Full MSA Analysis Memory (GB) | Full MSA Analysis Time (Hours) | SU Analysis Memory (GB) | SU Analysis Time (Hours) | Speed Increase Factor |
|---|---|---|---|---|---|
| Butterflies | 75.0 | 3140.7 | 0.04 | 0.67 | ~4,687x |
| Insects A | 115.4 | 5000.3 | 0.24 | 6.85 | ~730x |
| Mammals | 9.3 | 106.0 | 0.05 | 0.81 | ~131x |
| Yeasts | 13.0 | 973.5 | 0.03 | 0.63 | ~1,545x |
Q4: Is it acceptable to use a simpler model to save computational resources for a large dataset?
Yes, this is a common and often necessary practice. The "rule of thumb" is to pick the model with a smaller number of parameters for computational efficiency when computing resources are limited, especially for large datasets [22]. The SU approach provides a rigorous way to do this, as it reliably selects the correct model while using simpler, faster computations [56].
Symptoms: Model selection for a large multiple sequence alignment (MSA) takes days or weeks and consumes gigabytes of memory.
Solution: Implement the Subsample-Upsample (SU) method with an adaptive protocol like ModelTamer [56].
Experimental Protocol:
g) of unique site patterns from your full MSA without replacement. The minimum fraction (g_min) required for 99% accuracy is often very small (e.g., 0.5% - 1.5% for many empirical datasets) [56].This workflow reduces the computational burden by focusing analysis on a reduced set of unique site patterns.
Symptoms: Different programs (jModelTest2, ModelTest-NG, IQ-TREE) or different criteria (AIC, AICc, BIC) suggest different best-fit models.
Solution: Standardize your workflow around the most reliable statistical criterion.
Diagnosis and Resolution Steps:
Table 2: Essential Software and Statistical Tools for Model Selection
| Item Name | Category | Function / Explanation |
|---|---|---|
| IQ-TREE | Software Package | A widely used phylogenetic software that includes ModelFinder for efficient model selection. Known for its speed and accuracy [22] [3]. |
| ModelTest-NG | Software Package | A popular, fast program for statistical selection of nucleotide substitution models [22] [3]. |
| jModelTest2 | Software Package | Another established software for model selection, providing a comprehensive set of models and criteria [22]. |
| Bayesian Information Criterion (BIC) | Statistical Criterion | A model selection criterion that heavily penalizes extra parameters. Demonstrated to be the most accurate for identifying the true substitution model [22] [3]. |
| Subsample-Upsample (SU) Approach | Computational Method | A protocol that drastically reduces computational costs for model selection on large alignments by analyzing a tiny fraction of site patterns [56]. |
| ModelTamer | Software Tool | An implementation of the adaptive SU approach that automatically selects subsamples to reliably estimate optimal models [56]. |
Model selection and model adequacy testing serve fundamentally different purposes in statistical phylogenetic analysis [57] [58].
Model Selection: A relative fit procedure that identifies the best model from a set of candidate models. Methods like hierarchical Likelihood Ratio Test (hLRT), Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC) compare models against each other but cannot determine if the best candidate is actually suitable for analyzing your data [58] [47].
Model Adequacy Testing: An absolute goodness-of-fit assessment that determines whether a selected model provides a plausible description of the evolutionary process that generated your data. It answers the critical question: "Could this model have plausibly produced my observed data?" [57] [58]
Think of it this way: model selection helps you choose the best tool from your toolbox, while model adequacy testing checks whether that tool is actually capable of doing the job properly [57].
Using an inadequate nucleotide substitution model can negatively impact multiple aspects of phylogenetic inference [57] [58]:
The table below summarizes key studies demonstrating these impacts:
Table 1: Documented Impacts of Inadequate Substitution Models
| Impact Area | Consequence | Supporting Evidence |
|---|---|---|
| Tree Topology | Inconsistent estimates, particularly in Felsenstein Zone conditions | [57] |
| Branch Lengths | Biased estimates affecting divergence time estimation | [57] [58] |
| Bipartition Support | Inflated or inaccurate bootstrap/posterior probability values | [58] |
| Positive Selection | Increased false positive rates in genome-wide scans | [59] |
Several statistical methods have been developed specifically for testing the adequacy of DNA substitution models:
Table 2: Comparison of Model Adequacy Testing Methods
| Method | Framework | Key Strength | Key Limitation |
|---|---|---|---|
| Goldman-Cox Test | Frequentist | Well-established methodology | Computationally intensive; shown to lack power in some studies [57] [58] |
| Posterior Predictive Simulation | Bayesian | Eliminates reliance on point estimates | Often fails to reject simpler models than GC test [58] |
| Pearson's X² with Binning | Frequentist | Demonstrated improved power; robust to low site-pattern counts [57] | Requires careful binning strategy |
Model selection criteria are limited to comparing candidate models and will always identify a "best" model even when all candidates are poor representations of the true evolutionary process [58]. Several factors contribute to this limitation:
As noted in a comprehensive simulation study, "the use of incorrect models can mislead phylogenetic inference," highlighting why adequacy checking is essential [47].
When your selected model fails an adequacy test, consider these troubleshooting steps:
Investigate Model Mis-specification Sources:
Consider More Complex Models:
Address Data Quality Issues:
Incorporate model adequacy testing as a mandatory step after model selection with this workflow:
Implementation Protocol for Pearson's Goodness-of-Fit Test [57]:
Estimate Model Parameters: Obtain maximum likelihood estimates of tree topology, branch lengths, and substitution model parameters under the null hypothesis model
Calculate Expected Site Pattern Frequencies: Using the estimated parameters, compute the expected probability for each possible site pattern
Bin Site Patterns: Apply a K-means clustering algorithm to group site patterns into bins, ensuring each bin has sufficient expected counts (typically ≥5)
Compute Test Statistic: Calculate Pearson's X² statistic comparing observed and expected frequencies across bins
Assess Significance: Generate null distribution via parametric bootstrap and compute P-value
Model adequacy testing presents significant computational challenges that researchers should anticipate:
Mitigation Strategies:
Table 3: Key Resources for Model Adequacy Assessment
| Tool/Resource | Function | Application Context |
|---|---|---|
| ModelTest3.7/ jModelTest | Model selection from GTR family | Initial model identification [58] [47] |
| DT-ModSel | Decision-theoretic model selection | Alternative model selection approach [58] [47] |
| BUSTED-E | Detection of positive selection with error correction | Accounts for alignment errors in selection inference [59] |
| Dirichlet Process Mixture Models | Bayesian nonparametric modeling of site heterogeneity | Accommodates unknown number of rate categories [10] |
| PRANK-C | Codon-aware sequence alignment | Reduces alignment errors that affect model adequacy [59] |
| Custom Binning Algorithms | Site pattern grouping for X² tests | Enables powerful goodness-of-fit tests [57] |
To ensure robust phylogenetic conclusions, adopt these practices:
Always Pair Model Selection with Adequacy Testing: Model selection alone provides a false sense of security; adequacy testing validates that your chosen model is biologically plausible [57] [58]
Use Both Bootstrap and Adequacy Assessment: These provide complementary information - bootstrap assesses whether you have enough data, while adequacy testing assesses whether your model is appropriate [57]
Report Adequacy Test Results: When publishing, include information about model adequacy tests alongside model selection results to provide complete methodological transparency [57]
Exercise Caution with Inadequate Models: "We caution against deriving strong conclusions from analyses based on inadequate models. At a minimum, those results derived from inadequate models can now be readily flagged using the new test, and reported as such." [57]
In phylogenetic analysis, selecting the best-fitting nucleotide substitution model is crucial for obtaining accurate evolutionary inferences. Pearson's Goodness-of-Fit test, when applied to site pattern binning, provides a statistical framework to assess how well a candidate substitution model explains the observed patterns in your sequence alignment. A significant result indicates the model does not adequately fit your data, suggesting you should consider a more complex model or different modeling approach [62] [63].
Pearson's Chi-squared (χ²) test quantifies the discrepancy between observed and expected frequencies. The test statistic is calculated as:
χ² = Σ [(Oᵢ - Eᵢ)² / Eᵢ]
Where Oᵢ is the observed frequency for bin i, and Eᵢ is the expected frequency under the null hypothesis that the model fits the data. For site pattern analysis, these "bins" are the unique site patterns observed in a multiple sequence alignment. Under the null hypothesis of a good fit, the χ² statistic follows a Chi-squared distribution with degrees of freedom equal to the number of non-empty bins minus the number of estimated parameters minus one [62] [64].
Site pattern binning involves categorizing each column in your multiple sequence alignment based on the unique pattern of nucleotides across taxa.
Table 1: Steps for Pearson's GOF Test Calculation
| Step | Task | Description and Formula |
|---|---|---|
| 1 | State Hypotheses | H₀: The candidate substitution model is a good fit for the data.Hₐ: The candidate substitution model is not a good fit for the data [64]. |
| 2 | Calculate χ² Statistic | χ² = Σ [(Oᵢ - Eᵢ)² / Eᵢ], summed over all bins i [62]. |
| 3 | Determine Degrees of Freedom (df) | df = (Number of non-empty bins - 1) - (Number of independently estimated model parameters) [62]. |
| 4 | Obtain P-value | Use the Chi-squared distribution with the calculated df to find the probability (P-value) of observing a χ² value as extreme as, or more extreme than, the one calculated, assuming H₀ is true. |
| 5 | Draw Conclusion | If the P-value is less than your significance level (e.g., α = 0.05), you reject the null hypothesis, indicating a poor fit [64]. |
The following diagram illustrates the logical flow from data preparation to final inference:
Table 2: Research Reagent Solutions for Model Fit Testing
| Tool / Reagent | Function | Implementation Example / Note |
|---|---|---|
| RevBayes | Bayesian Phylogenetic Analysis | Implements a full suite of nucleotide substitution models (JC, F81, HKY, GTR, etc.) for calculating expected probabilities [9]. |
| Biopython | Python Library for Bioinformatics | Bio.Phylo and related modules can parse alignments and trees, facilitating custom script development for pattern counting and test implementation [66]. |
| PhyML | Phylogenetic Estimation | Command-line tool (e.g., via Bio.Phylo.Applications.PhymlCommandline) for model-based analysis. Useful for generating expected values under a model [67]. |
| RAxML | Large-Scale Phylogeny Inference | Similar to PhyML, a widely used tool for model-based inference that can be integrated into a testing pipeline [67]. |
| R/Statistical Packages | Statistical Computing | Functions like chisq.test() can perform the final test calculation once observed and expected frequencies are prepared [68]. |
| AsymmeTree | Python Simulation Package | While designed for simulation, it can generate sequences under known models, providing a perfect benchmark for validating your GOF test pipeline [69]. |
This is a common issue with complex patterns and many taxa. The number of possible site patterns grows exponentially with the number of sequences, leading to sparse data.
AsymmeTree or phastSim [69] [70]. This confirms whether the issue is with your data or the method.Not necessarily. A rejection indicates that none of the standard time-reversible models (e.g., GTR) perfectly capture the complexity of your data. This is a known issue, particularly for large and diverse datasets [65].
This is a critical assumption for the χ² approximation to be valid.
The core principle is identical. The main difference lies in the binning process and the calculation of expected frequencies.
Answer: The impact depends on your specific dataset and analytical goals. Multiple studies reveal that while different criteria often select different models, the practical effect on tree topology may be limited in many cases.
Recommendation: If your primary concern is the branching pattern (topology), your choice among standard criteria may not be critical. However, for analyses where high precision is required, you should perform model selection and test if the resulting tree is sensitive to the chosen model.
Answer: Research provides conflicting recommendations, but BIC is often identified as a strong candidate.
Recommendation: For general use, BIC is a robust choice due to its high accuracy. However, for specific tasks like ancestral sequence reconstruction, you may find that the choice of criterion has minimal impact on your results.
Answer: Under certain conditions, this can be a valid time-saving strategy. Research has shown that using the most parameter-rich model, GTR+I+G, can lead to phylogenetic inferences (topologies and ancestral sequences) that are similar to those obtained under a model selected through a formal procedure [71]. This suggests that for some applications, this complex model is sufficient to capture the evolutionary process in the data without misestimating the tree.
Recommendation: Skipping model selection can be a reasonable approach, especially for initial explorations or when computational time is a major constraint. However, the most rigorous practice remains to statistically justify your model choice for your final, published results.
Answer: The software itself may be less important than the criterion you choose. A recent study that analyzed model selection across three widely-used programs (jModelTest2, ModelTest-NG, and IQ-TREE) found that the choice of program did not significantly affect the ability to accurately identify the true nucleotide substitution model [20]. This indicates that researchers can confidently use any of these mainstream tools.
Table 1: Performance of different model selection criteria based on simulated datasets.
| Criterion | Accuracy in Recovering True Model | Tendency | Impact on Final Tree Topology |
|---|---|---|---|
| AIC | Variable; can be high for complex models but moderate for others [47]. | Tends toward more complex models [71] [47]. | Can yield topologically different trees for ~5% of datasets [30]. |
| AICc | Similar to AIC but with a correction for sample size [30]. | Tends toward more complex models. | Similar to AIC, but less commonly used in practice [71]. |
| BIC | High accuracy and precision; often outperforms AIC [20] [47]. | Tends toward simpler models [71] [47]. | Similar to other criteria; topological differences are often minimal [71]. |
| DT | High accuracy and precision, similar to BIC [47]. | Tends toward simpler models [47]. | Lowest dissimilarity with BIC; often selects the same model [47]. |
| hLRT | Poor performance when true model includes invariable sites [47]. | Favors complex models; depends on testing hierarchy [47]. | Can yield topologically different trees for ~5% of datasets [30]. |
Table 2: Community usage and practical recommendations.
| Aspect | Findings & Recommendations |
|---|---|
| Community Practice | Survey of 300 studies showed 41% used AIC, 21% BIC, and 5% AICc, while 36% did not specify the criterion [71]. |
| Software Choice | The choice between jModelTest2, ModelTest-NG, and IQ-TREE does not significantly affect model selection accuracy [20]. |
| Default Model Use | Using GTR+I+G by default can be a valid time-saving strategy, leading to similar inferences as model selection [71]. |
Objective: To determine the best-fit nucleotide substitution model for a given dataset and assess the impact of the selected model on the final phylogenetic tree topology.
Materials:
Methodology:
Model Selection:
Phylogenetic Tree Inference:
Topological Comparison:
Interpretation:
Table 3: Key software tools for nucleotide substitution model selection and phylogenetic analysis.
| Tool Name | Function | Use Case |
|---|---|---|
| jModelTest2 [30] | Statistical selection of best-fit nucleotide substitution models from a set of candidates. | Standard model selection for single-gene or multi-gene alignments. |
| ModelTest-NG [20] | A newer, faster implementation for model testing, a successor to jModelTest. | Rapid model selection on large datasets or when performing multiple tests. |
| IQ-TREE [20] | Integrated tool for model selection and ultrafast phylogenetic inference. | One-stop workflow for model selection (with ModelFinder) and tree building. |
| Phylogenetic Likelihood Library (PLL) [30] | A high-performance library that implements the core likelihood calculations. | Foundation for many software tools; used for developing custom analysis pipelines. |
FAQ 1: Why is model selection critical for phylogenetic analysis? Selecting the best-fit nucleotide substitution model is a critical step in phylogenetic analysis because distinct substitution models can change the outcome of phylogenetic analyses, directly influencing the reliability of the resulting trees and downstream interpretations [22]. Using an inappropriate model can lead to biased estimates of evolutionary relationships.
FAQ 2: Which information criterion should I use for model selection? Your choice of information criterion significantly impacts the selected model. Studies demonstrate that the Bayesian Information Criterion (BIC) consistently outperforms both AIC and AICc in accurately identifying the true model of evolution, regardless of the software program used [22]. BIC more heavily penalizes extra parameters, which helps avoid overfitting.
FAQ 3: Are the results from different model selection software comparable? Yes. Research shows that the choice of program (e.g., jModelTest2, ModelTest-NG, or IQ-TREE) does not significantly affect the ability to accurately identify the true nucleotide substitution model, indicating these programs offer comparable accuracy [22]. However, they may differ in computational speed and specific features.
FAQ 4: What is an advanced feature that can improve model selection? Some software, like IQ-TREE's ModelFinder, offers an advanced search option that concurrently searches both model-space and tree-space. This can lead to a significantly better fit between the tree, model, and data compared to the default option, which tests models on a fixed starting tree [32].
FAQ 5: My selected model has many parameters. Is this a problem? While a model with more parameters might fit your data better, it can lead to overfitting, especially with smaller datasets. BIC is often preferred as it selects simpler models. A general rule is to pick the model with a smaller number of parameters for computational efficiency when dealing with large datasets and limited computing resources [22].
Problem: AIC, AICc, and BIC select different best-fit models for the same dataset.
Solution:
Problem: The final phylogenetic tree is poorly supported or conflicts with known biology.
Solution:
Problem: Model selection is computationally prohibitive for genome-scale alignments.
Solution:
This protocol describes how to validate model selection procedures using simulated data.
1. Data Simulation:
2. Model Selection:
3. Validation:
Experimental Workflow Diagram:
1. Tree Inference:
2. Tree Comparison:
*Table 1: Example Impact of Model Selection on Phylogenetic Trees from Published Studies
| Data Type and Source | ModelFinder Best Model (BIC) | Alternative Model | ΔBIC | RF Distance |
|---|---|---|---|---|
| DNA, Lassa virus [32] | SYM+R5 | SYM+I+Γ4 | 215 | 16 |
| DNA, mitochondrial, mammals [32] | GTR+R8 | GTR+I+Γ4 | 2,632 | 16 |
| Protein, plastids, green plants [32] | JTT+F+R10 | JTT+F+I+Γ4 | 8,486 | 4 |
Table summarizing real examples where selecting a more appropriate model (with a much better BIC score) resulted in different phylogenetic tree topologies [32].
*Table 2: Essential Research Reagents & Software for Model Selection Benchmarking
| Item Name | Type | Function / Application |
|---|---|---|
| IQ-TREE (with ModelFinder) | Software Package | Performs model selection and phylogenetic inference. Features advanced models like PDF/Rk for rate heterogeneity and concurrent tree-space search [32]. |
| jModelTest2 | Software Package | Selects best-fit nucleotide substitution models using likelihood statistics, AIC, AICc, and BIC [22]. |
| ModelTest-NG | Software Package | Selects best-fit nucleotide substitution models; significantly faster than jModelTest2 [22]. |
| AliSim | Software Tool | Simulates evolutionary sequences under a known model and tree for benchmarking studies [22]. |
| BIC (Bayesian Information Criterion) | Statistical Criterion | Model selection criterion that consistently demonstrates high accuracy in identifying the true model by strongly penalizing extra parameters [22]. |
| Robinson-Foulds Distance | Metric | Quantifies topological differences between two phylogenetic trees, used to assess the impact of model choice [32]. |
The GTR+I+Γ (General Time Reversible + Invariant sites + Gamma-distributed rate variation) model is a complex, parameter-rich model that often provides a good fit for phylogenetic data. However, it is not universally optimal. Specific model features can make it inadequate or unidentifiable in certain situations, particularly when the assumptions of invariant sites and gamma-distributed rate variation are in conflict [71] [73]. Using an inadequate model can, despite some evidence to the contrary, lead to erroneous estimations of phylogeny and branch lengths [71]. The standard model selection tests can sometimes fail to detect these issues, requiring more sophisticated diagnostic procedures.
Several warning signs can appear during or after a phylogenetic analysis, suggesting potential problems with the GTR+I+Γ model. The table below summarizes the key diagnostics to monitor.
Table: Key Diagnostics for Identifying an Inadequate GTR+I+Γ Model
| Diagnostic Area | Specific Indicator | What It Suggests |
|---|---|---|
| Parameter Identifiability | High correlation between the shape parameter (α) of the Gamma distribution and the proportion of invariant sites (I) in MCMC samples [73]. | The model is overparameterized; the data cannot distinguish between rates being low because a site is invariant or because it falls in the slow category of the Gamma distribution. |
| Model Fit & Comparison | A model with only Gamma-distributed rates (GTR+G) has a similar or better marginal likelihood than the GTR+I+G model [10]. | The "+I" parameter is not improving the model fit and may be redundant. |
| MCMC Convergence | Poor convergence and mixing of the MCMC chain for the pinvar (proportion of invariant sites) and alpha (Gamma shape) parameters, even with long runtimes [74]. |
The model is struggling to find a single optimal solution for the rate distribution across sites. |
When standard model selection criteria like AICc or BIC are inconclusive, a rigorous Bayesian model averaging and diagnosis protocol can help reveal issues with the GTR+I+Γ model [10].
1. Problem: Suspected non-identifiability or poor fit of the GTR+I+Γ model for a given nucleotide alignment.
2. Hypothesis: The GTR+G model (without invariant sites) or a multi-model mixture provides a better and more stable explanation of the evolutionary process for the data.
3. Experimental Workflow:
The following diagram outlines the core diagnostic workflow.
4. Detailed Methodologies:
+I parameter, it is strong evidence against the GTR+I+Γ model.5. Expected Outcome: If the GTR+I+Γ model is inadequate, you will observe a strong negative correlation between α and pinvar, poor MCMC convergence for these parameters, and Bayesian model selection will favor the GTR+G model or other mixture models. Proceeding with the simpler GTR+G model or a more complex mixture model will lead to more reliable and interpretable results.
Table: Key Resources for Advanced Model Diagnostics
| Item Name | Function / Explanation |
|---|---|
| BEAST 2 with SDPM Package | A software platform for Bayesian evolutionary analysis. The SDPM package allows for model selection and averaging across site heterogeneity, directly testing the necessity of the +I+Γ parameterization [10]. |
| MrBayes | Another popular software for Bayesian phylogenetic inference. It supports rjMCMC for model selection between a set of predefined nucleotide substitution models. |
| Tracer | A graphical tool for analyzing the output of MCMC runs. It is essential for visualizing trace plots, calculating ESS values, and investigating correlations between parameters like alpha and pinvar [74]. |
| RevBayes | A highly flexible environment for phylogenetic inference, allowing users to build and test custom models, including various nucleotide substitution models [9]. |
| PartitionFinder | Uses a greedy heuristic algorithm to find the partition scheme and substitution model that maximizes the likelihood. It provides a comparative, non-Bayesian approach to model selection [10]. |
Selecting the best-fit nucleotide substitution model is not a mere preliminary step but a foundational component of rigorous phylogenetic analysis. The process requires a balanced approach, starting with a solid understanding of model foundations, leveraging powerful and comparable software with a preference for the BIC criterion, and proactively addressing heterogeneity through partitioning or mixture models. Crucially, the workflow must culminate in model adequacy testing to validate that the chosen model is plausible for the data, as even the most complex model can be inadequate. For biomedical and clinical research, where conclusions may inform drug target identification or understanding of pathogen evolution, adopting this comprehensive approach is paramount. Future directions point towards increased use of Bayesian model averaging, the development of more biologically realistic non-reversible and mutation-selection models, and the integration of these methods into standardized, reproducible pipelines to ensure the highest level of reliability in evolutionary inference.