## Abstract

In 1994, Muse and Gaut (MG) and Goldman and Yang (GY) proposed evolutionary models that recognize the coding structure of the nucleotide sequences under study, by defining a Markovian substitution process with a state space consisting of the 61 sense codons (assuming the universal genetic code). Several variations and extensions to their models have since been proposed, but no general and flexible framework for contrasting the relative performance of alternative approaches has yet been applied. Here, we compute Bayes factors to evaluate the relative merit of several MG and GY styles of codon substitution models, including recent extensions acknowledging heterogeneous nonsynonymous rates across sites, as well as selective effects inducing uneven amino acid or codon preferences. Our results on three real data sets support a logical model construction following the MG formulation, allowing for a flexible account of global amino acid or codon preferences, while maintaining distinct parameters governing overall nucleotide propensities. Through posterior predictive checks, we highlight the importance of such a parameterization. Altogether, the framework presented here suggests a broad modeling project in the MG style, stressing the importance of combining and contrasting available model formulations and grounding developments in a sound probabilistic paradigm.

CODON-BASED Markovian substitution models are widely recognized as attractive descriptions of molecular evolution (Yang 2006). There has also been an increasing recognition that some of these modeling approaches can be attributed population genetic interpretations (Thorne *et al.* 2007), and with the emerging juncture of population genetics and molecular evolutionary modeling (Felsenstein 2007), the codon-based framework could have much to contribute to these domains. A panoply of codon substitution models are now available, most of which consist of modifications and extensions of the seminal works of Muse and Gaut (1994) (MG) and Goldman and Yang (1994) (GY).

Suppose an alignment of nucleotide sequences, related according to a given phylogenetic tree. The more traditional level of interpretation surmises each nucleotide column of the alignment as arising from an independent continuous-time Markov process running over the tree and with a state space consisting of the four different nucleotides. In its general reversible form (*e.g.*, Lanave *et al.* 1984), the model is represented using six relative exchangeability parameters (with 5 effective d.f.), for each possible (unordered) pair of nucleotides, and four nucleotide propensities (with 3 effective d.f.), and is often referred to as the *general time reversible* (GTR) model. Taking this model as a starting point in the case of protein-coding sequences, a first step to mechanistically acknowledging the coding nature of the data is to suppose a strong purifying selection against stop codons; the process is reformulated in a state space consisting of nucleotide triplets, omitting triplet states corresponding to stops. In effect, such a model is equivalent to the same GTR type of model applied at the nucleotide level, but with the constraint that the nucleotide sequence must encode some full-length amino acid sequence (one-third the length of the nucleotide sequence). This is the rationale of the codon substitution models in the style of MG. Importantly, it implies that the rates of substitution (entries of the generator of the Markov processes) are proportional to the *target nucleotide* (at the mutating position). From this point, a further model construction step in the MG style is to distinguish between synonymous and nonsynonymous events, for instance utilizing the parameterization presented in the original work or the more compact representation of fixing the synonymous rate factor at one, with a free nonsynonymous rate factor.

In contrast with the MG style of model formulation, the GY models have entries of the Markov generator proportional to the stationary probability of the *target codon*. The contrast between the two formulations can be made very subtle. Indeed, a GY-style model can be specified from the same six nucleotide relative exchangeability and four nucleotide propensity parameters used in the MG-style model above. In such a case, codon stationary probabilities are approximated as proportional to the product of the three propensity parameter values associated with the nucleotides at each of the three codon positions (however, such a model entails peculiar properties, as we discuss in later sections). Another option available in the GY style is based on a full 61-dimensional (with 60 effective d.f., assuming the universal genetic code) vector of codon stationary probabilities (F61) (*e.g.*, Huelsenbeck and Dyer 2004; Huelsenbeck *et al.* 2006; Yang 2006). The GY-F61 approach has been suggested as important in giving “more freedom for the model to explain the data by modifying substitution rates using codon frequencies” (Huelsenbeck and Dyer 2004, p. 670). This may be the case, but the GY-F61 model has no natural mechanistic interpretation at the nucleotide level; nucleotide propensities have no direct parameterization in this formulation, and are only implicitly modeled, in a manner confounded with other effects inducing uneven codon stationary probabilities.

Another widely used modeling idea, adopted in both MG and GY formulations, has been to assign a separate set of nucleotide propensity parameters to each of the three codon positions. The distinction with the previously mentioned models is commonly referred to as F1 × 4 *vs.* F3 × 4, reflecting the use of a single *vs.* three vectors of dimension 4 (3 d.f.). However, the F3 × 4 configuration stands only as a phenomenological account of how the coding structure of the data induces a periodic pattern along the nucleotide sequence; there is no mechanistic sense to modeling features induced by selection via an expanded parameterization at the nucleotide level. In other words, this expanded parameterization may capture net resultants of the coding nature of the sequences, but it is not representative of our understanding of the causative factors bearing on the evolutionary process. Differences observed at each of the three codon positions are most likely the result of factors bearing on amino acid or codon preferences, or other high-order features, and should logically be modeled as such.

Recently, Yang and Nielsen (2008) studied models that address these issues, with explicit parameterizations bearing on amino acid or codon preferences within the MG formulation. From their analysis of mammalian genes, they find support for their formulation and hence further validate a modeling strategy aimed at disentangling mutational and selective features (Thorne 2007; Thorne *et al.* 2007; Yang and Nielsen 2008).

However, the ongoing developments of codon substitution models need to be complemented with rigorous evaluation and selection approaches, to contrast the relative merit of different modeling ideas. The Bayes factor (Jeffreys 1935) provides a natural framework for this purpose, with the theoretical advantage of being applicable to the comparison of any pair of models; it does not require models to be nested, and it intrinsically penalizes for higher-dimensional formulations. Bayes factors have already been used for comparing Markovian models of substitution at either the nucleotide level (*e.g.*, Suchard *et al.* 2001) or the amino acid level (*e.g.*, Lartillot and Philippe 2006). As discussed in Lartillot and Philippe (2006), reliable evaluations of Bayes factors for complex evolutionary models require elaborate computational devices. These computational demands are compounded at the codon level of interpretation, where the state space invoked leads to more taxing likelihood calculations than at the nucleotide or amino acid levels. Nonetheless, the application of Monte Carlo procedures running on modern computing machines makes these calculations feasible, thus allowing for broad codon substitution model comparisons.

In this work, we conduct such a Bayesian analysis of model fit and include all of the above-mentioned MG- and GY-style models, comparing the F1 × 4, F3 × 4, and F61 (in the GY context) configurations. We also include similar versions to the models of Yang and Nielsen (2008), which allow for a flexible account of either global amino acid preferences or global codon preferences. Each of these model configurations is contrasted in turn with a modeling of nonsynonymous rate heterogeneity, using the *Dirichlet process* device (Ferguson 1973) as described by Huelsenbeck *et al.* (2006). From our analysis of three real data sets, our findings indicate that a mechanistic MG-style modeling strategy that explicitly recognizes uneven codon preferences, while accounting for a global background of nucleotide propensities and heterogeneous nonsynonymous substitution rates, tends to outperform other models (or match top performing models). We explore aspects of the posterior distributions under the top models and conduct posterior predictive assessments (Rubin 1984; Gelman *et al.* 1996, 2004) highlighting implicit properties of the models.

## DATA

We used the following data sets, which we refer to here using a shorthand indicating the number of sequences and their length in number of codons: Globin*17-144*, 17 vertebrate sequences of the β-globin gene, described in Yang *et al.* (2000a); Lysin*25-134*, 25 abalone sperm lysin sequences, described in Yang *et al.* (2000b); and Hiv*22-99*, 22 sequences of the human immunodeficiency virus type 1 protease, described in Doron-Faigenboim and Pupko (2007). For computational reasons, we worked under a fixed tree throughout and, in all three cases, used the same topologies as those used in the works cited for each data set.

## MODELS

The models considered here are defined according to continuous-time Markov processes, with the infinitesimal generators written as *Q* = [*Q _{ij}*], specifying the instantaneous rate of substitution from codon

*i*to codon

*j*—we write

*i*(

_{c}*j*) to refer to the nucleotide state at position

_{c}*c*of codon

*i*(

*j*). The matrix

*Q*(under the universal genetic code with stop codons excluded) is 61 × 61 (

*i.e.*, 1 ≤

*i*,

*j*≤ 61, and 1 ≤

*c*≤ 3). It can be significantly simplified by viewing all substitutions as resulting from point mutations, such that each entry either is set to 0 (when states differ by two or three nucleotides) or is constructed from low-dimensional components. In the next sections, we describe these components in detail, constructing the entries of

*Q*following modeling approaches inspired from Muse and Gaut (1994) and Goldman and Yang (1994), with extensions proposed by Huelsenbeck

*et al.*(2006) and Yang and Nielsen (2008). The models are not identical to those presented in these original works, but correspond to flexible generalizations, while allowing us to focus on the distinguishing features of interest.

#### MG-style models:

We begin with the mechanistic modeling standpoint proposed by Muse and Gaut (1994), with a Markov generator given by(1)where is a set of (symmetrical) nucleotide relative exchangeability parameters, with the (arbitrary) constraint ; φ = (φ_{a})_{1≤a≤4}, with , represents a set of global nucleotide equilibrium propensities; and ω is the coefficient bearing on nonsynonymous rates, for now treated as a global parameter.

When ω = 1, this model corresponds to the well-known GTR model invoked for nucleotide-level interpretations, but with the purifying constraint against all stop codons. Here, however, ω is always treated as a free parameter, and the model is referred to as MG-F1 × 4.

Following in the MG style, Yang and Nielsen (2008) introduced a model with parameters bearing on codon fitness. A convenient parameterization of equivalent dimensionality is given as(2)where ψ = (ψ_{j})_{1≤j≤61}, with , represents a set of 61 codon preference parameters (60 d.f.). The exponent ensures reversibility (see the appendix), although the model could be expanded at this level as well (see Goldman and Whelan 2002). As it is, entries corresponding to substitutions from an unpreferred codon to a preferred codon () will be higher than entries corresponding to substitutions from a preferred to an unpreferred codon (), and in this way, an explicit account of global codon preference (CP) is included while maintaining an account of background nucleotide propensities. We refer to this model as MG-F1 × 4-CP.

Note that the codon preferences captured by ψ can be the result of several factors, including, for instance, global amino acid preferences. One way of assessing whether the CP model is capturing effects beyond those of global amino acid preferences is to compare it with a simplified version of the CP formulation, which accounts only for such features as given by(3)where ϕ = (ϕ_{k})_{1≤k≤20} is a 20-dimensional (19-d.f.) vector associated with amino acid preferences (AAP), and where *f*(*i*) returns an index corresponding the amino acid encoded by codon *i*. As in the case of the CP model, entries corresponding to substitutions from an unpreferred amino acid to a preferred amino acid () will thus be higher than entries corresponding to substitutions from a preferred to an unpreferred amino acid (). We refer to this model as MG-F1 × 4-AAP.

Finally, despite departing from the mechanistic modeling perspective, we also investigate the F3 × 4 configurations for the models defined in (1)–(3), by substituting φ appropriately with codon position-specific nucleotide propensity parameters, written as φ^{(c)} = (φ)_{1≤a≤4}, where ∀*c*, 1 ≤ *c* ≤ 3, . The MG-F3 × 4 model is thus written as(4)the MG-F3 × 4-CP model as(5)and the MG-F3 × 4-AAP model as(6)

#### GY-style models:

The models in the style proposed by Goldman and Yang (1994) have Markov generators specified as(7)where π = (π_{i})_{1≤i≤61}, with , represents a 61-dimensional (60 d.f.) vector of codon stationary probabilities (distinct from ψ).

Several options for π are available. First, it can be based on a set of global nucleotide propensity parameters according to(8)We refer to this model as GY-F1 × 4. Careful examination of this model reveals a number of peculiar properties, which seem undesirable. For instance, in a mutational context prone to events leading to A or T (*i.e.*, where the parameters φ_{A} and φ_{T} tend to be higher than φ_{C} and φ_{G}) a substitution from codon CGC to CTC would have a lower instantaneous rate than a substitution from codon ATA to AGA. In such a scenario, the rate of an event involving the second codon position depends on the nucleotide states at the first and third positions, which, in this case, leads to the higher rate for the substitution going against the mutational bias. From the mechanistic model construction described by the MG strategy, however, there are no obvious reasons for linking a change at the second position to the states at the first and third positions, unless this is mediated by selective effects at the codon level (*e.g.*, stop codons). Accordingly, for this same instance, the MG model displays the reverse situation, with the CGC to CTC substitution having a higher instantaneous rate than the ATA to AGA substitution in a manner consistent with the mutational bias.

Another similar choice for the GY models is to base π on codon-position-specific nucleotide equilibrium frequencies,(9)in which case we refer to the model as GY-F3 × 4. Note that the GY-F1 × 4 and MG-F1 × 4 models, as well as the GY-F3 × 4 and MG-F3 × 4 models, are respectively constructed from the exact same parameters; they also have the same limiting distributions and hence differ only in terms of their transient specifications (further details on this point are given in Yang and Nielsen 2008, as well as in the appendix). Finally, we consider the case where π is directly free—a full 61-dimensional (60-d.f.) vector—which we refer to as GY-F61.

The limiting distributions of all models are given in full in the appendix, along with further details specific to our implementation.

#### Priors:

Our prior on branch lengths is *exponential*, with a mean determined by a hyperparameter, itself endowed with an exponential prior of mean 1. Adopting the approach presented by Huelsenbeck *et al.* (2006), our most general prior on nonsynonymous rate factors is the Dirichlet process (DP)—as an infinite mixture across sites—with hyperparameter α, modulating the assumed “graininess” of nonsynonymous heterogeneity; α is endowed with an exponential prior of mean 1. The Dirichlet process prior also utilizes a base measure, defining the probability distribution of each component; as in Huelsenbeck *et al.* (2006), we use *p*(ω) = 1/(1 + ω)^{2}, the probability density of the ratio of two identically distributed draws from an exponential. This same base prior is used when dispensing with the DP framework, with the model based on a single global ω-factor. All other parameters have flat Dirichlet priors on their respective state space.

## MODEL COMPARISONS

#### Bayes factors:

Given a data set *D* and a model *M*, specified by some high-dimensional parameter vector θ ∈ Θ, we wish to evaluate the *predictive probability*, or *marginal likelihood*, written as *p*(*D* | *M*) and obtained by averaging the likelihood *p*(*D* | θ, *M*) over the prior *p*(θ | *M*):(10)

When comparing two models of interest, *M*_{0} and *M*_{1}, the Bayes factor (*B*_{01}) provides a measure of the *evidence* in favor of one model over the other:(11)(Jeffreys 1935). A Bayes factor > (<) 1 is considered as evidence in favor of *M*_{1} (*M*_{0}).

We used the model-switch thermodynamic integration framework described in Lartillot and Philippe (2006) to evaluate Bayes factors across all codon substitution models described above. The model-switch thermodynamic integration method is a computationally intensive Markov chain Monte Carlo procedure, and our strategy here has been to evaluate all log Bayes factors with respect to the GY-F1 × 4 model. Several log Bayes factors are computed from multiple independent model-switch schemes. For instance, the log Bayes factor between MG-F3 × 4-CP-DP and GY-F1 × 4 is assembled from(12)where each term is evaluated using a distinct model-switch scheme (see supplemental materials). Following Lartillot and Philippe (2006), we found the sampling error to be less problematic than the “thermic lag” of the method, which we thus made the focus of our MCMC settings. We ran each calculation in duplicate, using the quasi-static bidirectional method detailed in Lartillot and Philippe (2006, also see the supplemental material). Each pair of model-switch integrations produces two values, used here to construct an interval, and gives a crude sense of the precision of the settings. Note that the bidirectional scheme is employed as a framework for empirical MCMC tuning (see supplemental material) and not as a method for formal calculation of the Monte Carlo error.

As expected, the log Bayes factors can be made more precise when comparing models that are parametrically closest to the reference model (see solid bands in Figure 1). When computing log Bayes factors for the more complex models, involving several distinct model-switch schemes, the interval of the overall log Bayes factor against GY-F1 × 4 is constructed conservatively (to produce the broadest possible interval). This means that the interval (the error) grows with each step in model space. However, a careful model-space traversal appears necessary in practice, despite this growing error of multiple steps, as performing a direct integration between a high-dimensional model and the reference would require overcostly computations (see supplemental material). Still, in certain cases, an entirely unambiguous model ranking remains computationally prohibitive. For instance, with the Globin*17-144* data, the log Bayes factor of MG-F3 × 4-CP-DP against GY-F1 × 4, and the log Bayes factor of MG-F1 × 4-CP-DP against GY-F1 × 4, overlap with each other, which thus prevents us from clearly distinguishing the two models. In the present context, obtaining the required level of precision for distinguishing between log-marginal likelihoods that differ by a few units is relatively uninteresting and not worth the computational investment that would be needed when utilizing the present methods. Our objective here is rather to map out the main effects of different formulations in terms of overall model fit.

#### Posterior predictive assessments:

The calculation of Bayes factors allows for a quantitative ranking of a competing set of models. However, we want to explore how the models perform in absolute terms, by comparing the value of certain statistics computed on the real data with those computed on data replicates, generated by simulation under the model of interest. In the Bayesian framework, the simulation is typically conducted for each draw of a sample from the posterior, and the distribution of the statistic is referred to as the *posterior predictive distribution*. This designation is appropriate, in that the distribution consists of what we would expect to see under the model, in light of what has been learned from the data at hand. Discrepancies between posterior predictive distributions and observed values reveal weaknesses of a model, as previously explored in phylogenetic modeling contexts (*e.g.*, Bollback 2002; Nielsen 2002; Lartillot *et al.* 2007). A conventional way of summarizing the discrepancy between a posterior predictive distribution and the observed value of some statistic is to report the area of the distribution to the right of the observed value, otherwise referred to as the posterior predictive *P*-value (see, *e.g.*, Gelman *et al.* 2004); this can be computed as the proportion of replicates for which the statistic equals or exceeds the observed value. We illustrate this approach here using codon-position-specific nucleotide frequencies and codon entropy.

Note that formulating tests based on the posterior predictive *P*-values to “reject” models would require further calibration (since the data replicates were generated from unknown, previously estimated parameter values). In the case of the nucleotide frequencies, such a formal testing across all dimensions would require further correction (since dimensions are not independent). Nonetheless, these types of difficulties do not prevent us from performing simple graphical contrasts of features of true and replicated data (Gelman *et al.* 2004).

## RESULTS AND DISCUSSION

#### Bayes factors:

The series of log Bayes factors reported in Figure 1 reveals considerable differences in model performance, indicating the importance of conducting a careful examination of alternative parametric choices. We note that the MG-F1 × 4-CP-DP model is among the top ranking models for all three data sets. This result is somewhat expected. First, nonsynonymous rate heterogeneity has now been observed across numerous data sets (Yang 2006), and it thus seems reasonable to anticipate a good performance of the Dirichlet process framework proposed by Huelsenbeck *et al.* (2006). For the Globin*17-144* data set, we also tried a version of the DP model with a uniform hyperprior over the interval [0:1000] on α, in the MG-F1 × 4-CP-DP setting. With respect to the reference model, the resulting log Bayes factor interval of [237.8; 242.3] is overlapping with the top-ranking models, suggesting that model choice in favor of the DP model is robust to the hyperprior; more work is still needed, however, to explore alternative prior structures on the Dirichlet process in this context, in particular regarding alternative base priors distributions. The other specifications of this top model are also reassuring, in the sense that adhering closely to the mechanistic perspective of teasing apart mutational features and selective constraints produces, at worst, a model of roughly equivalent performance to models lacking such a natural interpretation. In addition, all three data sets suggest uneven codon preferences, although such preferences appear to go well beyond amino acid preferences only in the case of the Globin*17-144*.

We next note that under the simpler settings of MG-style models, suppressing AAP or CP parameters, the F3 × 4 configuration is generally preferred over the F1 × 4 configuration for all three data sets. The periodic pattern of codon-position-specific nucleotide propensities is a feature expected from the structure of the genetic code, through selective constraints. Such an interpretation, however, is not accurately represented by expanding the nucleotide-level parameterization. Indeed, with the richer models, including the CP parameters in particular, the F3 × 4 configurations are only mildly preferred over the F1 × 4 configuration, and when invoking the Dirichlet process, modeling heterogeneous nonsynonymous rates, the numerical error no longer allows for a clear distinction between these two configurations (except for the Hiv*22-99* data set, which gives preference to the F1 × 4 configuration). We also observed the parametric redundancy of including both the F3 × 4 setting and the CP parameters to be susceptible to identifiability problems (see supplemental material).

The GY style of models based on the F1 × 4 and F3 × 4 configurations is generally disfavored over the MG-style counterparts (except for the Globin*17-144* data set, which gives favor to GY-F3 × 4 over MG-F3 × 4). Surprisingly, for the Lysin*25-134* data set, the simpler GY-F1 × 4 model is slightly preferred over the GY-F3 × 4 model. However, for all three data sets, the GY model based on F61 configurations outperforms the other GY-style models, as well as the simpler MG-style models. In the case of the Globin*17-144* data set, the contrast of the F61 configuration is even greater than that observed between homogeneous and heterogeneous models of nonsynonymous rates; for instance, the log Bayes factor of GY-F61 against GY-F1 × 4 is [115.8; 117.4] whereas that of GY-F1 × 4-DP against GY-F1 × 4 is [102.3; 104.2]. These results for the GY-F61 model are also indicative of uneven codon preferences. However, as previously mentioned, the codon preferences accounted for in this GY formulation are confounded with other features, including the background of nucleotide propensities, making the model less attractive on interpretive grounds. Accordingly, when contrasted with the richer MG formulations accounting for codon (or amino acid) preferences, the GY-F61 model is less attractive on quantitative grounds (except for Hiv*22-99*, in which case it matches the top MG-style models).

#### Posterior distributions:

To highlight a few aspects of the MG-F1 × 4-CP-DP model, which was among the top performing for all three data sets, we display posterior distributions of parameters (obtained using plain MCMC sampling). Our focus is on the combination of background nucleotide propensities with global codon preference parameters, but we contrast the distributions with those obtained under the simpler model suppressing CP parameters, as well as under the GY-F61-DP model. We also explore the impacts of different models on the detection of positive selection.

Figure 2 displays the 95% credibility intervals of the posterior distributions of the global nucleotide propensity parameters for each data set. The solid lines correspond to the interval obtained under MG-F1 × 4-DP, whereas the dashed lines are obtained under MG-F1 × 4-CP-DP. The distributions are far more diffuse under the CP version, although their general locations appear similar. This is the well-known feature of increasingly diffuse distributions obtained when expanding the form of models and that we observed for the other parameters of the models as well (not shown).

Figure 3 displays the 95% credibility intervals of the posterior distributions of codon preference parameters under the MG-F1 × 4-CP-DP model, in comparison with the codon stationary parameters under GY-F61-DP model. Note that the parameters do not have the same interpretation under the two models. For the GY-F61-DP model, the parameters correspond to the limiting distribution of the Markov process; but for the MG-F1 × 4-CP-DP model, the limiting distribution of the Markov process is based on both codon preference parameters and nucleotide propensity parameters, and hence the graphical comparison displayed here does not correspond to a formal statistical testing. The GY-F61-DP model leads to tighter 95% credibility intervals than the MG-F1 × 4-CP-DP, which concords with its lower dimensionality. However, the overall distributions of the two sets of parameters are reasonably similar, with some degree of overlap in all cases. Thus both models appear to be capturing similar overall trends. The distributions suggest pronounced overall codon preferences for Globin*17-144*, but milder preferences for Lysin*25-134* and Hiv*22-99*. This corroborates well with our computed Bayes factors, which, for instance, indicate that for Lysin*25-134* and Hiv*22-99*, the improvement brought about by the CP parameters is less important than for the Globin*17-144* data. Observing the distributions for the Globin*17-144* data set in detail, we find that the parameter values appear to capture long observed tendencies of codon preferences on similar data, such as the elevated use of CTG for encoding leucine, GTG for valine, or GGC for glycine; indeed, these were some of the first observations stimulating research into the causes of codon preferences (*e.g.*, Fitch 1980; Modiano *et al.* 1981; Kimura 1983).

We contrasted the conclusions of the GY-F61-DP, MG-F1 × 4-DP, and MG-F1 × 4-CP-DP models with regard to the central application of such models: the inference of amino acid positions having undergone positive selection. Under the DP settings, the posterior probability of a site being under positive selection can be computed from the proportion of draws from a sample (obtained via plain MCMC sampling) found to be in a class ω > 1, as described in Huelsenbeck *et al.* (2006). We first note that for the Globin*17-144*, focusing on posterior probabilities at 0.9, 0.95, and 0.99 cutoff levels, the MG-F1 × 4-DP and MG-F1 × 4-CP-DP models infer sites under positive selection at each level, whereas the GY-F61-DP model infers no sites at either level (Table 1). The list of sites under positive selection under the three models considered also differs for the other two data sets (Table 1). The computed Bayes factor can be used to weigh the conclusions of different models and indeed could form the basis of a model averaging inference of positive selection. It will be important to conduct a broader empirical study of the impacts of these and other parametric choices on the detection of positive selection.

#### Posterior predictive assessments:

To investigate whether different nucleotide frequencies observed at the three different codon positions can be a result of codon preferences, we computed the codon-position-specific frequencies of each nucleotide on data replicates, generated by simulation under parameters sampled from the posterior distribution. Focusing on the Globin*17-144* data set, Figure 4 displays the posterior predictive distribution of the frequency of nucleotide A at the three positions under the MG-F1 × 4-DP model (solid lines), the MG-F1 × 4-CP-DP model (thick dashed lines), and the GY-F61 model (thin dashed lines). The MG-F1 × 4-DP model leads to closely matching distributions at the three positions, from the definition of the model (although the exclusion of stop codons induces mild shifts). In Figure 4, b and c, the posterior predictive *P*-values of 0 and 1 are indicators of problems with the MG-F1 × 4-DP model. The MG-F1 × 4-CP-DP model, in contrast, leads to markedly different distributions at the three positions, in a manner approaching the observed frequencies. The GY-F61-DP model is also capable of producing this effect, although graphical comparisons for the three other nucleotides suggest that it does not perform as well as the MG-F1 × 4-CP-DP model (supplemental material).

In another posterior predictive check, we computed the relative frequency of codons in alignments (real and replicated), from which we then evaluated the overall unevenness of the frequency profile from the codon entropy. In Figure 5, we find, as expected, that the MG-F1 × 4-DP model induces a relatively even predictive codon frequency profile (from the definition of the model), leading to a high codon entropy (solid-line histogram). The *P*-value of 1 indicates that the model is inadequate. In contrast, under the MG-F1 × 4-CP-DP model we obtain a more uneven profile, and thus a lower codon entropy (thick-dashed-line histogram); the *P*-value is 0.528, and the model is not deemed problematic under this test. The GY-F61-DP model leads to an intermediate codon entropy (thin-dashed-line histogram), which poorly anticipates the empirical value, but is nonetheless much closer than the MG-F1 × 4-DP model.

These two posterior predictive checks are indicative of tensions in the GY-F61-DP model configuration. The overall model construction strategy of the GY approach may be unable to adequately capture codon preferences without inducing distorted nucleotide propensities. Although the GY-F61-DP model finds a compromise to this tension that performs better than simpler models, in its ability to phenomenologically capture overall codon preference trends, it is outperformed by the MG-F1 × 4-CP-DP model, which explicitly decouples codon preferences from nucleotide propensities.

## CONCLUSIONS AND FUTURE DIRECTIONS

In recent years, numerous codon substitution models have been proposed, subscribing to either the GY or the MG perspective. GY-style models have been extended in several ways, for instance, to account for heterogeneous nonsynonymous rates (*e.g.*, Nielsen and Yang 1998; Yang *et al.* 2000a; Huelsenbeck and Dyer 2004; Huelsenbeck *et al.* 2006) or to recognize differences in the types of nonsynonymous substitutions, either based on various amino acid distance metrics (*e.g.*, Yang *et al.* 1998), by partitioning amino acids into predefined classes (Sainudiin *et al.* 2005; Wong *et al.* 2006), by incorporating information from an empirically derived amino acid replacement matrix (Doron-Faigenboim and Pupko 2007), or even based on full empirical modeling strategies (Kosiol *et al.* 2007). In parallel, MG-style models have been extended in similar ways as well, such as the models recognizing heterogeneous synonymous and nonsynonymous substitution rates across codon sites (Kosakovsky Pond and Muse 2005), models incorporating partition-based amino acid exchange propensities (Schadt and Lange 2002), or models accounting for dependence between codon positions due to protein tertiary structure (Robinson *et al.* 2003).

In light of all of these developments, the present study effectively takes a step back, to reassess the core motivation underlying codon-based models, namely, the formulation of a biologically meaningful parameterization that disentangles the different factors bearing on the overall substitution process. Our results indicate that a mechanistically founded attempt of teasing apart nucleotide propensities and amino acid or codon preferences in the MG style tends to surpass, or at least match, the optimal GY-style model. These results suggest that future modeling investigations should consider incorporating any extensions in the MG context.

Quantitative model comparisons based on Bayes factors would be of particular interest to evaluate how some of the recently proposed models mentioned above compare with the models studied here. For instance, the models proposed in Robinson *et al.* (2003), which can account for dependencies between codon sites due to structural constraints at the protein level, could be included into the scope of models evaluated. Working at the amino acid level, we previously proposed a variation of the thermodynamic method for evaluating Bayes factors with models accommodating a general dependence across sites (Rodrigue *et al.* 2006), which could be transposed to the codon level in a straightforward way. A broad range of model extensions are also evident from the AAP and CP approaches proposed in this work: given that these richer MG-style models lead to an improved overall fit, models based on mixtures of AAP or CP parameters could also be studied, so as to capture site-specific preferences. To this end, the Dirichlet process prior, applied here to model nonsynonymous rate heterogeneity across sites (Huelsenbeck *et al.* 2006), could also be applied to the AAP parameters or to the CP parameters. Indeed, the necessary MCMC operators for manipulating such models have been described previously (Lartillot and Philippe 2004). It might also be of interest to extend these models to nonstationary nucleotide propensities, as well as to nonstationary amino acid or codon preferences, on the basis of ideas presented in Blanquart and Lartillot (2006) and Nielsen *et al.* (2007).

We note that analogous quantitative model comparisons would be difficult in a frequentist framework. The nonnested form of the models of interest complicates traditional frequentist tests, in which case most practitioners rely on criteria such as the Akaike information criterion (AIC) (Akaike 1974). However, the AIC (and other similar criteria) relies on maximum-likelihood parameter estimates, as well as the log-likelihood value at this optimal point. Performing maximum-likelihood estimation and log-likelihood calculations will likely become increasingly difficult as richer nonanalytical models are proposed (Rodrigue *et al.* 2007), such as the Dirichlet process models or models with dependence across sites (Robinson *et al.* 2003).

Extending comparisons over many data sets will be of prime importance to determine if the results presented here can be generalized. However, such a project would entail significant computational costs using the model-switch thermodynamic methods employed here (several log Bayes factor calculations required over 40 days on an Intel P4 3.2 GHz desktop computer). We are currently investigating a marginal-likelihood estimator combining sigmoidal model-switch schemes (Lepage *et al.* 2007) with the Laplace method for integrals, along the lines described in Rodrigue *et al.* (2007). Combinations with other Bayes factor calculation schemes (*e.g.*, Suchard *et al.* 2001; Choi *et al.* 2007) are also possible, and the availability of such a suite of computational devices should enable a broad empirical study and help uncover and quantify the main factors bearing on protein-coding sequence evolution.

## APPENDIX

In our implementation the entries of *Q* are based on two sets of specifications: a 61-dimensional vector of stationary probabilities, π, and a set of *transient specifications*, written as ρ = (ρ_{ij})_{1≤i,j≤61} and combined according to(A1)(A2)In this appendix, we write out in full the stationary probabilities under the models, as well as the full transient specifications, and give an example of the detailed balance check.

#### Stationary probabilities:

First, expanding (8) for the stationary distribution under GY-F1 × 4, we have(A3)Similarly, with GY-F3×4, we have(A4)

The stationary probability under GY-F61 is already entirely specified and the models MG-F1 × 4 and MG-F3 × 4 have the same stationary distributions as (A3) and (A4), respectively.

Under the MG-F1 × 4-CP model, the stationary probability is given by(A5)and under the MG-F1 × 4-AAP model it is given by(A6)

The stationary distributions under the MG-F3 × 4-CP and MG-F3 × 4-AAP models follow directly as(A7)and(A8)respectively.

#### Transient specifications:

In the case of GY-type models, the transient specification is simply (7) without the π_{j} factor. In the case of the MG-F1 × 4 model, we have(A9)where *c′* and *c*″ are the two constant codon positions, and *Z* is the normalizing factor of the stationary distribution (in this case ). Note that this latter *Z* factor is not needed when scaling *Q*. Once again, substituting with , and the appropriate *Z*, yields the transient specification for MG-F3 × 4.

For the MG-F1 × 4-CP model, the transient specification is given by(A10)and the specification of MG-F1 × 4-AAP is given by(A11)

As always, substituting with , and the appropriate *Z*, yields the transient specifications for the F3 × 4 versions of (A10) and (A11).

We have now fully specified π and ρ used in Equation A1. We can see that upon substituting stationary and transient specifications appropriately into (A1), the models defined in the main body of the text are obtained. For instance, for a nonsynonymous substitution under the MG-F1 × 4-CP model, we have(A12)(A13)(A14)(A15)corresponding to the entry obtained from (2).

#### Checking the detailed balance:

The models studied here all satisfy the equality π*Q* = 0, and are time reversible, satisfying the equality *Q _{ij}*π

_{i}=

*Q*π

_{ji}_{j}. These developments are lengthy, and so we display only one example, for the detailed balance check under MG-F1 × 4-CP in the case where

*i*and

*j*differ at one nucleotide position, implying a nonsynonymous substitution,(A16)(A17)(A18)(A19)(A20)(A21)(A22)where the array is symmetrical, satisfying the equality.

Finally, we mention here that we follow the practice proposed by Huelsenbeck *et al*. (2006) and scale *Q* matrices such that branch lengths represent the expected number of synonymous substitutions per codon site, although we have also tried the model comparisons without any scaling of *Q* (such that branch lengths have no meaningful units) and obtained essentially identical results (not shown).

## Acknowledgments

We thank Stéphane Aris-Brosou for his feedback on an early version of the manuscript, as well as Rasmus Nielsen and two anonymous reviewers for thoughtful comments. We also thank the Réseau Québécois de calcul de haute performance for computational resources. This work was supported by Génome Québec, the biT fellowships for excellence (a Canadian Institutes of Health Research strategic training program grant in bioinformatics), the Robert Cedergren Centre for bioinformatics and genomics, the Natural Sciences and Engineering Research Council of Canada, the Canadian Institute for Advanced Research, the Canadian Research Chair Program, the Centre National de la Recherche Scientifique (through the Action Concertée Incitative–Informatique, Mathématique, Physique en Biologie Moléculaire Model-Phylo funding program), and the 60ème Commission Franco-Québécoise.

## Footnotes

Communicating editor: R. Nielsen

- Received June 3, 2008.
- Accepted September 5, 2008.

- Copyright © 2008 by the Genetics Society of America