## Abstract

More than 20 human genetic diseases are associated with inheriting an unstable expanded DNA simple sequence tandem repeat, for example, CTG (cytosine–thymine–guanine) repeats in myotonic dystrophy type 1 (DM1) and CAG (cytosine–adenine–guanine) repeats in Huntington disease (HD). These sequences mutate by changing the number of repeats not just between generations, but also during the lifetime of affected individuals. Levels of somatic instability contribute to disease onset and progression but as changes are tissue-specific, age- and repeat length-dependent, interpretation of the level of somatic instability in an individual is confounded by these considerations. Mathematical models, fitted to CTG repeat length distributions derived from blood DNA, from a large cohort of DM1-affected or at risk individuals, have recently been used to quantify inherited repeat lengths and mutation rates. Taking into account age, the estimated mutation rates are lower than predicted among individuals with small alleles (inherited repeat lengths less than 100 CTGs), suggesting that these rates may be suppressed at the lower end of the disease-causing range. In this study, we propose that a length-specific effect operates within this range and tested this hypothesis using a model comparison approach. To calibrate the extended model, we used data derived from blood DNA from DM1 individuals and, for the first time, buccal DNA from HD individuals. In a novel application of this extended model, we identified individuals whose effective repeat length, with regards to somatic instability, is less than their actual repeat length. A plausible explanation for this distinction is that the expanded repeat tract is compromised by interruptions or other unusual features. We quantified effective length for a large cohort of DM1 individuals and showed that effective length better predicts age of onset than inherited repeat length, thus improving the genotype–phenotype correlation. Under the extended model, we removed some of the bias in mutation rates making them less length-dependent. Consequently, rates adjusted in this way will be better suited as quantitative traits to investigate *cis-* or *trans*-acting modifiers of somatic mosaicism, disease onset and progression.

## 1. Introduction

Most human genetic diseases associated with inheriting an unstable expanded DNA simple sequence repeat are caused by repeat units of three nucleotide bases: CTG (cytosine–thymine–guanine) in myotonic dystrophy type 1 (DM1) and CAG (cytosine–adenine–guanine) in Huntington disease (HD) [1]. The CTG repeat unit in DM1 is found in the 3′ untranslated region of the *DMPK* gene [2–4]. In HD, the repeat unit is CAG in the coding region of the *huntingtin* gene [5].

The length of the repeat unit is inversely correlated with age of disease onset. Generally, as longer lengths get passed on from parent to child, the age of onset decreases, sometimes by as much as 30 years, between generations. This phenomenon is known as anticipation and is a feature of many expanded repeat diseases [1]. The DM1 CTG repeat is polymorphic in the general population, ranging from five to 37 repeats, and from upwards of 50 to several thousand in affected DM1 patients [3,4,6]. Individuals who inherit between 50 and 100 CTGs typically have late adult onset with mild symptoms and those inheriting between 100 and 800 CTGs have adult/juvenile onset. There is also a congenital form of DM1 usually occurring among those inheriting between 800 and 1500 CTGs [7]. Affected individuals are seen with repeat lengths over 35 CAGs in HD [5], but in contrast to DM1, in HD most individuals usually only inherit between 40 and 50 CAGs. Juvenile onset can occur over 50 CAGs, but there is not a known congenital form of HD. Despite differences between DM1 and HD with respect to the repeat motif and its position, and differences in the tissues affected and disease pathology, the unimodal shape of sized single-molecule repeat length distributions is very similar in blood or buccal DNA [8–10]. This suggests that there may be similarities in the mechanism underlying mutation in each disease.

The expanded repeat lengths mutate during the lifetime of affected individuals by changing the number of repeats, leading to repeat length differences between cells within a tissue, and giving rise to somatic mosaicism. In DM1 and HD, the diseases under discussion here, the changes are length- and time-dependent with levels of somatic mosaicism biased towards expansion and generally increasing as an individual ages [8,10]. Levels of somatic mosaicism are particularly high in pathology-related tissues, such as muscle in DM1 [11] and brain in HD [12,13] and this has given rise to the hypothesis that somatic instability may itself contribute to the pathogenic process [13,14]. The link between somatic mosaicism and disease onset suggests that preventing the repeat lengths from expanding might be therapeutic [1,15].

Interpretation of the level of somatic instability is confounded by several considerations, notably inherited repeat length, but also age and tissue in which somatic instability is observed and individual-specific rates of mutation. Distinguishing between the components of somatic instability is expected to improve interpretation and prognostic information for affected individuals and their families. Previously, we developed a mathematical model that describes changes in repeat length distributions in DM1 blood DNA over time [16]. This model was fitted to sized repeat lengths from a large cohort of DM1-affected or at risk individuals [14]. These individuals varied in terms of their estimated inherited repeat lengths (between 50 and 1500 CTGs). The probabilistic model assumed that changes in repeat length increased as a function of repeat length and time. This model was shown to explain satisfactorily the variable distributions of repeat lengths seen across this group of DM1 individuals. It was established that contractions play an important part in the mutational process, with relatively small net gains of CTGs being the predicted result of many expansions and almost as many contractions. From these results, we concluded that mutations at the DM1 locus are ultra-frequent, much more frequent than might be concluded from the net gains or examination of the modal repeat length. In summary, both expansions and contractions have resulted in the characteristic shape of the repeat length distributions [16].

However, we observed that for individuals with small alleles (inherited repeat lengths less than 100 CTGs) expansion and contraction rates clustered around the low end of the parameter spectrum [16]. As it is reasonable to expect these rates to be randomly distributed throughout the spectrum, the results indicate that the overall model may be improved by introducing a nonlinear response in line with the biological consideration that small alleles may have a reduced propensity to expand and/or contract owing to possible size effects [16].

The precise mechanisms that cause repeat units to become inserted or deleted from the repeat length tract are not known [1,17,18], but two basic types of explanation, DNA replication and DNA repair, have been proposed for the expansion of simple sequence repeats.

The aim of this study is to develop and test a more sophisticated mathematical model based on plausible biological assumptions about small alleles. The model is applied to both DM1 and HD data, and evaluated using techniques from statistical inference. We thereby quantify whether mutational propensity is lower in the small alleles and whether reduced levels of somatic mosaicism give rise to distinctive repeat length distributions. We also introduce and quantify the concept of *effective length*. Individuals whose effective length is different from their actual length are strong candidates for further experimental investigation, as lower than expected levels of somatic mosaicism may indicate the presence of modifiers of somatic instability such as interruptions in the repeat lengths.

## 2. Methods

### 2.1. Data

The data in this study comprised distributions of CTG repeat lengths sized from blood DNA from 14 DM1-affected individuals at the DM1 locus [14] and distributions of CAG repeat lengths sized from buccal DNA from 12 HD affected individuals at the HD locus [8]. A summary of the mean repeat length distributions and the range of repeat length distributions for each of the individuals above (14 DM1 and 12 HD) can be viewed in the electronic supplementary materials (figures S1, S2.1–S2.14 and S3.1–S3.12, respectively). The subset of 14 DM1 individuals with inherited repeat lengths less than 100 CTGs, see [14] for further details, was selected from the total cohort of 145 DM1 individuals. Out of these selected individuals, nine were asymptomatic when the blood samples were taken and five had late-onset, with age at onset ranging from 46 to 74 years. The 12 unrelated HD individuals [8] had estimated inherited repeat lengths between 39 and 48 CAGs and were all 39 years old when the buccal samples were taken. The distributions were sized, in terms of the number of repeats, using single-molecule PCR assays. Since the first application of small pool PCR to quantify variation at the myotonic dystrophy locus in 1995 [19], the technique has become well established as robust and reliable, and has been used to quantify triplet repeat dynamics in a wide range of scenarios and at various loci (e.g. [12,20–22]).

### 2.2. Formulation of an expansion and contraction model incorporating a length-specific effect

The framework we use to describe changes in repeat length (measured by the number of repeat units) over time in a population of cells is a stochastic birth and death process. In our context, birth is the gain of one repeat unit (expansion), and death is the loss of one repeat unit (contraction) within a cell. This is a probabilistic model with probability functions defined for the mutational events of expansion and contraction. In [16], we assumed that the likelihood of a mutational event increased linearly with repeat length over a threshold number of repeats and it is this assumption and corresponding function definitions that we will refine in this paper, see the appendix for details. A key modelling assumption is that the cells acquire mutations independently of one another, and this was justified for DM1 blood cells [16]. Our application of the model to another disease and cell type, HD buccal cells, requires the assumption that buccal cells acquire mutations independently of one another. Buccal cells, such as other external epithelium cells, are replenished from a large pool of self-renewing stem cells [23] hence an assumption of independence is reasonable.

Our approach to quantify a length-specific effect and to determine the range over which it operates is based on a repeat length distance parameter, *α*, and is sufficiently general to incorporate different distance requirements that might suppress mutation proportional to length. The length parameter, *α*, may be interpreted as the length of a DNA fragment typically processed by the DNA repair mechanism and/or DNA replication machinery. Alternatively, such a distance requirement might be created by an interruption in the repeat length.

### 2.3. Model comparison and parameter estimation

The basic model, linearly proportional expansion and contraction over a threshold number of repeats, referred to as *M*_{6b} in [16] and here as *M*, is summarized in the appendix. We develop a new model, *M _{α}*, by introducing a length-specific effect,

*R*(

_{n}*n*,

*α*), which is a function of repeat length,

*n*, and the distance constraint,

*α*, into model

*M*. The length-specific effect is quantified using a combinatorial counting argument based on the length of the constraint (one interpretation being the distance between loop-outs), denoted as

*α*, and the likelihood that mutation occurs. We introduce this effect into the probability functions for expansion and contraction, as described in the appendix.

Both models, *M* and *M _{α}*, were fitted to sized single-molecule distributions of DM1 blood DNA and HD buccal DNA, from the individuals described above. The prior ranges for Bayesian parameter estimation were chosen to represent DM1 blood/HD buccal cells: expansion per CTG/CAG unit per year (models

*M*and

*M*), contraction per CTG/CAG unit per year (models

_{α}*M*and

*M*), a threshold measured in CTG/CAG units (model

_{α}*M*) and a distance constraint measured in CTG/CAG units (model

*M*; table 1). For both datasets, the inherited number of repeats,

_{α}*n*

_{0}, was treated as an unknown parameter and its value inferred from the data along with the other parameters. The maximum likelihood was calculated using a grid search over the parameter space, modified respectively for DM1 blood cells and HD buccal cells.

The relative goodness of fit of the basic model *M* and the new model *M _{α}* was assessed using the Akaike information criterion (AIC) [24]. The models have the same number of parameters and are not nested, so AIC is an appropriate method to rank the models through a relative measure of the goodness of fit. Application of AIC involved calculating the maximum-likelihood value using a grid search over the parameter space, as outlined in table 1. Further details about model comparison and parameter estimation are found in the appendix.

## 3. Results

### 3.1. Repeat length distributions among individuals with small alleles have relatively low variance-to-mean ratios

We observed that the variance-to-mean ratios of the repeat length distribution among the subgroup of 14 DM1 individuals with inherited repeat lengths less than 100 CTGs are very low, much lower than predicted under the basic model *M*, especially when age is also taken into account. In terms of the difference between the predicted and the observed ratio per year (figure 1), the fact that all 14 DM1 individuals lie at the low end (15th percentile) of this distribution is highly significant (*p* < 10^{−}^{5} using a permutation test). These results inform our hypothesis that a length-specific effect, not accounted for in the model, *M*, results in proportionally less mutation within the small alleles.

### 3.2. Model comparison supports a role for a length-specific effect suppressing mutational rates in DM1

In order to test the hypothesis that a length-specific effect suppresses mutational rates among individuals with small alleles, we introduced a length constraint, *α*, into the basic model and used a model comparison method (AIC) to compare the extended model *M _{α}* with the basic model

*M*in terms of goodness of fit.

Model *M _{α}* (maximum-likelihood value = −4779 and AIC = 9670) ranks higher than model

*M*(maximum-likelihood value = −4805 and AIC = 9721; table 2). The difference in AIC values (51) indicates that the relative likelihood of model

*M*compared with

*M*is very low (6.90 × 10

_{α}

^{−}^{12}), and so we conclude that model

*M*fits the data significantly better than model

_{α}*M*among individuals with repeat lengths at the lower end of the DM1 range. The model fit can be visualized as a distribution curve or a cumulative distribution curve. The fits of models

*M*and

*M*compared with the data and each other are shown for two representative DM1 individuals (figure 2

_{α}*a*,

*b*), and for all DM1 individuals, see electronic supplementary material (figures S2.1–S2.14). Model

*M*is seen to be better than model

_{α}*M*at tracking the initially steep ascent of the cumulative distribution that is typical for these individuals.

The average maximum-likelihood value of *α* among these 14 DM1 individuals was 51 CTGs (153 base pairs), but there was considerable variation (s.d. = 22 CTGs). With a fixed length parameter, *α* = 51 CTGs, we estimate that the length-specific effect would be strongest between 51 CTGs and 173 CTGs. These results provide support for a length-specific effect operating below 200 CTGs in DM1. By suppressing the mutation rate per repeat unit, the length-specific effect makes a big difference to the shape of the repeat length distribution below 200 CTGs but its significance rapidly decreases beyond this level.

### 3.3. Model comparison supports a role for a length-specific effect suppressing mutational rates in Huntington disease

As for DM1 blood, the results from AIC for HD buccal cells indicate that model *M _{α}* (maximum-likelihood value = −1312 and AIC = 2746) ranks higher than model

*M*(maximum-likelihood value = −1343 and AIC = 2781; table 2). The difference in AIC values (35) indicates that the relative likelihood of model

*M*compared with

*M*is very low (2.06 × 10

_{α}

^{−}^{8}), so we conclude that model

*M*fits the buccal DNA data significantly better than model

_{α}*M*among individuals with repeat lengths at the lower end of the range. As was the case for DM1, model

*M*is better than model

_{α}*M*at tracking the initially steep ascent of the cumulative distribution (figure 2

*c*). For the fits of models

*M*and

*M*for all HD individuals, see electronic supplementary material (figures S3.1–S3.12).

_{α}Among these 12 HD individuals, there were three individuals for whom the two models, *M* and *M _{α}*, were equally likely and the estimates for the fixed length parameter were below three CAGs. These three individuals had low levels of somatic mosaicism and hence it may not be possible to distinguish between the models and estimate the length-specific factor in this instance. Among the other nine HD individuals, the average value of

*α*associated with the maximum-likelihood value was 26 CAGs (s.d. = 11 CAGs). These results provide support for a length-specific effect, suppressing the mutation rate per repeat unit, over the whole range of observed repeat lengths in this HD dataset (59 CAGs or less).

We also considered a model with global parameters for the mutation rates and length effect and individual-specific parameters only for the inherited length (results not shown). However, as reported for DM1 in reference [16], global parameters did not capture the variation seen in the data, indicating that individual-specific factors play a major role in HD somatic instability. Inclusion of contraction events, i.e. decreases in repeat length of one CAG unit for HD, was also justified statistically, as there was no support for the contraction rates being zero.

### 3.4. Estimates of inherited repeat length under model *M*_{α} are in agreement with original predictions

_{α}

In our study, we treated inherited repeat length, *n*_{0}, as an unknown parameter to be inferred from the data. Our estimates of the value of *n*_{0} are in agreement (correlation coefficient = 0.93) with those estimated using the lower bound of the distribution as seen with small pool PCR, discussed in reference [14]. Further, our estimates of the value of *n*_{0} are in complete agreement (correlation coefficient > 0.99) with the estimates by Veitch *et al.* [8], which for the HD individuals in this study were based on the lower boundary of their highly skewed distributions.

### 3.5. Mutational levels are higher in DM1 blood cells than in Huntington disease buccal cells

The parameter values associated with the maximum-likelihood value provide a point estimate for mutation rates, in terms of expansion per repeat unit and contraction per repeat unit for each individual. Comparing parameter values under model *M _{α}* for DM1 blood with those for HD buccal cells, the median expansion rate for DM1 (9.1 × 10

^{−}^{2}per CTG per year) is significantly higher (

*p*= 8.28 × 10

^{−}^{5}using the Mann–Whitney

*U*-test) than for HD (8.5 × 10

^{−}^{4}per CAG per year). Similarly, the median contraction rate is significantly higher (

*p*= 1.32 × 10

^{−}^{5}using the Mann–Whitney

*U*-test) for DM1 (7.0 × 10

^{−}^{2}per CTG per year) than for HD (1.5 × 10

^{−}^{4}per CAG per year), see figure 3 for comparison. The number of individuals is small, but there appears to be a correlation between expansion and contraction rates within DM1 (correlation coefficient > 0.99) and within HD (correlation coefficient = 0.70) suggesting a link between expansion and contraction, within the mutational mechanism, in both diseases. Interestingly, the ratio of contraction to total mutation (expansion and contraction) is higher for DM1 (0.40) than for HD (0.18), and this most probably reflects biological differences between DM1 blood cells and HD buccal cells.

### 3.6. For some DM1 individuals, effective length is lower than inherited repeat length

We have shown that model *M _{α}* better explains the distinctive distributions among DM1 individuals with smaller alleles than model

*M*. A length-specific effect may operate in other unusual individuals, particularly those with low variance-to-mean ratios (figure 1). Application of the extended model,

*M*, to the data from unusual individuals in order to infer a value for

_{α}*α*, may establish whether, and if so, when, this effect operates.

We therefore fitted model *M _{α}* to distributions of repeat lengths (blood DNA) from eight DM1 individuals with low variance-to-mean ratios (figure 1). Five of the eight DM1 individuals have estimates for

*α*of 80 repeat units or more and an improvement in fit (likelihood gain of 2 or more). The improvement in fit can be seen by comparing the fit of both models to the data for representative individual BC19 (figure 2

*d*). For the fits of models,

*M*and

*M*for other DM1 individuals (likelihood gain of 9 or more), see the electronic supplementary material, figures S4.1–S4.10. As seen before, (figure 2

_{α}*a–c*), model

*M*is better than model

_{α}*M*at capturing the steep rise at the beginning of the cumulative distribution. We then fitted model

*M*to distributions of repeat lengths (blood DNA) from the rest of the cohort (120 DM1 individuals with estimated inherited repeat length,

_{α}*n*

_{0}, over 100 CTGs). There were 16 DM1 individuals, with estimated values for

*α*over 100 CTGs (table 3

*b*). For most of these individuals, there is an improvement in fit (likelihood gain of 2 or more).

In defining effective length, we use inherited repeat length as a point of reference as it is both the initial repeat length and the major modifier of age of onset. Hence, we define effective length as the difference between inherited repeat and *α*, that is *n*_{0} − *α*. Consequently, higher *α* values imply lower effective lengths (table 3). Interestingly, BC6 and BC19 (father and son) were previously noted for two reasons. First, they have unusually mild symptoms given their estimated inherited repeat lengths (see fig. 2 in [25], BC6 = II.2 and BC19 = III.2). Second, the germline transmission from father (BC6) to son (BC19) resulted in a relatively rare apparent contraction [25].

### 3.7. Effective length is better than inherited repeat length at explaining age of onset

We would expect the effective length to align more closely and better predict age of onset and disease progression than inherited repeat length. To test this prediction, we obtained an estimate for effective length by adjusting inherited repeat length (by subtracting *α*) in the individuals with lower than expected variance-to-mean ratios (15th percentile) and setting *α* equal to zero in the other individuals (16th–100th percentile), 128 DM1 individuals in total. We then compared inherited repeat length and effective length, in 128 DM1 individuals, in terms of explaining age of onset using linear regression analysis. Effective length (adjusted *R*^{2} = 50.6%, *p* < 10^{−}^{15}, *n* = 128) was better than inherited repeat length (adjusted *R*^{2} = 46.8%, *p* < 10^{−}^{15}, *n* = 128) at explaining variance in age of onset, confirming our expectation. Using model *M _{α}*, we remove some of the bias in mutation rates, mentioned above, making them less length-dependent. Under model

*M*, the mutation rates were correlated with inherited repeat length (correlation coefficient = 0.64,

*p*< 10

^{−}^{5}), whereas under model

*M*correlation between mutation rates and inherited repeat length was much lower (correlation coefficient = 0.30,

_{α},*p*< 10

^{−}^{5}).

## 4. Discussion

In DM1, small inherited repeat lengths (less than 100 CTGs) are associated with late-onset and less disease severity. Model *M* assumed that expansion and contraction were linearly dependent on the number of repeats over a threshold number of repeats. However, estimated rates of mutation per repeat unit per year for individuals with small alleles were lower than the rest of the cohort [16]. While this does not affect our ability to describe the changes in repeat length and hence the levels of somatic mosaicism over time, the implication that individuals with small inherited repeat lengths also have low absolute rates of mutation (taking into account inherited repeat length) does not have an obvious biological basis. It is more plausible that model *M* does not take repeat length correctly into account.

We introduced a length constraint into the expressions for expansion and contraction as an extension of the original mathematical model. This approach is sufficiently general to cover a wide range of possible constraints on the mutational mechanism that act by suppressing mutation rates as alleles decrease in size. Here, we have determined that relative to longer alleles, the mutation rate in shorter alleles less than approximately 200 repeats is suppressed. This effect can be quantified and corrected for in our mathematical model of repeat dynamics by defining a new length factor, *α*. An obvious question that arises is what is the biological interpretation of *α*? The precise mechanism by which somatic mosaicism arises remains unknown, but it is assumed that it must involve the misalignment of repeats to generate small loops, hairpins or other alternative non-B-DNA structures [1,17,18]. Although the repeats may be unstable during replication, the lack of a correlation between tissue-specific somatic mosaicism and cell proliferation and the accumulation of high levels of somatic mosaicism in post-mitotic tissues such as brain and muscle [11,12,20,26–29], suggest the primary mechanism is cell division independent. Evidence from transgenic mouse models also implicates a major role for several of the DNA mismatch repair proteins [30–33], and inappropriate DNA mismatch repair of alternative structures presents as an attractive mechanism [34]. A length effect could be envisioned to operate at multiple steps along this pathway (figure 4). Consistent with a length effect mediated by the formation of alternative structures, Gellibolian *et al.* revealed from the biophysical examination of DNA mis-pairing CTG repeat tracts, that the number of mis-pairings per repeat unit was length-dependent with relatively fewer mis-pairings per repeat unit below 200 CTGs and reaching a constant rate over 200 CTGs [36]. Mis-aligned alternative DNA conformations can be formed readily *in vitro* [37], and may arise spontaneously *in vivo* owing to thermal fluctuations (DNA breathing) [38,39]. The rate at which alternative structures form would be expected to be strongly influenced by the local chromatin environment, an effect that is assumed to underlie the major modifying effect that *cis*-acting flanking sequences have on inter-locus repeat instability [40]. Local DNA breathing [38,39] and/or chromatin remodelling could mediate lateral diffusion of loop-outs along the duplex. Lateral diffusion could lead, in some cases, to resolution of the secondary structures, or alternatively lead to further separation of complementary loop-outs. Allele length effects could reasonably be expected to modify the initial distance between complementary loop-outs, which, in turn, would modify the probability of resolution by diffusion and/or limit the available space for lateral diffusion in opposing directions. Complementary loop-outs that are close together may interfere with each other [35]. Alternatively, complementary loop-outs that are close together may be more likely to be encompassed within the same repair domain, leading to no net change in allele length [34]. Whereas, complementary loop-outs separated by a greater distance may be more likely to be repaired as independent events giving rise to the opportunity to generate both expansions and contractions. Interestingly, the length of mismatch repair excision domains in human cells is estimated to be of the order of 60–230 base pairs [41], a distance that is consistent with our estimates of *α*. Of course, the above model(s) of expansions remain speculative, but it is easy to envisage how similar explanations for *α* could be incorporated into other expansion models.

Models *M* and *M _{α}* were also adapted for HD and fitted to repeat length distributions from HD buccal cells [8]. Overall, 14 DM1 blood datasets and 12 HD buccal datasets, model

*M*with its length-specific effect fitted better than the thresholded model,

_{α}*M*(table 2). This result suggests that there is a constraint on the mutational mechanism at the lower end of the allele range in DM1 blood cells and HD buccal cells. Our estimate for the length parameter in HD (average = 26 CAGs) is lower than for DM1 (average = 51 CTGs) which may reflect differences in flanking GC content and possibly explain differences in the disease threshold, which is lower for HD (35 CAGs) than for DM1 (50 CTGs). These results place

*α*within the DNA repair domain 60–230 base pairs suggested by Genschel

*et al.*[41] and thus is consistent with a hypothesis implicating inappropriate DNA repair, as outlined in figure 4.

As well as quantifying the length-specific effect, we inferred the parameter values underlying the best fit and associated with the maximum-likelihood value. For model *M _{α}*, the parameters comprised expansion rate per repeat per year (

*λ*), contraction rate per repeat per year (

*μ*), length parameter (

*α*) and inherited repeat length (

*n*

_{0}). For small alleles, in both DM1 and HD, we found statistical support for expansion and contraction (

*λ*and

*μ*both non-zero) and individual-specific parameters. Expansion and contraction rates were correlated in DM1 (correlation coefficient > 0.99) and in HD (correlation coefficient = 0.70) suggesting that expansion and contraction, in both DM1 and HD, may be different outcomes of the same underlying process or otherwise conserved components of the instability pathway. This result has direct relevance to therapies that target the mutations directly [15] in order to readdress the balance and/or reduce levels of instability. Targeting these common components may not affect the relative proportion of expansions and contractions. However, given that disease onset and progression are at least partially driven by absolute expansion rates, suppressing both expansions and contractions and hence overall net expansion rates, would still be expected to be therapeutically beneficial. Moreover, something must drive the net bias towards expansion and targeting this specific step in the expansion pathway provides a very attractive target as only a modest-specific suppression of expansions would result in a net contraction bias.

Estimated mutational rates, for both expansion and contraction, were significantly lower in HD buccal cells than in DM1 blood cells (figure 3) and were more weighted towards expansion in HD buccal cells (82%) than in DM1 blood cells (60%). These differences have implications for the shape of the repeat length distributions (more skewed to higher repeat lengths in HD and more spread out in DM1) and hence levels of somatic mosaicism. A clear difference between DM1 and HD is that the repeat length tract is found in a non-coding area in DM1 and a coding area in HD. However, the most likely explanation for differences in the estimated mutational rates is linked to cell type rather than linked to disease type. Repeat length distributions measured in both blood cells and buccal cells from the same DM1 individuals (F. Morales *et al*. 2013, unpublished data) show similar differences. Here, the variance-to-mean ratio was found to be higher in blood than in buccal cells reflecting a higher percentage of contraction and hence a lower percentage of expansion in blood than in buccal cells. Differences other than those linked to cell type may have a molecular basis related to flanking GC content that differ in DM1 and HD with a slightly higher percentage of GCs in HD. As there is a strong correlation between the relative germline expandability of these repeats and the flanking GC content [40,42], the higher percentage of GCs in HD might explain the weighting towards expansion in HD and further illuminate a modifying role for flanking GC content.

In addition to the 14 DM1 individuals with inherited repeat lengths less than 100 CTGs, we identified eight DM1 individuals with lower than predicted variance-to-mean ratios under model *M* (figure 1). We hypothesized that the repeat length distributions in these individuals may also have been affected by an individual-specific length constraint of biological origin. We, therefore, fitted model *M _{α}* using an extended range (0–200 CTGs) for the length constraint,

*α*, to an additional 128 DM1-sized single-molecule blood DNA datasets. We found that for six of the eight DM1 individuals, mentioned above, model

*M*fitted the data better than model

_{α}*M*(likelihood gain greater than or equal to 2; table 3). The explanation for this lies with high estimated values for

*α*(80 CTGs and above) and a correspondingly better fit at the low end of the repeat length distributions in these individuals (figure 2

*d*) whose likelihood remarkably increased. An improvement in fit (greater than or equal to 2) and a high estimated value for

*α*(greater than 100 CTGs) were also observed for a further 12 individuals listed in table 3

*b*, notably individuals BC8 and SCO115. These results suggest that length-specific effect may operate in some individuals over higher repeat length ranges (greater than 200 CTGs). These individuals are candidates for further investigation to establish the biological basis. For individuals whose variance-to-mean ratios conform to expectation, models

*M*and

*M*fit equally well, reflecting the decreasing significance of

_{α}*α*over 200 CTGs.

As mutation rates are assumed negligible in the repeat lengths less than *α* under model *M _{α}*, the effective length of the repeat length tract, with respect to mutation, can be considered to be the complementary number of repeats in the tract. We thus defined the difference between the inherited repeat length and

*α*as the effective length of an individual. In this context, individuals with either small inherited repeat lengths (less than 100 CTGs) and/or high estimated values for

*α*are predicted to have effective lengths much smaller than their actual lengths. The length of

*α*may be determined by individual-specific

*cis*-acting and or

*trans*-acting factors. One such plausible

*cis*-acting factor could be an interruption in the pure CTG tract such as CGG or CCG whose physical presence extends the effect of a length constraint [43,44]. A rule where the length constraint of

*α*applies only on one side of an interruption at position

*β*from the other side would be entirely consistent with the uninterrupted version. We would simply infer

*α*+

*β*in the first instance and

*α*in the second instance. High inferred

*α*values correspond to low effective lengths and potentially less instability and disease (figure 4).

Inherited repeat length explains a large proportion of variance in age of onset. Recently, levels of somatic instability have been shown to modify age of onset in DM1 [14] and HD [13]. However, the relationship between inherited repeat length, level of somatic instability and age of onset is not straightforwardly linear. By quantifying length-specific effects, we can now suggest a biologically plausible explanation for this nonlinear component, namely that levels of somatic mosaicism do not progress in a linear manner. Levels of somatic mosaicism appear to be relatively lower in small alleles than in long alleles owing to the length-specific effect. This gives rise to relatively later ages of onset in small alleles than in long alleles, resulting in the observed nonlinear relationship between age of onset and inherited repeat length [14,45].

We showed that effective length aligns more closely and better predicts age of onset and hence disease progression than inherited repeat length. Using model *M _{α}*, we also removed some of the bias in mutation rates, under model

*M*, making them less length-dependent. Consequently, rates adjusted in this way will be better suited as quantitative traits to investigate

*trans-*or

*cis*-acting modifiers of somatic mosaicism, disease onset and progression.

Our findings that mutational rates may be suppressed in the region above the disease thresholds in both HD buccal DNA (most effective up to 60 CAGs on average) and DM1 blood DNA (most effective up to 173 CTGs on average) are encouraging from a clinical perspective. Individuals with alleles in this range generally have reduced levels of somatic mosaicism, less severe phenotypes and later age of onset. Longer DM1 alleles transmitted to the next-generation result in more severe symptoms and an earlier age at onset, an effect compounded by somatic expansion. Suppression of somatic expansion is therefore expected to be therapeutically beneficial and induction of contractions potentially curative [1,15]. However, the feasibility of suppressing expansions/inducing contractions remains largely undetermined. Our results show that, in principle, therapies aimed at reducing the length of disease DNA tracts, if successful, should result in lower levels of somatic mosaicism which should slow down disease progression.

## 5. Conclusion

Inherited repeat length and somatic instability are emerging as key modifiers of disease onset and progression in DM1 and HD [13,14]. However, the relationship between inherited repeat length, somatic instability and age of onset appears complex. Our work unravels some of this complexity by formalizing biological hypotheses mathematically and using statistical inference to compare models and calibrate biological parameters that drive levels of somatic mosaicism. Through mathematical modelling and data-driven quantification, we can better assess the relative importance of these parameters within an individual, between individuals and between cell types and diseases. We find similarities in the underlying mechanism, as evidenced by strong correlation between expansion and contraction rates in both DM1 and HD. However, we also find high levels of variation in these rates suggesting that individual-specific factors modify levels of somatic mosaicism to a large degree. In addition, as illustrated here, some variant repeats or other polymorphisms may further modify repeat length distributions and disease progression. Finding factors that modify disease is an important next step that will be facilitated by the use of biologically relevant quantitative traits, such as those established here.

## Funding statement

This work was supported by a University of Glasgow Kelvin Smith PhD Scholarship awarded to C.F.H.

## Acknowledgements

The authors thank the members of their laboratory for their comments and support.

## Appendix A

### A.1. Basic model *M*

Representing the expansion rate per year, the contraction rate per year and inherited repeat length by *λ*_{n}, *μ*_{n} and *n*_{0}, respectively, and letting *P _{n}*(

*t*) denote the probability that an allele has length

*n*at time

*t*, the rate of change of

*P*(

_{n}*t*) with respect to time is governed by the master equation of the form A1

Given the allele length at time zero, *n*_{0}, we may approximate this infinite system of ordinary differential equations numerically by truncating at a suitably large value of *n* = *N* and setting *P _{n}*(

*t*) = 0 for all

*n*≥

*N*+ 1.

To specify the functional form of *λ*_{n} and *μ*_{n} for model *M*, we depart from the traditional linear model by introducing a threshold, *a*, for the birth and death process. No activity takes place for repeat lengths below this threshold, and the general propensity for expansion or contraction is proportional to the excess length above the threshold, consistent with the inherent stability observed in non-diseased individuals. Hence, the functional forms of *λ*_{n} and *μ*_{n} are *λ*_{n} = *λ*(*n* − *a*) and *μ*_{n} = *μ*(*n* − *a*), respectively, for *n* > *a*.

### A.2. Mathematical model with length-specific effect *M*_{α}

_{α}

To derive the new variation of this model, *M _{α}*, we formulate a length-specific factor, denoted

*R*, as follows. Let the total repeat length be

_{n}*n*. Consider now a length constraint on the mutational mechanism.

*A*and

*B*are locations where repair is needed (e.g. loop-outs). We hypothesize that subsequent mutation requires

*|A*−

*B|*>

*α*, where

*α*is an unknown number of repeat units to be inferred from the data. The length parameter

*α*is therefore interpreted as the minimum separation between repeats required for mutation to occur. We relate the likelihood that mutation occurs to the proportion of all possible distances that result in

*|A*−

*B|*>

*α*, assuming that

*A*and

*B*occur at arbitrary uniformly random positions along

*n*, and that these occurrences are independent. There are

*n*

^{2}possible complementary pairs. Using combinatorial counting methods, it can be shown that there are (

*n*−

*α*) (

*n*−

*α*+ 1) pairs separated by distance

*|A*−

*B|*>

*α*. Hence, the ratio,

*R*, of possible mutation events, is given by A2

_{n}We note that for fixed *α*, *R _{n}* → 1 as

*n*→

*∞*. This corresponds to the intuitively reasonable notion that the finite length constraint is negligible for very large repeat lengths.

*R _{n}* can be considered as the biophysical capacity of a repeat length to undergo expansion and contraction. We would expect smaller alleles to have a lower capacity than larger alleles to expand and contract.

Based on these considerations, we modify our basic model, equation (A 1), as follows
A3and
A4for *n* > *α*, where *λ* and *μ* are now the constant rates of expansion and contraction per repeat unit per year, respectively. For 0 ≤ *n* ≤ *α*, we have *λ*_{n} = *μ*_{n} = 0.

This introduces a nonlinearity into our equations and hence we cannot derive closed forms for the mean and variance. However, the equations can of course still be solved numerically.

### A.3. Model comparison and parameter estimation

We use likelihood methods to carry out model comparison and parameter estimation. The likelihood is defined to be the probability that a repeat length has reached the length observed given the model and its parameters. We can solve equation (A 1) numerically in order to obtain the probability distribution function components *P _{n}*(

*t*) which give the probability that repeat length is

*n*at time

*t*. The likelihood

*L*[

*] is then the product over all the data , which denotes the repeat length for the*

^{i}*j*th observation from individual

*i*, of the probability , where

*θ*

^{[i]}are the model parameters for that individual and

*t*

^{[i]}the age of the individual when the data sample was taken. This gives the likelihood for individual

*i*, A5and the overall likelihood

*L*is the product over all individuals in the population, A6The AIC is used to assess the goodness of the fit of the model [24]. AIC uses the maximized value of the likelihood of the model,

*L*

_{max}, penalized by the number of model parameters,

*k*, to rank models thus A7with the model with the smallest AIC value being ranked highest.

We obtain the maximum value of the likelihood by evaluating the likelihood over a broad parameter space described in table 1. Maximization of the likelihood *L* in equation (A 6) is essentially the maximization of each *L*^{[i]} in equation (A 5) using each dataset from an individual. For further statistical analysis, it was useful to have point estimates for the parameters. These were taken to be the maximum-likelihood values.

The relative likelihood of two models with AIC values denoted AIC_{1} and AIC_{2} respectively, where AIC_{1} < AIC_{2}, is
A8

- Received July 8, 2013.
- Accepted August 22, 2013.

- © 2013 The Author(s) Published by the Royal Society. All rights reserved.