You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.

# Approaches to Sequence the HTT CAG Repeat Expansion and Quantify Repeat Length Variation

### Background:

Huntington’s disease (HD) is an autosomal dominant neurodegenerative disorder caused by the expansion of the HTT CAG repeat. Affected individuals inherit ≥36 repeats and longer alleles cause earlier onset, greater disease severity and faster disease progression. The HTT CAG repeat is genetically unstable in the soma in a process that preferentially generates somatic expansions, the proportion of which is associated with disease onset, severity and progression. Somatic mosaicism of the HTT CAG repeat has traditionally been assessed by semi-quantitative PCR-electrophoresis approaches that have limitations (e.g., no information about sequence variants). Genotyping-by-sequencing could allow for some of these limitations to be overcome.

### Objective:

To investigate the utility of PCR sequencing to genotype large (>50 CAGs) HD alleles and to quantify the associated somatic mosaicism.

### Methods:

We have applied MiSeq and PacBio sequencing to PCR products of the HTT CAG repeat in transgenic R6/2 mice carrying ∼55, ∼110, ∼255 and ∼470 CAGs. For each of these alleles, we compared the repeat length distributions generated for different tissues at two ages.

### Results:

We were able to sequence the CAG repeat full length in all samples. However, the repeat length distributions for samples with ∼470 CAGs were biased towards shorter repeat lengths.

### Conclusion:

PCR sequencing can be used to sequence all the HD alleles considered, but this approach cannot be used to estimate modal allele size or quantify somatic expansions for alleles ⪢250 CAGs. We review the limitations of PCR sequencing and alternative approaches that may allow the quantification of somatic contractions and very large somatic expansions.

## INTRODUCTION

Huntington’s disease (HD) is caused by the expansion of a CAG repeat in exon one of the HTT gene. Affected individuals inherit ≥36 CAG repeats and longer alleles are associated with earlier onset, greater disease severity and faster disease progression [1–3]. The HTT CAG repeat is genetically unstable in the germline, which frequently causes the number of CAG repeats to increase in successive generations [4]. Germline instability of the HTT CAG repeat is thus the cause of the anticipation often seen in HD, i.e., earlier onset and increased disease severity in successive generations. The HTT CAG repeat is also genetically unstable in the soma [5]. The somatic mosaicism of the HTT CAG repeat is cell-type specific [6], expansion-biased and is dependent on both the number of inherited CAG repeats and age [7, 8]. In HD patients and HD mouse models, the largest somatic CAG expansions are observed in the cortex and striatum [9–12], the brain region most affected by HD pathology [13]. Such somatic expansions can contain dozens, hundreds or even thousands of re-peats and are observed early in the course of the disease [9]. Because patients with a larger number of CAG repeats experience earlier onset, greater disease severity and faster progression, somatic CAG expansions are likely determinants of the tissue specificity, onset, severity, and progression of the disease. Several association studies have gathered evidence that supports this hypothesis. Swami et al. [14] inv-estigated somatic expansion in post-mortem DNA samples from HD patients, and showed that patients with particularly early-onset have a higher proportion of large expansions in the cortex. Most recently, genetic association studies in HD patient cohorts have revealed the association between DNA repair gene variants and age at motor signs of HD and HD progression [15–19]. These data suggest that somatic CAG repeat expansions contribute toward HD pat-hology. Nevertheless, it remains unknown whether somatic expansions are required to precipitate pat-hology, or what their relative contribution to HD pathology is. More detailed quantitative analyses of the dynamics of somatic mosaicism should help to answer some of these questions, and further disentangling of the mechanisms by which they are generated should allow for the identification of additional therapeutic targets. Although several components of the DNA mismatch repair pathway have been shown to be critical for somatic expansion [12, 20–23], the pre-cise molecular processes that generate expansions and determine their tissue specificity is unknown [18, 19, 24, 25]. In this context, precise and accurate quantification of HTT CAG somatic mosaicism in patient samples and model systems is warranted, since it may provide a suitable phenotype for genetic association studies aimed at identifying genetic variants that act as modifiers of repeat instability [7, 15]. Somatic mosaicism in HD patients and model organisms is traditionally assessed by gel or capillary electrophoresis of single-molecule, small pool (SP) or bulk-PCR (PCR using thousands of DNA molecules as a template) products [26]. However, using electrophoresis to estimate the number of re-peats in the PCR fragments detected does not provide any information about genetic variants within and around the HTT CAG repeat, which can lead to erroneous electrophoresis estimates of the number of pure CAG repeats [7]. Massively parallel sequencing (MPS) should allow for some of the limitations of electrophoresis to be overcome. However, although short MPS reads (<150 nt) can be used to genotype variants within and around the HTT CAG repeat, longer reads (≥150 bp) are required to be able to phase these variants with the CAG repeat. Identifying these variants and phasing them with the CAG repeat is essential to determine pure CAG length in HD patients and because some of them (e.g., presence or absence of the CAACAG hexamer 3’ of the CAG repeat) are associated with variation in HD onset [7, 16, 27], severity and progression [7].

Indeed, MPS has also recently proven useful for the detection of microsatellite somatic mosaicism in tumour samples [28, 29]. At high read-depth, MPS data should allow precise and accurate quantification of somatic repeat length variants, whilst also pro-viding information about genetic variants within and around the repeat. However, massively parallel seq-uencing is not commonly used to sequence, size and quantify the somatic mosaicism associated with trinucleotide expansions. The main reason for this is that sequencing reads generated by commonly used MPS platforms (<150 nt) are too short for the accurate sizing of the repeat, which requires repeat-spanning sequencing reads. However, long-read sequencing platforms have recently been successfully applied to the sequencing of trinucleotide repeat expansions. Using PacBio single-molecule real-time (SMRT) long-read sequencing, Cumming et al. [30] have demonstrated the usefulness of bulk-PCR sequencing to sequence myotonic dystrophy type 1 (DM1)-assoc-iated DMPK CTG repeat expansions with modal alleles of up to ∼400 CTG repeats. Besides, Hafford-Tear et al. [31] have sequenced Fuchs endothelial corneal dystrophy-associated TCF4 CTG repeat expansions with modal alleles of up to ∼150 CTG repeats and some associated somatic expansions of more than 1,000 repeats, using an amplification-free PacBio SMRT approach. As demonstrated by Ciosi et al. [7] using 400 nt MiSeq reads for sequencing of the HTT CAG repeat, PacBio SMRT long-read sequencing is not required for the sequencing and quantification of somatic mosaicism of shorter trinucleotide repeat expansions. This is likely to apply to several other polyglutamine encoding CAG repeat disorders which are typically associated with smaller repeat expansions (<80 CAG repeats) and for which preliminary evidence also suggest somatic expansion may be important [32].

Although MPS approaches allow some of the limitations of electrophoresis to be overcome by providing information about genetic variants within and around the repeat, they might be limited in their ability to detect very large repeat expansions. However, no formal comparison of electrophoresis and MPS approaches for the quantification of somatic mosaicism of trinucleotide expansions has previously been presented. In this study, we have applied bulk-PCR sequencing approaches using Illumina MiSeq and PacBio SMRT long-read sequencing to assess their usefulness for the estimation of the modal allele, and the quantification of somatic expansions, associated with small (∼50 to ∼100 CAG) and large (∼100 to ∼500 CAG) HTT repeat expansions. Where appropriate, the bulk-PCR sequencing data were compared with more traditional small pool-PCR (SP-PCR) Southern blot analyses and capillary electrophoresis of fluorochrome-labelled bulk-PCR products.

## MATERIALS AND METHODS

### Samples

In order to evaluate the ability of bulk-PCR seq-uencing approaches to sequence different numbers of CAGs, we used previously described [33] DNA samples from an allelic series of R6/2 transgenic mice [34] carrying a single copy of a human HTT transgene with germline alleles containing ∼55, ∼110, ∼255 or ∼470 CAGs [33, 35] (Table 1). To measure different levels of somatic mosaicism of the HTT CAG repeat, DNA samples isolated from four different tissues, known to harbour different levels of somatic mosaicism, were selected (striatum, liver, cortex and cerebellum, Table 1). To capture some of the age-dependent increase in somatic mosaicism, we analysed samples from one young and one old mouse for each repeat length, aiming at analysing DNA from the oldest available mice. The life span of R6/2 mice carrying 100 to 300 CAGs is much shorter than in mice carrying <  60 and >440 CAGs [35–37]. The oldest available mice carrying ∼110 and ∼255 CAGs were therefore much younger (∼5 months old), than the oldest mice carrying ∼55 and ∼470 CAGs (Table 1). To ensure that somatic mosaicism did not bias the estimation of the number of CAGs in the progenitor allele, a DNA sample of the tail at weaning was also analysed for each of the older mice (Table 1).

##### Table 1

Characteristics of the R6/2 DNA samples analysed

 PacBio sequencing batch/run Number of CAGs * Age (weeks) Striatum Liver Cortex Cerebellum Tail at death Tail at weaning 1 53 6 + + + + + – 1 53 117 + + + + – + 1 109 4.4 + + + + – – 1 111 20 + + + + – + 2 252 6 + + + + – – 1 &2 258 20 + + + + – + 2 469 6 + + + + – – 2 476 115.7 + + + + – –

One row corresponds to one mouse analysed.  *: previously estimated by capillary electrophoresis [33].+: sample analysed. – : sample not available.

### MiSeq library preparation for four R6/2 mice inheriting HTT transgenes with ∼55 or ∼110 CAGs

The HTT exon one trinucleotide repeat (compr-ising the CAG repeat, a CAACAG hexamer, a CCGCCA hexamer, the CCG repeat and the CCT repeat) was PCR-amplified from 20 ng of genomic DNA using MiSeq-compatible PCR primers [38]. The HTT-specific part of these MiSeq-compatible PCR primers corresponded to the primer pair 31329/33934 [34] that targets the human HHTT exon one trinucleotide repeat from the R6/2 human HTT transgene and not the endogenous mouse Htt CAG repeat. A single PCR per tissue was performed. Each sample, i.e., one tissue from a particular mouse, was associated with a unique pair of indexes, also frequently referred to as a sample barcode. TruSeq CD indexes were used for the samples with ∼55 CAGs, and Nextera XT Index Kit v2 indexes were used for the samples with ∼110 CAGs (Supplementary Tables 1 and 2) [38]. MiSeq sequencing was performed with a 400 nt forward read and a 200 nt reverse read by Glasgow Polyomics (https://www.polyomics.gla.ac.uk). See Ciosi et al. [38] for the full details of the sequencing library preparation and MiSeq sequencing.

### Preparation of bulk-PCR products for PacBio SMRT sequencing

Bulk-PCR products of the HTT exon one trinucl-eotide repeat (comprising the CAG repeat, a CAACAG hexamer, a CCGCCA hexamer, the CCG re-peat and the CCT repeat) for PacBio SMRT library preparation were generated by PCR using barcoded PCR primers. PacBio barcodes for the PacBio RS II System in symmetric mode [39] were used as 5’-tails to the R6/2 transgene-specific primers MS1F (5’-GCCCAGAGCCCCATTCATTG-3’) and MS1R (5’-CCGCCATCCCCGCCGTAGCC-3’). These spe-cifically target the human HHTT exon one (which contains the CAG repeat) from the R6/2 transgene and not the endogenous mouse Htt CAG repeat (see Supplementary Table 3 for the full sequence of barcoded primers used). PCRs using the barcoded primers MS1F and MS1R should generate PCR products of 557, 722, 1157 and 1802 bp on R6/2 transgene templates containing 55, 110, 255, or 470 CAGs respectively. Each sample, i.e., one tissue from a particular mouse, was associated with a unique barcode. Fifteen microliter PCRs containing 20 ng of genomic DNA were set up as described by Ciosi et al. [38] using the following cycling conditions: (96°C, 5 min); 28 cycles of (96°C, 45 s), (59°C, 45 s) and (72°C, 3 min); and (72°C, 10 min).

Samples were processed in two batches, depending on the number of CAGs in the modal allele, with the aim to sequence the two batches on two separate PacBio RSII SMRT cells. This precaution was taken because fragment loading on PacBio RSII SMRT cells was known to be biased towards the loading of smaller molecules. The first batch corresponded to the samples with ∼55 and ∼110 CAGs, and the second batch corresponded to samples with ∼255 and ∼470 CAGs. One DNA sample, from the tail at weaning of the 20-week-old mouse with 258 CAGs (Table 1), was included in both batches to allow evaluation of potential inter-SMRT cell heterogeneity in sequencing quality.

The first batch of PCR products corresponded to the R6/2 samples with the shorter modal alleles (four or five tissues for each of the four mice with ∼55 or ∼110 CAGs, Table 1). For this first batch, five PCRs were performed (as described above) for each sample (i.e., tissue from a particular mouse). After amplification, the PCR products were pooled to obtain one pool of ∼70μl per sample. A 15μl aliquot of each of these sample-specific pools of PCR products was resolved on a 1.5% agarose gel to confirm amplification. The remaining PCR products (∼55μl per sample) corresponding to the same number of CAGs were pooled (one pool for ∼55 CAGs, one pool for ∼110 CAGs and one pool for ∼255 CAGs) and purified using a 1.4X AMPure® XP (Beckman Coulter) clean-up procedure [38] with a final elution volume equal to 1/4 the volume of beads used. The quality of the three PCR product pools (∼55, ∼110 and ∼255) post-AMPure purification was assessed by capillary electrophoresis on a Bioanalyzer (Agilent) to check that the fragments had the expected size and that primer dimers were absent. These three pools (∼55, ∼110 and ∼255 CAGs) were then combined to form an equimolar pool (i.e., the same number of molecules per sample based on Bioanalyzer-es-timated molarity), containing ≥500 ng of PCR product at ≥13 ng μl −1 as required by the PacBio RSII sequencing facility at the Earlham Institute (Norwich, UK – https://www.earlham.ac.uk).

The second batch of PCR products corresponded to the R6/2 samples with the longer modal alleles (four or five tissues for each of the four mice with ∼255 and ∼470 CAGs, Table 1). Five PCRs were performed (as described above) for each sample (i.e., tissue from a particular mouse) with ∼255 CAGs. After amplification, the PCR products were pooled to obtain one pool of ∼70μl per sample. A 7.5μl aliquot of each of these sample-specific pools of PCR products was resolved on a 1.5% agarose gel to confirm amplification. The remaining PCR products (∼60μl per sample) were then pooled and purified using the 1.4X AMPure® XP clean-up procedure described above. The quality of the pool of ∼255 CAGs PCR products was then assessed on a Bioanalyzer (Agilent) as described above.

Eight to 60 PCRs were performed (as described above) for each sample with ∼470 CAGs depending on PCR yield (i.e., more PCRs were performed for samples associated with lower PCR yield). After amplification, the PCR products were pooled to obtain one pool per sample. Each of these pools was purified using a 0.6X AMPure® XP clean-up procedure [38] with a final elution volume equal to 1/2 the volume of beads used. This lower amount of 0.6X AMPure® XP beads, relative to the amount used for the samples with ∼55, ∼110 and ∼255 CAGs, was used for the samples with ∼470 CAGs in an attempt to remove smaller fragments that would, if present, be preferentially sequenced. The quality and quantity of each of these PCR product pools were then assessed on a Bioanalyzer (Agilent) as described above. The pools of ∼470 CAGs PCR products were then combined with the pool of ∼255 CAGs PCR products to form an equimolar pool (same number of molecules per sample based on Bioanalyzer-estimated mol-arity). The equimolar pool was then concentrated using a 1.6X AMPure® XP clean-up procedure to obtain a solution containing ≥500 ng of PCR product at ≥13 ng μl −1 as required by the PacBio RSII sequencing facility at the Earlham Institute (Norwich, UK).

### PCR-capillary electrophoresis for the two R6/2 mice carrying ∼55 HTT CAG repeats

The HTT exon one repeat was PCR-amplified from 80 ng of genomic DNA using a previously described [47] human-specific PCR assay that amplifies the HTT CAG repeat from the R6/2 transgene without amplifying the mouse Htt CAG repeat. Three independent PCRs were performed per sample (i.e., tissue from each of the two R6/2 mice carrying ∼55 HTT CAG repeats). The forward primer was fluorescently labelled with 6-FAM and PCR products were resolved using the ABI 3730xl DNA analyzer (Applied Biosystems) with GeneScan 500 LIZ as internal size standard (Applied Biosystems). GeneMapper v3.7 (Applied Biosystems) was used to generate CAG repeat size distribution traces. Peaks falling below the threshold of intensity, 50 fluorescent units, were excluded from the analysis. The final CAG frequency distribution considered for each sample was the mean distribution over the three obtained for each sample, i.e., one per PCR.

### Small pool-PCR

Previous small-pool PCR (SP-PCR) experiments using the striatum from the 117-week-old R6/2 mo-use with ∼55 CAGs showed that a small percentage of somatic CAG expansions are very large (>90 CAGs) [33]. It is not clear if these very large somatic CAG expansions can also be detected by capillary electrophoresis or either of the parallel-sequencing approaches in combination with bulk-PCR. To investigate this, we used SP-PCR to quantify these very large CAG expansions in the striatum sample of the R6/2 mouse with ∼55 CAGs. The SP-PCR quantification involved a combination of single-molecule PCRs (to derive the overall length distribution and precisely quantify the amount of input DNA) and PCRs using higher template concentrations (to estimate the frequency of the rarer large repeat length increases) [48]. A concentration range experiment between 5 and 50 pg of template DNA per PCR was first conducted to establish the correct quantity of template DNA to achieve single-molecule PCRs and 17.5 pg per PCR was selected. Overall, 288 PCRs with 17.5 pg (single molecule level), 132 with 150 pg, and 44 with 250 pg of genomic DNA template as starting material, were performed as previously described [33]. The PCR products obtained were resolved by agarose gel electrophoresis, Southern blotted and hybridised as previously described [33]. Individual bands (>250 CAG) were identified and sized by comparing against the 1 Kb Plus DNA Ladder (Invitrogen) using the CLIQS 1D gel analysis software (TotalLabs, UK). Assuming the number of bands is proportional to the amount of template, the 17.5 pg data were used to calculate expected bands/lane at 150 pg under assumptions of a Poisson distribution [48] and the frequency of very long expansions in the 150 pg PCR products was determined.

### Data availability

The sequencing reads for this study have been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under accession number PRJEB41395 (https://www.ebi.ac.uk/ena/browser/view/PRJEB41395).

## RESULTS

### Qualitative assessment of the usefulness of MiSeq and PacBio SMRT sequencing to quantify somatic mosaicism in R6/2 mice with ∼55 CAGs and comparison with capillary electrophoresis

##### Fig. 1

Qualitative assessment of somatic mosaicism comparing CAG frequency distributions obtained by capillary electrophoresis, MiSeq or PacBio SMRT sequencing of bulk-PCR products obtained for different tissues of one 6-week-old and one 117-week-old R6/2 mouse with ∼55 CAGs. Capillary electrophoresis data in black, MiSeq sequencing data in white and PacBio SMRT sequencing data in grey.

##### Fig. 2

Representative sequence alignments of the 400 nt MiSeq reads (A and B), PacBio CCS reads (C and D) and PacBio subreads (E and F) uniquely aligned (i.e., reads not discarded post alignment) to a synthetic reference sequence with 115 CAGs. Alignments shown correspond to 30 sequencing reads obtained from the tail at weaning of the 20-week-old mouse with ∼110 CAGs. The part of the alignment shown corresponds to the four nucleotides in the immediate 5’–flank of the HTT CAG repeat, followed by the first 20 CAGs (A, C and E), as well as the last 7 CAGs followed by (CAACAG)1(CCGCCA)1(CCG)7(CCT)2 and the four nucleotides in the immediate 3’-flank of that sequence (B, D and F). Note that the last nucleotide sequenced for the sample with the 400 nt MiSeq reads end was the first C of the seventh CCG (B). The white box on the right-hand side of panel B represents the part of the PCR products containing 115 CAGs that could not be sequenced using 400 nt MiSeq reads.

### Comparison of three bulk-PCR approaches with SP-PCR on the 117-week-old striatum sample with ∼55 CAGs

As previously shown [33], SP-PCR revealed a high frequency of somatic expansions in the striatum of the 117-week-old R6/2 mouse with ∼55 CAGs (Fig. 3A). Most of these somatic expansions can be seen on the autoradiographs as bands between 55 CAGs (i.e., size of the modal allele) and 70 CAGs (Fig. 3A). SP-PCR also detected frequent somatic expansions with 70 to 80 CAGs, and rarer somatic expansion ≥80 CAGs (Fig. 3A). The frequency of somatic expansions ≥70 CAGs detected by SP-PCR was estimated by genotyping ∼1,300 molecules across 464 SP-PCRs. This allowed a quantitative comparison between SP-PCR and the three bulk-PCR approaches presented here (capillary electrophoresis, MiSeq and PacBio SMRT) that demonstrates that large expansions are better detected by SP-PCR. The percentage of somatic expansions with 70 to 80 CAGs was similar for the three bulk-PCR approaches (∼4.5%, Figs. 1 and 3B) and lower than the one estimated by SP-PCR (∼6%, Fig. 3B). Similar percentages of somatic expansions with 80 to 89 CAGs were detected by bulk or SP-PCR (∼1%, Fig. 3B). No somatic expansions ≥90 CAGs were detected by bulk-PCR PacBio SMRT sequencing (Fig. 3B, C). The percentage of somatic expansions with 90 to 99 CAGs estimated by SP-PCR (0.38%, Fig. 3B, C) was one order of magnitude higher than that estimated by bulk-PCR capillary electrophoresis and MiSeq (0.02% and 0.04% respectively, Fig. 3B, C). No somatic expansions ≥100 CAGs were detected by capillary electrophoresis and the percentage of such somatic expansion estimated by SP-PCR (0.61%, Fig. 3B, C) was one order of magnitude higher than that estimated by bulk-PCR MiSeq (0.04%, Fig. 3B). Read depth must be considered when directly comparing the results obtained by bulk-PCR PacBio SMRT sequencing and bulk-PCR MiSeq sequencing. For example, the much lower number of PacBio CCS reads obtained for the samples with ∼55 CAGs (Fig. 1) is a very likely explanation why somatic expansion with >90 CAGs could be detected with 23,064 MiSeq reads for the striatum with 55 CAGs, but not with 444 CCS PacBio reads (Fig. 3B, C).

##### Fig. 3

SP-PCR can detect very large HTT CAG somatic expansions (≥90 CAGs) that cannot be detected using bulk-PCR approaches. A) Representative small pool PCR autoradiograph from 150 pg template DNA obtained for the striatum of the 117-week-old R6/2 mouse with ∼55 CAGs. The number of CAG repeats, equivalent to each molecular weight marker (left) and the boundaries of the categories represented in panel A (right), is indicated. The boundaries of the categories represented in panel A (right) are also indicated by white dashed lines. B) Percentage of large (≥70 CAGs) HTT CAG somatic expansions detected by SP-PCR (black to white gradient), or bulk-PCR capillary electrophoresis (black), bulk-PCR MiSeq (white), bulk-PCR PacBio SMRT (grey) in the striatum of the 117-week-old R6/2 mouse with progenitor allele ∼55 CAGs. C: HTT CAG somatic expansions >90 CAGs from panel B. Error bars indicate the 95% confidence intervals (they could not be estimated for the bulk-PCR capillary electrophoresis because the fluorescence units measured cannot be transformed into a count of PCR products detected).

### Qualitative assessment of the usefulness of MiSeq and PacBio SMRT sequencing to quantify somatic mosaicism in R6/2 mice with ∼110 CAGs

##### Fig. 4

CAG frequency distributions obtained by MiSeq or PacBio SMRT sequencing of bulk-PCR products obtained for different tissues of one 6-week-old and one 117-week-old R6/2 mouse with ∼110 CAGs. MiSeq sequencing data in white and PacBio SMRT sequencing data in grey. The dotted line on the MiSeq sequencing data panels indicates 123 CAGs, which is the theoretical maximum number of CAGs that could have been sequenced using the PCR primer pair (31329/33934) and a 400 nt MiSeq read.

It was very clear from all the PacBio CCS re-ads alignments that all mice with ∼110 CAGs car-ried a typical HTT allele with a pure CAG tract and seven CCGs (i.e., HTT allele of structure (CAG)n(CAACAG)1(CCGCCA)1(CCG)7(CCT)2 [7]). Alth-ough it was clear from the forward MiSeq read ali-gnment (Fig. 2A, B) that the CAG tract was pure, the many sequencing errors at the ends of these reads (Fig. 2B) did not allow us to confidently determine the allele structure. However, it was possible to det-ermine the allele structure by aligning the reverse MiSeq reads to the synthetic reference sequence (CAG)100(CAACAG)1(CCGCCA)1(CCG)7(CCT)2-3’-flank (Supplementary Figure 2).

As previously illustrated for the mice with ∼55 CAGs, a qualitative assessment of somatic mosaicism in the different tissues of the two mice analysed should be possible by comparing the CAG frequency distributions obtained. The CAG frequency distributions obtained for different tissues in the 20-week-old mouse should be interpreted relative to the progenitor allele which corresponds to the modal number of CAGs in the tail at weaning (Fig. 4). The CAG frequency distributions obtained with the PacBio CCS reads reflect the expected age-dependent and tissue-specific nature of HTT CAG somatic mosaicism [33]. Indeed, the variation of the number of CAGs in the PacBio data was higher in the old mouse than the young one, and was higher in the striatum than in the other tissues in the old mouse (Fig. 4). However, the CAG frequency distributions obtained with the PacBio CCS reads were very jagged and discontinuous (Fig. 4). These jagged discontinuous distributions still allowed us to observe that there are more somatic expansions in cortex and liver than in the cerebellum (Fig. 4), but could lead to inaccurate estimates of the frequency of these somatic expansions. On the contrary, the MiSeq data failed to capture any of the expected age-dependent and tissue-specific variation in somatic mosaicism of the HTT CAG repeat (Fig. 4).

### Qualitative assessment of the usefulness of PacBio sequencing to quantify somatic mosaicism in R6/2 mice with ∼255 CAGs

##### Fig. 5

CAG frequency distributions obtained by PacBio SMRT sequencing of bulk-PCR products obtained for different tissues of one 6-week-old and one 20-week-old R6/2 mouse with ∼255 CAGs. The tail at weaning data for the 20-week-old mouse is not shown because only two reads with 266 and 274 CAGs were obtained post-alignment and post-discard.

### Qualitative assessment of the usefulness of PacBio sequencing to quantify somatic mosaicism in R6/2 mice with ∼470 CAGs

##### Fig. 6

CAG frequency distributions obtained by PacBio SMRT sequencing of bulk-PCR products obtained for different tissues of one 6-week-old and one 116-week-old R6/2 mouse with ∼470 CAGs.

### The PacBio sequencing data illustrates the effect of the number of inherited CAG repeats on somatic expansion

Somatic mosaicism of the HTT CAG repeat depends on age, but also on the number of inherited CAG repeats [7, 8]. An illustration of that effect can be seen in the PacBio sequencing data obtained for the older R6/2 mice analysed, where we can see more somatic expansions in some samples with a larger number of inherited CAG repeats. Indeed, we see a longer tail of more frequent somatic expansions in the cortex and striatum of the mouse with ∼255 CAGs (20-week-old, Fig. 5) compared to the mouse with ∼110 CAGs (20-week-old, Fig. 4). Moreover, despite the older mouse with ∼110 CAGs being nearly two years younger than the older mouse with ∼55 CAGs, the PacBio sequencing data shows that there were more somatic expansions in the liver of the 20-week-old mouse with ∼110 CAGs (Fig. 4) than in the 117-week-old mouse with ∼55 CAGs (Fig. 1).

## DISCUSSION

### Bulk-PCR sequencing

In the present study, we have applied bulk-PCR sequencing approaches to sequence HTT CAG repeat expansions and to quantify the associated somatic mosaicism. In doing so, we have shown that both MiSeq and PacBio technologies can be used to genotype the HTT CAG repeat and to quantify somatic mosaicism. Nevertheless, both sequencing platforms have limitations. For MiSeq, the main limitation is read length. In the present study, we have generated MiSeq sequencing libraries by PCR using a primer in the immediate 5’–flanking DNA of the CAG repeat. This, in combination with MiSeq reads of 400 nt, is expected to limit the sequencing to 123 CAGs. We have shown that, because of this upper limit, the MiSeq approach applies to sequencing HTT alleles containing up to ∼115 CAGs but not to the estimation of modal allele length of ∼110 CAGs and/or the quantification of the somatic mosaicism of such alleles. Considering this upper limit of 123 CAGs, the largest modal alleles for which somatic mosaicism will reliably be quantifiable using bulk-PCR MiSeq are thus probably ∼90 CAGs. This prediction assumes that there would be enough leeway between a progenitor of ∼90 CAGs and the upper sequencing limit of 123 CAGs to reliably quantify somatic expansions. We have also shown that the reverse MiSeq reads allow the determination of the HTT allele structure (presence/absence, organisation and number of CAA, CCA, CCG and CCTs 3’ of the CAG repeat) as accurately in samples with ∼110, as in samples with ∼55 CAGs (Supplementary Figure 2). This thus seems to indicate that the quality of the reverse reads is independent of the number of CAGs. It should thus be possible to use reverse MiSeq reads to determine the allele structure of any HTT alleles that can be PCR amplified.

It is unclear whether larger CAG repeat length and/or inter-run variability inherent to the PacBio technology is/are responsible for the lower sequencing quality associated with PacBio run 2 relative to PacBio run 1 (Supplementary File 2). This difference in sequencing quality between the two runs warrants the future use of a sequencing control, similar to the Illumina PhiX control [50], for PacBio sequencing. Nevertheless, we have shown that the PacBio platform can be used to sequence HTT alleles of up to ∼550 CAGs. Based on studies of other unstable CAG•CTG repeat loci, it is likely possible to sequence an even higher number of HTT repeats with PacBio SMRT sequencing. Indeed, Ardui [51] and Hafford-Tear et al. [31] used PacBio to sequence respectively DMPK and TCF4 CAG•CTG somatic expansions containing up to about 1,500 repeats. However, the bulk-PCR approach we have used in the present study is not appropriate to estimate the modal number of CAGs in tissues where it is very large (like the samples with ∼470 CAGs used in the present study). Our PacBio data for such samples demonstrate that the loading/sequencing bias on PacBio RSII SMRT cells produces CAG frequency distributions biased towards fragments containing <300 CAGs. This bias is similar to the one described by Ardui [51] for the DMPK CAG•CTG repeat where PacBio SMRT sequencing of bulk-PCR fragments with a mode corresponding to ∼750 CAG•CTGs produced reads with a mode at ∼255 CAG•CTGs. Data presented here demonstrate, however, that PacBio sequencing would be useful to quantify somatic mosaicism in individuals with progenitor alleles containing up to ∼250 CAGs.

### Comparison of three bulk-PCR approaches with SP-PCR

The CAG frequency distributions obtained using bulk-PCR followed by capillary electrophoresis or MiSeq sequencing for the samples with ∼55 CAGs were very similar for the smaller and more frequent somatic expansions (<85 CAGs, Fig. 1). However, compared to the SP-PCR data, large expansions >90 CAG repeats were underrepresented using the capillary electrophoresis and MiSeq approaches, and undetectable in the PacBio data (Fig. 3). A major factor in driving this disparity is the reduced PCR efficiency in amplifying large alleles compared to smaller ones. This yields a relatively lower number of end-products per input molecule for larger alleles. In SP-PCR, the reduced amplification of larger alleles is at least partially compensated for by greater hybridisation efficiency to a repeat unit probe, and by the spatial resolution offered by the low number of input molecules, multiple reactions and gel electrophoresis. That means that the products of single input molecules can still be readily detected by SP-PCR independent of their size and relative amplification efficiency (at least up to ∼1,000 repeats). In the capillary electrophoresis approach, each PCR product contains only a single fluorescent moiety incorporated into one of the primers independent of the size of the molecule. Thus, large fragments that amplify less efficiently yield a lower signal. When such larger alleles are relatively rare, the signal from such molecules becomes lost in the inevitable background fluorescence observed using this approach. The sequencing-based approaches do not yield any inherent background and, as in the MiSeq data, with high enough read depth (typically ∼45,000 sequencing reads per sample), rare large expansions can be detected, albeit at a lower absolute frequency than the relative frequency of input molecules due to the amplification bias, and probably a higher frequency of sequencing errors that likely reduces alignment efficiency too. It seems reasonable to assume that such large rare expansions would be similarly detectable with PacBio, assuming read depth was high enough.

##### Table 2

Main characteristics of different methods for the preparation of libraries for MiSeq and PacBio for the sequencing of CAG repeats and the quantification of somatic mosaicism

 Library preparation Sequencing platform Max CAG repeat size Max modal CAG size * Max somatic expansion size Estimated sequencing cost per sample ** Quantity of genomic DNA required per sample Indels Relative accuracy of the quantification of somatic expansions Relative accuracy of the quantification of somatic contractions Bulk-PCR Illumina MiSeq ∼115 ∼90 ∼120 ∼$8 † 20 ng No + not detected if PacBio RSII ≥550 ∼250 ≥550 ∼$50 † 0.1 to 1.2  μg Yes + not very large PacBio Sequel ≥550SM ∼250SM ≥550SM ∼$30 † 20 to 240 ngSM Yes + PacBio Sequel II ≥550SM ∼250SM ≥550SM ∼$10 † 20 to 60 ngSM Yes + SP-PCR Illumina MiSeq ∼115SM ∼90SM ∼120SM ∼$8 ‡ 2 ngSLM No ++ + PacBio RSII ≥1500L ≥1200L ≥1500L ∼$50 ‡ 10 to 120 ngSLM Yes ++ + PacBio Sequel ≥1500LM ≥1200LM ≥1500LM ∼$30 ‡ 2 to 24 ngSLM Yes ++ + PacBio Sequel II ≥1500LM ≥1200LM ≥1500LM ∼$10 ‡ 2 to 6 ngSLM Yes ++ + Amplification- Illumina MiSeq ∼115S ∼90S ∼120S ∼$20 † ≥5μgL No +++ +++ free PacBio RSII ≥1500L ≥150L ≥1500L ∼$1000 † ≥5μgL Yes +++ +++ PacBio Sequel ≥1500LM ≥150LM ≥1500LM ∼$600 † ≥5μgL Yes +++ +++ PacBio Sequel II ≥1500LM ≥150LM ≥1500LM ∼$100 † ≥5μgL Yes +++ +++ Amplicon Illumina MiSeq ∼115S ∼90S ∼120S ∼$20 # 20 to 150 ngL No +++ +++ sequencing PacBio RSII ≥550SM ∼250SM ≥550SM ∼$1000 # 2.5 to 30μgSL Yes +++ +++ of barcoded PacBio Sequel ≥550SM ∼250SM ≥550SM ∼$600 # 0.4 to 4.8μgSLM Yes +++ +++ single molecules PacBio Sequel II ≥550SM ∼250SM ≥550SM ∼$100 # 0.2 to 0.6μgSLM Yes +++ +++

Observations described in this study are indicated in bold, the other information corresponds to expectations for approaches not used in this study based on observations described in this study (S), on what has been described on other trinucleotide loci in the literature (L) and/or on the manufacturer information available for each sequencing platform (M).  *: Max modal allele size for which the modal allele size can be estimated by sequencing and for which somatic mosaicism will reliably be quantified.  **: based on [63]. †: assuming a minimum of ∼5,000 reads per sample and a maximum of 384 samples per sequencing run. ‡: assuming 20 SP-PCRs per sample and 250 reads per SP-PCRs. #: assuming ∼20 reads per single molecule and the genotyping of 5,000 single molecules.

The three bulk-PCR approaches (capillary electrophoresis, MiSeq and PacBio SMRT) share the same bulk-PCR pitfalls, i.e., lack of detection of very large and rare expansions that can be detected by SP-PCR (Fig. 3). The use of bulk-PCR makes the sequencing library preparation robust and straightforward, which allows the processing of a high volume of samples. This is particularly true for the MiSeq platform, for which the number of reads produced is much higher than on the PacBio platforms (Table 2). However, bulk-PCR artefacts make it difficult to estimate the size of larger progenitor alleles, hampering our ability to quantify somatic expansions very accurately and making it very difficult to quantify somatic contractions (Table 2).

### Future directions in the field for the use of parallel and single-molecule sequencing

Although strongly biased towards expansion, somatic instability is thought to involve a combination of small frequent expansions and contractions [54]. Contractions of the HTT CAG repeat are observed in some animal models [33], but virtually nothing is known about their molecular basis. A first step to improve our understanding of this process would be to be able to quantify somatic repeat contractions. Whereas it is possible to estimate the proportion of somatic expansions using bulk-PCR, repeat contractions cannot be clearly distinguished from PCR slippage artefacts and thus cannot be easily quantified (Table 2). Moreover, as illustrated here, bulk-PCR fails to identify the rare large somatic expansions that can be detected using SP-PCR (Table 2). This is a major problem in the field, as the pathogenic impact of these rare large somatic expansions relative to the more frequent and shorter somatic expansions remains unknown. Improvement of current methods to detect and quantify repeat contractions and accurately measure the net frequency of somatic expansions is thus needed. Such improved methods would expand our understanding of CAG repeat dynamics during the lifetime of an individual and the role of somatic instability in HD disease pathology. More accurately quantifying the full range of somatic repeat length changes could also prove useful in identifying genetic modifiers of somatic instability (it is likely that some variants may act to modify the size and direction of repeat length changes, and not just their absolute frequency), and for the development of outcome measures in clinical trials that aim to suppress somatic repeat expansions and/or induce contractions [55].

Some of the limitations of bulk-PCR may be overcome by using recombinase polymerase amplification, an isothermal replacement to PCR that has been shown to produce fewer HTT CAG slippage pr-oducts [56], or SP-PCR-sequencing which has the potential of allowing the detection of rare large somatic expansions [51]. However, these approaches are still limited to some extent by PCR slippage, confounding the quantification of somatic contractions (Table 2). Using amplification-free methods should allow detection and accurate quantification of both somatic contractions and large somatic expansions that remain undetectable using bulk-PCR (Table 2).

Such an amplification-free approach for sequencing of repeat-expansion loci has been developed by Tsai et al. [57]. This approach, named “no-amp targeted sequencing”, utilises the capture and enrichment of the region(s) of interest using the CRISPR/Cas9 system (Fig. 7A). In trinucleotide expansion studies, no-amp targeted sequencing has so far been used in combination with PacBio or Oxford Nanopore Technologies (ONT) long-read single-molecule sequencing [31, 49, 57–59]. This is because the size of the repeat expansion of interest was expected to be very large [31, 58, 59] but also be-cause the captured fragments are typically several kilobases long [49, 57]. Hafford-Tear et al. [31] have demonstrated that no-amp targeted PacBio SMRT sequencing can capture somatic mosaicism by showing that the variance in the number of repeats increases with the modal number of CAG•CTG in the third intron of TCF4. The application of no-amp targeted PacBio SMRT sequencing to two fully-penetrant HD-causing alleles (>39 CAGs) suggested that no-amp targeted sequencing might be useful to quantify somatic mosaicism at the HTT repeat [49]. However, Höijer et al.’s [49] no-amp HTT PacBio SMRT sequencing data contains frequent intra-individual variation in the number of CAGs associated with the non-HD-causing allele. Because somatic instability of non-HD-causing alleles has not been described before, this is very unexpected. Such variation is, therefore, likely to have been generated by a technical and/or data processing bias and should be investigated to validate the approach for studying the somatic instability of the HTT repeat. Moreover, the CRISPR/Cas9 capture efficiency was not 100% specific to the HTT locus and was associated with a very large fraction of off-target reads (∼95%) [49]. This, together with the fact that Höijer et al. [49] also targeted C9orf72, ATXN10 and FMR1, contributed to a very limited sequencing depth of the HTT repeat (157 HTT repeat reads per sample on average and only 32 reads for each for the two fully-penetrant HD-causing alleles). At that low read depth, the precision in the estimation of the frequency of somatic variants is low. For example, the estimated percentage of somatic expansions estimated was 21.88% with a 95% confidence ranging from 9.28 to 39.97% and 53.13% with a 95% confidence ranging from 34.74 to 70.91% for the two fully-penetrant HD-causing alleles analysed by no-amp HTT PacBio sequencing [49]. No-amp PacBio SMRT sequencing of the HTT repeat thus needs major improvements (reduction of the off-target capture and evaluation of technical artefacts) to be useful in HD research focused on somatic instability of the HTT repeat. Assuming that the HTT read depth would quadruple if Höijer et al.’s [49] no-amp PacBio SMRT was made HTT-specific (because the published data correspond to an assay where HTT was one of four loci targeted), the most recent Sequel II PacBio platform should theoretically produce 10,000 CCS reads per sample if sequencing ∼25 samples per Sequel II PacBio SMRT cell. Such a read depth per sample should allow accurate quantification of the frequency of somatic HTT CAG variants.

No-amp targeted ONT (Oxford Nanopore Technologies) sequencing has so far only been applied to cell line DNA to sequence the C9orf72 GGGGCC and the FMR1 CGG repeats [58]. Its usefulness to estimate somatic mosaicism of the HTT repeat from primary tissue samples thus remains to be demonstrated. Nevertheless, the high error rate of ONT sequencing, including insertion and deletion errors, stands as an obstacle in obtaining sequence information for repeated DNA with the accuracy required for the quantification of somatic mosaicism, at least through conventional alignment-based pipelines. Alignment-free raw-read signal processing algorithms were shown to improve the suitability of ONT for genotyping long repeated DNA regions [58, 60] but are still prone to technical errors even when processing reads from short repeat regions. For example, Liu et al. [60] obtained very broad CAG frequency distributions for non-disease associated alleles at the ATXN3 CAG repeat. This suggests that this type of ONT application would not be appropriate for the quantification of somatic mosaicism of small CAG•CTG expansions, although it might be sufficient for estimating the modal number of CAG•CTGs. Sequencing library preparation techniques, that allow the sequencing of the same starting molecule multiple times using a single ONT read, have been proposed to generate more accurate ONT consensus sequencing reads (conceptually similar to PacBio CCS reads) [61, 62]. In the future, these ONT consensus sequencing reads might prove more useful than standard ONT reads for the accurate quantification of somatic mosaicism of CAG•CTG expansions.

The no-amp targeted long-read sequencing approaches published so far [31, 49, 57–59] are of great utility to sequence and size very large repeat expansions (⪢3,000 repeats at some loci in some tissues) that are refractory to PCR and can only be assessed by relatively crude Southern blot analysis of restriction digested genomic DNA, or that can only be detected by SP-PCR. However, they have several major limitations that make them unsuitable, in their current form, for high-throughput analysis of low levels of variation and for modifier studies that require the analysis of large cohorts. Indeed, they require micrograms of DNA (Table 2) and produce low numbers of reads (∼5,000 on target PacBio CCS reads per RSII SMRT cell [49] and <  1,000 on target ONT reads per MinION flow cell [58]) at a high cost (assuming a best-case scenario of 5% on-target reads and the production of 4,000,000 PacBio CCS reads on one Sequel II SMRT cell, the sequencing cost would be ∼$100 per sample (Table 2) [63] with a requirement of 5,000 reads per sample to quantify somatic mosaicism). MiSeq sequencing-based methods, that offer higher throughput, remain more cost-effective for routinely quantifying somatic mosaicism in primary tissue samples from individuals having inherited HD-causing alleles with <100 CAGs (sequencing cost would be ∼$8 per sample (Table 2) [63] if producing 40,000 MiSeq reads per sample and processing 380 samples per MiSeq run).

Another potential route for quantifying somatic mosaicism that reduces some of the limitations of bulk-PCR is to use single-molecule DNA barcoding in combination with PCR. Through this approach, one can trace the sequence reads back to a single input molecule which allows the identification of PCR and sequencing errors. In the context of a repeated sequence, such an approach can be used to correct for PCR slippage and identify the repeat size in the original molecule (Fig. 7B) [64]. Besides, this approach should also, at least partially, correct for the PCR amplification efficiency problem that generates relatively fewer yields per input molecule for larger alleles. Several methods have been developed to achieve single-molecule barcoding, including molecular inversion probe (MIP) capture using probes with degenerate tags and one- to three-cycle PCR using barcoded primers at low concentrations [65, 66]. Such methods, combined with high-throughput sequencing, allow multiplexing across loci in large cohorts while providing single-molecule level sequence data. Single-molecule barcoding has been mostly explored in human tumour sequencing analysis and was shown to perform very well in detecting single-base somatic variants at an allele frequency of <0.5% [65, 66]. With a primary focus on single nucleotide variants, variation in microsatellite regions has been somewhat less well explored to date, but with some evidence showing that, for example, MIP capture is less efficient for repeat regions when compared to single base variation in multi-loci capture assays [67]. However, single-molecule barcoding by MIP capture was recently shown to be a sensitive method in detecting repeat length variants, including repeat contractions, across multiple microsatellite loci in A. thaliana. This suggests that this approach should be applicable to human microsatellites. The main disadvantage of such barcoding methods, when compared to amplification-free sequencing library preparation, is that DNA molecules are copied by polymerase during the barcoding step; thus, any insertion and deletion errors that occur during this stage are indistinguishable from true somatic variants. However, preliminary data suggest that such errors, at least in MIP capture, form background noise that does not disrupt genotype calling [64]. With relatively low amounts of the input DNA required and high-order multiplexing potential, amplicon sequencing of barcoded single molecules offers a potentially cost-effective solution for quantifying somatic mosaicism for moderate CAG expansions in HTT and could be adapted to long alleles by using long-read sequencing platforms such as PacBio (Table 2).

##### Fig. 7

Method summary for somatic mosaicism quantification at the level of a single molecule in HD. A) Generalised schematics for CRISPR/Casp9-mediated targeted enrichment of HTT locus for single-molecule long-read sequencing (i.e., no-amp targeted sequencing). Following DNA fragmentation and DNA molecule protection by adapter ligation or de-phosphorylation, CRISPR/Cas9 and locus-specific guide RNAs are used to selectively cut across the region of interest. While undigested DNA fragment ends are still protected, sequencing adapters are ligated to the Cas9 digestion product. Sequencing is then done on the appropriate single-molecule long-read sequencing platform such as PacBio SMRT or Oxford Nanopore Technologies (ONT). No-amp targeted sequencing studies of repeat expansions have used one or two Cas9 cuts with PacBio sequencing [31, 49, 57, 59] or ONT sequencing [58] respectively. Single-molecule sequencing read output can then be used to build the somatic mosaicism profile. B) The general method for amplicon sequencing of barcoded single molecules. Several methods for single-molecule barcoding exist, including one-cycle PCR using hairpin-protected primers with degenerate tags or region capture by barcoded molecular inversion probes. Following barcoding, sequencing adapters are incorporated into the uniquely tagged molecules through PCR with overhang primers. The resulting amplicon library is then sequenced on the platform of interest, including Illumina MiSeq or PacBio, depending on the amplicon length and the desired throughput. Resulting reads are grouped by barcode family, and the repeat length of the original molecule for each family is determined to build the real somatic mosaicism profile per sample.

## CONFLICT OF INTEREST

V.C.W. is a scientific advisory board member of Triplet Therapeutics, a company developing new therapeutic approaches to address triplet repeat disorders such as HD and myotonic dystrophy and of LoQus23 Therapeutics, and has provided paid consulting services to Alnylam. Her financial interests in Triplet Therapeutics were reviewed and are managed by Massachusetts General Hospital and Partners HealthCare in accordance with their conflict of interest policies. S.K. is employed by CHDI Management, Inc., as an advisor to the CHDI Foundation. D.G.M. has been a scientific consultant and/or received honoraria or stock options from Biogen Idec, AMO Pharma, Charles River, Vertex Pharmaceuticals, Triplet Therapeutics, LoQus23, and Small Molecule RNA and has had research contracts with AMO Pharma and Vertex Pharmaceuticals.

## ACKNOWLEDGMENTS

The authors wish to thank the D.G.M. group for helpful advice with this work and David Meltzer, David McGuinness and Julie Galbraith, from Glasgow Polyomics, for administrating the Galaxy instance of the University of Glasgow (D.M. & D.McG.) and for their help and insight with the MiSeq sequencing (J.G. and D.McG.). This work was supported by the CHDI Foundation (awards to D.G.M. and A.J.M.) and the National Institutes of Health USA (Grants Nos. R01NS049206 to V.C.W.).