OUP user menu

Evolution of short sequence repeats in Mycobacterium tuberculosis

Cath Arnold , Nicola Thorne , Anthony Underwood , Kathleen Baster , Saheer Gharbia
DOI: http://dx.doi.org/10.1111/j.1574-6968.2006.00142.x 340-346 First published online: 1 March 2006


Whole genome comparison has revealed the presence of short sequence repeats (also called mycobacterial interspersed repeat units and variable number tandem repeat units) used for genotyping schemes. In this study, we have used deletion analysis, single nucleotide polymorphism data and spoligotype taken from published data from others to investigate the evolution of selected repeats that form the common denominators of the majority of established schemes. Analysis of the number of repeats per locus from over 400 isolates revealed that the general trend globally appears to be loss of repeats in modern strains compared with ancestral strains.

  • VNTR
  • MIRU
  • tuberculosis
  • molecular evolution


Mycobacterium tuberculosis is one of the most successful human pathogens and kills around three million people every year (Dye et al., 1999). In order to follow and better understand the evolution and spread of this disease, different strain typing methods have been employed to link epidemiologically related strains and to carry out the more difficult task of differentiating epidemiologically unrelated strains of this notoriously homogeneous organism. One of the most successful methods to date has been profiling of the insertion sequence IS6110 (Thierry et al., 1990). The majority of M. tuberculosis strains contain c. 5–20 copies of this transposable element located throughout the genome. Many strains (up to 40% in India) however do not contain high numbers of this element and alternative strategies such as spoligotyping were used to differentiate these strains (Goyal et al., 1997). Although useful, these methods could be arduous for large numbers of strains and alternative molecular approaches were sought to simplify methods and to facilitate data portability between laboratories. The advent of whole sequenced genomes for comparison has precipitated the use of different molecular markers, including single nucleotide polymorphisms (SNPs) (Sreevatsan et al., 1997; Gutacker et al., 2002) and deletion analysis (Brosch et al., 2002). Both of these unrelated genotyping methods correlated perfectly and indicated a novel scenario for the evolution of the M. tuberculosis complex, suggesting that M. tuberculosis did not, as is popularly believed, evolve from M. bovis, the etiologic agent of bovine tuberculosis (Brosch et al., 2002). Whole genome comparison has also revealed the presence of short sequence repeats (also called mycobacterial interspersed repeat units (MIRUs, Supply et al., 2000) and variable number tandem repeat units (VNTRs, Frothingham & Meeker-O'Connell, 1998). The location and number of repeats in different M. tuberculosis strains can be measured by a variety of PCR-based methods and the data generated is portable between laboratories (Mazars et al., 2001). Recently, Ferdinand . (2004) described the use of particular loci to classify isolates of the M. tuberculosis complex into different families, including the East African Indian (EAI), Beijing, Haarlem and X, Latin-American and Mediterranean (LAM) families. This has provided a gateway for microevolution analysis and correlation of geographical and epidemiological niches for M. tuberculosis populations. However, the rate of change in a VNTR sequence and copy number is a characteristic feature of that specific locus and higher resolution investigation of each selected target locus for genotyping is required to reconstruct the emergence of the M. tuberculosis complex (MTBC) familial cluster. In this study we utilised deletion analysis, SNP data and spoligotype generated from different published studies to construct a putative genetic lineage (Brosch et al., 2002; Filliol et al., 2002; Steinlein Cowan et al., 2002; Sola et al., 2003a, b; Sun et al., 2004a, b), and investigated the evolution of 15 of the most used repeats in molecular epidemiological studies worldwide.

Materials and methods

Spoligotype, VNTR/MIRU, exact tandem repeat (ETR) and genotypic group data were extracted from eight recent studies to establish a time line of genetic events (Fig. 1) (Sreevatsan et al., 1997; Brosch et al., 2002; Filliol et al., 2002; Steinlein Cowan et al., 2002; Sola et al., 2003a, b; Mokrousov et al., 2004; Sun et al., 2004a, b). Data used were compiled on to a single worksheet of a spreadsheet document (see supplemental data). In addition to data from the eight recent studies, data from the MIRU–VNTR Typing website was also included (http://www.ibl.fr/mirus.mirus.html) This database was developed by Dr P. Supply and E. Savine and sited at the INSERM U447, at the Institut de Biologie de Lille/Institut Pasteur de Lille. The development of this database has been funded by the French Ministère de la Recherche. VNTR/MIRU and ETR profiles were split into two groups for analysis, ancestral and modern, according to whether two copies of MIRU 24 were present. Brosch . (2002) showed that the presence of a marker region TbD1 indicates that a strain belongs to an ancestral lineage compared with strains lacking this region, and the presence of TbD1 also correlate with the presence of two copies of MIRU 24 (Banu et al., 2004; Ferdinand et al., 2004; Sun et al., 2004b). In this context, strains described as ancestral are more similar to the common ancestor at the loci examined. The use of SNPs at katG463 and gyrA95 divide strains into molecular genetic groups (MGG) 1–3, whereby MGG1 is thought to be evolutionarily older (Sreevatsan et al., 1997). Mycobacterium tuberculosis strains belonging to MGG1 that still have TbD1 also still have two copies of MIRU 24 are thought to represent ancestral strains. MGG1 strains lacking TbD1 with a single copy of MIRU 24 are considered ‘modern’. It is important to clearly state that the definitions ancestral and modern used in this study do not imply that the ancestral strains are the ancestors of the modern strains but more similar to the common ancestor at the loci examined. Once separated into ancestral and modern strains, spoligotyping data was used to further elucidate evolution, based on the assumption that the direct variable repeat regions (DVRs) in the direct repeat (DR) region analysed can only be lost (together or in groups) due to movement of the IS6110 element and cannot be regained as little or no recombination appears to takes place between strains (van Embden et al., 2000). Sequential loss of individual DVRs or groups of DVRs were mapped onto a time line (Fig. 1) and, depending on presence or absence in these groups, genotyping data such as the genetic groups defined by Sreevatsan . (1997) was also incorporated. The mean numbers of repeats from VNTR/MIRU and ETR profiles were then superimposed onto the timeline to elucidate patterns of evolution. Of 121 ancestral isolates analysed, 80 unique VNTR/MIRU profiles and 41 ETR profiles were examined. Of 369 modern (Beijing and Haarlem) isolates, demonstrating 180 VNTR/MIRU profiles and 28 ETR profiles were then compared.

Figure 1

Timeline of genetic events for Mycobacterium tuberculosis.

The alignment of repeats taken from various loci (Fig. 2) was carried out using default alignment settings in BioEdit Sequence Alignment Editor (Tom Hall, Isis Pharmaceuticals, Carlsbad, CA) and then edited by eye. The repeats were extracted from the fully sequenced genome of M. tuberculosis genome of CDC1551 (Fleischmann et al., 2002), by searching for published primer sequences for each locus (Supply et al., 2000; Le Fleche et al., 2002; Roring et al., 2002). Also included are similar repeat sequences from Mycobacterium avium and Mycobacterium leprae (GenBank accession numbers AE017235, AE017238 and MLU15181).

Figure 2

Alignment of individual repeats found at various loci in CDC1551 and similar sequences from Mycobacterium avium and Mycobacterium leprae. Sequences similar to the first sequence are represented by a dot and missing bases represented by a dash.

Statistical methods

For the comparison between ancestral and modern strains, we looked at the number of repeats for each unique strain. The count of repeats for the ancestral strains was compared to the count of repeats for the modern strains at each location using a Wilcoxon rank-sum (Mann–Whitney) test. Where the test statistic is significant, this contributes to the evidence for an alternative classification system for modern strains compared to ancestral. The Wilcoxon rank-sum test is the nonparametric equivalent to a t-test for comparing two samples of data. Observations are ranked in order of size and the sum of the ranks for each group computed. The rank-sum for the smaller group is then compared to the expected value if there had been no difference between the groups, to give the test statistics T and U. T has an approximately normal distribution from which a P-value may be obtained, while U is the number of pairs of observations xi and yj, one from each sample, for which xi<yj (or U/n1n2 is the probability that xi<yj).

Simpson's diversity index and variance were calculated using the formulae Embedded Image where xi is the number of isolates in category i and N is the total number of isolates in the sample.

λS is constrained between 0 (indicating completely identical isolates) and (N−1)/N (indicating completely diverse isolates).

Since λS is only an estimate of the true population diversity index, a 95% confidence interval is calculated to convey the precision about that point estimate.

Results and discussion

The averaged MIRU/VNTR repeats for each of the 15 loci were calculated from 478 isolate data set representing ancestral and modern isolates from nine different studies. VNTR and ETR profile repeat numbers are shown in Tables 1 and 2. As VNTR/MIRU and ETR data were not always available together, diversity indices have been calculated separately for MIRU and ETR data (Table 3).

View this table:
Table 1

Mean MIRU/VNTR repeat number comparing ancestral and modern Mycobacterium tuberculosis strains

DataAncestral (n1=80) meanStandard deviationModern (n2=180) meanStandard deviationBonferroni-adjusted P-valueProb(xi<yj) U/n1n2Prob(xi<yj) 1−U/n1n2
MIRU 22.000.2251.980.3081.000000.49190.5081
MIRU 44.481.8962.140.635<0.000010.13740.8626
MIRU 104.040.4893.391.359<0.000010.25530.7447
MIRU 162.740.4972.760.6901.000000.51590.4841
MIRU 201.942.4361.930.2501.000000.49790.5021
MIRU 235.681.6134.970.832<0.000010.27270.7273
MIRU 242.040.1911.000.000<0.000010.00001.0000
MIRU 262.160.8495.291.802<0.000010.90240.0976
MIRU 272.940.5362.830.4790.429870.44600.5540
MIRU 314.511.0193.611.160<0.000010.28310.7169
MIRU 392.580.8392.430.8190.132050.40490.5951
MIRU 402.830.8832.821.1441.000000.48990.5101
  • Ancestral strains were defined by the possession of two or more copies of MIRU24 and modern strains defined by the possession of a single copy of MIRU 24.

  • * Probability that a modern strain has more repeats at this location than an ancestral strain.

  • ** ** Probability that an Ancestral strain has more repeats at this location than a modern strain.

  • MIRU, mycobacterial interspersed repeat units; VNTR, variable number tandem repeat units.

View this table:
Table 2

Mean ETR repeat number comparing ancestral and modern Mycobacterium tuberculosis strains

DataAncestral (n1=41) meanStandard deviationModern (n2=28) meanStandard deviationBonferroni- adjusted P-valueProb(xi<yj) U/n1n21−Prob(xi<yj) 1−U/n1n2
ETR A5.681.7532.460.838<0.000010.94860.0514
ETR B3.732.2471.790.4180.016970.72470.2753
ETR C3.610.7713.610.8321.000000.51130.4887
ETR D5.661.2173.040.429<0.000010.93600.0640
ETR E4.241.2002.890.9160.000060.82100.1790
  • * Probability that an ancestral strain has more repeats at this location than a modern strain.

  • ** ** Probability that a modern strain has more repeats at this location than an ancestral strain.

View this table:
Table 3

Simpson's Diversity Index, λS, for each group (modern and ancestral or both) at each individual locus

Using MIRU profiles
All0.9953 (0.9944, 0.9963)0.9812 (0.9726, 0.9897)0.9942 (0.9938, 0.9946)
MIRU 20.5264 (0.4833, 0.5695)0.0832 (0.0061, 0.1604)0.1676 (0.0972, 0.2381)
MIRU 40.6270 (0.5720, 0.6820)0.5985 (0.4901, 0.7068)0.2668 (0.1857, 0.3480)
MIRU 100.7901 (0.7664, 0.8137)0.20000 (0.0917, 0.3084)0.7264 (0.6851, 0.7676)
MIRU 160.7055 (0.6689, 0.7420)0.3459 (0.2438, 0.4480)0.5027 (0.4307, 0.5746)
MIRU 200.5107 (0.4675, 0.5538)0.1017 (0.0199, 0.1836)0.1281 (0.0659, 0.1903)
MIRU 230.6841 (0.6379, 0.7304)0.5316 (0.4248, 0.6385)0.4103 (0.3298, 0.4908)
MIRU 240.4488 (0.4086, 0.4891)0.0624 (0, 0.1296)0 (0, 0)
MIRU 260.7934 (0.7686, 0.8182)0.0629 (0, 0.1311)0.7669 (0.7297, 0.8041)
MIRU 270.5834 (0.5382, 0.6286)0.2174 (0.1074, 0.3273)0.2620 (0.1841, 0.3400)
MIRU 310.8253 (0.8051, 0.8456)0.6368 (0.5809, 0.6927)0.6991 (0.6612, 0.7371)
MIRU 390.7482 (0.7129, 0.7835)0.5582 (0.4627, 0.6538)0.5464 (0.4820, 0.6108)
MIRU 400.8345 (0.8127, 0.8563)0.6768 (0.6352, 0.7185)0.7097 (0.6674, 0.7521)
Using ETR profiles
All0.7937 (0.7478, 0.8396)0.8327 (0.7610, 0.9044)0.5813 (0.4970, 0.6656)
ETR A0.7638 (0.7241, 0.8035)0.7490 (0.6978, 0.8002)0.5349 (0.4676, 0.6021)
ETR B0.7461 (0.7080, 0.7841)0.7818 (0.7253, 0.8383)0.4876 (0.4366, 0.5385)
ETR C0.7426 (0.7068, 0.7784)0.5869 (0.5338, 0.6400)0.5270 (0.4620, 0.5921)
ETR D0.7268 (0.6931, 0.7605)0.6412 (0.5815, 0.7009)0.4787 (0.4312, 0.5261)
ETR E0.7580 (0.7192, 0.7967)0.7085 (0.6640, 0.7530)0.5318 (0.4647, 0.5989)
  • MIRU, mycobacterial interspersed repeat units.

Calculating the mean repeat number of 15 loci from a 478 isolate data set revealed that there are significant differences in the counts of repeats found at MIRU/VNTR locations 4, 10, 23, 24, 26 and 31 when comparing strains from some of the modern lineages (e.g. Beijing and Haarlem), defined by lack of TbD1 and one copy of MIRU 24, to strains from ancestral lineages. Of these only MIRU 26 shows an increase in the average number of repeats in modern strains compared with ancestral strains, while MIRU 2, 16, 27, 39, 40 and ETR C have a similar mean value in both modern and ancestral strains. Significant differences in the counts of repeats were found at two ETR locations (A and D) although the P-value was very low for ETR B and E (0.01697 and 0.00006, respectively), indicating that the probability of finding reduced numbers of repeats in modern strains when compared to ancestral isolates is likely for all ETRs with the exception of ETR C which is remaining static. It is hypothesized that Mycobacterium tuberculosis is a recently emerged clone of Mycobacterium canettii (Fabre et al., 2004), and when mean number of repeats of 43 M. canettii strains studied by Fabre et al. are analysed and compared with the M. tuberculosis strains studied here, they are the same or higher than that of ancestral strains (ETR A=10.0, ETR B =3.8, ETR C=5.4, ETR E=4.5) except for ETR D (=2.7). This indicates that using M. canettii as an outgroup, the hypothesis for a trend of repeat number reduction over time is supported. However, Gutierrez . (2005) have shown that M. canettii can show considerable diversity and its use an outgroup may be oversimplifying the analysis of the ancestral lineage of the MTBC. Warren . (2004) suggested that in a closed community VNTR variation may occur in both directions (i.e. increase and decrease in the number of repeats), by a mechanism called slipped-strand mispairing (SSM) (Levinson & Gutman, 1987). The number of repeats is thought to increase during replication if there is slippage in the replicated strand and decrease if it occurs in the template strand. However, data analysed in this study do not support their report, but rather shows that the general trend appears to be loss of repeats in modern strains compared with ancestral strains. Nevertheless, the Simpson's diversity index for the MIRU profiles (Table 3) was high enough (above 0.95) for each of the strain types and overall to support the current genotyping system. However, using the ETR profiles, the Simpson's diversity index falls well below the level (0.95) at which we would consider the typing method to provide a useful classification, especially for modern strains.

Mycobacterium tuberculosis is thought to have undergone an evolutionary bottleneck around 15–20 000 years ago (Sreevatsan et al., 1997). Essentially, this means that a single clone became extremely successful at that time and we see its descendants in today's modern strains. It is possible that the observed loss of repeats in modern strains may have contributed to this clonal expansion and the appearance of more successful pathogens but they are more likely to be genetic markers with no effect on phenotype. However, it is possible that the ancestral clone of the modern strains by chance happened to have a lower number of repeats and that the repeat number variation is occurring more slowly than previously thought. The Hunter–Gaston discriminative index (HGI) for MIRU/VNTR is thought to approach that of IS6110 RFLP typing, a method with a fast ‘molecular clock’ (Kremer et al., 2005), but only when low copy number IS6110 isolates (often ancestral) are included in the analysis.

However, VNTR variation within modern strains such as Haarlem and Beijing does appear to be lower however even though these so-called modern strains have existed for perhaps 4000 years or more (Zink et al., 2003), evidenced by the presence or absence of spacers typical for MGG2 and 3 in the spoligotype of strains from Egyptian mummies. Interestingly MIRU 26, the most variable locus in modern strains appears to be stable in ancestral strains and at low copy number. Sequence variation in repeats may play a part in whether repeat number variation is more likely, with exact sequences ‘slipping’ more often than variable sequences. The genesis of many of these repeats studied in M. tuberculosis, which have highly similar sequences, may have started with an initial 53 bp single copy repeat which then spread to different loci throughout the genome by recombination. When some of the repeat sequences from the published sequence of M. tuberculosis CDC1551 are split into their individual repeat units and aligned (Fig. 2) it is clear that the repeats at each locus have highly similar sequences although each repeat consensus differs slightly from that of other loci. It is possible that once a single copy of the sequence spread throughout the genome, slight changes in sequence occurred at these different loci and then started to duplicate themselves as those sequence differences in the first repeat are also duplicated in subsequent repeats, as detailed by Benson & Dong (1999). As there is only one copy of this 53 bp repeat in M. leprae (see Fig. 2) this could support this hypothesis. The reasons for this are unclear but may have been as a result of environment or pressure on the organism to create sequence diversity in an effort to survive. Some of the loci are in or close to genes or expression signals that may affect protein expression (e.g. MIRU 4 (ETR D) lies between the two component regulator senX3–regX3). It may be that the ‘locking’ or ‘unlocking’ of the repeat variability of some of these loci directly contributed to the ability of the organism to survive in different environments but it may also reflect that chance sequence changes affecting the SSM coincided with other factors, such as the dramatic increase in the human population, thus increasing the chance of transmitting the bacteria. When comparing the data from Fabre . (2004), the only ETR locus whose repeat number does not appearing to be reducing over time is ETR D.

The combined information from various genomic sources in Fig. 1 clearly shows a genetic distinction between M. africanum subtype I and M. africanum subtype II. It appears that M. africanum subtype I is the ancestor of M. bovis and M. africanum subtype II is very closely related to M. tuberculosis. By spoligotype and SNP evidence a distinct group of M. africanum appears to have evolved from an MGG2 strain. Sola . (2003a, b) proposed in 2003 that M. africanum type II is not genetically M. africanum. The anomaly lay in that the initial study group contained a diverse collection of strains mostly based on colony morphology confirming that identification tests based on phenotypic evidence would be best supported by genetic evidence as well.

In summary, repeat numbers at the loci discussed are more likely to decrease with time except for MIRU 26. The current panel MIRU/VNTR used for typing modern strains may not be as effective at resolving epidemiologically unrelated infections as the potential for allelic variation is less if the repeat numbers are lower. However, the present resolution provided might still be adequate and analysis of epidemiologically and genetically well-defined collections of isolates will address this issue. If this not the case, utilising other more variable loci could improve the resolution. The majority of loci currently analysed stem from variation in the number of a particular 53 bp repeat unit. Many other repeat units exist in M. tuberculosis, often shorter repeats in multiples of 3 bp, indicating a possible association with gene variation. The exploitation of these areas may result in the generation of a more useful panel for so-called clonal genetic families, possibly specific for each group. Care should be exercised however to measure the speed of the molecular clock of each novel locus to ensure that true molecular links are established between epidemiologically related strains within an evolutionary context.


View Abstract