OUP user menu

Displaying the relatedness among isolates of bacterial species – the eBURST approach

Brian G. Spratt, William P. Hanage, Bao Li, David M. Aanensen, Edward J. Feil
DOI: http://dx.doi.org/10.1016/j.femsle.2004.11.015 129-134 First published online: 1 December 2004


Determining the most appropriate way to represent the relationships between bacterial isolates is complicated by the differing rates of recombination within species. In many cases, a bifurcating tree can be positively misleading. The recently described program eBURST can be used with multilocus data to define groups or clonal complexes of related isolates derived from a common ancestor, the patterns of descent linking them together, and the ancestral genotype. eBURST has recently been extensively updated to include additional tools for exploring the relationships between isolates. We discuss the advantages of this approach and describe its use to explore patterns of descent within clonal complexes identified using multilocus sequence typing.

  • MLST
  • Clonal complex
  • Evolution
  • Pathogen

1 Introduction

The relatedness among isolates of a bacterial population or species is generally represented as a dendrogram. Although in many cases phylogenetic inferences at an intra-species level are tenuous, for molecular epidemiology this approach remains attractive since it conveniently identifies groups of isolates that are indistinguishable in genotype (a strain or clone), and those that are closely related in genotype, and which are likely to be descended from a recent common ancestor (a clonal complex). The approach is also versatile; it can be applied to isolates that have been characterised by a broad range of typing procedures, whether these produce patterns of DNA fragments on gels, gene sequences, or allelic profiles generated by multilocus approaches, such as repeat length polymorphisms or multilocus sequence typing. A limitation of the dendrogram is that it appears to provide a single authorative account of how the isolates have arisen, by a bifurcating process of lineage splitting, which in most bacterial species is likely to be highly suspect. Frequent recombination over long periods of evolutionary time makes it unlikely that the relatedness among distantly related isolates implied by the dendrogram has any phylogenetic meaning for many bacterial species [1,2]

Recombination can also distort the apparent relationships between similar isolates if molecular typing indexes variation at a single locus; two identical isolates will appear to be completely different if a recombinational replacement in one of them changes the locus used for typing. For this reason, multilocus approaches are preferable for strain characterisation, as they buffer against this effect of recombination. A recombinational replacement at one locus still allows the isolates to be recognised as closely related since they remain identical at the other loci used for typing. This article therefore only considers multilocus data and focuses on data produced by multilocus sequence typing (MLST; [3]). MLST produces allelic profiles based on the sequences of seven house-keeping gene fragments and each different allelic profile is defined as a sequence type (ST), which provides a convenient descriptor for each strain or clone. MLST is widely used for strain characterisation and molecular epidemiology, and for analysing the population biology of bacteria, and produces data that readily can be compared via the internet [4,5].

MLST databases are available for many bacterial pathogens, and for two Candida species, and several of them contain thousands of isolates. Displaying the relatedness among thousands of isolates is problematic. However, even in cases where the relationships between distantly related members of a species are difficult or impossible to discern, the data may still provide robust evidence about the relationships among closely related isolates, and the patterns of recent evolutionary descent among such isolates. By first sub-dividing the thousands of isolates within a MLST database into manageable groups of isolates, that share some defined level of similarity in their genotypes, the relationships among these groups can then be explored. There are practical reasons to focus on recent events, as these are important in molecular epidemiology (e.g., identifying clusters of infection), for understanding the origins of new pathogenic strains, and the response of bacterial populations to antibiotics and vaccines.

2 Displaying relationships among similar bacterial isolates

Dendrograms identify clusters but they display very poorly the recent evolutionary events that have generated these groups of isolates with similar genotypes. A number of alternative methods have been proposed to represent the relationships among isolates from bacterial populations. These include the Splits Tree program [6] and the minimum spanning tree module within the Bionumerics software package (http://www.applied-maths.com/bionumerics/bionumerics.htm), and have in common that they do not impose a tree-like pattern of descent, which is inappropriate for exploring the evolutionary relationships among similar isolates of a bacterial species in which recombination is frequent enough to disrupt phylogenetic signal. Here, we focus on one of these approaches – the eBURST algorithm – as currently it provides the clearest visualization of the likely relationships among similar isolates. BURST (based upon related sequence types) focuses on identifying groups of closely related isolates within a bacterial population, which are assumed to share a recent common ancestor (a clonal complex), and on exploring how these emerged and diversified [7]. BURST makes no attempt to display the relatedness between very different multilocus genotypes in a population. It differs radically from normal clustering algorithms as it incorporates a simple but realistic model of the way in which bacterial clones emerge and diversify to form clonal complexes. The approach was developed as a tool to analyse MLST data but it can, in principle, be used for most other types of multilocus data.

According to the model incorporated in the BURST algorithm, a genotype occasionally increases in frequency, as a consequence of genetic drift or natural selection, and diversifies by the accumulation of mutations and/or localised recombinational replacements, to result in slight variants of the founding genotype. In the context of MLST, the members of this emerging clone will initially be indistinguishable in allelic profile (same ST) but over time they will diversify to produce a number of variants in which one of the seven MLST loci has been altered (single locus variants; SLVs). Further diversification will produce variants of the founding ST that differ at two of the seven loci (double locus variants; DLVs) etc. Bacterial populations should therefore consist of clonal complexes which, provided they are relatively young, should easily be apparent using multilocus approaches to strain characterisation. According to this simple model, the founding ST should be identifiable as it will have the greatest number of SLVs. Also, the founding ST might be expected to be relatively prevalent in the population, and probably will be geographically more widely disseminated, compared to its descendent variant SLVs and DLVs.

The current implementation of the BURST algorithm, eBURST v2, runs as a Java™ applet and is available at http://eburst.mlst.net, or via the MLST databases at http://www.mlst.net. The input data used by eBURST are the STs and their allelic profiles. The first step is to divide the input data (e.g., all the isolates within a MLST database) into groups of STs that have some user-defined level of similarity in allelic profile. The definition of the group can be changed, but the default in eBURST is to identify groups of related STs using the most stringent (conservative) definition, where all members assigned to the same group share identical alleles at =6 of the 7 loci with at least one other member of the group. A less stringent approach is to define the groups by the sharing of alleles at =5 of the 7 loci but in this case it is much less clear that all isolates in the group share a recent common ancestor. Whatever group definition is used, this approach results in non-overlapping groups; no ST can be assigned to more than one group. A ‘group’ is used here as a neutral term for the collection of STs that are placed together by eBURST, according to the selected group definition, whereas a clonal complex is a set of STs that are all believed to be descended from the same founding genotype. Using the stringent group definition (6/7 shared alleles), isolates in the group defined by eBURST will be considered to belong to a single clonal complex.

The BURST algorithm then attempts to identify the founding ST of each group, using the criterion that it should be the ST with the greatest number of SLVs. It then predicts the descent from the founding ST to the other STs in the group, displaying the output as a radial diagram, centred on the predicted founding ST (Fig. 1). STs that are directly linked within an eBURST diagram differ at only one of the seven MLST loci and, using the default group definition, all STs are linked in the resulting eBURST diagram. The eBURST diagram shows the abundance of each ST in the input data by the size of the circle representing the ST and shows the predicted founder of the clonal complex in blue. eBURST links directly to the MLST databases and this allows the program to be run from within a MLST website, and STs can be highlighted in an eBURST diagram, and information about them and their SLVs retrieved directly from the MLST database (Fig. 1).

Figure 1

eBURST run from within the S. pneumoniae MLST website. All of the 2266 isolates in the S. pneumoniae MLST website (http://spneumoniae.mlst.net) were analysed by eBURST v2 and one of the clonal complexes is displayed as an eBURST diagram. The predicted founder of this clonal complex (ST124; 100% bootstrap support) is the major serotype 3 clone associated with invasive disease, and is shown in blue. The sizes of the circles that represent each ST indicate their prevalence in the MLST database. The numbers in the eBURST diagram correspond to the STs. ST124 has been selected (green square) and the linkage between eBURST and the MLST database has been used to extract all of the isolates of this ST from the database.

Young clonal complexes, such as that in Fig. 1, will typically have a single strongly predicted founding ST and a number of SLVs and perhaps one or two DLVs. Older clonal complexes will have diversified further and are likely to have a less simple structure. In the latter, SLVs or DLVs of the founder of the clonal complex may themselves have become prevalent in the population, and may have diversified to produce their own sets of SLVs, to become what are described as subgroup founders. In such cases the identification of the founder of the whole clonal complex may be less clear, as both the real founder and the subgroup founders, may have substantial bootstrap support for being the founder. The ST21 clonal complex of Campylobacter jejuni[8] provides an example of a much larger and more diversified clonal complex, with a strongly supported founder (ST21) and its linked SLVs, some of which have become prevalent and have diversified to form major subgroups (Fig. 2). Subgroup founders are shown in yellow.

Figure 2

The eBURST diagram for a large highly diversified clonal complex. All of the isolates in the Campylobacter public MLST database (http://pubmlst.org/campylobacter/) were analysed by eBURST v2 and the largest clonal complex was displayed. For improved clarity the ST numbers are not shown. The predicted founder is identified as ST21 (blue circle; 99% bootstrap support) and is linked to a large number of SLVs, three of which have diversified to become major subgroup founders (yellow circles). ST21 and all of the major subgroup founders are prevalent in the MLST database, as indicated by the size of the circles that represent these STs. Two other large subgroups are identified but their descent from ST21 is more tenuous as they are linked to it through several other STs. Reproduced with permission in a slightly modified form from reference [7].

Embedded Image

With a less stringent group definition, where STs assigned to the same group share alleles at 5 or more loci with at least one other ST in the group, there will typically be several separate clonal complexes as well as individual unlinked STs. eBURST can also display in a single diagram all of the STs in an entire MLST database, providing an overview of the major clonal complexes, the small clusters of linked STs, and individual unlinked STs in the input data (a population snapshot).

The original version of eBURST produced diagrams that needed to be manually edited, which can be laborious for large clonal complexes, or for a population snapshot. The current implementation (eBURST v2) automatically optimises the arrangement of all STs and provides diagrams that need minimal manual editing. The program is robust and is able to analyse large datasets. For example, Fig. 3 shows a population snapshot obtained using all 5587 isolates (3737 STs) in the N. meningitidis MLST public database. The output is very visual and highlights the contrast between the simple structure of the major serogroup A clonal complex (ST5 clonal complex; subgroup III), which has two very prevalent STs, which are SLVs of each other but very few other SLVs, and the much more diversified major serogroup B and C clonal complexes (Fig. 3).

Figure 3

The eBURST population snapshot. All 5587 isolates of N. meningitidis in the public MLST database (http://pubmlst.org/neisseria/) were displayed as a single eBURST diagram (a population snapshot). The snapshot shows in a single diagram all the clonal complexes, and their predicted founders (blue) and patterns of descent, as well as the smaller groups of linked STs and individual STs. The major serogroup A clonal complex is identified (arrow), as are a selection of the large clonal complexes that include isolates of serogroups B or C. Serogroup A isolates are associated with epidemics and pandemics of meningococcal disease whereas isolates of serogroups B and C are typically associated with endemic disease and small outbreaks of disease.

3 Exploring the eBURST diagram

To aid in the further exploration of eBURST output, the program includes a simple bootstrap procedure which estimates the degree of support for the predicted founder of a clonal complex, as well as for subgroup founders [7]. Where more than one ST has good bootstrap support, this indicates conflict. eBURST provides an hypothesis about the founding ST within a clonal complex and the evolutionary pathways, or patterns of descent, from that founder to each of the other STs, but this hypothesis should be assessed in the light of other data. For example, in the case of large old clonal complexes containing many subgroups, isolates with the same genotype as the original founder of the group may not be present in the sample (having dwindled in frequency in the population for stochastic or selective reasons). Here, eBURST will assign the next best alternative, but the patterns of descent depicted will obviously be wrong. Alternatively the founder may be present, but one subgroup in the clonal complex may be highly over-represented because the ancestor of this subgroup has acquired a feature of interest, like antibiotic resistance, resulting in over-sampling. The result is that this subgroup and its founder are erroneously identified as the root of the whole clonal complex. This is clearly the case in the ST156/162 clonal complex of Streptococcus pneumoniae, which is described below.

Bootstrapping provides one way of estimating the degree of confidence in the assignment of the founder of a clonal complex. Another useful measure provided by eBURST v2 is the average genetic distance (measured as the average number of allelic differences) of each ST from all other STs in a clonal complex. The ST within a clonal complex that has the minimum average genetic distance is likely to be the founder. For example, in a simple (presumably young) clonal complex, where there is only the founding ST and a set of SLVs, the average genetic distance from the founding ST to all other STs will be one, and all other STs in the clonal complex will have a greater average distance. For clonal complexes with multiple subgroups where it is difficult to resolve the most likely founder the minimum average distance provides another indicator of the likely founder.

Where the founder predicted by eBURST is considered to be incorrect, or to explore how the clonal complex would look with an alternative plausible founder, the clonal complex can be re-drawn by eBURST v2 with a user-defined ST as the founder. Re-assignment of the founder can produce a considerable change in the diagram, as some STs may be SLVs of the founder predicted by eBURST, and also of a subgroup founder that is a plausible alternative candidate for the founder of the clonal complex. In building the eBURST diagram, all of the SLVs of the predicted founder are first linked to it, and this assignment rule means that they are not available to be linked as SLVs of a subgroup founder. If the subgroup founder is selected as the user-defined founder, all of its SLVs become assigned to it, now including those that are also SLVs of the founder predicted by the algorithm.

Fig. 4 shows the clonal complex that includes ST156, the widely disseminated penicillin-resistant Spain9V-3 clone of S. pneumoniae[9,10]. There are two major subgroups in this clonal complex, one apparently descended from the penicillin-susceptible ST162, and the other from the penicillin-resistant ST156. In this example, ST156 is probably wrongly assigned by eBURST as the founder of the clonal complex, as this ST, and thus SLVs of it, have been greatly oversampled since they are penicillin-resistant. The average distance of ST156 to all other isolates in this group is 1.64, while the corresponding value for ST162 is 1.81 (the next best is ST930 with 2.08). Again, this suggests erroneously that ST156 is the likely founder due to the problem of over-sampling, but we can see that ST162 also occupies a central position relative to other members of the group. Comparison of the average distance may help to identify STs meriting further scrutiny as alternative founders. In Fig. 4(b) the clonal complex has been re-drawn with the penicillin-susceptible ST162 as the user-assigned founder.

Figure 4

Re-assigning the founder of a clonal complex. ST156 (the Spain9V-3 penicillin-resistant clone of S. pneumoniae) is assigned by eBURST as the founder of its clonal complex (a), but this is probably incorrect (see text). The penicillin-susceptible ST162 is the more likely founder and the clonal complex was re-drawn using eBURST v2 with this ST as the user-defined founder (b). The red colour of the circle that represents ST162 indicates it is a user-defined founder. ST162 has more linked SLVs when it is selected as the founder (b) than when it is a subgroup founder (a), due to the rule within eBURST that preferentially links the founder to all of its SLVs. eBURST v2 allows all of the SLVs (pink lines) and DLVs (blue lines) of any selected ST to be identified and those of ST156 are shown (b). This indicates that several of the SLVs of ST162 are also SLVs of ST156. Additional data can be used to explore whether it is more parsimonious to re-assign any of these as SLVs of ST156 (see text).

Some STs in an eBURST diagram could be descended from more than one immediate ancestor and the eBURST diagram provides only one hypothesis for the patterns of descent. eBURST uses a set of rules that attempts to maximise the number of SLVs associated with the predicted founder of the clonal complex, and with the subgroup founders, and in a large clonal complex there are many equally plausible minor variants of the arrangement of STs in the eBURST diagram. These may be explored by clicking on an ST of interest while holding down the shift key, so displaying all SLV and DLV links from that ST, not just those shown in the eBURST diagram. The display of one of several alternatives is similar to a dilemma frequently encountered in phylogenetic tree reconstruction, where there may be a large number of approximately equally plausible trees, which differ slightly in the branching order, but only one of them is displayed, or a consensus tree is generated, which is not necessarily as plausible as the individual trees which it reconciles. Additional information can in some cases be used to further explore the patterns of descent implied by an eBURST diagram. For example, differences in serotypes, or in the antibiotic resistance profiles, of the isolates of the STs within a clonal complex may favour one of two equally parsimonious linkages between STs in an eBURST diagram. The ability in eBURST v2 to show all SLVs and DLVs of any selected ST, as described above, helps to explore alternative equally plausible patterns of descent. For example, in Fig. 4, eBURST shows that some of the SLVs of ST162 are also SLVs of ST156. One of these (ST838) has high-level resistance to penicillin and if it possessed the same altered penicillin-binding protein genes as the Spain9V-3 clone (ST156) it would be more parsimonious to re-assign it as a SLV of the penicillin-resistant ST156.

4 Concluding remarks

Dendrograms provide a poor representation of the relationships among isolates with similar genotypes and do not contain easily interpreted information about ancestry or pathways of recent evolutionary descent. Incorporating a simple model of clonal expansion and diversification provides a new approach to understanding recent evolutionary events, as exemplified by the eBURST program. This approach identifies clonal complexes within bacterial populations, and predicts their founding genotypes, and the evolutionary pathways by which the other genotypes within the clonal complex may have arisen. While designed specifically for use with MLST data, the approach may be generalised to explore ancestry and descent among isolates characterized by other typing methods generating digital data. For example, the sizes of 12 mycobacterial interspersed repetitive units (MIRUs) are equated with alleles at each locus and provide allelic profiles that are used to type strains of Mycobacterium tuberculosis[11] and similar methods are used to characterise other pathogens. Such techniques, which typically index variation at many more loci than MLST, may require that the default parameters for definition of eBURST groups be changed: two strains that differ at one of twenty loci are more similar in multilocus genotype than two isolates differing at one of seven. Therefore it may be appropriate to have a less stringent definition of an eBURST group than with MLST data and some modification of the way the eBURST diagram is displayed would be necessary for a substantially greater number of loci than that used in MLST. We suggest that the patterns of diversification, and the identification of the founding genotype, of the major globally disseminated clones of M. tuberculosis is an example of a problem amenable to the algorithmic approach implemented in eBURST.

eBURST diagrams produce evolutionary hypotheses, rather than the truth, which can be further explored by mapping on additional genetic or phenotypic data. The ancestry and patterns of descent in simple clonal complexes are probably well represented by eBURST but there will be increasing uncertainty about the assignment of the founding ST and the direction of evolutionary descent in large more diversified complexes. Simulations are required to explore the reliability of eBURST in identifying the founding ST and patterns of descent and how it relates to the complexity of clonal complexes and is influenced by over-sampling of particular genotypes


  1. [1].
  2. [2].
  3. [3].
  4. [4].
  5. [5].
  6. [6].
  7. [7].
  8. [8].
  9. [9].
  10. [10].
  11. [11].
View Abstract