The Vibrio pathogenicity island (VPI) encodes the toxin-coregulated pilus and other virulence factors for Vibrio cholerae to colonize the human intestine to cause cholera. We assessed the level of genetic variation of VPI in nine nonpandemic isolates, and compared them with the sixth and seventh pandemic strains by sequencing c. 5 kb each from the start, middle and end regions of the VPI. Variation is similar among the three regions at around 2%, except for the tcpA gene, which has a much higher level of variation (23%). Numerous recombination segments were identified with sizes up to 2177 bp. Nearly all VPI genes sequenced have a ratio of synonymous to nonsynonymous substitutions considerably lower than that for housekeeping genes, suggesting that VPI genes are under positive selection pressure for change. The tagA gene was deleted or damaged in six isolates, which is likely to affect the efficiency of colonization of the human intestine. Two genes, orf2 and acfD, previously found to be translated differently in the sixth and seventh pandemic strains, were determined to be mutant in the seventh and sixth pandemic strains, respectively. These findings enhance our understanding of variation in the VPI, and of the pathogenic potential of VPI-positive environmental isolates.
Vibrio pathogenicity island
Vibrio cholerae is the causative agent of cholera, a life-threatening severe watery diarrhoea (Kaper et al., 1995). Two major factors, the toxin-coregulated pilus (TCP) and cholera toxin (CT), are essential for virulence (Kaper et al., 1995). The TCP is encoded in the Vibrio pathogenicity island (VPI) and CT within the CTX phage (CTXΦ) (Waldor & Mekalanos, 1996; Karaolis et al., 1998). The TCP is utilized not only to aid the colonization of the intestinal mucosal epithelium surface, but it also acts as a receptor for the CTXΦ (Waldor & Mekalanos, 1996).
The tcpA gene encodes TcpA, the major protein of the TCP (Rhine & Taylor, 1994). There are considerable differences in the epitope or antigenic structure of TcpA between classical (sixth pandemic) and El Tor (seventh pandemic) biotype strains of V. cholerae O1 (Jonson et al., 1992). The variability in the immunodominant C-terminal domains of TcpA leads to antigenic variation, which is believed to affect the level of cross protection among strains carrying variants of TcpA (Nandi et al., 2000). Comparison of the VPIs of the sixth and seventh pandemic strains showed that the TCP region has the highest level of sequence variation, with tcpA being the most divergent gene (Karaolis et al., 2001).
Vibrio cholerae occurs widely in marine and brackish environments (Colwell & Spira, 1992), in this resembling most other species of Vibrio, and, in some areas, is also found in fresh waters. It is isolated from muds and waters and also from molluscs and crustaceans (Nair et al., 1988, 1991). The environmental isolates are much more variable than those of the sixth and seventh pandemics (Beltran et al., 1999; Farfan et al., 2000), which can now be seen to be closely related clones with human pathogenic properties. The VPI was initially thought to be present only in the epidemic V. cholerae isolates, but a number of studies have shown that it is also present in some environmental isolates, with recovery of novel tcpA alleles (Novais et al., 1999; Nandi et al., 2000; Mukhopadhyay et al., 2001; Boyd & Waldor, 2002; Li et al., 2002, 2003). However, we know little of variation of the VPI genes other than for tcpA. In this study, we sequenced three regions of 5 kb each of the VPI from seven environmental isolates to determine the level of variation and assess the forces driving the divergence of VPI.
Materials and methods
The V. cholerae isolates used in this study are listed in Table 1. The two Australian non-O1 isolates were from Dr P. Desmarchelier (Food Sciences, Australia), four non-O1 isolates were from Dr T. Shimada (Japan National Institute of Health), and one isolate (M1121) from Dr C. Salles (Instituto Oswaldo Cruz, Brazil). All but M1121 had been identified as carrying the VPI by screening using PCR targeting the tcpA gene (unpublished data). Some of these strains were also reported by others to be VPI positive. Presence of VPI in M1098 and M1567 was reported by Li et al. (2002). Strains M1593 and M1118 are likely to be closely related to two other strains used by Li et al. (2002) as the O antigen, place and year of isolation are the same. The tcpA sequence of M1121 was published by Novais et al. (1999).
All primers (Supporting Information, Table S1) were designed based on N16961 (the seventh pandemic genome sequenced strain) as the template. The primer pairs amplify c. 1-kb segments, overlapping adjacent segments by c. 100 bp. Primers 9013, 9014, 9015 and 9020 were designed to replace 9007, 9011, 9012 and 9010, respectively, which failed to anneal to the DNA templates of strains M1118, M1593, M1619 or M1098, respectively.
The phred-phrap-consed program package (Gordon et al., 1998), available at the Australian National Genomic Information Service (ANGIS), was utilized for sequence editing. The gcg package and multicomp (Reeves et al., 1994) were used for multiple sequence alignment and comparison. phylip (Felsenstein, 1989) was used to generate phylogenetic trees and bootstrap values. The rpd3 program (Martin et al., 2005) was used to detect recombination. The single likelihood ancestor counting (SLAC), fixed effects likelihood (FEL) and random effects likelihood (REL) method (Pond & Frost, 2005) were used to detect positive selection on individual codons on the datamonkey website (http://www.datamonkey.org/). Stephens test of recombination (Stephens, 1985) was performed using a program written by one of the authors (R.L.).
GenBank accession numbers for the sequences determined in this study are FJ208994–FJ209020.
Results and discussion
The VPI is c. 40 kb in size. We sequenced 5 kb each from the left, middle and right end regions of the VPI, representing 38.5% of the island, from nine strains, including seven environmental strains and two toxigenic isolates closely related to the seventh pandemic clone. The three regions are denoted as ALD left, TCP middle and ACF right regions. The sixth and seventh pandemic VPI sequences (Karaolis et al., 2001) were included for comparison.
For the ALD left region, we sequenced 5734 bp, which covers 1059 bp of the 1521 bp aldA gene, the whole tagA gene (3042 bp) and 1305 bp of the 1711 bp orf2. An average nucleotide variation of 2.17% with a maximum of 5.68% per gene was observed (Table 2). aldA has very low variation while the level of variation in tagA and orf2 is similar. A deletion was detected by PCR in strains M1118 and M1593, and found by sequencing to be a 2367 bp deletion within the tagA gene. Four strains had mutations in tagA leading to truncated peptides: a single base deletion at base 50 in strain M1121, a G to A mutation at base 233 in M2140 and a G to T mutation at base 28 in both M1567 and M1619.
↵* M1118 and M1593 were excluded from Ks/Ka calculation due to the 2-kb deletion involving tagA.
↵† The length of the region includes intergenic regions.
A total of 4350 bp was sequenced from the TCP middle region, which includes five of the VPI genes, tcpPHABQ, in full. The region has an average nucleotide variation of 6.79% and a maximum of 9.21% (Table 2). However, the variation for all except tcpA ranges from 1.81% to 2.50% while tcpA has extremely high variation (22.56%), up to 10 times higher than the other four genes sequenced in this region.
A total of 5701 bp from the ACF right region was sequenced, which covers 828 bp of the 909 bp tagE, the complete acfA (648 bp) and 4011 bp of the 4563 bp acfD gene. The ACF right region has an average variation of 2.12%, ranging from 1.27% to 2.32%. It is clear that with the exception of tcpA and aldA, the VPI genes across all three regions have a similar level of variation at around 2%. This is comparable to the level of variation we observed in housekeeping genes for a different set of V. cholerae strains, averaging 1.52% for mdh (Byun et al., 1999), 1.41% for asd (Karaolis et al., 1995) and 3.21% for hlyA (Byun et al., 1999), suggesting that these regions of VPI resemble housekeeping genes in level of variation.
Ratio of synonymous and nonsynonymous substitutions (Ks/Ka)
The Ks/Ka ratio is often used as an indicator of neutral or adaptive variation. For the 11 VPI genes sequenced, the ratios range from 1.83 for tagE to 9.3 for tcpB. In comparison, the Ks/Ka ratios in three V. cholerae housekeeping genes, asd, mdh and recA, calculated using data from Karaolis et al. (1995) are 30.88, 50.52 and 42.32, respectively. The Ks/Ka ratios of these VPI genes are much lower than those of housekeeping genes, suggesting that the VPI genes are under positive selection. Proteins or molecules exposed on the cell surface are more likely to be under selection pressure for change. The pilus subunits are obvious targets of the immune system and the major pilin subunit TcpA is discussed below. The gene encoding the minor pilin subunit, tcpB, has the highest Ks/Ka ratio among the VPI genes studied, but still substantially lower than the housekeeping genes. However, tcpP encoding a regulatory protein also has a low Ks/Ka ratio. We analysed the genes by codon position to determine whether positive selection has been exerted at individual amino acid level using the SLAC, FEL or REL methods (Pond & Frost, 2005), but none was detected (data not shown). However, as the number of sequences is small, positive selection may not be detectable.
TcpA is of particular interest because it is the major pilin subunit. Although tcpA does not have the lowest Ks/Ka ratio, it has the highest Ka. tcpA has been well studied and was suggested to be under selection pressure to evade the immune system (Karaolis et al., 2001). It is known that differences in the TCP lead to differences in antigenic specificity and to substantial specificity in protective immunity (Voss et al., 1996). This was suggested to have led to selection pressure that has driven the recombinational substitution of the central region of the VPI (Karaolis et al., 2001). It is also possible that the differences in TcpA, which is the major component of the TCP, would affect interaction with various phages using it as a receptor in a way that is subject to selection (Karaolis et al., 2001). However, the recent discovery of a TCP gene cluster homologue in the genome of Vibrio fischeri (Ruby et al., 2005), a species not usually associated with human disease, suggests that TCP may play a role in other systems. The tcpA gene has also been detected in another distantly related Vibrio species, Vibrio alginolyticus (Sechi et al., 2000). Therefore, the divergence of tcpA may not have been a result of selection pressure from the human immune system only. It is possible that TCP is involved in the interactions of V. cholerae with marine invertebrates with which V. cholerae is frequently associated for no apparent symbiosis.
Frequent recombination was detected within the VPI
We examined all three regions for recombination by visual inspection of patterns of sequence variation (Figs 1 and S1). Numerous potential recombinant segments were identified and in most cases the breakpoint of recombination is clear-cut. Recombination is distinguished from multiple mutations by the presence of segments in which there is a big difference in the frequency of substitutions relative to other strains, and often by segments in which the relationships of the strains are different from adjacent segments. It is not always possible to tell in which set of strains the recombination event occurred. We first attempted to locate recombination segments using the rpd3 program (Martin et al., 2005), which implemented several algorithms, including maxchi (Maynard Smith, 1992), but all failed to find any recombination, although there were many visually obvious recombinant segments. We then used the Stephens test (Stephens, 1985) to determine recombinational segments based on nonrandom clustering of base changes and found 6, 31 and 9 partitions with at least four sites that are significantly clustered at P<0.001 for the ALD, TCP and ACF regions, respectively (Table S2). Eight partitions were significant only after removing the longest nonvarying segment. One partition in the ALD region (M1567, M1619 and M1098 vs. others) is trivial with a d0 of 4 (for definition of d0 see Table S2) while the remaining partitions have d0 of 197 or greater. The much higher number of partitions in the TCP region indicates that it has undergone more recombination than the other two regions, consistent with our visual examination results described below. However, it is not easy to relate the partitions identified by the Stephens test to specific recombination events and we report below those found by inspection.
Plot of polymorphic bases in the three regions sequenced. The consensus is plotted at the top for each region to visualize the level of variation in different genes. Difference to the consensus is plotted as a vertical bar for individual strains. The deletions in the ALD region in M1593 and M1567 were plotted as a thin grey line. The same x-axis scale is used for the three regions. The horizontal line above the bars indicates the segment is a recombinant except for the lines above the consensus, which indicate gene boundaries with gene names marked. See Fig. S1 for sequence variation in detail.
In the ALD region, six recombinant segments were found, ranging in size from 77 to 2501 bp (Figs 1 and S1). The longest recombinant segment is from sites 1360 to 3860 in M1098, with another small recombinant segment (sites 3430–3506) within this segment. The segment from sites 27 to 741 in M1098 is also potentially recombinant as there are only two informative base differences to the pandemic strains. We treat any segment in M1098 similar to other strains as recombinant in origin with five such segments identified. However, alternatively M1098 could be closely related to M1567/M1619 with replacement of other regions from dissimilar sequences. In either scenario, M1098 seems to be a mosaic sequence.
The TCP middle region has nine recombinant segments of sizes 100 bp or greater (Figs 1 and S1). Because of the high level of sequence variation, recombinant segments in this region are more difficult to identify. The largest segment is 1510 bp in M1118 with a few base differences to the potential donor, the sixth pandemic strain. However, the TCP region, particularly the tcpA gene, has several small recombinant segments (Fig. S1).
In the ACF region there are nine recombinant segments ranging in size from 119 to 1149 bp detected, including three in M1098 and four in M1593 (Figs 1 and S1). A pair of recombinant segments (site 1675–2823 between M1118/M1121 and M1567/M1619) appeared to be a reciprocal exchange, based on the level of sequence variation and the location of the breakpoints.
Phylogenetic trees for the three regions were constructed separately using the neighbour-joining algorithm and are shown in Fig. 2. In the ALD region tree, the sixth and seventh pandemic strains, together with closely related Australian (M2140) and US Gulf (M794) strains, and two environmental isolates, M1618 and M1121, form a tight cluster. M1118 and M1593, and M1567 and M1619 are grouped together, respectively, while M1098 stands alone. Within the cluster of the pandemic and related strains, the Australian and US Gulf strains are grouped together with the seventh pandemic strain, and the M1618 Australian environmental isolate is grouped with the sixth pandemic strain.
Phylogenetic trees of the three regions. (a) The ALD region; (b) the TCP region; (c) the ACF region. The trees were constructed using the neighbour-joining method. Bootstrap values are percentages of 1000 replications and, if 50% or greater, are indicated at the nodes.
In contrast, in the phylogenetic tree of the TCP region, most isolates are distantly related with long branch lengths, which reflect high divergence of the sequences between the isolates. The sixth and seventh pandemic strains are not grouped together as observed before (Karaolis et al., 2001), when it was suggested that one of them had obtained part of the TCP region from another source. However, the Australian (M2140) and US Gulf (M794) strains are still grouped together with the seventh pandemic strain, as in the tree of the ALD left region. M1618 is again grouped together with the sixth pandemic strain.
In the phylogenetic tree of the ACF region, the pandemic strains, the Australian and US Gulf strains and M1618 are very closely related as seen in the ALD region tree. However, M1098, rather than M1121 as in the ALD region tree is closely related to the pandemic cluster. M1619 and M1567 are grouped together, consistent with the ALD left region tree. It seems that one of these strains obtained a different TCP region. From inspection of the sequences of the TCP region (Fig. S1), the two strains were found to share similar sequences at the 5′ end including tcpP and tcpH genes, up to the intergenic region between tcpH and tcpA. The rest of the sequence in the TCP region, including tcpA, tcpB and tcpQ, are dissimilar.
The non-O1/non-O139 Australian strain M1618 has a very similar VPI to that of the sixth pandemic strain with only 14 base changes in the 15 785 bp sequenced. Sequencing of seven housekeeping genes showed that this strain is a derivative of the sixth pandemic clone (data not shown).
A phylogenetic analysis using both DNA (Fig. 3) and amino acid sequences (data not shown) showed that the 30 sequences were grouped into 12 clusters as reported previously and can be regarded as major alleles. Three clusters (8, 9 and 10) are more closely related and their relationship is consistent between protein and DNA trees. The tcpA sequences obtained in this study fell into the existing clusters, sharing similar or identical sequence to known tcpA alleles.
The tcpA alleles are either distantly related (smallest difference between the major clusters 15.3%) or very closely related (largest difference between alleles within the major clusters 2.3%). A gradient of variation between the two levels was not observed. We suggest that the major alleles with high level of divergence are more likely to have been derived from recombination for two reasons. Firstly, genes flanking tcpA have much lower level of variation. Secondly, the high level of divergence in tcpA also extends to the intergenic regions flanking tcpA in the eight major alleles examined in this study (Fig. S1). However, it is difficult to determine which divergent tcpA alleles were obtained from recombination. The polarized levels of variation indicate that recombination rather than mutation played a major role.
Frame-shift mutations in the sixth and seventh pandemic strains
One frame-shift mutation each was detected in orf2 and acfD in a comparison of sixth and seventh pandemic strains (Karaolis et al., 2001). With the sequences from the other strains, it was shown that for orf2, the seventh pandemic has a deletion where other non-O1/non-O139 strains are all identical to the sixth pandemic strain with a ‘T’ nucleotide at the site. However, the function of orf2 is not known and therefore the significance of this mutation is not known. In the case of acfD, the sixth pandemic strain is a mutant with an additional ‘A’ nucleotide. Hughes et al. (1994) reported that the mutation affects motility of V. cholerae. Interestingly, M1618, a strain closely related to the sixth pandemic strain, has not got the frame-shift mutation. It is not clear when the mutation arose in the sixth pandemic clone or whether it is present in all sixth pandemic strains.
The relationship of the pandemic strains and related toxigenic strains
The sequences of the VPIs of the Australian (M2140) and US Gulf (M794) strains are closely related to that of the seventh pandemic strain, differing by only four (two substitutions and two indels) and 10 (four substitutions and six indels) nucleotides, respectively (Table 3), with the Australian strain more closely related to the seventh pandemic strain than the US Gulf strain, consistent with the data on 26 housekeeping genes (Salim et al., 2005).
↵† The numbers refer to the nucleotide positions in the individual regions sequenced. Base differing from the majority is highlighted in bold.
↵‡ The sixth pandemic strain is not included in the TCP region comparison because of high-level divergence.
The sixth and seventh pandemic strains have a high similarity at both ends of the VPI but a very different TCP middle region (Karaolis et al., 2001). The exchange of the TCP region must have occurred either in the sixth pandemic clone or in the lineage leading to the seventh pandemic clone before the seventh pandemic clone diverged from the other ‘El Tor’ isolates as the Australian and the US Gulf strains carry the same form of the VPI as the seventh pandemic clone. The ctxB gene was sequenced from these strains by Olsvik et al. (1993). There are 3-base differences among them (Table 3). The US Gulf strain has an identical ctxB to that of the sixth pandemic strain, and the ctxB of the Australian strain differs by 1 base. This relationship is consistent with housekeeping gene data. However, the seventh pandemic ctxB differs by 2 bases from the sixth and the location of the CTXΦ is also known to be different (Olsvik et al., 1993), suggesting independent acquisition of the CTXΦ.
VPI is essential for V. cholerae pandemic and epidemic strains to colonize the human intestine. A number of environmental isolates have been found to carry the VPI, giving them the primary requirement to become pathogenic to humans. Previously we have shown that the sixth and the seventh pandemic strains have a different TCP region, one of which must be recombinant, likely to have been driven by selection pressure applied to the tcpA gene (Karaolis et al., 1998). This study assessed the overall variation of the VPI in nonpandemic isolates. The variation is generally similar throughout the VPI except for the tcpA gene, which has a much higher level of variation. We found that recombination is a primary force driving the divergence as seen in the two pandemic strains for the TCP region. Many recombinant segments were identified by similarity to sequences in other strains studied and potentially these represent the donors of the sequence. There seems to be sufficient opportunity for these VPI-positive strains to exchange DNA to give such high levels of recombination. There are also recombinant segments from unknown sources, in particular, those involving the tcpA gene. It is quite likely that there are far more VPI variants, either in V. cholerae or other species in the environment.
Although the TCP is an essential colonization factor and variation in tcpA is important in immune evasion, other genes encoded on the VPI may also affect the efficiency of colonization. Boyd & Waldor (2002) have shown that strain 208 (same strain as M1121 in this study) is far less efficient in mouse intestinal colonization than four other VPI-positive isolates studied, while the remaining three tcpA variants are equally efficient. The VPI regions sequenced from M1121 in this study were found to have a single base deletion in tagA, which may have attributed to the decrease in efficiency, although genes in unsequenced regions may carry damages. If tagA plays a role in colonization, five other strains (M2140, M1567, M1619, M1118 and M1593) would also be less efficient in colonization as they suffered a frame-shift mutation or deletion in tagA. The tagA gene encodes a lipoprotein with an unknown role in virulence (Harkey et al., 1995) and is in the toxT regulon (Withey & Dirita, 2005), which regulates the ctx genes and the tcp genes. Functional studies of different forms of the VPI will further elucidate significance of variation in the VPI and pandemic potential of VPI-carrying strains.
Additional Supporting Information may be found in the online version of this article:
Table S2. Stephens test of nonrandom clustering of polymorphic sites.
Fig. S1. Informative bases of the three regions sequenced. Note that for strains, where no sequence was obtained, the corresponding region appears blank. The numbers at the top of the figure, reading vertically, are base positions which in the text are referred to as sites. Proposed recombinant segments are coloured, but in some cases it is not possible to tell which set of strains underwent recombination. Potential donors if present in strains studied can be easily deduced by comparison of sequence similarity to other strains and are coloured the same as the recombinant. Different colours are used for recombinant blocks at the same region for clarity.
This study was supported by University of New South Wales GoldStar award. We thank David Ryan and Pei Pei Gan for technical assistance. We thank the reviewers for helpful comments.
(1994) Sequence analysis of the Vibrio cholerae acfD gene reveals the presence of an overlapping reading frame, orfZ, which encodes a protein that shares sequence similarity to the FliA and FliC products of Salmonella. Gene 146: 79–82.