OUP user menu

Identification of bacterial rep-PCR genomic fingerprints using a backpropagation neural network

Fei Ni Tuang , Jan L.W. Rademaker , Evangelyn C. Alocilja , Frank J. Louws , Frans J. de Bruijn
DOI: http://dx.doi.org/10.1111/j.1574-6968.1999.tb13740.x 249-256 First published online: 1 August 1999

Abstract

A backpropagation neural network (BPN) was used to identify bacterial plant pathogens based on their genomic fingerprints. Genomic fingerprint data, comprised of complex DNA band patterns generated using BOX, enterobacterial repetitive intergenic consensus (ERIC) and repetitive extragenic palindromic (REP)-primers (rep-PCR), were used to train three independent BPNs. 10 Strains of the genus Xanthomonas, each with a characteristic host plant range, were identified correctly using the three trained BPNs. When tested with fingerprints of bacterial strains not present in the training sets, the rejection rate was 100%, using the three BPN classifiers combined. Thus, BPN protocols can be employed to generate a powerful computer-based system for the identification of pathogenic bacteria in the genus Xanthomonas.

Keywords
  • Backpropagation neural network
  • Bacterial identification
  • rep-PCR
  • Genomic fingerprint
  • Xanthomonas

1 Introduction

Bacterial plant pathogens can be difficult to identify in a timely manner. This is particularly true for strains classified within the genus Xanthomonas, which are able to cause disease on over 390 plant species. The genus comprises over 140 distinct pathogenic lineages that in many cases cannot be reliably differentiated for phylogenetic classification purposes by their cellular metabolism or other phenotypic characteristics [1]. Strains within a taxon are differentiated primarily on the basis of the plant host that they parasitize or by the type of symptoms produced on a diseased host plant [2]. The taxons are named at the species level based on DNA homology experiments [3] or at the subspecies level as pathogenic variants or pathovars. However, with the advent of DNA-based genotypic analysis methods, major advances in the reliable differentiation and classification of these host-adapted strains has been made. For example, it has recently been shown that repetitive DNA sequence-based polymerase chain reaction (rep-PCR) genomic fingerprinting can be used for the identification and classification of Xanthomonas species/pathovars [4]. Rep-PCR makes use of repetitive elements dispersed throughout the chromosome of diverse bacterial species [5,6] as primer sites for PCR, generating amplicons of distinct sizes to generate specific genomic fingerprints of each Xanthomonas species or pathovar [7]. In this report, primers targeting the highly conserved repetitive sequences from the BOXA subunit of the BOX element of Streptococcus pneumoniae (BOX), enterobacterial repetitive intergenic consensus (ERIC) and repetitive extragenic palindromic (REP) sequences, were used in rep-PCR reactions ([8] and references therein) to generate genomic fingerprints of 24 Xanthomonas strains, comprising 10 pathovars within seven of the 20 species, as described by [3].

A backpropagation neural network (BPN) is a connectionist network that can be trained to identify complex patterns. The neural network approach has been applied successfully for pattern identification in several areas of biological research. Examples of such applications are the protein classification artificial neural system developed by Wu et al. [9], backpropagation and counter propagation neural networks for phylogenetic classification of ribosomal RNA sequences developed by Wu and Shivakumar [10] and a neural network for classification of mass spectra developed by Curry and Rumelhart [11]. Bungay and Bungay [12] as well as Rataj and Schlinder [13] have used neural networks for the identification of bacteria at the beginning of this decade. These explorations were continued by Chun et al. [14] who used artificial neural network analysis of pyrolysis mass spectrometric data in the identification of Streptomyces strains. Carson et al. [15] explored the recognition of pathogenic Escherichia coli O157:H7 PFGE restriction patterns by artificial neural network and Goodacre et al. [16] used infrared spectroscopy and artificial neural networks for rapid identification of Streptococcus and Enterococcus species. Bertone et al. [17] developed automated systems for identification of heterotrophic marine bacteria on the basis of their fatty acid composition. Furthermore, Gyllenberg and Koski [18] explored a taxonomic associative memory based on neural computation.

The BPN approach has several advantages over conventional numerical methods. Conventional numerical methods require the application of cluster analysis or sophisticated libraries, which usually involve pairwise comparison of all fingerprints in a collection [7]. In contrast, the computation intensive pattern recognition is performed up front during the BPN training stage. Once trained, a BPN can identify patterns using its internal connections without going through a database to compare an unknown pattern with each individual library entry.

It is hypothesized in this research that computer-assisted pattern analysis using a BPN would be a valuable alternative for the identification of Xanthomonas species/pathovars using rep-PCR genomic fingerprints as input data. The objective of this paper is to describe the development of a procedure for the identification of bacteria based on rep-PCR genomic fingerprints and a backpropagation neural network and to show the application of this system to the analysis of 10 Xanthomonas species/pathovars.

2 Materials and methods

A system flow chart of the BPN identification and classification system is shown in Fig. 1. The BPN formulation and classification process is divided into the following four major parts.

2.1 Information gathering and data processing

Duplicate sets of BOX-, ERIC- and REP-PCR genomic fingerprints were generated of 24 Xanthomonas strains obtained from the BCCM/LMG culture collection of the Laboratorium voor Microbiologie, Universiteit Gent, Gent, Belgium (Table 1). The fingerprint images were digitized using a flatbed scanner. GelCompar (Windows based gel image processing software, version 3.1 Applied Maths, Kortrijk, Belgium) was used to segment and normalize the digitized image as previously described [7].

View this table:
1

Strains analyzed with rep-PCR genomic fingerprinting

OrganismLMG strain #Geographical originHost
X. arboricola corylina688USACorylus avellana
689USAC. maxima
8 658UnknownC. maxima
8 660UKC. avellana
X. axonopodis vesicatoria905UnknownUnknown
910MoroccoCapsicum
929USALycopersicon lycopersicum
X. a. pruni852New ZealandPrunis salicina
8 680New ZealandP. dulcis
X. campestris barbareae547USABarbarea vulgaris
7 385USAB. vulgaris
X. c. incanae7 421USAMatthiola sp.
7 490USAM. incana
X. c. raphani7 505USARhaphanus sativus
8 134USAR. sativus
X. cassavae670UnknownUnknown
673MalawiManihot esculenta
5 264RuandaM. esculenta
X. hyacinthi742The NetherlandsHyacinthus orientalis
8 041The NetherlandsH. orientalis
X. translucens hordei882CanadaHordeum vulgare
8 279New ZealandH. vulgare
X. vesicatoria911New ZealandL. lycopersicum
920t1ItalyL. lycopersicum
  • Name as proposed by [3].

The normalized rep-PCR genomic fingerprint of each bacterial strain was exported from GelCompar as a densitometric curve of 400 data points with each point representing a gray level ranging from 0 to 255. A sample densitometric curve and associated fingerprint is shown in Fig. 2. The 400 data points of each of the fingerprints were further linearly normalized into the range of 0–2 by dividing each data point by 127.5, since the BPN performs better with input data within this scale.

2

A sample of BOX-PCR genomic fingerprint of X. a. corylina LMG 688.

In addition, all training data were assigned a target classification (Xanthomonas species or pathovar) represented in binary form [19]. The nth target classification vector was expressed in terms of one 0.8 and nine 0.2 s, with the 0.8 in the nth position. The data were divided equally into training and testing data sets. For all 24 Xanthomonas strains, independent duplicates of the BOX-, ERIC- and REP-fingerprint were used to train and test the BPN. 22 Strains classified as species or pathovars not included in the training sets were presented to the BPN in order to test the rejection capability.

2.2 BPN modeling and training

Three independent BPNs with the same architecture were set up and trained with the BOX-, ERIC- and REP genomic fingerprints, respectively. The BPNs were multilayer feed-forward neural networks made up of an input layer, a hidden layer and an output layer [20]. A diagram of the BPN architecture is shown in Fig. 3. The input layer of BOX-, ERIC- and REP-BPNs contained 360 input units, corresponding to data points 31–390 of the 400-point rep-PCR genomic fingerprint densitometric curve. This segment represents DNA fragment sizes between 210 bp and 7.5 kbp. The first 30 and last 10 points were left out, as they contained little relevant information. The hidden layer contained 10 hidden units. The output layer was made up of 10 output units and each unit was assigned to represent a Xanthomonas species/pathovar. The number of 10 hidden units was determined by trial and error. The number of hidden units is important, since it determines the capacity of the BPN to learn the different rep-PCR genomic fingerprint patterns. However, a large number of hidden units can cause ‘overfitting’. A network trained too specifically to the training data will not perform well when presented with other data sets. In addition, a larger network will also increase the computation complexity and training time required. Thus, it is desirable to keep the network as small as possible, while maintaining the ability to learn the correct patterns. All of the units had a sigmoidal logistic activation function [19] which is non-linear.

The Matlab Neural Network toolbox, running under the Matlab 5.0 (The Math Works, 1997) environment, was used to simulate the BPN. A supervised training algorithm using the gradient descent method was used to train the BPN in batch mode [21]. An epoch represents one complete pass of the entire training set through the BPN training process [22]. The connection weights were updated after every epoch using the delta rule, in order to minimize the error between the BPN output and the target output [22]. Adaptive momentum and learning rates [23] were applied. Starting with a momentum of 0.1 and a learning rate of 0.003, the system would increase the learning rate by a factor of 1.05 if the new training error was found lower than the training error in the previous epoch. However, if the new training error was found greater or equal to the previous training error, the learning rate would be decreased by a factor of 0.7 and the momentum would be set to zero. Using an adaptive learning rate and momentum, the BPN converged faster. The status of the training was determined by analyzing the training and generalization sum square errors. The generalization error was found through cross-validation with test data sets that were not used for training after every 500 epochs. The BPN was trained until the new generalization error was greater than the previous generalization error, in order to prevent over-training. The number of 500 epochs was chosen so that the decision interval is large enough to avoid the possibility of ending the training at a local minimum of the training/generalization error.

2.3 BPN testing and strain classification

A testing procedure is essential to check the performance of the BPNs generated from the training protocol. Therefore, BPNs for BOX-, ERIC- and REP-PCR data types were set up for testing, using the corresponding set of final connection weights obtained during the training period. When presented with a test genomic fingerprint as input, the trained BPN assigned a corresponding identification. Each output unit of the BPN yielded an output score for the identification it represented ranging from 0 to 1, with 0.8 and above for perfect match and zero for no match at all. An identification threshold of 0.55 was selected for evaluation of the system. If the output score of an output unit was greater than the threshold, the bacterial strain being examined was identified as the Xanthomonas species/pathovar corresponding to that particular output unit. In addition, the BPNs were also tested for their rejection capabilities with fingerprints of species/pathovars not included in the collection used to train the BPN. Furthermore, the testing results of the three different BPN classifiers were compared and combined to yield the final designation. For improved accuracy, the identification was only considered valid if all of the three BPNs (based on REP, ERIC and BOX primer sets) yielded the same results.

2.4 Graphical user interface

A graphical user interface was developed using Visual Basic 5.0 (Microsoft, 1997) to facilitate the classification process. Using the window interface, users can select input data files, corresponding rep-PCR genomic fingerprint types, as well as the identification threshold. The interface was linked to Matlab to simulate the BPNs. The results of the three different rep-PCR genomic fingerprint data types could thus directly be compared using this interface.

3 Results and discussion

The numbers of training epochs were 7000, 2000 and 22 000 for BOX-, ERIC- and REP-BPNs, respectively. As shown in Table 2, the REP-BPN converged to a much lower training and generalization error after the longest number of training epochs. This may be explained by the fact that the REP-fingerprint patterns have a lower complexity (number of bands and variation in intensity), as compared to the BOX- and ERIC-fingerprints.

View this table:
2

BPN training and classification results

BOX-BPNERIC-BPNREP-BPNCombined BPNs
Number of epochs7 0002 00022 000
Final training sum square error0.000130.000630.00008
Final generalization sum square error1.7991.1020.513
Recognition rate100%100%100%100%
Rejection rate of unknown pathovars73%73%50%100%

When tested with 24 independent fingerprints of bacterial strains included in the 10 trained species/pathovar classification, the three BPNs were able to identify all of the 24 bacterial strains correctly at an identification threshold of 0.55, achieving a 100% recognition rate. As shown in Table 3, the REP-BPN produced higher output scores with an average of 0.77, as compared to BOX- and ERIC-BPNs with an average of 0.70. This agreed with the observation that the REP-BPN converged to a lower generalization error.

View this table:
3

Neural network identification results and identification levels obtained by the BOX-, ERIC- and REP-BPN after submission of fingerprints not used to train the BPN

Species/pathovarStrain LMG #BOX-BPNERIC-BPNREP-BPNCombined identification
X. arboricola corylina6880.740.700.82X. a. corylina
6890.700.660.80X. a. corylina
8 6580.670.710.74X. a. corylina
8 6600.660.730.81X. a. corylina
X. a. pruni8520.560.690.71X. a. pruni
8 6800.600.780.80X. a. pruni
X. axonopodis vesicatoria9050.810.660.78X. a. vesicatoria
9100.800.680.79X. a. vesicatoria
9290.780.560.77X. a. vesicatoria
X. campestris barbareae5470.750.710.79X. c. barbareae
7 3850.750.710.74X. c. barbareae
X. c. incanae7 4900.570.690.80X. c. incanae
7 4210.590.690.76X. c. incanae
X. c. raphani7 5050.810.710.72X. c. raphani
8 1340.830.730.79X. c. raphani
X. cassavae5 2640.690.760.77X. cassavae
6700.620.790.77X. cassavae
6730.570.780.74X. cassavae
X. hyacinthi7420.810.720.80X. hyacinthi
8 0410.830.590.77X. hyacinthi
X. translucens hordei8820.650.770.71X. t. hordei
8 2790.630.750.73X. t. hordei
X. vesicatoria9110.650.640.81X. vesicatoria
920t10.770.650.82X. vesicatoria
  • A 100% recognition rate was found with the single and combined BPNs.

The BPNs were also tested with fingerprint of species/pathovars not included in the trained BPNs (Table 4). The rejection rates were 73, 73 and 50% for BOX-, ERIC- and REP-BPNs, respectively. It was observed that the BPNs tended to assign a designation even when they were presented with a fingerprint of a species/pathovar not included in the trained classification. In most of those cases, a closely related pathovar of the same species was assigned (Table 4). The REP-BPN was found to have the lowest rejection rate. This is due to the fact that a BPN that is trained too well may become predisposed to favor the desired output patterns and tends to organize itself to produce those output identifications, even when the input pattern is outside of the training domain [24]. An improved rejection rate of 100% was found when the results of the three BPNs were combined (Table 2). Identification decision based on combined outputs of the three BPNs clearly helped to overcome the problem of incorrect identification. This observation is fully consistent with a variety of other studies revealing that the linear combination of fingerprinting data using multiple primer sets improves the robustness of phylogenetic trees [7].

View this table:
4

Neural network identification results and identification levels obtained by the BOX-, ERIC- and REP-BPN after submission of a collection of fingerprints comprising species and pathovars not used to train the BPN

Species/pathovarStrain LMG #BOX-BPN identificationERIC-BPN identificationREP-BPN identificationCombined identification
X. arboricola juglandis8 047X. a. pruni (0.63)X.a. corylina (0.68)X.a. corylina (0.76)
X. a. populi12 141X. a. corylina (0.68)
X. axonopodis citri A682
X. a. phaseoli7 455
X. c. aberrans9 037X. c. barbareae (0.70)
X. c. armoraciae535
X. c. armoraciae7383t2X. c. barbareae (0.63)
X. c. campestris568X. c. barbareae (0.64)
X. cucurbitae8 662X. c. barbareae (0.55)
X. hortorum pelargonii7 314X. a. vesicatoria (0.64)X. a. corylina (0.56)
X. melonis8 672
X. pisi847t1X. c. barbareae (0.62)
X. a. vesicatoria (0.56)
X. theicola8 684X. hyacinthi (0.75)X. a. corylina (0.61)
X. translucens arrhenatheri727t1X. t. hordei (0.58)X. t. hordei (0.61)
X. t. cerealis679
X. t. phlei730X. t. hordei (0.67)X. vesicatoria (0.57)
X. t. phleipratensis843X. a. pruni (0.55)X. t. hordei (0.57)
X. t. poae728X. vesicatoria (0.56)
X. t. secalis883X. cassavae (0.56)
X. t. translucens876X. t. hordei (0.71)
X. t. undulosa892
X. vasicola holcicola7 416
  • A 100% rejection rate was found combining the single identification results.

BOX-, ERIC- and REP-PCR genomic fingerprint-based BPNs were developed. The BPNs achieved recognition rates of 100% individually and when combined. Rejection rates of 73, 73 and 50% were found for BOX-, ERIC- and REP-BPNs, respectively, when tested with fingerprints of pathovars not included in the trained pathovar classification scheme. In addition, a 100% rejection rate was found with the three BPNs combined. Two BPNs were also sufficient to achieve this, except for the identification of LMG 727 with a BOX- and REP-BPN and LMG 8047 with a ERIC and REP combination.

The speed and accuracy of the BPN approach as applied to the limited collection of strains described here is comparable with conventional clustering methods (e.g. product moment/UPGMA) [7]. The identification of a single strain in a larger collection would be substantially faster using a BPN than conventional cluster analysis, since the latter requires pairwise comparison to all members of a database, while for a BPN, the database is only required at the training session. Therefore, conventional clustering methods occupy considerably more computer disk space than BPNs, although the computer programs themselves are of similar size. An additional potential advantage is that a BPN-based system of pattern analysis can be relatively cost effective.

Our results show the promise of using BPN as an alternative protocol to identify bacterial strains to a pathovar-specific level based on their BOX-, ERIC- and REP-PCR-generated genomic fingerprints. The quality of the identification can be improved by incorporating more species and pathovars in the analysis as well as including more training data sets to ensure a better coverage of the patterns and to better account for the inherent small variations in rep-PCR-generated genomic fingerprint data.

Acknowledgements

The development of the rep-PCR genomic fingerprinting method for the analysis of plant associated and soil microbes has been supported by the DOE (DE FG 0290ER20021), the NSF Center for Microbial Ecology (DIR 8809640), Heinz, Roger Seeds, Procter and Gambler, as well as by the Consortium for Plant Biotechnology Research (DE-FC05-92OR22072) and the States of Michigan and North Carolina. We also gratefully acknowledge Drs Vauterin and J. Swings (LMG, Gent, Belgium) for bacterial DNA, Maria Schneider (MSU) for technical support and Jim Lupski (Baylor College of Medicine, Houston, TX, USA) for many useful discussions on rep-PCR genomic fingerprinting.

References

  1. [1].
  2. [2].
  3. [3].
  4. [4].
  5. [5].
  6. [6].
  7. [7].
  8. [8].
  9. [9].
  10. [10].
  11. [11].
  12. [12].
  13. [13].
  14. [14].
  15. [15].
  16. [16].
  17. [17].
  18. [18].
  19. [19].
  20. [20].
  21. [21].
  22. [22].
  23. [23].
  24. [24].
View Abstract