OUP user menu

ABCdb: an online resource for ABC transporter repertories from sequenced archaeal and bacterial genomes

Gwennaele Fichant , Marie-Jeanne Basse , Yves Quentin
DOI: http://dx.doi.org/10.1111/j.1574-6968.2006.00139.x 333-339 First published online: 1 March 2006

Abstract

The ATP-binding cassette (ABC) transporters are one of the major classes of active transporters. They are widespread in archaea, bacteria, and eukaryota, indicating that they have arisen early in evolution. They are involved in many essential physiological processes, but the majority import or export a wide variety of compounds across cellular membranes. These systems share a common architecture composed of four (exporters) or five (importers) domains. To identify and reconstruct functional ABC transporters encoded by archaeal and bacterial genomes, we have developed a bioinformatic strategy. Cross-reference to the transport classification system is used to predict the type of compound transported. A high quality of annotation is achieved by manual verification of the predictions. However, in order to face the rapid increase in the number of published genomes, we also include analyses of genomes issuing directly from the automated strategy. Querying the database (http://www-abcdb.biotoul.fr) allows to easily retrieve ABC transporter repertories and related data. Additional query tools have been developed for the analysis of the ABC family from both functional and evolutionary perspectives.

Keywords
  • ABC transporters
  • database
  • complete genome
  • genomic

Introduction

The ATP-binding cassette (ABC; Hyde et al., 1990) transporters, also called traffic ATPases (Ames et al., 1990), are one of the major families of active transporters in archaea, bacteria, and eukaryota (Higgins, 1992; Holland & Blight, 1999). The ABC-type ATPase domains, characterized by the presence of the ABC signature motif in addition to the Walker A and B motifs, occur not only in ABC transporters but also in proteins involved in other cellular processes. Some of these domains are present in proteins concerned with nucleotide excision repair (UvrA, Thiagalingam & Grossman, 1993; Goosen & Moolenaar, 2001), DNA mismatch repair (MutS, Obmolova et al., 2000), DNA double strand break repair (Rad50, Hopfner et al., 2000), and chromosomal segregation (SMC, Lowe et al., 2001). Others are involved in control of translation initiation and elongation (Belfield et al., 1995), or confer resistance to macrolide antibiotics through interaction, either directly or indirectly, with tRNA-containing molecules (Holland & Blight, 1999; Dassa & Bouige, 2001).

In archaea and bacteria, the majority of ABC transporters mediate the active uptake or efflux of specific molecules across cellular membranes (Higgins, 1995). Thus, they play an essential role in the adaptation of organisms to their ecological niches through tight control of the substance fluxes between the cell and the external medium. The ABC importers are involved in the uptake of a wide variety of compounds that are required for metabolism (such as sugars, other carbohydrates, amino acids, peptides, polyamines, metal ions, sulfate, iron, molybdate). The ABC exporters are implicated in the efflux from the cell of waste products or toxins, cell surface components (such as capsular polysaccharides, lipopolysaccharides, and techoic acids), proteins and other molecules involved in bacterial pathogenesis, such as haemolysin, heme-binding proteins, alkaline proteases and peptide antibiotics (Dassa & Bouige, 2001).

Despite their wide-ranging substrate specificity, the ABC transporters share a common basic architecture composed of four domains (Higgins, 1992; Fath & Kolter, 1993): two nucleotide-binding domains (NBDs) and two hydrophobic membrane-spanning domains (MSDs), also known as transmembrane domains (TMDs). The hydrophilic NBDs interact at the cytoplasmic face of the membrane with the MSDs to supply the energy for active transport through the hydrolysis of ATP to ADP. Typical MSDs form six putative α-helical transmembrane (TM) segments. The homo- or heterodimer of MSDs constitutes the translocation pathway through which the compound crosses the membrane. The MSDs participate in the specificity of the transporter through substrate binding sites. ATP binding and/or hydrolysis are coupled to conformational changes in the MSDs that trigger the unidirectional pumping of the compound across the membrane.

The ABC importers require the presence of a high-affinity solute binding protein (SBP, Shuman & Panagiotidis, 1993). The SBP is located in the periplasm of gram-negative bacteria (Neu & Heppel, 1965). In gram-positive bacteria, it is either anchored to the outside of the cell via N-terminal lipid groups (Perego et al., 1991), or, in a few cases, fused to the transporter itself (van der Heide & Poolman, 2002). In archaea, it is anchored to the cytoplasmic membrane via an N-terminal transmembrane segment (Albers et al., 1999). The SBPs may have two distinct but related functions. The first is to confer high affinity and specificity on the transporter. The second is to confer directionality through stimulation of the ATPase activity (Davidson et al., 1992; Liu & Ames, 1997; Chen et al., 2001; Davidson, 2002). Some ABC transporters possess several distinct SBPs and it was suspected that multiple binding domains broaden the compound specificity of such systems (Higgins & Ames, 1981).

Among the three types of domain, only the NBDs exhibit sequence conservation throughout the ABC super-family. The MSDs and SBPs show only weak sequence conservation and then only at the level of the subfamily.

Unlike eukaryotic ABC transporters, where the NBD and its cognate MSD are fused into a single polypeptide, bacterial and archaeal ABC systems are generally formed by assembly of functional domains encoded by separated genes located in the same neighborhood on the chromosome (Tomii & Kanehisa, 1998; Quentin et al., 1999). However, domain fusions are known.

Several studies focused on sequence analysis have revealed that the ABC systems can be arranged in a comprehensive classification that is well correlated with the specificity of compound transport (Tam & Saier, 1993; Saurin & Dassa, 1994; Kuan et al., 1995; Tomii & Kanehisa, 1998; Quentin et al., 1999; Quentin & Fichant, 2000; Dassa & Bouige, 2001). In addition, the congruence observed between the classification obtained on the three types of domain suggests that the different components of the ABC transporters evolved as an unit, with little or no shuffling of their constituents. Comparative analyses of ABC transporter repertories reveal that ABC transporters have resulted from successive waves of duplications and that the ancestral ABC transporter may have arisen early in evolution, before the differentiation of prokaryotes and eukaryotes, in the last common universal ancestor (Tomii & Kanehisa, 1998; Saurin et al., 1999).

The structural, functional and evolutionary complexity of the ABC transporters in prokaryotic genomes motivated us to develop and maintain a dedicated database. Such a database should be useful for predicting the protein partners and compound specificity of hypothetical transporters obtained through systematic sequencing. It should also be relevant to studies of multi-drug resistance of eukaryotic cells and of virulence and antibiotic resistance in prokaryotes. Moreover, this database constitutes a platform for further evolutionary and structure–function relationship studies.

Data sources

Complete genomes from archaea and bacteria are regularly retrieved from the EMBL-EBI genome pages (http://www.ebi.ac.uk/genomes/) through Perl scripts. Chromosomes are renamed according to the following rule: the first letter of the gender (upper case), followed by the three first letters (lower case) of the species name, a capital letter for the strain (A, or B, …) and a two digit integer for the chromosome number (e.g. the chromosome of Bacillus subtilis strain 168 has been called BsubA01). A lower case letter is used to refer to the extra-chromosomal genetic information (for example, the two sequenced plasmids of Agrobacterium tumefaciens have been named AtumA03p and AtumA04p, since A. tumefaciens possesses two chromosomes). The protein names are composed by joining the chromosome name and the CDS name extracted from the EBI file (e.g.: BsubA01.OPPD). Chromosomes and proteins can also be retrieved via their original accession number.

Data generation and quality control

A first version of the automated strategy applied to partner identification and system reconstruction has been published, together with its validation (Quentin et al., 2002). The identification of the different partners and their classification into functional subfamilies are achieved by applying different bioinformatics methods, based on either motif detection [Mast (Bailey & Elkan, 1995) and Meta-Meme (Grundy et al., 1997)] or similarity searches [BlastP and PsiBlast (Altschul et al., 1997)]. The prediction is achieved by independently applying each method on the set of protein sequences annotated in the genome under study. Once the different partners have been identified and classified, their assembly into a functional system requires that the two following conditions are verified: (i) the genes encoding the different partners are closely linked on the chromosome, and (ii) the different domains involved in the ABC system belong to compatible subfamilies. This strategy has been implemented in a framework based on Perl scripts and a database management system (ACeDB). Different steps have been identified and implemented in separated modules launched in a sequential order (Fig. 1).

1

Overview of the strategy developed for the identification and reconstruction of functional ATP-binding cassette (ABC) transporters. The procedure relies on a set of modules that are designed to accomplish specific tasks in a sequential order. The modules have been interfaced with a database that collects the outputs and delivers the inputs to each module. The process is circular, since new annotated sequences are used to update the parameters of the methods. Description of the different modules: File retrieval: The genome files are retrieved from the EMBL-EBI genome pages (http://www.ebi.ac.uk/genomes/). This module reduces the delay in updating the database. Data formatting: The flat files are parsed in order to extract and format the data used later in the strategy. These data include the protein sequences but also other information that is directly entered in the database, such as the taxonomy of the organism, brief identification of the gene function, and the identification number in InterPro database (Apweiler et al., 2000). Identification of the different partners of the system: This is the main step of the strategy. Different bioinformatic methods, based either on similarity searches (BlastP and PsiBlast) or motif identification (Mast and Meta-Meme), are launched simultaneously. False positive removal: The false positives are automatically removed by a procedure based on the BlastP program. Each predicted ABC partner candidate is used as query against the complete set of proteins encoded by the genomes already annotated in ABCdb. The false positives are identified as queries having their best hits with sequences not annotated as ABC proteins in ABCdb. Domain prediction: The domains constitute the functional units of the transporter and can be found fused on the same protein. To predict their boundaries, the PsiBlast results are used. For the NBDs, the coordinates of the MEME profiles based on the Walker A, Signature and Walker B motifs are preferred when available. Feature annotation: Depending of the domain type, different bioinformatic methods are launched: TopPredII (Claros & von Heijne, 1994) to predict the transmembrane fragments in MSDs and SignalP (Nielsen et al., 1997) to identify the signal peptide in solute binding proteins (SBPs). Partner assembly: The assembly of partners in a functional ABC system is achieved by applying two rules deduced from experimental data: (i) the genes encoding the different partners are closely linked, and (ii) the different domains involved in the ABC system belong to compatible subfamilies. Expert validation: This step is required to release high-quality data, but it is also the bottleneck of the strategy. However, the expert's work is facilitated by providing summary tables automatically generated by querying the database and by interfaces that facilitate the modification and addition of data. Learning file update: The data validated by the expert are used to update the parameters of the methods involved in the identification step (PsiBlast profiles and MEME motifs) to increase their performance.

Once the different methods have been executed, their results are automatically compared in order to validate or reject as ABC partners the proteins identified by the individual methods. At the end of the procedure, we obtain the different ABC proteins annotated for functional domains, assembled to form transporters and classified into functional categories. Final decision and intermediate data are automatically entered into the database.

The expert works on these results to either validate or reject the choices taken by the system. When subfamily ambiguities or missing partners in reconstructed systems (like SBP or MSD) are detected, manual intervention is required. Presence of partial systems can be due either to the failure of the methods to identify the missing partner(s) or to a scattering along the chromosome of the genes encoding the different domains. In the first case, the expert analyses the products of the genes located in the neighborhood of the ones encoding the partners of the partial system. He looks for the presence of specific features (like a minimum of four transmembrane fragments for MSD, or a signal peptide for SBP) and checks whether the putative partner is not significantly similar to proteins unrelated to ABC transporters. Except for sequencing errors or pseudogenes, the NBDs are systematically recovered with the applied methods due to their higher sequence conservation. To associate in a functional system partners encoded by dispersed genes on the chromosome, we search for occurrences of orthologous systems in other genomes. If several hits are found, we use the modular organisation of these systems as a guide to recover and assemble the orthologous domains encoded by the genes spread over the chromosome.

While high-quality data can only be obtained through this evaluation step, one consequence is to delay addition to the database relative to the published sequences of prokaryotic genomes. Thus, we decided to build two sections in the new release of our database: (i) ABCdb, containing data validated by the expert, and (ii) autodb, which includes the data produced by the automated strategy and checked only to detect and rectify abnormal assemblies.

Database schema and implementation

We administer our data with the ACeDB system (http://www.acedb.org) originally developed by Thierry-Mieg and Durbin to manage the nematode genome project (ACeDB is an acronym for A Caenorhabdis elegans DataBase). A same database schema has been developed for ABCdb and autodb (See Supplementary Material). We designed two main models, Protein and Assembly that are connected by the class Domain. Indeed, the model Assembly describes the ABC transporter in terms of functional domains. The model Protein includes a link to the peptide sequence, and specific tags to store: (i) predicted features such as Walker A and B motifs, transmembrane segments and signal peptide, (ii) the domain organization, and (iii) a link to the InterPro database (Apweiler et al., 2000; Mulder et al., 2005). The results of the annotation procedure are reachable via the class Prediction. The origin of the proteins is modeled as a path through the classes Chromosome, Species, Strain, and Taxon. The taxonomy of the genomes, as found on the NCBI server, has been implemented as a tree (classes TreeNode and Tree). The class Subfamily describes the classification into subfamilies. To name the subfamilies we kept the nomenclature that was first adopted for analysis of the ABC transporter repertories in Escherichia coli (Linton & Higgins, 1998) and B. subtilis (Quentin et al., 1999). It consists of a number preceded by an uppercase letter to indicate the domain type (N, M, and S for NBD, MSD and SBP, respectively). The number can be followed by a lowercase letter corresponding to an organization into sub–subfamilies revealing that they share a common ancestor. An alternative classification has been proposed for the ABC transporters (Dassa & Bouige, 2001), and recently Milton Saier proposed a transporter classification (TC) system (Busch & Saier, 2002; http://www.tcdb.org/), analogous to the EC enzyme classification system except that it incorporates not only functional data but also phylogenetic information. Therefore, we cross-referenced our subfamily classification to the TC system. This is achieved by a link between the classes Subfamily and TC_number. As only proteins that have been experimentally characterized receive a TC number, this cross-reference provides more precise information on the transported compound.

Querying the database

The database was designed in order to answer different kinds of requests according to the objectives of the users. The Simple search option allows direct access to the objects by querying their names or a pattern matching the names. The Text search option retrieves object by a free text search. For example, SwissProt accession numbers or keywords can be used. The query result appears as a list of object names, and each object can be displayed with a mouse ‘double-click’ on its name. Next, to explore the database further, one can follow the links embedded in the objects. The database can also be queried by evolutionary criteria throughout the Taxonomy option that browses the taxonomy tree.

The option Compilation allows users to compute summary tables over chromosomes. One can obtain either a detailed inventory of the ABC transporters encoded by a chromosome (compilation over Assembly), or a table containing all the proteins involved in the ABC systems (compilation over Protein). In the first case, each transporter is described as an assembly of protein domains and is associated to a subfamily class. In the second case, each protein is depicted by its predicted type of functional domain, its subfamily, its SwissProt accession number, and its brief identification. Those searches can be restricted to a given functional subfamily.

For evolutionary studies, we add a tool (Query ABC repertories) that can be used to compute, for species of a chosen lineage, the relative frequency of each functional subfamily of ABC transporters (Fig. 2a). We also design another query interface (ABC proteins) that allows more flexible data mining. The construction of complex queries can be achieved by combining several entry tags with logical operators. The results are presented as a table according to the given criteria or presented as a list of objects if the number of retrieved objects is too large. An example of such a query is given in Fig. 2b. It retrieves unusual ABC proteins that contain more than 12 TM but whose domain organization is not predicted as MSD-MSD (duplicated permease domains). The protein's subfamily is also printed.

2

Examples of queries. (a) Result of Query ATP-binding cassette (ABC) repertories on the chosen lineage Lactobacillales. The average frequency of each functional subfamily of ABC transporters observed in all genomes entered into ABCdb (black values under the subfamily names) is compared to the relative frequency computed on the chromosomes of the bacteria belonging to the Lactobacillales lineage. This table presents an overview of the differences observed between the ABC repertories from closely related species. The link on Strain gives access to ABC protein and ABC transporter repertories. (b) Result of Query ABC proteins based on a complex request that selects a set of unusual proteins containing more than 12 predicted transmembrane segments (TMSs) but composed by only one membrane domain (AND NOT MSD–MSD). The retrieved proteins belong to different subfamilies, and a more detailed analysis reveals that the extra TMSs are located in an N_ter or C_ter sequence extension when compared to the closest homologues.

Finally, we implemented a BlastP form on our web server. The user query is compared to the annotated ABC proteins and html links to the protein objects and subfamilies are added in the BlastP output in order to facilitate the classification and comparison of the query sequence with ABCdb entries.

In summary, if the principle aim of ABCdb is to provide synthetic descriptions of the repertories of ABC transporters in complete archaeal and bacterial genomes, the accompanying tools developed around the database allow flexible and powerful mining of the information.

Conclusion

We have developed a bioinformatic strategy to automatically identify and analyze ABC proteins encoded by complete archaeal and bacterial genomes. The predicted partners are assembled in macromolecular systems and are arranged in our classification system. Cross-reference to the TC classification system helps to predict the type of compound transported by the reconstructed systems. The data released in ABCdb are manually curated in order to ensure high quality annotation. However, in order to accommodate the rapid increase in the number of published genomes, we have also published in a separate section (autodb) the data produced by the automated strategy. In this version of the database, we have developed new query tools for the analysis of the ABC family from both functional and evolutionary perspectives. All these characteristics made this online resource complementary to other databases relevant to the study of ABC transporters.

In future, ABCdb will be regularly updated with data obtained from newly sequenced genomes. New strategies, making use of orthology links between sequences, will be developed to deal with the problem of reconstructing a system when the partners are encoded by genes scattered along the chromosome. We plan to complete the mining capability of the resource by adding query tools whose results will be treated and graphically represented using the R package (http://www.r-project.org/index.html).

Availability and requirements

ABCdb is made freely available to academic and non-academic users at: http://www-abcdb.biotoul.fr.

Acknowledgements

We thank Cécile Capponi for advice and encouragement, Dave Lane for critical reading of the manuscript and helpful comments. We are grateful to Jean Thierry-Mieg and Richard Durbin for the ACeDB system, Lincoln D. Stein for the AcePerl (Web interface for ACeDB). This work was supported by grants from French Ministry of Education and the CNRS (Centre National de la Recherche Scientifique).

References

View Abstract