OUP user menu

The Giardia genome project database

Andrew G. McArthur, Hilary G. Morrison, Julie E.J. Nixon, Nora Q.E. Passamaneck, Ulandt Kim, Gregory Hinkle, Melissa K. Crocker, Michael E. Holder, Rebecca Farr, Claudia I. Reich, Gary E. Olsen, Stephen B. Aley, Rodney D. Adam, Frances D. Gillin, Mitchell L. Sogin
DOI: http://dx.doi.org/10.1111/j.1574-6968.2000.tb09242.x 271-273 First published online: 1 August 2000

Abstract

The Giardia genome project database provides an online resource for Giardia lamblia (WB strain, clone C6) genome sequence information. The database includes edited single-pass reads, the results of BLASTX searches, and details of progress towards sequencing the entire 12 million-bp Giardia genome. Pre-sorted BLASTX results can be retrieved based on keyword searches and BLAST searches of the high throughput Giardia data can be initiated from the web site or through NCBI. Descriptions of the genomic DNA libraries, project protocols and summary statistics are also available. Although the Giardia genome project is ongoing, new sequences are made available on a bi-monthly basis to ensure that researchers have access to information that may assist them in the search for genes and their biological function. The current URL of the Giardia genome project database is http://www.mbl.edu/Giardia.

Keywords
  • Genome sequencing
  • BLAST
  • Genome database
  • Giardia lamblia

1 Introduction

Giardia is a significant, environmentally transmitted, waterborne, human pathogen [13]. Giardia lamblia is a model organism for genome analysis because of its well recognized impact on human health, its relatively small genome, containing approximately 12 million bp distributed onto five chromosomes [4], and the insights it will provide about the origins of nuclear genome organization. Comparisons of several different gene families demonstrate Giardia's basal position in eukaryotic molecular phylogenies [58]. Since the divergence of G. lamblia lies close to the transition between eukaryotes and prokaryotes in universal ribosomal RNA phylogenies, Giardia is an opportune model for understanding the genetic innovations that led to the formation of eukaryotic cells. Sequence analyses of the Giardia genome will also address several important questions related to human health, including the number, gene organization and regulation of variant-specific surface protein coding sequences.

2 The Giardia genome project: organization and strategy

The Giardia genome project at the Marine Biological Laboratory, Woods Hole, is part of an NIH Investigator-Initiated Interactive Research Project Grant (IRPG). The genome sequencing component of this IRPG is a collaborative effort between the laboratories of Mitchell L. Sogin (Josephine Bay Paul Center for Comparative Molecular Biology and Evolution at the Marine Biological Laboratory in Woods Hole), Stephen Aley (University of Texas at El Paso), Rodney Adam (University of Arizona at Tucson) and Gary Olsen (University of Illinois at Urbana-Champaign). The Giardia sequencing effort is complemented by its IRPG functional genomics unit ‘Giardia: A Model For Ancient Eukaryote Genome Function’, directed by Frances D. Gillin at the University of California at San Diego.

Our sequencing strategy is a shotgun approach to maximize recovery of information about genetic diversity in an efficient, cost-effective manner. Shotgun sequencing has proven to be a successful strategy as pioneered by the Institute for Genomic Research (TIGR) for several microbial genomes. Approximately 900–1100 bases are sequenced from both ends of G. lamblia strain WB, clone C6 gDNA inserts in plasmid libraries. The WB strain of G. lamblia, originally isolated by F.D.G. from a patient with chronic symptomatic giardiasis (American Type Culture Collection #50803) [9], belongs to the most representative group of Giardia worldwide. Because it is tantamount to a reference strain and has no endosymbionts or double-stranded RNA viruses [10], which can be associated with other Giardia isolates, we used G. lamblia WB for our genome sequencing. Small-insert plasmid libraries containing genomic DNA fragments were obtained either by partial enzyme digestion or by random shearing by nebulization (details available at http://www.mbl.edu/Giardia). Single-pass read sequences are obtained using bi-directional, long-read sequencing protocols developed for LI-COR 4200 automated scanners [11]. Single-pass reads are assembled using PHRAP [12], and the contigs are mapped to chromosomes and to BACs containing 180-kbp inserts (by hybridization and BAC-end sequencing) to create a physical map. Our goal is 4-fold coverage of the genome using shotgun sequencing, followed by completion of the genome sequence by directed sequencing. Stable contig consensus sequences will be released to the database once 4-fold shotgun coverage is obtained, at which time we will begin annotation of the genome. Volunteers for a collaborative annotation initiative are most welcome and should contact us at giardia{at}mbl.edu.

3 The Giardia genome project database site

To accelerate contributions of this genome project to parasitology and evolutionary biology, preliminary data in the form of ‘single-pass reads’ are updated bi-monthly on the Giardia genome project database web site: http://www.mbl.edu/Giardia. All of the single-pass reads are edited to improve accuracy and can be downloaded from the web site, either individually or in multiple sequence FASTA format. Single-pass read accuracy exceeds 99.9% to 750 bases and 99.0% to 1000 bases, with 0.25% overall ambiguity. Sixty percent of single-pass reads are 900 bases in length or greater (Fig. 1). The average base pair frequency is 49.0% G+C. As of the June 2000 data release, the Giardia genome project database contained 37.7 Mb of single-pass read sequences, representing 3-fold coverage of the genome.

Figure 1

Histogram of LI-COR single-pass read lengths, based on an April 2000 PHRAP assembly. Read lengths are after removal of cloning vector sequence. Reads shorter than 200 bases are not released to the Giardia genome project web site or to NCBI.

Pre-sorted results of BLASTX [13] searches of the single-pass reads against the NCBI non-redundant (nr) database are also available. The BLAST output is sorted by clone name, best BLAST score and keyword content of significant matches. Sequences that match a keyword search can be downloaded in FASTA format. Approximately 33% of reads have a BLASTX hit of e−4 or better to entries in the GenBank nr database. BLASTX hits to giardial variant-specific surface proteins, responsible for protection from immunological and environmental attack, have been found for 8% of single-pass reads. Other genes or regions occurring in high numbers in the genome (based on BLAST and PHRAP assemblies) include ribosomal RNA tandem repeats, telomeric DNA repeats, protein kinases, ankyrins, tubulins and myosin homologues. Other BLASTX hits include three new giardial VSP genes and examples of a HSP90 homologue, long chain fatty acid CoA ligase, TATA element modulatory factor, ubiquitin conjugating enzyme, protein disulfide isomerases, eukaryotic initiation factor/DEAD box protein, protein phosphatase PP2, yeast secretory pathway GDP dissociation inhibitor, phosphotyrosyl phosphatase activator, G2-specific protein kinase, serine–threonine protein kinase, vesicle-trafficking protein, 5′ guanylate kinase homologue, uridine kinase and DNA helicase similar to a protein that binds the switch region of immunoglobulin genes. The current database also contains a wealth of information for genes involved in intermediary metabolism, information processing, e.g. subunits of DNA polymerases I, II and III, elongation factors, synthetases, histones, cytoskeletal proteins and cell division proteins. As the utility of the database will improve with better sorting criteria, we welcome suggestions for searching and sorting BLAST results. Suggestions may be submitted by E-mail to giardia{at}mbl.edu.

Giardia single-pass reads can also be searched using a local BLAST search tool or through NCBI. All the released single-pass reads are deposited in NCBI's high throughput genome sequence database (http://www.ncbi.nlm.nih.gov/HTGS/), including assignment of GenBank accession numbers, and can be searched by NCBI BLAST tools.

Finally, the Giardia genome project database reports the progress towards completion of the project, e.g. average read length and read quality, estimated genome coverage, identification of repeated elements, and number and length of contigs in the current assembly. Database users are invited to join a moderated E-mail list to receive E-mail notification of data releases, submissions to NCBI and new services (details available at http://www.mbl.edu/Giardia). Descriptions of the genomic DNA libraries and project protocols are additionally available on the project's web pages.

4 Policy on use of data

In accordance with the policy statement from the malaria genome project (http://www.sanger.ac.uk/Projects/P_falciparum/), the laboratories contributing to the Giardia genome project have agreed that the preliminary sequence information will be rapidly released to the project web site (http://www.mbl.edu/Giardia) and to NCBI's high throughput genome sequence database (http://www.ncbi.nlm.nih.gov/HTGS/). Data releases do not constitute scientific publication, but rather provide investigators with information that may ‘jump-start’ biological experimentation. Single-pass sequences should not be published in any form (including phylogenetic inferences) without independent confirmation of the primary sequence. Users of this information are encouraged to share their results with the authors in order to improve annotation of the sequence data. Investigators using data from the Giardia genome project should acknowledge the source of information or materials in any publication by citing this publication and by referring to the project database (http://www.mbl.edu/Giardia) or the NCBI high throughput genome sequence database accession numbers.

Acknowledgements

The Giardia genome project is supported by the National Institute of Allergy and Infectious Diseases (Grant numbers AI43273 and AI42488), The G. Unger Vetlesen Foundation, and LI-COR Biotechnology. Additional laboratory assistance has been provided by Bruce Luders, John Darga, Elizabeth Duffy, Margaret L. Bradley and Scott Bressoud.

References

  1. [1].
  2. [2].
  3. [3].
  4. [4].
  5. [5].
  6. [6].
  7. [7].
  8. [8].
  9. [9].
  10. [10].
  11. [11].
  12. [12].
  13. [13].
View Abstract