Glutamic proteases are a distinct, and recently re-classified, group of peptidases that are thought to be found only in fungi. We have identified and analysed the distribution of over 20 putative glutamic proteases from all fungal species whose genomes have been sequenced so far. Although absent from the Saccharomycetales class, glutamic proteases appear to be present in all other ascomycetes species examined. A large number of coding regions for glutamic proteases were also found clustered together in the Phanerochaete chrysosporium genome, despite apparently being absent from three other species of Basidiomycota.
The glutamic protease family, recently re-classified as a sixth catalytic type of peptidase (family G1) in the MEROPs database (http://www.merops.sanger.ac.uk) currently contains peptidases from five species of Ascomycota. Previously known as the A4 family of aspartic endopeptidases, recent analysis of the molecular structure and catalytic mechanism has identified these enzymes as a novel protease family, the Eqolisins, a name derived from the active-site residues, glutamic acid (E) and glutamine (Q) . Members of this newly recognised family of peptidases have a previously un-described ß-sandwich as a tertiary fold and a unique catalytic dyad consisting of glutamine and glutamate residues which, respectively, activate the nucleophilic water and stabilise the tetrahedral intermediate on the hydrolytic pathway . The only previously isolated examples of glutamic proteases (previously designated acidic or aspartic endopeptidaes) are from Scytalidium lignicolum, Aspergillus niger, Cryphonectria parasitica (chestnut blight fungus), Talaromyces emersonii and Sclerotina sclerotiorum, all filamentous fungal species of the Ascomycota phylum. Of these species, only A. niger has a completed genome sequence, although this is not publicly available (see, http://www.dsm.com and http://www.genencor.com). However, a growing number of ascomycete genomes have been sequenced [2–8] and the first from a basidiomycete (Phanerochaete chrysosporium) has been publicly released. Nevertheless, many fungal open reading frames (ORFs) still have unknown functions. Comparative genomics is increasingly being used to examine gene conservation and can be used for improved gene prediction [7,10]. Here, we have used a comparative genomics approach to look at the conservation and evolutionary distribution of genes for this interesting and newly characterised protease family in the fungi.
All predicted open reading frames from each species (in Table 1) with 100 or more amino acids, and beginning with a methionine start codon, were searched for amino-acid sequences with significant similarity to previously identified glutamic proteases using InterProScan  and BLAST searches. GenBank was also searched for homologs to previously identified glutamic proteases. Amino-acid sequences were downloaded from the sources listed in Table 1. The coding sequences for predicted proteases were manually compared using ClustalW alignment . Hierarchical clustering of amino-acid sequence alignments was performed using the MultAlin program  with default (blosum62) matrix. The SignalP program  was used to look for predicted prepro signal sequences.
The glutamic proteases are quite distinct from previously characterised proteases. This is highlighted by the fact that there are only a handful of significantly similar sequences currently in GenBank (BLASTp e-value ≤0.01), and these represent either the small number of previously isolated glutamic proteases or hypothetical proteins encoded by ORFs from recently sequenced genomes of filamentous fungal species. These hypothetical proteins and other predicted ORFs were included in this study.
The 23 fungal genomes (listed in Table 1) were systematically searched for ORFs with sequence similarity to previously isolated examples of glutamic proteases. InterProScan  was used to search PROSITE, PRINTS, Pfam, ProDom and SMART for protein signatures in these ORFs (Table 2).
Bold sequences represent previously described genes in MEROPs and GenBank, other sequences can be found from websites listed in Table 1. Protein signatures identified with InterProScan from and PROSITE, PRINTS, Pfam, ProDom and SMART; IPR000250 Family Peptidase A4, PD018627 PRTA A. niger, PR00977 scytalidopepsin B, PF01828.7 Peptidase A4 family, IPR008958 transglutaminase. Positions of active site and disulphide bridge residues refer to Scytalidium lignicolum glutamic protease .
A total of 26 ORFs from seven species with completely sequenced genomes were identified as potentially encoding glutamic proteases, including one previously identified sequence (pepB from A. niger). Investigating the predicted ORFs from more than 20 fungal species showed that glutamic proteases were restricted to the higher ascomycetes and were not found in any of the Saccharomycetales sequenced to date (Table 1). Possible ORFs encoding glutamic proteases were found in all the other ascomycetes sequenced so far from the Eurotiales, Pezizales and Sordariales. The genomes of these species each contain 1–4 ORFs predicted to encode glutamic proteases; these genomes have approximately twice as many predicted ORFs compared to members of the Saccharomycetales.
Further investigation of the predicted glutamic protease sequences (Table 2) showed a high level of conservation between species, with the glutamine and glutamic acid active-site residues resolved by Fujinaga et al.  conserved in the amino-acid sequences of most of them. Exceptions to this were two similar ORFs from Neurospora crassa and Magnaporthe grisea (NCU04205.1, MG09032.4) and two adjacent ORFs from P. chrysosporium (pc.12.97.1, pc.12.98.1). The sequences of these ORFs appear to have been correctly predicted and may represent genes which have lost functionality. The P. chrysosporium ORF, pc78.43.1, has a glutamine rather than a glutamic acid residue at the second active site position, which may still be functional. The additional sequences of putative glutamic proteases identified here can be used to create an improved Hidden Markov Model of this protein family for future analyses of other species. All of the ORFs characterised were predicted to encode signal peptides using SignalP  and many were seen to specify the conserved cysteine residues that form a disulphide bridge, which surrounds an aspartic acid residue that is conserved in all members of this family . Within the glutamic protease family, there appear to be two distinct groups, with only half of the ORFs predicted to contain two C-terminal transglutaminase signatures (InterPro domain IPR008958) which catalyze the post-translational modification of proteins at glutamine residues .
The product of the Fusarium graminearum ORF, FG08196.12, has much greater amino-acid sequence similarity to other glutamic proteases (including conservation of the glutamic acid active-site residue), if it is considered to have just 1 exon (ignoring the predicted intron). This assumption is supported by a 99% match to Gz31371614, a full-length sequence from the Phytopathogenic Fungi and Oomycete EST Database (http://cogeme.ex.ac.uk)  and represents another example of the value of using alignments to cDNA sequences for correct gene prediction .
The largest number of ORFs encoding putative glutamic proteases was found in the genome of the white rot fungus P. chrysosporium, which contains 12 predicted sequences. This was unexpected as the Basidiomycota are evolutionarily distant relatives of the Ascomycota and no members of the glutamic protease family have been identified (using tBLASTn) in other members of the Basidiomycota which have been sequenced (but not yet annotated), namely Cryptococcus neoformans Serotype A, Coprinus cinereus, Ustilago maydis (http://www.broad.mit.edu/annotation/).
Looking more closely at the large number of predicted glutamic proteases in P. chrysosporium, it is clear that the sequences are very highly conserved. The nucleotide sequences of the ORFs are up to 94% identical and, even if the introns are included in the analysis, an identity of 91% is still found. Moreover, there is a large degree of conservation in gene structure with almost identical sizes and positions of introns and exons (Fig. 2). In addition, the distribution of the ORFs is unusual with three pairs of genes adjacent to one another in the same scaffold and four within a 22-kb region of scaffold 123 (Fig. 1). Variation of the ORFs adjacent to the predicted glutamic proteases within the scaffolds and the quality control measures taken with the genome assembly , suggest that these findings are not the result of sequencing or assembly errors. Although fewer in number, the glutamic proteases found in other species were not on the same contigs/scaffolds.
Conservation of glutamic protease sequences in P. chysosporium. (a) Hierarchical clustering of amino-acid sequence alignments. Relative evolutionary distances are shown in PAM units. (b) Comparison of gene structure showing relative nucleotide sizes of introns (lines) and exons (blocks). a Manually deduced from sequence alignments.
Distribution of glutamic proteases in the P. chrysosporium genome. Lines represent genomic and intron sequences. Blocks represent exons in the direction indicated and numbers of the ORF in the scaffold are given below. Relative nucleotide positions in contigs are indicated but are not to scale.
The high level of conservation and distribution of these ORFs within the P. chrysosporium genome may indicate local duplication events of fairly recent evolutionary origin. In S. cerevisiae, a number of gene families are telomere-associated . The P. chrysosporium genome is believed to be divided into 10 chromosomes  and the current genome assembly  contains 29.6 Mb of non-repetitive sequence in 349 scaffolds greater than 3 kb. Although the shotgun sequencing approach excluded telomeres from the assembly, the association (in scaffold 12) of two adjacent ORFs encoding glutamic proteases with AADE, a close homolog of the telomere-associated aryl-alcohol dehydrogenase (AAD) genes of S. cerevisiae, may suggest that this scaffold is close to a chromosome end. The occurrence of these closely related members of a gene family is in clear contrast to the specific and general situation of N. crassa, in which the repeat-induced point mutation system appears to restrict the sizes of gene families .
Glutamic proteases have been shown to be responsible for degrading recombinant proteins in A. niger. Silencing by antisense RNA expression or protease removal by gene disruption has resulted in increased yields ; further knock-outs of these protease genes identified here may also be advantageous. These data imply that glutamic proteases are not essential for fungal growth and the distribution of members of this gene family both within and between the sequenced fungal genomes demonstrates the greater variation of non-essential genes as compared to the ubiquitous occurrence of the core group of highly conserved orthologs in all species examined . The increasing number and diversity of fungal species for which fully sequenced genomes are available [2–8] is facilitating investigations of gene conservation and the evolution of protein families. The glutamic proteases are interesting to study both in terms of their novel peptidase activity and their apparently unusual distribution among the filamentous fungi.
A.H.S is the grateful recipient of a CASE Studentship from the BBSRC and Genencor International. S.G.O. acknowledges the contribution of the COGEME Grant from the IGF Initiative of the BBSRC to his research in the bioinformatics of fungi.
(1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680.
(2002) Silencing of the aspergillopepsin B (pepB) gene of Aspergillus awamori by antisense RNA expression or protease removal by gene disruption results in a large increase in thaumatin production. Appl. Environ. Microbiol. 68, 3550–3559.