OUP user menu

Evidence of rare codon clusters within Escherichia coli coding regions

David A. Phoenix , Eugene Korotkov
DOI: http://dx.doi.org/10.1111/j.1574-6968.1997.tb12686.x 63-66 First published online: 1 October 1997


It is known that there is a high occurrence of rare codons at the start of coding region. Here it is shown that although the remainder of the gene is likely to contain a relatively low number of rare codons, rare and non-rare codons do not form a random sequence. It is apparent that throughout the coding region there is a higher than expected number of rare codon clusters. For example once a rare codon has occurred there is a greater chance than expected of the next six codons containing another rare codon. This non-random distribution implies that rare codons may have an as yet unidentified biological role.

  • Translation
  • Rare codon
  • Protein folding
  • E. coli

1 Introduction

In E. coli, codons with low corresponding tRNA concentrations are generally used less frequently than synonymous codons associated with higher tRNA concentrations [1]. The differential use of codons causes a given mRNA sequence to be translated at variable rates in vivo [2]. Rare codons with extremely low tRNA concentrations may even stall translation until the correct tRNA species becomes available [2]. In several proteins it has been proposed that rare codon induced pauses could allow discrete nascent protein domains to fold since the partially synthesised protein would have a decreased number of potential interactions available thus simplifying the folding pathway [3, 4]. Proteins studied include Rabbit α and β globins [5], cytochrome C, yeast pyruvate kinase [3], and the photosystem II D1 protein [6].

A high occurrence of rare codons has previously been noted at the start of coding regions [7] although in this and other studies [2, 8, 9] a broad range of codons have been defined as rare based on their frequency of occurrence. Since we are interested in the potential of codons to affect the rate of translation we have chosen the rare codon data set based on codons with low corresponding tRNA levels [1]. Here the distribution of these codons is statistically modelled and the non-random nature of this distribution is interpreted with reference to protein translation and folding.

2 Methods

2.1 Preparation of the database

E. coli coding regions were obtained from the EMBL database. Files containing anti-sense strand information, rRNA, obvious errors or incomplete coding sequences were discarded. 2233 coding regions were selected for analysis. To locate potential pause sites eight codons were chosen [8] which corresponded to minor tRNA concentrations [1]. These were CTA (Leu), ATA (Ile), ACA (Thr), CCT and CCC (Pro), CGG, AGA, and AGG (Arg).

2.2 Analysis of rare codon distribution

The occurrence of rare codons was modelled using the Poisson distribution. The probability of a codon position being occupied by a rare codon was assumed to be constant with this parameter being estimated by the observed overall frequency of rare codon occurrence across all of the sequences considered. Based on the level of codon occurrence within the 2233 genes the probability of any given codon being a rare codon was found to be equal to 0.038. The number of DNA sequences expected to contain between zero and six rare codons within a 40 codon bin was calculated and compared to that observed within the sample. This was repeated for codons 1–40, 41–80, 81–120, 121–160, 161–200, 201–240, 241–280, 281–320, 321–360. The sample size varied between bins due to the varying length of the coding regions and this was taken into account when calculating the expected number of occurrences in any given bin.

The number of codons occurring between two adjacent rare codons was studied for all rare codon separations within the 2233 coding regions. A separation of zero would imply that rare codon ai and ai+1 were next to each other with no intervening codons. A separation distance of one would imply that the two rare codons were separated by one non-rare codon, etc. The number of times any given separation was observed was noted for separations of 0–140 codons. The observed occurrences were compared to the expected number of times any given separation should occur. Expected occurrences were calculated using Monte Carlo simulation and testing methodology [10]. Random sequences were generated with a total length of over 6×105 codons although the rare codon distribution was found to be the same using random sequences with a total length of over 2×107 codons (data not shown). The probability of rare codon occurrence was kept the same as that within the 2233 gene data set.

The level of separation was further modelled by looking at separation between rare codon ai and ai+n where n could be 1, 2 or 3. Separation of ai and ai+2 would therefore imply that there were three rare codons ai, ai+1, ai+2 with the separation of the two outer codons being observed. These separations were analysed for the initial 40 codon bin where a high level of rare codon occurrence was noted and for the remainder of the sequence.

3 Results and discussion

The Poisson distribution relates to unpredictable, random events. If therefore, rare codon occurrence fits this model it implies that rare codons are randomly distributed. The sequences were divided into 40 codon-width bins as described in the methods. The number of bins containing either 0, 1, 2, 3, 4, 5, or 6 rare codons were observed. The number of occurrences was compared to the number of occurrences that would have been expected if rare codon occurrence could be described by a Poisson process. Chen and Inouye showed that there is a high occurrence of rare codons at the start of coding regions [9] and in our analysis it was found that fewer sequences than expected contained 0–2 codons at the start but more than expected contain 3–6 codons (results not shown). Interestingly, when the rest of the sequence was analysed more of the bins than expected contained zero rare codons. Indeed many more bins than expected contained four or more rare codons. A χ2 goodness of fit gave the probability that the expected and observed distributions were similar of less than 10−30 which clearly indicates that the rare codons do not follow a Poisson distribution. Hence the distribution of rare codons through the 2233 sequences is not a random event. To further investigate this, the Monte Carlo methodology was used. Random sequences were generated containing the same proportion of rare codons as the data set. The distance between any rare codon ai and its nearest neighbour ai+1 was noted for all rare codons in the data set with separations of 0–140 codons. The number of times any given separation was observed was compared to the number of times it was expected from the random sequences (Fig. 1). The mean separation observed was 23.4 codons and that expected from the simulations was 25.7 codons. Whether these distributions were different was analysed using a χ2 goodness of fit testing of the hypothesis which gave a χ2 value of 3116 which is far in access of the 0.1% critical value with 139 degrees of freedom. It can be seen from Fig. 1 that rare codons tend to cluster preferring separations of between 1–10 codons. These clusters could well occur upstream of domain boundaries, thus pausing translation while the exposed nascent chain folds. Recently Zhang et al. [6] produced a model to describe the effect of rare codon clusters and their data implied that clusters containing as few as two rare codons maximised the increase in steady state density of ribosomes downstream of the ‘pause site'. To further investigate clustering we observed separation distances of ai and ai+1, a two codon cluster, of ai and ai+2, a three codon cluster and ai and ai+3, a four codon cluster. Since codons are known to cluster at the start of coding sequences this analysis excluded the first 40 codons of the coding region. Fig. 2 shows that codons appear to cluster with rare codons occurring within six codons of each other at a higher frequency than expected. The groups of three rare codons have a higher than expected tendency to occur within a 20 codon band and four rare codons within 40 codons.


The distance between each rare codon ai and its nearest neighbour ai+1 was determined for all 2233 coding sequences. A comparison of the observed (▪) and expected (◻) number of times rare codons were separated by distances of 0 to 30 codons is shown above. Expected values were generated using Monte Carlo testing methodology with 2233 randomly generated sequences of the same lengths as the data set. The random sequences contained in total over 600 000 codons. The same distribution was obtained with sequences containing 2×107 codons. The codon composition was the same as in the data set. Values shown represent the percentage of all rare codons with the separation shown. Separations of upto 140 codons were calculated (data not shown).


The observed (▪) and expected (◻) number of rare codons with a rare codon separation of between 0 and 30 codons. The analysis excluded the first 40 codons. A shows the separation between consecutive rare codons, i.e. rare codon ai and ai+1, a separation of zero indicates the codons are adjacent, B shows the separation between rare codon ai and ai+2 hence the minimum separation is one and this indicates ai, ai+1, ai+2 are adjacent. C shows the separation between ai and ai+3 hence the minimum separation is two. 2233 E. coli genes were tested for the observed values and Monte Carlo analysis with 20 000 randomly generated 1000 codon sequences were used to find the expected values by examining codons 41–1000. The codon composition was the same in the random sequence and the data set. Values shown represent the percentage of all rare codons with the separation shown. Separations of upto 100 codons were measured but after 30–40 codons were all less than the expected (data not shown).

In summary it appears that rare codons do not occur randomly but prefer to cluster within DNA coding regions. This clustering effect is clear both at the start of the coding region and within the remainder of the gene. It may well be that this clustering has a role in the slowing or pausing of translation thus aiding the initiation of translation or protein folding.


View Abstract