silico generation of putative ORFs from B. pseudomallei genome
The B. pseudomallei genome sequence downloaded from Sanger Institute (ftp://ftp.sanger.ac.uk/pub/pathogens/bps/) contained genome sequence contigs being assembled in FASTA format. At the time of analysis, there are 22 contigs (Figure 2).
Figure 2 Genome sequence data of B. pseudomallei. The sequence file contains currently assembled 22 contigs in FASTA format. The contig list was generated by a programme written in Perl (Appendix I).
The bacterial genetic code (translation table 11, for bacteria) was used in the in silico generation of ORFs. In our current analysis, the minimal length of an ORF was set to be 150 base pairs. The function generated two files that contain the nucleotide sequences and the amino acid sequences of all the putative ORFs respectively, both of which are in FASTA format and contain 53,516 ORFs. The contig number was calculated using the Perl script shown in Appendix I.
Sequence analysis of
ORFs – Karlin’s five criteria
The ORFs were analysed according to the five principles proposed by Karlin. The JAVA programme (Appendix II) takes the sequence of every ORF and computes the G+C frequency , genome signature profile , codon usage and amino acid usage , and compares them with the respectively genome average value.
As the genome sequence was in the form of 22 contigs at the time of analysis, the genome average values used for codon usage and amino acid usage were based on the average values of all the ORFs generated, while the genome average values used for the computation of G+C frequency and genome signature profile were based on the combination of all the 22 contigs.
1. G+C frequency
To compute G+C frequency of each ORF, the programme uses the following formulae
where the denominator is the number of bases of the ORF.
To compute the average genome G+C frequency, the programme calculates
The programme then compares the G+C frequency of an ORF and that of the genome by calculating
for every ORF and outputs the value.
2. Genome signature contrast
Genome signature profile consists of the array of dinucleotide relative abundance values
where is the frequency of the dinucleotide XY, and and are the frequency of X and Y respectively. As there are four types of nucleotide (A, T, C and G), there are 16 such values for each ORF and for the whole genome.
The programme computes the genome signature difference between each ORF and the average value of the whole genome
where sums up the genome signature different between each ORF and the entire genome for all the 16 types of XY dinucleotides, and outputs the value for every ORF.
3. Codon usage contrast
Assume that an amino acid Z is encoded by n types of codons, and the frequency of the usage of each type of codon is . Therefore, the codon usage of amino acid Z can be represented by an array of n numbers , and .
The codon usage contrast between an ORF and the entire genome is
where is the frequency of amino acid Z in this ORF, and sums up the codon usage difference for all the 20 common amino acids.
4. Amino acid contrast
Divergence in amino acid usage (amino acid contrast) between every ORF and the entire genome was calculated by
where is the frequency of amino acid Z in this ORF or in the genome, and sums up the amino acid contrast for all the 20 common amino acids.
5. Putative alien gene cluster
To assess if the ORFs generated are putative alien, the currently known ribosomal proteins genes (RP), translation and transcription factor genes (TF) and chaperone genes (CH) of B. pseudomallei (or related species B. cepacia) were retrieved from GenBank. There are currently 115 nucleotide entries of B. pseudomallei in the GenBank. Ribosomal protein genes, translation and transcription factor genes (TF) and chaperone genes (CH) were selected (Table 1).
Table 1 Ribosomal proteins genes (RP), translation and
transcription factor genes (TF) and chaperone genes (CH) selected for
codon bias analysis
Many entries that encode ribosomal RNAs (rRNAs) but are not
translated into proteins are not selected.
1. The programme
programme consists of three major parts Util.java,
shown in Appendix II.
Util.java provides some basic functions, including counting the number of
occurrence of a base, a dinucleotide, a codon, and an amino acid.
It also contains a bacterial translation table (translation table
number 11). These functions
are imported by the other two programmes that calculate the values for the
five criteria proposed by Karlin.
ORF.java contains the functions that calculate the G+C frequency , genome signature profile , codon usage and amino acid usage for a given ORF. It is also imported by Analyzer.java.
is the major part of the programme. Once
executed, the programme will prompt user for several input files and
countContig calculates the average G+C frequency
and genome signature profile
of the entire genome based on
the 22 contigs. countORF calculates
and amino acid usage
of the entire genome based on
the average value of all the putative ORFs. In consideration of the speed of computation, we let
and countORF process
contigs and ORFs respectively in different manners.
We noticed that most of the contigs are extremely long in the original contig file. Therefore, when processing the information of a contig (number of bases, number of G and C, number of dinucleotides, etc), countContig reads the contig sequence line by line without concatenating all the lines, and counts the various values of that line. After this, it moves to the next line and count the same values. At the same time, it adds up the values of all the lines. Thus, although JAVA does not have a limit on how long a string can be, we avoid such a situation where a very long string (corresponding to the entire sequence of a contig) is stored in the memory buffer, as this would dramatically slow down computation. However, the length of an ORF is usually short. Therefore, countORF performs the counting by concatenating all the lines of an ORF into a string and computes the various values of this string when it encounters the last line of this ORF.
Analyzer.java calculates both the values of every ORF and the average values of all the ORFs at the same time after it finishes reading the entire ORF file. So it processes this large file by scanning it only once, making the computation less time-consuming.
the end of the programme,
the amino acid sequence and 7 values following the nucleotide sequence of
each ORF, in a file that is specified by the user.
The 7 values are
(1) (Equation 1c)
(2) (Equation 2b)
(3) (Equation 3)
(4) (Equation 4)
(5) (Equation 2b, except that the ORF is compared with the average value of the selected ribosomal protein genes)
(6) (Equation 2b, except that the ORF is compared with the average value of the selected translation and transcription factor genes)
(Equation 2b, except
that the ORF is compared with the average value of the selected chaperone
format of the output file for one ORF is shown as an example in Figure 3.
Figure 3 Sample output of sequence analysis programme. The output file is specified by the user at the start of the computation. For each ORF, the amino acid is translated by the programme, followed by 7 values (see text for description).
2. Exception handling
Several possible exceptions were taken into consideration.
Firstly, when calculating genome signature profile (Equation 2a), if the frequency of single nucleotide X or Y is 0 (i.e., X or Y does not appear at all in the sequence.), the denominator in Equation 2a is 0. The programme sets to be 0, as dinucleotide XY will not appear in the sequence in this case.
Secondly, when calculating codon usage for
a particular amino acid in a sequence, if this amino acid does not appear
in the sequence, we will again have a case where the denominator of codon
usage of condon i that encodes amino acid Z
is 0. Since in such a case, amino acid Z does not appear at all, the programme sets to be 0 again.
Thirdly, when calculating dinucleotide frequency, if the two nucleotides X and Y are the same (for example, if they are both G), the programme counts the number of dinucleotide XY (i.e., GG) in such a sequence as “GGGTGGGA” to be 4.