In silico generation of putative ORFs from B. pseudomallei genome sequence

The B. pseudomallei genome sequence downloaded from Sanger Institute (ftp://ftp.sanger.ac.uk/pub/pathogens/bps/) contained genome sequence contigs being assembled in FASTA format.  At the time of analysis, there are 22 contigs (Figure 2).

>Burk345h10.q1c 1231 bp, 3 reads

>Burk343c07.q1c 1084 bp, 71 reads

>J7889Bf03.q1c 3307 bp, 433 reads

>J7902Af01.p1c 1099 bp, 79 reads

>J7902Af10.p1c 1021 bp, 21 reads

>J7888Cd05.p1c 1076 bp, 5 reads

>J7903Ae05.p1c 1780 bp, 242 reads

>J7888Cf03.p1c 1604 bp, 7 reads

>Burk1071h12.q1c 451662 bp, 12659 reads

>BurBAC46C2Aa05.p1c 325593 bp, 8636 reads

>J7887Cf07.p1c 1010 bp, 4 reads

>J7888Ac08.p1c 1071 bp, 5 reads

>Burk416b04.p1c 162820 bp, 5721 reads

>Burk1284f07.s2ca4334 955322 bp, 25768 reads

>Burk1288h01.s1c 444285 bp, 10729 reads

>Burk182h11.q2c4226 1077098 bp, 27062 reads

>Burk1274f04.s2ca4253 306966 bp, 8530 reads

>Burk412g10.q1c 1243699 bp, 35649 reads

>J6812Ab06.p1c 1893 bp, 34 reads

>BurBAC68B9Eh02.q1c 199380 bp, 5115 reads

>Burk1125d09.s1c 838854 bp, 20261 reads

>Burk388b06.p1c 1240696 bp, 38583 reads

 

Contig = 22

Length = 7261991

Lines = 90786

Figure 2  Genome sequence data of B. pseudomallei.  The sequence file contains currently assembled 22 contigs in FASTA format.  The contig list was generated by a programme written in Perl (Appendix I).

     The bacterial genetic code (translation table 11, for bacteria) was used in the in silico generation of ORFs.  In our current analysis, the minimal length of an ORF was set to be 150 base pairs.  The function generated two files that contain the nucleotide sequences and the amino acid sequences of all the putative ORFs respectively, both of which are in FASTA format and contain 53,516 ORFs.  The contig number was calculated using the Perl script shown in Appendix I.

Sequence analysis of ORFs – Karlin’s five criteria

     The ORFs were analysed according to the five principles proposed by Karlin.  The JAVA programme (Appendix II) takes the sequence of every ORF and computes the G+C frequency , genome signature profile , codon usage  and amino acid usage , and compares them with the respectively genome average value.

     As the genome sequence was in the form of 22 contigs at the time of analysis, the genome average values used for codon usage and amino acid usage were based on the average values of all the ORFs generated, while the genome average values used for the computation of G+C frequency and genome signature profile were based on the combination of all the 22 contigs.

1. G+C frequency comparison

     To compute G+C frequency of each ORF, the programme uses the following formulae

(1a)

where the denominator is the number of bases of the ORF.

     To compute the average genome G+C frequency, the programme calculates

(1b)

     The programme then compares the G+C frequency of an ORF and that of the genome by calculating

(1c)

for every ORF and outputs the value.

2. Genome signature contrast

     Genome signature profile consists of the array of dinucleotide relative abundance values

(2a)

where  is the frequency of the dinucleotide XY, and  and  are the frequency of X and Y respectively.  As there are four types of nucleotide (A, T, C and G), there are 16 such  values for each ORF and for the whole genome.

The programme computes the genome signature difference between each ORF and the average value of the whole genome

(2b)

where sums up the genome signature different  between each ORF and the entire genome for all the 16 types of XY dinucleotides, and outputs the  value for every ORF.

3. Codon usage contrast

Assume that an amino acid Z is encoded by n types of codons, and the frequency of the usage of each type of codon is .  Therefore, the codon usage of amino acid Z can be represented by an array of n numbers , and .

The codon usage contrast between an ORF and the entire genome is

(3)

where  is the frequency of amino acid Z in this ORF, and sums up the codon usage difference for all the 20 common amino acids.

4. Amino acid contrast

Divergence in amino acid usage (amino acid contrast) between every ORF and the entire genome was calculated by

(4)

where  is the frequency of amino acid Z in this ORF or in the genome, and sums up the amino acid contrast for all the 20 common amino acids.

5. Putative alien gene cluster

To assess if the ORFs generated are putative alien, the currently known ribosomal proteins genes (RP), translation and transcription factor genes (TF) and chaperone genes (CH) of B. pseudomallei (or related species B. cepacia) were retrieved from GenBank.  There are currently 115 nucleotide entries of B. pseudomallei in the GenBank.  Ribosomal protein genes, translation and transcription factor genes (TF) and chaperone genes (CH) were selected (Table 1).

Table 1 Ribosomal proteins genes (RP), translation and transcription factor genes (TF) and chaperone genes (CH) selected for codon bias analysis

Gene Category

GenBank Accession No.

Description

RP*

AF084812

Burkholderia pseudomallei strain ATCC15682 ribosomal protein S21 (rpsU) gene

AF084813

Burkholderia pseudomallei strain ATCC23343 ribosomal protein S21 (rpsU) gene

AF292383

Burkholderia pseudomallei ribosomal protein L21, ribosomal protein L27

AF292383

Burkholderia pseudomallei ribosomal protein L21, ribosomal protein L21

AF447447

Burkholderia pseudomallei strain Institut Pasteur 6068VIR ribosomal subunit S21 (rpsU) gene

U73848

Burkholderia pseudomallei ribosomal protein S21 (rpsU)

TF

AF288383

Burkholderia pseudomallei DNA polymerase I (polA)

AF326688

Burkholderia pseudomallei RNA polymerase sigma-32 factor (rpoD) gene

AF005358

Burkholderia pseudomallei transcriptional activator protein IrlR

AF005358

Burkholderia pseudomallei transcriptional activator protein IrlS

AY032866

Burkholderia pseudomallei hypothetical protein and probable transcriptional regulator genes

AY040244

Burkholderia pseudomallei global regulator which belongs to the LysR family of transcription factors

CH

AF016711

Burkholderia pseudomallei heat shock protein 70 (dnaK) gene

AF287633

Burkholderia pseudomallei chaperonin GroEL gene

AF453480

Burkholderia cepacia putative flagellar chaperone

AF104907

Burkholderia cepacia 10 kDa heat shock protein GroES

AF104907

Burkholderia cepacia 57 kDa heat shock protein GroES

*: Many entries that encode ribosomal RNAs (rRNAs) but are not translated into proteins are not selected.

 

Implementation of Karlin’s criteria

1. The programme

The programme consists of three major parts Util.java, Analyzer.java and ORF.java, shown in Appendix II.

Util.java provides some basic functions, including counting the number of occurrence of a base, a dinucleotide, a codon, and an amino acid.  It also contains a bacterial translation table (translation table number 11).  These functions are imported by the other two programmes that calculate the values for the five criteria proposed by Karlin.

ORF.java contains the functions that calculate the G+C frequency , genome signature profile , codon usage  and amino acid usage  for a given ORF.  It is also imported by Analyzer.java.

Analyzer.java is the major part of the programme.  Once executed, the programme will prompt user for several input files and output files:

Please input contig file name:  /* the file containing sequences of all contigs */

Please input ORF nucleotide file name:  /* the file containing sequences of all the putative ORFs that were generated by getorf */

Please input ORF RP file name:  /* the file containing sequences of ribosomal protein genes */

Please input ORF TF file name:  /* the file containing sequences of translation and transcription factor genes */

Please input ORF CH file name:  /* the file containing sequences of chaperone genes */

Please input pathogenic ORF file name:  /* the file containing sequences of known pathogenic genes.  This feature can be modified according to the number of such pathogenic gene files.  Currently there are two such files. */

Please specify output file name for all putative ORFs:  /* the output file for all putative ORFs */

Please specify output file name for pathogenic ORF 1:  /* the output file for the first pathogenic gene file */

Please specify output file name for pathogenic ORF 2:  /* the output file for the first pathogenic gene file */

countContig calculates the average G+C frequency  and genome signature profile  of the entire genome based on the 22 contigs.  countORF calculates codon usage  and amino acid usage  of the entire genome based on the average value of all the putative ORFs.  In consideration of the speed of computation, we let countContig and countORF process contigs and ORFs respectively in different manners.

We noticed that most of the contigs are extremely long in the original contig file.  Therefore, when processing the information of a contig (number of bases, number of G and C, number of dinucleotides, etc), countContig reads the contig sequence line by line without concatenating all the lines, and counts the various values of that line.  After this, it moves to the next line and count the same values.  At the same time, it adds up the values of all the lines.  Thus, although JAVA does not have a limit on how long a string can be, we avoid such a situation where a very long string (corresponding to the entire sequence of a contig) is stored in the memory buffer, as this would dramatically slow down computation.  However, the length of an ORF is usually short.  Therefore, countORF performs the counting by concatenating all the lines of an ORF into a string and computes the various values of this string when it encounters the last line of this ORF.

Analyzer.java calculates both the values of every ORF and the average values of all the ORFs at the same time after it finishes reading the entire ORF file.  So it processes this large file by scanning it only once, making the computation less time-consuming.

At the end of the programme, Analyzer.java outputs the amino acid sequence and 7 values following the nucleotide sequence of each ORF, in a file that is specified by the user.  The 7 values are

(1)    (Equation 1c)

(2)    (Equation 2b)

(3)    (Equation 3)

(4)    (Equation 4)

(5)    (Equation 2b, except that the ORF is compared with the average value of the selected ribosomal protein genes)

(6)    (Equation 2b, except that the ORF is compared with the average value of the selected translation and transcription factor genes)

(7)    (Equation 2b, except that the ORF is compared with the average value of the selected chaperone genes)

     The format of the output file for one ORF is shown as an example in Figure 3.

>Burk345h10.q1c_1 [103 - 390] 1231 bp, 3 reads

GTGGTGTTTTCCAGGCCACATAACCTGGAAGCGGTAATTGTAAAACCACACACTTTCCGA

GGCTCCACTCCCATTTATTCACACTCCATCTACAATTTCTGCCCACCAGTTCTAGAAGCC

TTGGGCTCTTTTTCTCCCGAGGAATCTTGGTGGGAGTGGGGTGCCGGGGGAGAGGTCCAC

GTATGTCTAAAACTTGGATTGGTAGACTGCATCATGGCAGGCTCCTTTTTTGTTTGTTAT

GACTGTTTGGAATCCATGGTATCATCTGTCACTGTCCATATTGTGGTG

@amino acids sequence

MVFSRPHNLEAVIVKPHTFRGSTPIYSHSIYNFCPPVLEALGSFSPEESWWEWGAGGEVH

VCLKLGLVDCIMAGSFFVCYDCLESMVSSVTVHIVV

@results

0.9224268 0.3833075 1.8995285 0.026063826 2.205662 2.2205944 2.1334183

>Burk345h10.q1c_2 [394 - 615] 1231 bp, 3 reads

CTGAAGCCCTCCAAAGATCCTGATGCTGCCCCTGTGTTTTGCTCTGTCACTTGCCTGGTC

CAGCTAGAGGCTGTTCTAGCGTGCAGCTGCAAAGGTAACAGCAGAGTAGGAACATGCCCT

TGGTCCCACCCTACCCTCCAGCCTCCATTGTGGCACCTACTGTTAAACACTGCCCTCAAC

AGAGTCTTTAAAGAAACTGCTCTTTCCACTTTTACTCACTTT

@amino acids sequence

MKPSKDPDAAPVFCSVTCLVQLEAVLACSCKGNSRVGTCPWSHPTLQPPLWHLLLNTALN

RVFKETALSTFTHF

@results

0.97987527 0.54589283 2.0341592 0.023968514 2.4359636 2.350689 2.360087

Figure 3  Sample output of sequence analysis programme.  The output file is specified by the user at the start of the computation.  For each ORF, the amino acid is translated by the programme, followed by 7 values (see text for description).

2. Exception handling

Several possible exceptions were taken into consideration.

Firstly, when calculating genome signature profile (Equation 2a), if the frequency of single nucleotide X or Y is 0 (i.e., X or Y does not appear at all in the sequence.), the denominator in Equation 2a is 0.  The programme sets  to be 0, as dinucleotide XY will not appear in the sequence in this case.

Secondly, when calculating codon usage for a particular amino acid in a sequence, if this amino acid does not appear in the sequence, we will again have a case where the denominator of codon usage of condon i that encodes amino acid Z

(5)

is 0.  Since in such a case, amino acid Z does not appear at all, the programme sets  to be 0 again.

Thirdly, when calculating dinucleotide frequency, if the two nucleotides X and Y are the same (for example, if they are both G), the programme counts the number of dinucleotide XY (i.e., GG) in such a sequence as “GGGTGGGA” to be 4.

 

Chen Kang