![]() |
|
| Home Programs Discussion References Links Acknowledgments Sitemap |
|
In
silico generation of putative ORFs from B. pseudomallei genome
sequence The B. pseudomallei genome sequence downloaded from Sanger Institute (ftp://ftp.sanger.ac.uk/pub/pathogens/bps/) contained genome sequence contigs being assembled in FASTA format. At the time of analysis, there are 22 contigs (Figure 2). >Burk345h10.q1c
1231 bp, 3 reads >Burk343c07.q1c
1084 bp, 71 reads >J7889Bf03.q1c
3307 bp, 433 reads >J7902Af01.p1c
1099 bp, 79 reads >J7902Af10.p1c
1021 bp, 21 reads >J7888Cd05.p1c
1076 bp, 5 reads >J7903Ae05.p1c
1780 bp, 242 reads >J7888Cf03.p1c
1604 bp, 7 reads >Burk1071h12.q1c
451662 bp, 12659 reads >BurBAC46C2Aa05.p1c
325593 bp, 8636 reads >J7887Cf07.p1c
1010 bp, 4 reads >J7888Ac08.p1c
1071 bp, 5 reads >Burk416b04.p1c
162820 bp, 5721 reads >Burk1284f07.s2ca4334
955322 bp, 25768 reads >Burk1288h01.s1c
444285 bp, 10729 reads >Burk182h11.q2c4226
1077098 bp, 27062 reads >Burk1274f04.s2ca4253
306966 bp, 8530 reads >Burk412g10.q1c
1243699 bp, 35649 reads >J6812Ab06.p1c
1893 bp, 34 reads >BurBAC68B9Eh02.q1c
199380 bp, 5115 reads >Burk1125d09.s1c
838854 bp, 20261 reads >Burk388b06.p1c
1240696 bp, 38583 reads Contig = 22 Length = 7261991 Lines = 90786 Figure 2 Genome sequence data of B. pseudomallei. The sequence file contains currently assembled 22 contigs in FASTA format. The contig list was generated by a programme written in Perl (Appendix I). The bacterial genetic code (translation table 11, for bacteria) was used in the in silico generation of ORFs. In our current analysis, the minimal length of an ORF was set to be 150 base pairs. The function generated two files that contain the nucleotide sequences and the amino acid sequences of all the putative ORFs respectively, both of which are in FASTA format and contain 53,516 ORFs. The contig number was calculated using the Perl script shown in Appendix I. Sequence analysis of
ORFs – Karlin’s five criteria The
ORFs were analysed according to the five principles proposed by Karlin.
The JAVA programme (Appendix II) takes the sequence of every ORF
and computes the G+C frequency
As the genome sequence was in the form of 22 contigs at the time of analysis, the genome average values used for codon usage and amino acid usage were based on the average values of all the ORFs generated, while the genome average values used for the computation of G+C frequency and genome signature profile were based on the combination of all the 22 contigs. 1. G+C frequency
comparison To compute G+C frequency of each ORF, the programme uses the following formulae
where the denominator is the number of bases of the ORF. To compute the average genome G+C frequency, the programme calculates
The programme then compares the G+C frequency of an ORF and that of the genome by calculating
for every ORF and outputs the value. 2. Genome signature contrast Genome signature profile consists of the array of dinucleotide relative abundance values
where
The programme computes the genome signature difference between each ORF and the average value of the whole genome
where
3. Codon usage contrast Assume that an amino acid Z is
encoded by n types of codons, and the frequency of the usage of
each type of codon is
The codon usage contrast between an ORF and the entire genome is
where
4. Amino acid contrast Divergence in amino acid usage (amino acid contrast) between every ORF and the entire genome was calculated by
where
5. Putative alien gene cluster To assess if the ORFs generated are putative alien, the currently known ribosomal proteins genes (RP), translation and transcription factor genes (TF) and chaperone genes (CH) of B. pseudomallei (or related species B. cepacia) were retrieved from GenBank. There are currently 115 nucleotide entries of B. pseudomallei in the GenBank. Ribosomal protein genes, translation and transcription factor genes (TF) and chaperone genes (CH) were selected (Table 1). Table 1 Ribosomal proteins genes (RP), translation and
transcription factor genes (TF) and chaperone genes (CH) selected for
codon bias analysis
*:
Many entries that encode ribosomal RNAs (rRNAs) but are not
translated into proteins are not selected. Implementation of
Karlin’s criteria 1. The programme The
programme consists of three major parts Util.java,
Analyzer.java
and ORF.java,
shown in Appendix II. Util.java provides some basic functions, including counting the number of
occurrence of a base, a dinucleotide, a codon, and an amino acid.
It also contains a bacterial translation table (translation table
number 11). These functions
are imported by the other two programmes that calculate the values for the
five criteria proposed by Karlin. ORF.java contains the functions that calculate the G+C frequency
Analyzer.java
is the major part of the programme. Once
executed, the programme will prompt user for several input files and
output files: Please input contig file name:
/* the file containing sequences of all contigs */ Please input ORF nucleotide file name:
/* the file containing sequences of all the putative ORFs that
were generated by getorf
*/ Please input ORF
RP file name: /* the file containing sequences of ribosomal protein
genes */ Please input ORF
TF file name: /* the file containing sequences of translation and
transcription factor genes */ Please input ORF
CH file name: /* the file containing sequences of chaperone genes */ Please input
pathogenic ORF file name:
/* the file containing sequences of known pathogenic genes.
This feature can be modified according to the number of such
pathogenic gene files. Currently
there are two such files. */ Please specify output file name
for all putative ORFs:
/* the output file for all putative ORFs */ Please specify output file name
for pathogenic ORF 1:
/* the output file for the first pathogenic gene file */ Please specify output file name for pathogenic ORF 2: /* the output file for the first pathogenic gene file */ countContig calculates the average G+C frequency
We noticed that most of the contigs are extremely long in the original contig file. Therefore, when processing the information of a contig (number of bases, number of G and C, number of dinucleotides, etc), countContig reads the contig sequence line by line without concatenating all the lines, and counts the various values of that line. After this, it moves to the next line and count the same values. At the same time, it adds up the values of all the lines. Thus, although JAVA does not have a limit on how long a string can be, we avoid such a situation where a very long string (corresponding to the entire sequence of a contig) is stored in the memory buffer, as this would dramatically slow down computation. However, the length of an ORF is usually short. Therefore, countORF performs the counting by concatenating all the lines of an ORF into a string and computes the various values of this string when it encounters the last line of this ORF. Analyzer.java calculates both the values of every ORF and the average values of all the ORFs at the same time after it finishes reading the entire ORF file. So it processes this large file by scanning it only once, making the computation less time-consuming. At
the end of the programme,
Analyzer.java outputs
the amino acid sequence and 7 values following the nucleotide sequence of
each ORF, in a file that is specified by the user.
The 7 values are (1)
(2)
(3)
(4)
(5)
(6)
(7)
The
format of the output file for one ORF is shown as an example in Figure 3. >Burk345h10.q1c_1
[103 - 390] 1231 bp, 3 reads GTGGTGTTTTCCAGGCCACATAACCTGGAAGCGGTAATTGTAAAACCACACACTTTCCGA GGCTCCACTCCCATTTATTCACACTCCATCTACAATTTCTGCCCACCAGTTCTAGAAGCC TTGGGCTCTTTTTCTCCCGAGGAATCTTGGTGGGAGTGGGGTGCCGGGGGAGAGGTCCAC GTATGTCTAAAACTTGGATTGGTAGACTGCATCATGGCAGGCTCCTTTTTTGTTTGTTAT GACTGTTTGGAATCCATGGTATCATCTGTCACTGTCCATATTGTGGTG @amino
acids sequence MVFSRPHNLEAVIVKPHTFRGSTPIYSHSIYNFCPPVLEALGSFSPEESWWEWGAGGEVH VCLKLGLVDCIMAGSFFVCYDCLESMVSSVTVHIVV @results 0.9224268
0.3833075 1.8995285 0.026063826 2.205662 2.2205944 2.1334183 >Burk345h10.q1c_2
[394 - 615] 1231 bp, 3 reads CTGAAGCCCTCCAAAGATCCTGATGCTGCCCCTGTGTTTTGCTCTGTCACTTGCCTGGTC CAGCTAGAGGCTGTTCTAGCGTGCAGCTGCAAAGGTAACAGCAGAGTAGGAACATGCCCT TGGTCCCACCCTACCCTCCAGCCTCCATTGTGGCACCTACTGTTAAACACTGCCCTCAAC AGAGTCTTTAAAGAAACTGCTCTTTCCACTTTTACTCACTTT @amino
acids sequence MKPSKDPDAAPVFCSVTCLVQLEAVLACSCKGNSRVGTCPWSHPTLQPPLWHLLLNTALN RVFKETALSTFTHF @results 0.97987527
0.54589283 2.0341592 0.023968514 2.4359636 2.350689 2.360087 Figure 3 Sample output of sequence analysis programme. The output file is specified by the user at the start of the computation. For each ORF, the amino acid is translated by the programme, followed by 7 values (see text for description). 2. Exception handling Several possible exceptions were taken into consideration. Firstly, when calculating genome signature
profile (Equation 2a), if the frequency of single nucleotide X
or Y is 0 (i.e., X or Y does not appear at all in the
sequence.), the denominator in Equation 2a is 0.
The programme sets
Secondly, when calculating codon usage for
a particular amino acid in a sequence, if this amino acid does not appear
in the sequence, we will again have a case where the denominator of codon
usage of condon i that encodes amino acid Z
is 0. Since in
such a case, amino acid Z does not appear at all, the programme
sets
Thirdly, when calculating dinucleotide frequency, if the two nucleotides X and Y are the same (for example, if they are both G), the programme counts the number of dinucleotide XY (i.e., GG) in such a sequence as “GGGTGGGA” to be 4.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||