G+C content is slightly higher in one type III secretion gene cluster, but not the other

B. pseudomallei and a variety of other Gram-negative pathogens use type III secretion as a conserved and highly adapted virulence mechanism (Hueck, 1998).  Type III secretion systems are made up of clusters of homologous proteins with export functions that deliver virulence factors directly to host cells.  Therefore, type III secretion system is a type of PAI in bacterial genomes.

From the calculated results, the currently known two type III secretion gene clusters of B. pseudomallei (Winstanley et al., 1999; Attree and Attree, 2002) display different G+C content compared to the putative ORFs.  The values range from 0.95 to 1.12 for cluster AY044082 (GenBank accession number), with an average of 1.06, slightly higher than that of the average of all putative ORFs. However, the values of cluster AF074878 range from 0.84 to 0.98, with an average of 0.93, essentially of no difference between the average of all putative ORFs.  Some putative ORFs were seen to have a  value of about 1.1, which serves as one indication of them being components of potential PAIs.

     PAIs are usually generated by horizontal gene transfer.  Therefore, they often consist of DNA regions that differ from the core genome in G+C content and in different codon usage.  However, the differences in the G+C contents of PAIs and the core genome will not be observed if the DNAs of the donors and recipients have similar or identical G+C contents (Hacker and Kaper, 2000).  Furthermore, laterally transferred genes may adopt the genome-wide tendencies in terms of G+C content of their new host (Karlin, 2001).  This may account for the difference we observed between the two type III secretion systems in terms of their G+C content and highlights the possibility that the two type III secretion systems may have been acquired at different time during evolution.  In addition, this also poses a question on the validity of using G+C content as one of the universal indicators of PAIs.

Genome signature contrast

Nearly all the putative ORFs and the ORFs in the two type III secretion systems have  values less than 0.90 (moderately similar), with most of them less than 0.45.  This is in accordance with the notion that these genes are genes within the same species, based on the value of  (if <0.50, they are closely similar and pervasively within species) (Karlin et al., 1998).  Laterally transferred genes may evolve rapidly toward the signature of their new host (Karlin, 2001), which may also account for the observed small  values of the two type III secretion systems.

A few ORFs have  values around 0.90.  However, from the result, higher  values do not correlate with higher  values.  We noticed that most of these ORFs with higher  values are short in length.  Thus, the error of calculation is relatively big for them, which may be responsible for the big  values obtained.

Codon and amino acid usage contrast

A gene is considered putative alien (pA) if the biases , ,  and  all exceed , where M is the median codon bias of  over all genes.

, ,  and  were calculated for the two type III secretion gene clusters and for all the putative ORFs in the genome.  However, as we did not compute the value of M, we did not validate this criterion using the above values of the two pathogenic islands, nor did we carry out search for potential PAIs in the genome based on this criterion.  Such an analysis in the future will be very informative in providing information on potential PAIs in B. pseudomallei and other bacterial genomes.

However, as noted by Karlin (2001), not every pA gene cluster is a PAI, and conversely, not every PAI is a pA gene cluster.  Therefore, this criterion cannot serve as a definitive indicator of a gene cluster being a PAI even if it is pA. 

In summary

Karlin proposed the five criteria for detecting anomalous gene clusters and pathogenicity islands in bacterial genomes based on a sliding window W of length 10, 20, …, 50 kb.  In our current study, we implemented the five criteria based on each in silico generated ORF with the initial concern that the genome of B. pseudomallei is being assembled (i.e., the sizes of W is smaller).  A smaller size of W, as we have seen, may give bigger relative errors in calculation.  Therefore, a study based on the sliding window size of 10, 20, …, 50 kb should be designed to reassess the potential of pathogenicity after the B. pseudomallei genome is assembled.  The B. pseudomallei genome is being sequenced and assembled.  Thus, the original contig file we downloaded from Sanger Institute may contain redundant sequences or some genome sequences are unrepresented in the file.  This could also contribute the errors in our study, when we want to computer the average values of several parameters of the entire genome.  Another problem that our method has is that the ORFs are only generated by simple programmes (getorf), thus the resulting ORFs are not equal to the real transcribed and translated genome of the bacteria.  We obtained 53,516 putative ORFs using the getorf programme, as counted by the programme in Appendix I.  This is obviously an over-representation of the genes of B. pseudomallei (as we know, human beings only have about 40,000 genes).  This problem can be partially overcome by increasing the threshold of ORF length during ORF generation.  Nevertheless, the ORFs generated by our method, although not equal to the real genes produced in real B. pseudomallei bacterium, are exhaustive.

In addition to using know type III secretion system gene clusters to validate the five criteria, validation should also be carried out using currently known non-pathogenic genes.  Systemic statistical methods are to be developed to analyse the calculated results.  For better management and exploitation of data, database interface can be included in our programme so that the calculated values generated are automatically stored in a database for more convenient data mining.

 

Chen Kang