BioR

From LsmWiki
Jump to: navigation, search

R applications in Bioinformatics

Basic Introduction to R

  • Simple Intro [1]
  • Comprehensive Intro [2]

Rscript

Non-interactive command-line mode of using R can be directly piping R commands into R -slave, or on the command line, typically achieved using Rscript, which is found in most modern installations of R.

The philosophy of Rscript traces essentially to the significant following of Unix Shell as a powerful 4GL. See http://www.rdb.com/lib/4gl.pdf

Also the de-scoffing of scripting languages: http://awk.info/?doc/praise.html

Piping R commands to R

$ a=10; b=100; echo "mean($a:$b)" | R --slave

This script construct essentially makes use of bash interpolation to substitute values of variables using the double quotes to construct the R commands needed to be sent into the R environment for computation.

If line numbers of R are to be omitted, then try this construct:

$ a=10; b=100; echo "x=mean($a:$b); cat(x)" | R --slave

or to insert an end-of-line character:

$ a=10; b=100; echo "x=mean($a:$b); cat(x,fill=T)" | R --slave

So R can be a very very powerful text manipulator with powerful and easy to use commands within the R environment, far clearer and cleaner to implement on the commandline than say a bunch of obscure sed or awk manipulations, and perhaps even better than perl.

An Rscript takes R commands on the command line

Rscripts produce output of R commands to the stdout. R commands are delivered to Rscript and executed. These R commands in Rscript are achieved using the -e flag of Rscript and quoted to define where the R commands begin and end. More than one R command in Rscript can be added provided each command ends with a semicolon ";". For example:

$ Rscript -e "R command; R command;..." arguments

Double-quoting of Rscript R commands will result in shell variables being interpolated within the quotes. This variable substitution will be useful for calling Rscript by another script, say bash, and for bash variables to be delivered to the R commands executed by Rscript.

Passing command line arguments to Rscript

Shell variables can also be delivered as Rscript arguments. However, the R commands must be able to read these arguments. The CommandArgs command in R performs this task.

$ Rscript -e "x=commandArgs(TRUE); x[1] ; x[2]" arg1 arg2
[1] "arg1"
[1] "arg1"

If arg1 etc are to be used as numbers e.g. integers or float, then use the as.integer or as.float command in R to convert the variable/vector.

$ Rscript -e "x=commandArgs(TRUE); as.integer(x[1]); x[2]" 123  345
[1] 123
[1] "345"

Note the difference between x1 as an integer, and x2 as a string of characters.

$ Rscript -e "x=commandArgs(TRUE); mean(as.integer(x))" 1 2 3 4 5 6
[1] 3.5


Pipeing stdout into stdin of a Rscript command

Each line of the stdin can be read into a Rscript variable as specific elements of a vector.

$ echo "1
> 2
> 3" | Rscript -e "x=readLines(file('stdin')); x"
[1] "1" "2" "3"


$ ls -l /usr/bin | awk '!/total/ {print $5}' | \
  Rscript -e "x=readLines(file('stdin')); summary(as.integer(x))"
    Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
    1.0     553.2    7128.0   76650.0   24050.0 9918000.0
$ echo "col1  col2 col3
> 11  12   13
> 21  22   23
> 31  32   33" | Rscript -e "x=read.table('stdin');print(x)"


$ (echo "h1 h2 h3
> happy new year
> good morning sir
> invite for meal") | Rscript -e "x=read.table('stdin',header=T);x"
      h1      h2   h3
1  happy     new year
2   good morning  sir
3 invite     for meal

Comparison with Perl which takes five lines to read in a file and print it out, Rscript does it cleanly with one single command. Compare that with C, which is even more involved than Perl. This explains why R is gaining rapidly popularity.

$ cat /etc/passwd | Rscript -e "cat(readLines(file('stdin')))"
$ cat /etc/passwd | Rscript -e "cat(readLines(file('stdin')),fill=T)". # put in newlines

Rscript and Table Joins

Introduction

Because R can read in tables so flexibly, it can function like a relational database. For example, it can perform SQL-type database joins. When used with Rscript, this is analogous to NoSQL.

Carlo Strozzi wrote NoSQL in 1999 to leverage on Unix shell utilities and Unix redirection.

  1. http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page
  2. http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/NoSQL/Philosophy%20of%20NoSQL


Reading a file via stdin into a table (strictly R data.frame)

  • use read.table which will read a file into a data.frame
  • define stdin for file,
  • specify na.strings as "-" for any missing values, and
  • set header=TRUE so that the header line is read as headers in R.
$ cat t.table | Rscript -e "x=read.table(file('stdin'), na.strings='-',header=T); x"


Other ways of reading in a file

read.table is reading a external file whose format is a data "table" to non-R-users. Similarly, there are

read.csv() to read a CSV file, 
read.delim() to read tab-delim file. 

But both read.csv and read.delim will store the data into a data frame.

If you load the "foreign" *package* by "library(foreign)", you have read.spss, read.octave, read.systat, etc. Here is another confusing concept in R, package vs library. What is typically known as a "package", is loaded by "library()", but it is WRONG to call it a "library", which refers to something else in R.

Reading in two tables as data.frames

$ Rscript -e "x=read.table(file('t1.table'), na.strings='-',header=T); \
  y=read.table(file('t2.table'), header=T); x;  y "
 CustID Product
1      1 Toaster
2      2 Toaster
3      3 Toaster
4      4   Radio
5      5   Radio
6      6   Radio
 CustID State
1      2    AL
2      4    AL
3      6    CA

Inner join using merge

# Rscript -e "x=read.table(file('t1.table'), na.strings='-',header=T);\    
  y=read.table(file('t2.table'), header=T); merge(x,y) "
 CustID Product State
1      2 Toaster    AL
2      4   Radio    AL
3      6   Radio    CA

We can specify which specific Column we wish to merge by using by='COLUMNNAME'

# Rscript -e "x=read.table(file('t1.table'), na.strings='-',header=T);\    
  y=read.table(file('t2.table'), header=T); merge(x,y, by='CustID') "

Explicit inner join using merge and by='columnname'

# Rscript -e "x=read.table(file('t1.table'), na.strings='-',header=T);  \
  z=read.table(file('t3.table'), header=T); z;  merge(x,z, by='Product') "
 Product   MadeIn
1 Toaster    China
2   Radio Malaysia
 Product CustID   MadeIn
1   Radio      4 Malaysia
2   Radio      5 Malaysia
3   Radio      6 Malaysia
4 Toaster      1    China
5 Toaster      2    China
6 Toaster      3    China

Sorting by column name

Sorting of a dataframe according to a specified column can be achieved
by using the construct: df[order(df$ColumnName),]
# Rscript -e "x=read.table(file('t1.table'), na.strings='-',header=T);  \
  z=read.table(file('t3.table'), header=T); xz=merge(x,z, by='Product');\
  xz[order(xz\$CustID),]"

Note that the $ sign has to be protected from the Shell with a backslash. This constructs sorts the entire dataframe according to the numeric values of the column "CustID".

  Product CustID   MadeIn
4 Toaster      1    China
5 Toaster      2    China
6 Toaster      3    China
1   Radio      4 Malaysia
2   Radio      5 Malaysia
3   Radio      6 Malaysia

If column X is a numeric column (you can check this by str(df)), the above command will sort the data frame by numeric values. If column X is a character column, it is sorted by alphabetic order. If it is a numeric column, but you want to sort it alphabetically, simply convert the column to character. (This statment is simply a combination of "order()" and ways-to-access-rows-of-data.frame)

Explicit three-"table" inner join using merge, by= in two steps

# Rscript -e "x=read.table(file('t1.table'), na.strings='-',header=T);\
  y=read.table(file('t2.table'), header=T); \
  z=read.table(file('t3.table'), header=T); \
  dum=merge(x,z, by='Product'); merge(dum,y, by='CustID') "
 CustID Product   MadeIn State
1      2 Toaster    China    AL
2      4   Radio Malaysia    AL
3      6   Radio Malaysia    CA

Only the intersect of CustIDs in Table 2 with Table 1 will be produced

Explicit three-table outer join using merge of data.frames with all=TRUE

The Union of CustIDs in Table 2 with Table 1 will be produced with missing values as <NA> null

# Rscript -e "x=read.table(file('t1.table'), na.strings='-',header=T);\
  y=read.table(file('t2.table'), header=T); \
  z=read.table(file('t3.table'), header=T); \
  dum=merge(x,z, by='Product'); merge(dum,y, by='CustID', all=TRUE) "
  CustID Product   MadeIn State
1      1 Toaster    China  <NA>
2      2 Toaster    China    AL
3      3 Toaster    China  <NA>
4      4   Radio Malaysia    AL
5      5   Radio Malaysia  <NA>
6      6   Radio Malaysia    CA

If we want a Left outer join, use "all.x=TRUE" and the result is a "union", with missing values

If we want an Right outer join,use "all.y=TRUE" and the result is an "intersect" and lines with missing values are not merged

  CustID Product   MadeIn State
1      2 Toaster    China    AL
2      4   Radio Malaysia    AL
3      6   Radio Malaysia    CA

To access a specific row of a data.frame named df

df[10, ] is the 10th row of df
df["MyRowname", ] is the row with name MyRowname

There are several ways to access specific rows of a dataframe:

1) by row numbers: df[10:30, ]
2) by negative row numbers: df[-(10:30), ] for all rows except 10 to 30
3) by row names
4) by boolean: df[ x > 10, ] where x is a vector with n elements, and
df has n rows (so x can be one of df's columns)


Set Functions in R

Just like in Python where one can use set functions, R also can handle this very easily. So if we have a list of entries in a text file, and we want to compare with that of another file, we can use the set functions. Note that R will interpret white space strictly, and so a line containing "x" and "x " will not be the same. To make a comparison if you are expecting white space, strip off all trailing or spaces at the beginning. like using l.strip() or r.strip() in Python.

Save the following bash script into a file, ln -s file i; ln -s file u; ln -s file s; ln -s file o

#!/bin/bash
# R-based bash script to find the
# intersection (i), union (u), subtraction (s), symmetric subtraction (o) (outersect)
# of two lists of strings one in each file
# Usage [i|u|s|o] file1  file2

case `basename $0` in
i) Rscript -e "a<-readLines(file('$1'));b<-readLines(file('$2')); \
   cat ('--Set A=$1--\n'); cat(a, '\n'); cat ('--Set B=$2--\n'); cat (b, '\n'); \
   cat ('--Intersect of A and B-- \n');i<-intersect(a,b); cat (i, '\n') "
;;
u) Rscript -e "a<-readLines(file('$1'));b<-readLines(file('$2')); \
   cat ('--Set A=$1--\n'); cat(a, '\n'); cat ('--Set B=$2--\n'); cat (b, '\n'); \ 
   cat ('--Union of A and B-- \n');u<-union(a,b); cat (u, '\n') "
;;
s) Rscript -e "a<-readLines(file('$1'));b<-readLines(file('$2')); \
   cat ('--Set A=$1--\n'); cat(a, '\n'); cat ('--Set B=$2--\n'); cat (b, '\n'); \ 
   cat ('--Subtraction of B from A-- \n');s<-setdiff(a,b); cat (s, '\n') "
;;
o) Rscript -e "a<-readLines(file('$1'));b<-readLines(file('$2')); \ 
   cat ('--Set A=$1--\n'); cat(a, '\n'); cat ('--Set B=$2--\n'); cat (b, '\n'); \
   cat ('--Asymmetric Subtraction of A and B-- \n');u<-union(setdiff(a,b),setdiff(b,a)); cat (u, '\n') "
;;
esac
  1. http://answers.oreilly.com/topic/1629-how-to-import-data-from-external-files-in-r/
  2. readLines(con = stdin(), n = -1L, ok = TRUE, warn = TRUE, encoding = "unknown")
  3. Argument Description Default
  4. con A character string (specifying a file or URL) or a connection containing the data to read. stdin()
  5. n An integer value specifying the number of lines to read. (Negative values mean .read until the end of the file..) -1L
  6. ok A logical value specifying whether to trigger an error if the number of lines in the file is less than n. TRUE
  7. warn A logical value specifying whether to warn the user if the file does not end with an EOL. TRUE
  8. encoding A character value specifying the encoding of the input file. .unknown

Basic Introduction to Bioconductor

  • Home of Bioconductor [3]
  • Home of Bioconductor [4]

R Publication Quality Graphics

  • Graphics Gallery [5]
  • ggplot2 (only works with R 2.9 and above) [6]


High Performance R

  • High performance analytics with REvolution R and HPC server [7]
  • ParallelR [8]