BioR: Difference between revisions
Line 252: | Line 252: | ||
Just like in Python where one can use set functions, R also can handle this very easily. | Just like in Python where one can use set functions, R also can handle this very easily. | ||
So if we have a list of entries in a text file, and we want to compare with that | |||
of another file, we can use the set functions. Note that R will interpret white | |||
space strictly, and so a line containing "x" and "x " will not be the same. | |||
To make a comparison if you are expecting white space, strip off all trailing | |||
or spaces at the beginning. like using l.strip() or r.strip() in Python. | |||
Save the following bash script into a file, ln -s file i; ln -s file u; ln -s file s; ln -s file o | |||
<pre> | <pre> | ||
#!/bin/bash | #!/bin/bash |
Revision as of 21:13, 7 April 2013
R applications in Bioinformatics
Basic Introduction to R
Rscript
Non-interactive command-line mode of using R can be directly piping R commands into R -slave, or on the command line, typically achieved using Rscript, which is found in most modern installations of R.
The philosophy of Rscript traces essentially to the significant following of Unix Shell as a powerful 4GL. See http://www.rdb.com/lib/4gl.pdf
Also the de-scoffing of scripting languages: http://awk.info/?doc/praise.html
Piping R commands to R
$ a=10; b=100; echo "mean($a:$b)" | R --slave
This script construct essentially makes use of bash interpolation to substitute values of variables using the double quotes to construct the R commands needed to be sent into the R environment for computation.
If line numbers of R are to be omitted, then try this construct:
$ a=10; b=100; echo "x=mean($a:$b); cat(x)" | R --slave
or to insert an end-of-line character:
$ a=10; b=100; echo "x=mean($a:$b); cat(x,fill=T)" | R --slave
So R can be a very very powerful text manipulator with powerful and easy to use commands within the R environment, far clearer and cleaner to implement on the commandline than say a bunch of obscure sed or awk manipulations, and perhaps even better than perl.
An Rscript takes R commands on the command line
Rscripts produce output of R commands to the stdout. R commands are delivered to Rscript and executed. These R commands in Rscript are achieved using the -e flag of Rscript and quoted to define where the R commands begin and end. More than one R command in Rscript can be added provided each command ends with a semicolon ";". For example:
$ Rscript -e "R command; R command;..." arguments
Double-quoting of Rscript R commands will result in shell variables being interpolated within the quotes. This variable substitution will be useful for calling Rscript by another script, say bash, and for bash variables to be delivered to the R commands executed by Rscript.
Passing command line arguments to Rscript
Shell variables can also be delivered as Rscript arguments. However, the R commands must be able to read these arguments. The CommandArgs command in R performs this task.
$ Rscript -e "x=commandArgs(TRUE); x[1] ; x[2]" arg1 arg2 [1] "arg1" [1] "arg1"
If arg1 etc are to be used as numbers e.g. integers or float, then use the as.integer or as.float command in R to convert the variable/vector.
$ Rscript -e "x=commandArgs(TRUE); as.integer(x[1]); x[2]" 123 345 [1] 123 [1] "345"
Note the difference between x1 as an integer, and x2 as a string of characters.
$ Rscript -e "x=commandArgs(TRUE); mean(as.integer(x))" 1 2 3 4 5 6 [1] 3.5
Pipeing stdout into stdin of a Rscript command
Each line of the stdin can be read into a Rscript variable as specific elements of a vector.
$ echo "1 > 2 > 3" | Rscript -e "x=readLines(file('stdin')); x" [1] "1" "2" "3"
$ ls -l /usr/bin | awk '!/total/ {print $5}' | \ Rscript -e "x=readLines(file('stdin')); summary(as.integer(x))" Min. 1st Qu. Median Mean 3rd Qu. Max. 1.0 553.2 7128.0 76650.0 24050.0 9918000.0
$ echo "col1 col2 col3 > 11 12 13 > 21 22 23 > 31 32 33" | Rscript =e "x=read.table('stdin');print(x)"
$ (echo "h1 h2 h3 > happy new year > good morning sir > invite for meal") | Rscript -e "x=read.table('stdin',header=T);x" h1 h2 h3 1 happy new year 2 good morning sir 3 invite for meal
Comparison with Perl which takes five lines to read in a file and print it out, Rscript does it cleanly with one single command. Compare that with C, which is even more involved than Perl. This explains why R is gaining rapidly popularity.
$ cat /etc/passwd | Rscript -e "cat(readLines(file('stdin')))"
Rscript and Table Joins
Introduction
Because R can read in tables so flexibly, it can function like a relational database. For example, it can perform SQL-type database joins. When used with Rscript, this is analogous to NoSQL.
Carlo Strozzi wrote NoSQL in 1999 to leverage on Unix shell utilities and Unix redirection.
- http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page
- http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/NoSQL/Philosophy%20of%20NoSQL
Reading a file via stdin into a table (strictly R data.frame)
- use read.table which will read a file into a data.frame
- define stdin for file,
- specify na.strings as "-" for any missing values, and
- set header=TRUE so that the header line is read as headers in R.
$ cat t.table | Rscript -e "x=read.table(file('stdin'), na.strings='-',header=T); x"
Other ways of reading in a file
read.table is reading a external file whose format is a data "table" to non-R-users. Similarly, there are
read.csv() to read a CSV file, read.delim() to read tab-delim file.
But both read.csv and read.delim will store the data into a data frame.
If you load the "foreign" *package* by "library(foreign)", you have read.spss, read.octave, read.systat, etc. Here is another confusing concept in R, package vs library. What is typically known as a "package", is loaded by "library()", but it is WRONG to call it a "library", which refers to something else in R.
Reading in two tables as data.frames
$ Rscript -e "x=read.table(file('t1.table'), na.strings='-',header=T); \ y=read.table(file('t2.table'), header=T); x; y " CustID Product 1 1 Toaster 2 2 Toaster 3 3 Toaster 4 4 Radio 5 5 Radio 6 6 Radio CustID State 1 2 AL 2 4 AL 3 6 CA
Inner join using merge
# Rscript -e "x=read.table(file('t1.table'), na.strings='-',header=T);\ y=read.table(file('t2.table'), header=T); merge(x,y) " CustID Product State 1 2 Toaster AL 2 4 Radio AL 3 6 Radio CA
We can specify which specific Column we wish to merge by using by='COLUMNNAME'
# Rscript -e "x=read.table(file('t1.table'), na.strings='-',header=T);\ y=read.table(file('t2.table'), header=T); merge(x,y, by='CustID') "
Explicit inner join using merge and by='columnname'
# Rscript -e "x=read.table(file('t1.table'), na.strings='-',header=T); \ z=read.table(file('t3.table'), header=T); z; merge(x,z, by='Product') " Product MadeIn 1 Toaster China 2 Radio Malaysia Product CustID MadeIn 1 Radio 4 Malaysia 2 Radio 5 Malaysia 3 Radio 6 Malaysia 4 Toaster 1 China 5 Toaster 2 China 6 Toaster 3 China
Sorting by column name
Sorting of a dataframe according to a specified column can be achieved
by using the construct: df[order(df$ColumnName),]
# Rscript -e "x=read.table(file('t1.table'), na.strings='-',header=T); \ z=read.table(file('t3.table'), header=T); xz=merge(x,z, by='Product');\ xz[order(xz\$CustID),]"
Note that the $ sign has to be protected from the Shell with a backslash. This constructs sorts the entire dataframe according to the numeric values of the column "CustID".
Product CustID MadeIn 4 Toaster 1 China 5 Toaster 2 China 6 Toaster 3 China 1 Radio 4 Malaysia 2 Radio 5 Malaysia 3 Radio 6 Malaysia
If column X is a numeric column (you can check this by str(df)), the above command will sort the data frame by numeric values. If column X is a character column, it is sorted by alphabetic order. If it is a numeric column, but you want to sort it alphabetically, simply convert the column to character. (This statment is simply a combination of "order()" and ways-to-access-rows-of-data.frame)
Explicit three-"table" inner join using merge, by= in two steps
# Rscript -e "x=read.table(file('t1.table'), na.strings='-',header=T);\ y=read.table(file('t2.table'), header=T); \ z=read.table(file('t3.table'), header=T); \ dum=merge(x,z, by='Product'); merge(dum,y, by='CustID') " CustID Product MadeIn State 1 2 Toaster China AL 2 4 Radio Malaysia AL 3 6 Radio Malaysia CA
Only the intersect of CustIDs in Table 2 with Table 1 will be produced
Explicit three-table outer join using merge of data.frames with all=TRUE
The Union of CustIDs in Table 2 with Table 1 will be produced with missing values as <NA> null
# Rscript -e "x=read.table(file('t1.table'), na.strings='-',header=T);\ y=read.table(file('t2.table'), header=T); \ z=read.table(file('t3.table'), header=T); \ dum=merge(x,z, by='Product'); merge(dum,y, by='CustID', all=TRUE) "
CustID Product MadeIn State 1 1 Toaster China <NA> 2 2 Toaster China AL 3 3 Toaster China <NA> 4 4 Radio Malaysia AL 5 5 Radio Malaysia <NA> 6 6 Radio Malaysia CA
If we want a Left outer join, use "all.x=TRUE" and the result is a "union", with missing values
If we want an Right outer join,use "all.y=TRUE" and the result is an "intersect" and lines with missing values are not merged
CustID Product MadeIn State 1 2 Toaster China AL 2 4 Radio Malaysia AL 3 6 Radio Malaysia CA
To access a specific row of a data.frame named df
df[10, ] is the 10th row of df
df["MyRowname", ] is the row with name MyRowname
There are several ways to access specific rows of a dataframe:
1) by row numbers: df[10:30, ] 2) by negative row numbers: df[-(10:30), ] for all rows except 10 to 30 3) by row names 4) by boolean: df[ x > 10, ] where x is a vector with n elements, and df has n rows (so x can be one of df's columns)
Set Functions in R
Just like in Python where one can use set functions, R also can handle this very easily. So if we have a list of entries in a text file, and we want to compare with that of another file, we can use the set functions. Note that R will interpret white space strictly, and so a line containing "x" and "x " will not be the same. To make a comparison if you are expecting white space, strip off all trailing or spaces at the beginning. like using l.strip() or r.strip() in Python.
Save the following bash script into a file, ln -s file i; ln -s file u; ln -s file s; ln -s file o
#!/bin/bash # R-based bash script to find the # intersection (i), union (u), subtraction (s), symmetric subtraction (o) (outersect) # of two lists of strings one in each file # Usage [i|u|s|o] file1 file2 case `basename $0` in i) Rscript -e "a<-readLines(file('$1'));b<-readLines(file('$2')); \ cat ('--Set A=$1--\n'); cat(a, '\n'); cat ('--Set B=$2--\n'); cat (b, '\n'); \ cat ('--Intersect of A and B-- \n');i<-intersect(a,b); cat (i, '\n') " ;; u) Rscript -e "a<-readLines(file('$1'));b<-readLines(file('$2')); \ cat ('--Set A=$1--\n'); cat(a, '\n'); cat ('--Set B=$2--\n'); cat (b, '\n'); \ cat ('--Union of A and B-- \n');u<-union(a,b); cat (u, '\n') " ;; s) Rscript -e "a<-readLines(file('$1'));b<-readLines(file('$2')); \ cat ('--Set A=$1--\n'); cat(a, '\n'); cat ('--Set B=$2--\n'); cat (b, '\n'); \ cat ('--Subtraction of B from A-- \n');s<-setdiff(a,b); cat (s, '\n') " ;; o) Rscript -e "a<-readLines(file('$1'));b<-readLines(file('$2')); \ cat ('--Set A=$1--\n'); cat(a, '\n'); cat ('--Set B=$2--\n'); cat (b, '\n'); \ cat ('--Asymmetric Subtraction of A and B-- \n');u<-union(setdiff(a,b),setdiff(b,a)); cat (u, '\n') " ;; esac
- http://answers.oreilly.com/topic/1629-how-to-import-data-from-external-files-in-r/
- readLines(con = stdin(), n = -1L, ok = TRUE, warn = TRUE, encoding = "unknown")
- Argument Description Default
- con A character string (specifying a file or URL) or a connection containing the data to read. stdin()
- n An integer value specifying the number of lines to read. (Negative values mean .read until the end of the file..) -1L
- ok A logical value specifying whether to trigger an error if the number of lines in the file is less than n. TRUE
- warn A logical value specifying whether to warn the user if the file does not end with an EOL. TRUE
- encoding A character value specifying the encoding of the input file. .unknown
Basic Introduction to Bioconductor
- Home of Bioconductor [3]
- Home of Bioconductor [4]
R Publication Quality Graphics