First Step with R for Life Sciences: Learning Basics of this Tool for NGS Data Analysis

Rashid Saif, Kinza Qazi, Saeeda Zia, Tania Mahmood, Aniqa Ejaz, Talha Tamseel, Suliman Mohammed Alghanem, Adnan Khaliq


Background: R is one of the renowned programming language which is an open source software developed by the scientific community to compute, analyze and visualize big data of any field including biomedical research for bioinformatics applications.

Methods: Here, we outlined R allied packages and affiliated bioinformatics infrastructures e.g. Bioconductor and CRAN. Moreover, basic concepts of factor, vector, data matrix and whole transcriptome RNA-Seq data was analyzed and discussed. Particularly, differential expression workflow on simulated prostate cancer RNA-Seq data was performed through experimental design, data normalization, hypothesis testing and downstream investigations using EdgeR package. A few genes with ectopic expression were retrieved and knowhow to gene enrichment pathway analysis is highlighted using available online tools.

Results: Data matrix of (4×3) was constructed, and a complex data matrix of Golub et al., was analyzed through χ2 statistics by generating a frequency table of 15 true positive, 4 false positive, 15 true negative and 4 false negative on gene expression cut-off values, and a test statistics value of 10.52 with 1 df and p= 0.001 was obtained, which reject the null hypothesis and supported the alternative hypothesis of “predicted state of a person by gene expression cut-off values is dependent on the disease state of patient” in our data. Similarly, sequence data of human Zyxingene was selected and a null hypothesis of equal frequencies was rejected.

Conclusion: Machine-learning approaches using R statistical package is a supportive tool which can provide systematic prediction of putative causes, present state, future consequences and possible remedies of any problem of modern biology.

Keywords: NGS data; R language; Zyxin gene

Full Text:



Anderson DR, Burnham KP, Thompson WL. Null hypothesis testing: problems, prevalence, and an alternative. The journal of wildlife management, (2000); 912-923.

Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S. Bioinformatics and computational biology solutions using R and Bioconductor. Chapter: Book Name. 2006 of publication; Springer Science & Business Media.

Gentleman R. R Programming for Bioinformatics. 2008; 53: 4200‐6367. Chapman & Hall/CRC

Hartigan JA. Direct clustering of a data matrix. Journal of the american statistical association, (1972); 67(337): 123-129.

Krijnen WP. Applied statistics for bioinformatics using R. Institute for Life Science and Technology, Hanze University, (2009).

Mirsky E, DeHon A. MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources. In FCCM, (1996); 96: 17-19.

Pinar A, Heath MT. Improving performance of sparse matrix-vector multiplication; 1999. ACM. pp. 30.

Quandt K, Frech K, Karas H, Wingender E, Werner T. Matlnd and Matlnspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic acids research, (1995); 23(23): 4878-4884.

Elmroth E, Gustavson F, Jonsson I, Kågström B. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM review, (2004); 46(1): 3-45.

Tallarida RJ, Murray RB (1987) Chi-square test. Manual of Pharmacologic Calculations: Springer. pp. 140-142.

Wilcox RR Introduction to robust estimation and hypothesis testing. Chapter: Book Name. 2011 of publication; Academic press.

Getts RC, Kadushin J (2016) Whole transcriptome sequencing. Google Patents.

Tang F, Barbacioru C, Wang Y, Nordman E, Lee C, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nature methods, (2009); 6(5): 377.

Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics, (2016); 17(6): 333-351.

MacDonald JW, MacDonald MJW, biocViews ReportWriting M, OneChannel G. Package ‘affycoretools’.

Luo W, Brouwer C. Pathview: an R/Bioconductor package for pathway-based data integration and visualization. Bioinformatics, (2013); 29(14): 1830-1831.

Wettenhall JM, Smyth GK. limmaGUI: a graphical user interface for linear modeling of microarray data. Bioinformatics, (2004); 20(18): 3705-3706.

Wang L, Feng Z, Wang X, Wang X, Zhang X. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics, (2009); 26(1): 136-138.

Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, et al. A survey of best practices for RNA-seq data analysis. Genome biology, (2016); 17(1): 13.

Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews genetics, (2009); 10(1): 57.

Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, (2010); 26(1): 139-140.

Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, et al. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome biology, (2013); 14(9): 3158.

Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, (2010); 11(3): R25.

Spitzer M, Wildenhain J, Rappsilber J, Tyers M. BoxPlotR: a web tool for generation of box plots. Nature methods, (2014); 11(2): 121.

Wilson PW. FEAR: A software package for frontier efficiency analysis with R. Socio-economic planning sciences, (2008); 42(4): 247-254.


  • There are currently no refbacks.