1 Introduction

RCAS is an automated system that provides dynamic genome annotations for custom input files that contain transcriptomic regions. Such transcriptomic regions could be, for instance, peak regions detected by CLIP-Seq analysis that detect protein-RNA interactions, RNA modifications (alias the epitranscriptome), CAGE-tag locations, or any other collection of target regions at the level of the transcriptome.

RCAS is designed as a reporting tool for the functional analysis of RNA-binding sites detected by high-throughput experiments. It takes as input a BED format file containing the genomic coordinates of the RNA binding sites and a GTF file that contains the genomic annotation features usually provided by publicly available databases such as Ensembl and UCSC. RCAS performs overlap operations between the genomic coordinates of the RNA binding sites and the genomic annotation features and produces in-depth annotation summaries such as the distribution of binding sites with respect to gene features (exons, introns, 5’/3’ UTR regions, exon-intron boundaries, promoter regions, and whole transcripts). Moreover, by detecting the collection of targeted transcripts, RCAS can carry out functional annotation tables for enriched gene sets (annotated by the Molecular Signatures Database) and GO terms. As one of the most important questions that arise during protein-RNA interaction analysis; RCAS has a module for detecting sequence motifs enriched in the targeted regions of the transcriptome. The final report of RCAS consists of high-quality dynamic figures and tables, which are readily applicable for publications or other academic usage.

2 Input Settings

3 Annotation Summary for Query Regions

3.1 Distribution of query regions across gene features

Figure 1 : The number of query regions that overlap different kinds of gene features are counted. The ‘y’ axis denotes the types of gene features included in the analysis and the ‘x’ axis denotes the percentage of query regions (out of total number of query regions denoted with ‘n’) that overlap at least one genomic interval that host the corresponding feature. Notice that the sum of the percentage values for different features don’t add up to 100%, because some query regions may overlap multiple kinds of features

3.2 Interactive table of genes that overlap query regions

Table 1 : Interactive table of top 100 genes that overlap query regions, grouped by gene features such as introns, exons, UTRs, etc.

3.3 Distribution of query regions in the genome grouped by gene types

Figure 2 : The number of query regions that overlap different kinds of gene types are counted. The ‘x’ axis denotes the types of genes included in the analysis and the ‘y’ axis denotes the percentage of query regions (out of total number of query regions denoted with ‘n’) that overlap at least one genomic interval that host the corresponding gene type. If the query regions don’t overlap any known genes, they are classified as ‘Unknown’.

3.4 Coverage profile of query regions at/around Transcription Start/End Sites

Figure 3 : The depth of coverage of query regions at and around Transcription Start/End Sites

3.5 Coverage profile of query regions at Exon - Intron Boundaries

Figure 4 : The depth of coverage of query regions at exon - intron junctions

3.6 Coverage profile of query regions across the length of different gene features

Figure 5 : The query regions are overlaid with the genomic coordinates of features. Each entry corresponding to a feature is divided into 100 bins of equal length and for each bin the number of query regions that cover the corresponding bin is counted. Features shorter than 100bp are excluded. Thus, a coverage profile is obtained based on the distribution of the query regions. Mean coverage score for each bin is represented with ribbons where the thickness of the ribbon indicates the 95% confidence interval (mean +- standard error of the mean x 1.96). The strandedness of the features are taken into account. The coverage profile is plotted in the 5’ to 3’ direction.

4 motifRG analysis results

4.1 Top motifs discovered using motifRG

Figure 6 : Top motifs discovered in the sequences of the query regions

Motif 1 : Consensus: CCATGG

Motif 2 : Consensus: GGCGGC

Motif 3 : Consensus: CGCTGC

Motif 4 : Consensus: CCGCCG

4.2 motifRG motif discovery statistics

Table 2 : motifRG motif discovery statistics. fg: foreground; bg: background; hits: number of motif hits; seq: number of sequences with motifs; frac: fraction of sequences that contain the motif compared to the all sequences; ratio: ratio of foreground motif fraction versus background motif fraction

5 GO Term Analysis Results

5.1 Biological Processes

Table 3 : Significant Biological Process GO terms (FDR < 0.1) enriched for genes that overlap query regions

5.2 Molecular Functions

Table 4 : Significant Molecular Function GO terms (FDR < 0.1) enriched for genes that overlap query regions

5.3 Cellular Compartments

Table 5 : Significant Cellular Compartment GO terms (FDR < 0.1) enriched for genes that overlap query regions

6 Gene Set Enrichment Analysis Results

Table 6 : Significant MSigDB Gene Sets (FDR < 0.1) enriched for genes that overlap query regions

7 Acknowledgements

RCAS is developed in the group of Altuna Akalin (head of the Scientific Bioinformatics Platform) by Bora Uyar (Bioinformatics Scientist), Dilmurat Yusuf (Bioinformatics Scientist) and Ricardo Wurmus (System Administrator) at the Berlin Institute of Medical Systems Biology (BIMSB) at the Max-Delbrueck-Center for Molecular Medicine (MDC) in Berlin.

RCAS is developed as a bioinformatics service as part of the RNA Bioinformatics Center, which is one of the eight centers of the German Network for Bioinformatics Infrastructure (de.NBI).

8 Session Information

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
##  [1] grid      stats4    parallel  methods   stats     graphics  grDevices
##  [8] utils     datasets  base     
## 
## other attached packages:
##  [1] org.Hs.eg.db_3.4.0                RCAS_1.1.1                       
##  [3] motifRG_1.18.0                    BSgenome.Hsapiens.UCSC.hg19_1.4.0
##  [5] BSgenome_1.42.0                   rtracklayer_1.34.1               
##  [7] GenomicRanges_1.26.1              GenomeInfoDb_1.10.1              
##  [9] seqLogo_1.40.0                    Biostrings_2.42.0                
## [11] XVector_0.14.0                    topGO_2.26.0                     
## [13] SparseM_1.74                      GO.db_3.4.0                      
## [15] AnnotationDbi_1.36.0              IRanges_2.8.1                    
## [17] S4Vectors_0.12.0                  Biobase_2.34.0                   
## [19] graph_1.52.0                      BiocGenerics_0.20.0              
## [21] data.table_1.9.8                  DT_0.2                           
## [23] plotly_4.5.6                      ggplot2_2.2.0                    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8                lattice_0.20-34           
##  [3] tidyr_0.6.0                Rsamtools_1.26.1          
##  [5] assertthat_0.1             rprojroot_1.1             
##  [7] digest_0.6.10              gridBase_0.4-7            
##  [9] R6_2.2.0                   plyr_1.8.4                
## [11] backports_1.0.4            RSQLite_1.0.0             
## [13] evaluate_0.10              httr_1.2.1                
## [15] zlibbioc_1.20.0            GenomicFeatures_1.26.0    
## [17] lazyeval_0.2.0             Matrix_1.2-7.1            
## [19] rmarkdown_1.2              BiocParallel_1.8.1        
## [21] readr_1.0.0                stringr_1.1.0             
## [23] htmlwidgets_0.8            RCurl_1.95-4.8            
## [25] biomaRt_2.30.0             munsell_0.4.3             
## [27] base64enc_0.1-3            htmltools_0.3.5           
## [29] SummarizedExperiment_1.4.0 tibble_1.2                
## [31] matrixStats_0.51.0         XML_3.98-1.5              
## [33] viridisLite_0.1.3          dplyr_0.5.0               
## [35] GenomicAlignments_1.10.0   bitops_1.0-6              
## [37] jsonlite_1.1               gtable_0.2.0              
## [39] DBI_0.5-1                  magrittr_1.5              
## [41] scales_0.4.1               KernSmooth_2.23-15        
## [43] stringi_1.1.2              impute_1.48.0             
## [45] reshape2_1.4.2             RColorBrewer_1.1-2        
## [47] tools_3.3.2                seqPattern_1.6.0          
## [49] purrr_0.2.2                yaml_2.1.14               
## [51] plotrix_3.6-3              colorspace_1.3-1          
## [53] genomation_1.6.0           knitr_1.15.1

The RNA Centric Analysis System Report

Bora Uyar, Dilmurat Yusuf, Ricardo Wurmus, Altuna Akalin

2016-12-22