Final Report

PiGx BSseq is a data-processing pipeline for bisulfite experiments; it automates the analysis of raw single-end or paired-end fastq reads, including quality control, trimming and alignment. The software also provides post-mapping analysis, such as differential-methylation detection, methylation segmentation, and annotation of such detected regions. It was first developed by the Akalin group at MDC in Berlin in 2017.

This report was generated with PiGx BSseq version 0.0.10.

The results of this pipeline were saved to the output directory listed in the following table; for the remainder of this report, this path will be referred to as ‘[out]’.

Parameters Values
Sample ID tet2
Source directory /scratch/AG_Akalin/bosberg/pigx_usecase/in
Output directory ([out]) /scratch/AG_Akalin/bosberg/pigx_usecase/pigx_out_standardized
Reference genome assembly mm10

Methylation Calling

The following table(s) lists some of the basic parameters under which the pipeline calculations were performed as well as output files that may be of interest to the user for further analysis.

Parameters Values
Minimum Coverage 10
Minimim Mapping Quality 10

Output files:

Format location
MethylRaw Object [out]/06_methyl_calls/tet2_se_bt2.sorted.deduped_methylRaw.RDS
Bam file [out]/05_sorting_deduplication/tet2_se_bt2.sorted.deduped.bam

Extract Methylation Calls

We first extract the methylation calls from the sequence alignment produced by the bisulfite mapper Bismark (Krueger and Andrews 2011) using the processBismarkAln() function of methylKit (Akalin et al. 2012) –a package for the analysis of DNA methylation profiles. We apply filters based on a minimum coverage of 10 and a mapping quality of at least 10 , as indicated in the parameters table. Here we show some simple statistics related to the distribution of methylation and coverage in the sample.

Segmentation

Segmentation based on methylation provides a way to compress information on methylation at the base-pair level into regional resolution. This allows for the detection of regions with methylation patterns that might warrant further investigation.

Output files:

format location
Segments (BED) [out]/08_segmentation/tet2_se_bt2.sorted.deduped_meth_segments.bed
Segments (GRanges) [out]/08_segmentation/tet2_se_bt2.sorted.deduped_meth_segments_gr.RDS

Segmentation of Methylation Profile

Segmentation of the methylation profile is done using the methSeg() function, where change-points in the genome-wide signal are recorded and the genome is partitioned into regions between consecutive change-points. This approach is typically used in the detection of copy-number variation (Klambauer et al. 2012) but can be applied to methylome segmentation as well (Wreczycka et al. 2017). Here, the identified segments are further clustered based on their average methylation signal, using a mixture-modeling approach, which permits the detection of distinct regions inside the genome (Wreczycka et al. 2017).

Various diagnostic plots of the segmentation profile observed

Various diagnostic plots of the segmentation profile observed

Export to BED

We export the above regions to a BED file, which can be loaded into any genome browser (such as IGV or UCSC ) to allow for further analysis, annotation and visualisation.

Annotation of Segments

The annotation of the identified regions with genomic features allows for a better understanding and characterization of detected regions.

Pigx first searches for a reference gene set for the given genome assembly in the path specified in the settings file. Upon failure to find such a file in this location, Pigx allows the user to query the UCSC table browser directly to fetch the reference gene set using the rtracklayer Package (Lawrence, Gentleman, and Carey 2009), when the option webfetch (in the settings file) is set to True -or, equivalently, yes. When this option is set to False –or no–, and the reference gene files are not found, the corresponding sections are simply omitted from this report. Upon acquiring the reference genes, we can determine different features (e.g. promoter, exon, intron) intrinsic to each gene, using the readTranscriptFeatures() function from genomation (Akalin et al. 2014) –likewise, for CpG islands and shores.

In this particular execution of the pipeline, the reference gene set was obtained, while the option webfetch was set to = FALSE; this implies that the .bed file was found locally on the machine.

Here, we plot the average methylation per segment group and the overlap with gene features for the input reference gene set.

Finally, we consider the average methylation over the promoter regions in reference gene set provided. In this case, all CpG sites with coverage above threshold are weighted equally and aligned to the transcription start site (TSS) in the direction of transcription.

Session Information

This report concludes with a summary of the session parameters used in the generation of the data conveyed here. Thank you for using PiGx

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /gnu/store/ccad09zgj85251ksp5xd71ds3cz3f7gp-openblas-0.2.20/lib/libopenblasp-r0.2.20.so
## 
## locale:
## [1] en_US.UTF-8
## 
## attached base packages:
##  [1] parallel  stats4    grid      stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] rtracklayer_1.40.3   AnnotationHub_2.12.0 DT_0.4              
##  [4] methylKit_1.6.1      GenomicRanges_1.32.3 GenomeInfoDb_1.16.0 
##  [7] IRanges_2.14.10      S4Vectors_0.18.3     BiocGenerics_0.26.0 
## [10] genomation_1.12.0   
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6                  matrixStats_0.53.1           
##  [3] bit64_0.9-7                   httr_1.3.1                   
##  [5] rprojroot_1.3-2               numDeriv_2016.8-1            
##  [7] tools_3.5.0                   backports_1.1.2              
##  [9] R6_2.2.2                      KernSmooth_2.23-15           
## [11] DBI_1.0.0                     lazyeval_0.2.1               
## [13] colorspace_1.3-2              seqPattern_1.12.0            
## [15] fastseg_1.26.0                bit_1.1-14                   
## [17] compiler_3.5.0                Biobase_2.40.0               
## [19] DelayedArray_0.6.1            scales_0.5.0                 
## [21] readr_1.1.1                   stringr_1.3.1                
## [23] digest_0.6.15                 Rsamtools_1.32.0             
## [25] rmarkdown_1.10                R.utils_2.6.0                
## [27] XVector_0.20.0                base64enc_0.1-3              
## [29] pkgconfig_2.0.1               htmltools_0.3.6              
## [31] plotrix_3.7-2                 bbmle_1.0.20                 
## [33] limma_3.36.1                  BSgenome_1.48.0              
## [35] highr_0.7                     htmlwidgets_1.2              
## [37] rlang_0.2.1                   RSQLite_2.1.1                
## [39] impute_1.54.0                 BiocInstaller_1.30.0         
## [41] shiny_1.1.0                   jsonlite_1.5                 
## [43] mclust_5.4                    BiocParallel_1.14.1          
## [45] gtools_3.5.0                  R.oo_1.22.0                  
## [47] RCurl_1.95-0.1.2              magrittr_1.5                 
## [49] GenomeInfoDbData_0.99.1       Matrix_1.2-14                
## [51] Rcpp_0.12.17                  munsell_0.5.0                
## [53] R.methodsS3_1.7.1             stringi_1.2.3                
## [55] yaml_2.1.19                   MASS_7.3-50                  
## [57] SummarizedExperiment_1.10.1   plyr_1.8.4                   
## [59] qvalue_2.12.0                 blob_1.1.1                   
## [61] promises_1.0.1                lattice_0.20-35              
## [63] Biostrings_2.48.0             splines_3.5.0                
## [65] hms_0.4.2                     knitr_1.20                   
## [67] pillar_1.2.3                  reshape2_1.4.3               
## [69] XML_3.98-1.11                 evaluate_0.10.1              
## [71] data.table_1.11.4             httpuv_1.4.4.1               
## [73] gtable_0.2.0                  ggplot2_2.2.1                
## [75] emdbook_1.3.10                gridBase_0.4-7               
## [77] mime_0.5                      xtable_1.8-2                 
## [79] coda_0.19-1                   later_0.7.3                  
## [81] tibble_1.4.2                  GenomicAlignments_1.16.0     
## [83] AnnotationDbi_1.42.1          memoise_1.1.0                
## [85] interactiveDisplayBase_1.18.0

References

Akalin, Altuna, Vedran Franke, Kristian Vlahovicek, Christopher Mason, and Dirk Schubeler. 2014. “Genomation: A Toolkit to Summarize, Annotate and Visualize Genomic Intervals.” Bioinformatics. https://doi.org/10.1093/bioinformatics/btu775.

Akalin, Altuna, Matthias Kormaksson, Sheng Li, Francine E. Garrett-Bakelman, Maria E. Figueroa, Ari Melnick, and Christopher E Mason. 2012. “MethylKit: A Comprehensive R Package for the Analysis of Genome-Wide Dna Methylation Profiles.” Genome Biology 13 (10):R87.

Klambauer, Günter, Karin Schwarzbauer, Andreas Mayr, Djork-Arné Clevert, Andreas Mitterecker, Ulrich Bodenhofer, and Sepp Hochreiter. 2012. “Cn.MOPS: Mixture of Poissons for Discovering Copy Number Variations in Next-Generation Sequencing Data with a Low False Discovery Rate.” Nucleic Acids Research 40 (9):e69. https://doi.org/10.1093/nar/gks003.

Krueger, Felix, and Simon R. Andrews. 2011. “Bismark: A Flexible Aligner and Methylation Caller for Bisulfite-Seq Applications.” Bioinformatics 27 (11):1571–2. https://doi.org/10.1093/bioinformatics/btr167.

Lawrence, Michael, Robert Gentleman, and Vincent Carey. 2009. “Rtracklayer: An R Package for Interfacing with Genome Browsers.” Bioinformatics 25:1841–2. https://doi.org/10.1093/bioinformatics/btp328.

Wreczycka, Katarzyna, Alexander Gosdschan, Dilmurat Yusuf, Bjoern Gruening, Yassen Assenov, and Altuna Akalin. 2017. “Strategies for Analyzing Bisulfite Sequencing Data.” bioRxiv. Cold Spring Harbor Labs Journals. https://doi.org/10.1101/109512.