This pipeline performs mutation analysis of SARS-CoV-2 and reports and quantifies the occurrence of lineages and single nucleotide (single-NT) mutations.

The visualizations below provide an overview of the evolution of VOCs found in the analyzed samples across given time points and locations. The abundance values for the variants are derived by deconvolution. For details please consult the documentation)

This pipeline is part of PiGx, a collection of highly reproducible genomics pipelines developed by the Bioinformatics & Omics Data Science platform at the Berlin Institute of Medical System Biology (BIMSB).

1 Lineage abundance

These plots provide an overview of the relative frequency dynamics of identified lineages at specific wastewater sampling locations over time.

1.1 Lineage abundances stacked

The summary plot shows the results pooled by day and across locations by weighted average using the read number as weights.
Please use the tabs to access the not-pooled plots for each location.

Summary over all locations

USA:New York, Brooklyn, C

USA:New York, Queens, D

not applicable

USA:New York, Queens, C

USA:New York, Queens

USA:New York, Manhattan

USA:New York, Brooklyn

USA:New York, Brooklyn, A

USA:New York, Bronx

USA:New York, Queens, H

USA:New York, Queens, G

USA: New York

1.2 Lineage abundances geo-localized

This plot visualizes proportions of identified lineage abundances at the provided sampling locations.

Locations of wastewater processing plants have been generated arbitrarily and do not correspond to actual locations.

Use the slider to select a specific date or hit the Play button to display all snapshots successively. Click on a lineage in the legend to toggle its visibility in the map; double-click to view only the selected lineage.

2 Mutation dynamics

The following plots provide an overview of detected single nucleotide mutations in different locations and how their relative frequency changes over time. Furthermore, mutations showing a significant frequency increase over time are highlighted.

Mutation notation
Mutations are noted in the pattern of
gene :: protein-sequence mutation : NT-sequence mutation
Please note that this translation was done for single mutations. Combinations of single-NT mutations that taken together may lead to a different amino acid are not yet taken into account.

2.1 Increasing mutations

To show the dynamic of significantly changing mutations over time a linear regression model was applied to the mutation results across all samples. The following table shows the showing the strongest increasing trend (p <= 0.05). The number of trending mutation is restricted to the top 20.

Mutations with significant increase in frequency over time

Download significant_mutations.csv
Download lm_res_all_mutations.csv (unfiltered)

3 Data summaries and download

Frequencies per lineage per sample, derived by deconvolution, pooled by weighted mean by read number”) Download variant_frequencies.csv

Frequencies per mutation per sample Download mutation_frequencies.csv

Counts of mutations found across all sample and per sample Download mutation_counts.csv

4 Detailed per-sample reports

For every sample three reports are generated:

  • a QC report reporting general statistics and amplicon coverage
  • a lineage report including tables summarizing the mutation calling and the deconvolution results for the abundance of VOCs
  • a taxonomic classification report including a pie chart showing the analysis of the unaligned reads.

The reports for each sample can be accessed here:

sample_sheet <- fread(params$sample_sheet)
sample_names <- sample_sheet$name
reports <- list(list("suffix" = ".variantreport_p_sample.html",
                     "name"   = "variant report"),
                list("suffix" = ".qc_report_per_sample.html",
                     "name"   = "QC report"),
                list("suffix" = ".taxonomic_classification.html",
                     "name"   = "taxonomic classification"))

df <- as.data.frame( dplyr::select(sample_sheet, name, location_name, date) )

links <- lapply(sample_names, function (sample) {
    as.vector(lapply(reports, function (report) {
        paste0("<a href=", sample, report$suffix, ">", report$name, "</a>")
    }))
})
df$reports <- links

datatable(df, options = list( fixedColumns = TRUE, 
                               scrollY = 180,
                               scrollX = TRUE,
                               scroller = TRUE,
                               dom = 'Slfrtip'))

5 Discarded samples

Prior visualization and regression analysis each sample gets a quality score depending on proportion of covered reference genome and proportion of covered signature mutation locations. Results from samples without sufficient coverage measures (< 90%) are not included in the visualizations or the linear regression calculations.

Table 1: Discarded samples not matching the provided sample quality scoring