Getting Started with Flexynesis

Quick Start

Install

The installation process via pip/mamba (assuming you have mamba installed) should take a few minutes.

See mamba installation instructions here.

# create an environment with python 3.11 
mamba create --name flexynesisenv python==3.11
mamba activate flexynesisenv
# install latest version from pypi (https://pypi.org/project/flexynesis)
# make sure to use python3.11*
python -m pip install flexynesis --upgrade

Test the installation

Download a dataset and test the flexynesis installation on a test run.

The test run should finish within a minute.

curl -L -o dataset1.tgz \
https://bimsbstatic.mdc-berlin.de/akalin/buyar/flexynesis-benchmark-datasets/dataset1.tgz

tar -xzvf dataset1.tgz

mamba activate flexynesisenv

flexynesis --data_path dataset1 \
  --model_class DirectPred \
  --target_variables Erlotinib \
  --hpo_iter 1 \
  --features_top_percentile 5 \
  --data_types gex,cnv

Input Dataset Description

Flexynesis expects as input a path to a data folder with the following structure:

InputFolder/
| --  train 
|    |-- omics1.csv 
|    |-- omics2.csv
|    |--  ... 
|    |-- clin.csv

| --  test 
|    |-- omics1.csv 
|    |-- omics2.csv
|    |--  ... 
|    |-- clin.csv

File contents

clin.csv

clin.csv contains the sample metadata. The first column contains unique sample identifiers. The other columns contain sample-associated clinical variables. NA values are allowed in the clinical variables.

v1,v2
s1,a,b
s2,c,d
s3,e,f

omics.csv

The first column of the feature tables must be unique feature identifiers (e.g. gene names). The column names must be sample identifiers that should overlap with those in the clin.csv. They don't have to be completely identical or in the same order. Samples from the clin.csv that are not represented in the omics table will be dropped.

s1,s2,s3
g1,0,1,2
g2,3,3,5
g3,2,3,4

Concordance between train/test splits

The corresponding omics files in train/test splits must contain overlapping feature names (they don't have to be identical or in the same order). The clin.csv files in train/test must contain matching clinical variables.

Download a curated dataset

Before using Flexynesis on your own dataset, it is highly recommended that you familiarize yourself with datasets we have already curated and used for training and testing Flexynesis models.

Below you can find examples of how we can utilize Flexynesis from the command-line in multi-omic data integration for clinical variable prediction.

In order to demonstrate the various command-line options and different ways to run Flexynesis, we will use a multi-omic dataset of Lower Grade Glioma (LGG) and Glioblastoma Multiforme (GBM) Merged Cohorts. The data were downloaded from Cbioportal. The dataset was split into 70/30 train/test splits and used as input to Flexynesis.

wget -O lgggbm_tcga_pub_processed.tgz https://bimsbstatic.mdc-berlin.de/akalin/buyar/flexynesis-benchmark-datasets/lgggbm_tcga_pub_processed.tgz 
tar -xzvf lgggbm_tcga_pub_processed.tgz

The example dataset contains 556 training samples and 238 testing samples. Each sample has both copy number variation and mutation data. The mutation data was converted into a binary matrix of genes-vs-samples where the value of a gene for a given sample is set to 1 if the gene is mutated in that sample, or it is set to 0 if no mutation was found for that gene.

Supervised training

Minimal setup

For supervised training, the minimum required options to run Flexynesis are

Path to a dataset folder
Selection of a tool/model
One target variable which can be numerical or categorical for regression/classification tasks.
List of data types to use for modeling. Here we use the prefix of the filename that is available in the train/test folders (e.g. mut.csv => mut). While flexynesis is built for multi-omic integration, a single data modality is also acceptable.

While it is not a required argument, we set the hyperparameter optimisation steps to 1 to avoid lengthy run times for demonstration purposes.

flexynesis  --data_path lgggbm_tcga_pub_processed \
            --model_class DirectPred \
            --target_variables KARNOFSKY_PERFORMANCE_SCORE \
            --data_types mut \
            --hpo_iter 1

In the case where we want to use multiple data modalities, we provide a comma separated list of data type names as input:

For example, if we wanted to utilize both mutation and CNA data matrices for training:

flexynesis  --data_types mut,cna  <... other arguments>

Different options for the outcome variables

Flexynesis supports both single-task and multi-task training. We can provide one or more target variables and optionally survival variables as input and Flexynesis will build the appropriate model architecture. If the selected variable is numerical, a Multi-Layered-Perceptron (MLP) with MSE loss will be used. If a categorical variable is provided, an MLP with cross-entropy-loss will be utilized. If survival variables are provided, an MLP with Cox-Proportional-Hazards loss will be attached to the model.

All the user has to do is to provide a list of variable names:

Example: Regression

The target variable KARNOFSKY_PERFORMANCE_SCORE is a numerical value, so it will be built as a regression problem.

flexynesis  --data_path lgggbm_tcga_pub_processed \
            --model_class DirectPred \
            --target_variables KARNOFSKY_PERFORMANCE_SCORE \
            --data_types mut,cna \
            --hpo_iter 1

Example: Classification

The target varible HISTOLOGICAL_DIAGNOSIS is a categorical variable, so it will be built as a classification problem.

flexynesis  --data_path lgggbm_tcga_pub_processed \
            --model_class DirectPred \
            --target_variables HISTOLOGICAL_DIAGNOSIS  \
            --data_types mut,cna \
            --hpo_iter 1

Example: Survival

For survival analysis, two separate variables are required, where the first variable is a numeric event variable (consisting of 0's or 1's, where 1 means an event such as disease progression or death has occurred). The second variable is also a numeric time variable, which indicates how much time it took since last patient follow-up.

flexynesis  --data_path lgggbm_tcga_pub_processed \
            --model_class DirectPred \
            --surv_event_var OS_STATUS \
            --surv_time_var OS_MONTHS \
            --data_types mut,cna \
            --hpo_iter 1

Example: Mixed/multi-task model

Flexynesis can be trained with multiple target variables, which can be a mixture of regression/classification/survival tasks.

flexynesis  --data_path lgggbm_tcga_pub_processed \
            --model_class DirectPred \
            --target_variables HISTOLOGICAL_DIAGNOSIS,KARNOFSKY_PERFORMANCE_SCORE \
            --surv_event_var OS_STATUS \
            --surv_time_var OS_MONTHS \
            --data_types mut,cna \
            --hpo_iter 1

Using different model architectures

For the supervised tasks, the user can easily switch between different model architectures.

flexynesis  --data_path lgggbm_tcga_pub_processed \
            --model_class [DirectPred|supervised_vae|MultiTripletNetwork|GNN|CrossModalPred] \
            --target_variables HISTOLOGICAL_DIAGNOSIS,KARNOFSKY_PERFORMANCE_SCORE \
            --surv_event_var OS_STATUS \
            --surv_time_var OS_MONTHS \
            --data_types mut,cna \
            --hpo_iter 1

Model-specific exceptions

However there are model-specific exceptions due to the nature of the model architectures.

MultiTripletNetwork requires the first target variable to be a categorical variable. Triplet loss works by definition on categorical variables.
GNN: in the case of multi-omics input, the features should have the same naming convention. Another restriction for GNNs is that it only works if the omics features are "genes". For instance, if the features are CpG methylation sites, it wouldn't work. The reason is that GNNs require a prior knowledge network, which is currently set to use STRING database. Other model architectures can work on any kind of features, where feature nomenclature is not important. The current implementation of GNNs is also using early fusion type by default.

GNNs have an additional option called --gnn_conv_type, which determines the type of graph convolution algorithm. By default it is set to GC, but it can be change to SAGE or GCN.

Modality fusion options

Flexynesis currently supports two main ways of fusing different omics data modalities: 1. Early fusion: The input data matrices are initially concatenated and pushed through the networks 2. Intermediate fusion: The input data matrices are initially pushed through the networks to obtain a modality-specific embedding space, which then gets concatenated to serve as input for the supervisor MLPs.

Fusion option can be set using the --fusion flag

flexynesis  --data_path lgggbm_tcga_pub_processed \
            --model_class DirectPred \
            --target_variables HISTOLOGICAL_DIAGNOSIS \
            --data_types mut,cna \
            --fusion intermediate \
            --hpo_iter 1

Unsupervised Training

In the absence of any target variables or survival variables, we can use a VAE architecture to carry out unsupervised training.

Set model class to supervised_vae and leave variable arguments out.

flexynesis  --data_path lgggbm_tcga_pub_processed \
            --model_class supervised_vae \
            --data_types mut,cna \
            --hpo_iter 1

Cross-modality Training

We have implemented a special case of VAEs where the input data layers and output data layers can be set to different data modalities. The purpose of a cross-modality encoder is to learn embeddings that can translate from one data modality to another. Crossmodality encoder we implemented supports both single/multiple input layers and also one or more target/survival variables can be added to the model.

The user needs to provide which data layers to be used as input and which ones to be used as output (reconstruction target).

flexynesis  --data_path lgggbm_tcga_pub_processed \
            --model_class CrossModalPred \
            --data_types mut,cna \
            --input_layers mut \
            --output_layers cna \
            --hpo_iter 1

Both input and output layers can be set to one or more data modalities, where the modalities are determined by the --data_types flag. If the --data_types is set to "mut,cna"; the --input_layers can be set to mut, mut,cna, or cna, while the --output_layers can be set to mut, mut,cna, and cna. However, if the --input_layers and --output_layers are set to the same values, then it will behave as supervised_vae because the goal of the reconstruction would be identical to the input layers.

Multi-modal input and multiple target variables:

flexynesis  --data_path lgggbm_tcga_pub_processed \
            --model_class CrossModalPred \
            --data_types mut,cna \
            --input_layers mut,cna \
            --output_layers cna \
            --target_variables HISTOLOGICAL_DIAGNOSIS,AGE \
            --hpo_iter 1

Fine-tuning options

To enable fine-tuning, where Flexynesis builds a model on the training dataset, fine-tunes it on a portion of the test dataset, and evaluates the model on the remaining test samples, set the --finetuning_samples to a positive integer.

For instance, to fine-tune the model on a randomly drawn subset of 50 samples:

flexynesis  --data_path lgggbm_tcga_pub_processed \
            --model_class DirectPred \
            --data_types mut,cna \
            --target_variables HISTOLOGICAL_DIAGNOSIS \
            --finetuning_samples 50 \
            --hpo_iter 1

Feature filtering options

Flexynesis will by default do feature selection using multiple flags.

--variance_threshold 1 : will remove the lowest 1% of the features based on their variances in the training data.
--features_top_percentile 20: This will trigger the "Laplacian Scoring" module to rank features by this score and the top 20% of the features will be kept.
--correlation_threshold 0.8: Among the top ranking features, highly redundant features based on a pearson correlation score cut-off are dropped, based on the laplacian score rankings.
--restrict_to_features <filepath>: If the user provides a path to a list of feature names, the analysis will be restricted to only these features.

Hyperparameter optimisation

Flexynesis will run by default for 100 hyperparameter optimisation steps. It will stop the procedure if no improvement has been observed in the last 10 iterations. We can change these with the following flags: --hpo_iter and --hpo_patience.

flexynesis  --data_path lgggbm_tcga_pub_processed \
            --model_class DirectPred \
            --data_types mut,cna \
            --target_variables HISTOLOGICAL_DIAGNOSIS \
            --hpo_iter 50 \
            --hpo_patience 20

Accelerating with GPUs

If you have access to GPUs on your system, they can be used to accelerate the training of models using the --use_gpu flag.

However, making GPUs accessible to torch is system-specific. Please contact your system administrator to make sure you have accessible GPUs and methods to access them.

With Slurm

If you have [Slurm Workload Manager] in your system, you can call flexynesis as follows:

conda activate flexynesisenv
srun --gpus=1 --pty flexynesis --use_gpu ...otherarguments

GridEngine

If you have an HPC sytem running GridEngine with GPU nodes, you may be allowed to request a node with GPUs. The important thing here is to request a GPU node with the proper CUDA version installed on it.

# request 1 GPU device node with CUDA version 12
qrsh -l gpu=1,cuda12
# activate your environment
conda activate flexynesisenv
flexynesis --use_gpu ...otherarguments

Using Guix

You can also create a reproducible development environment or build a reproducible package of Flexynesis with GNU Guix. You will need at least the Guix channels listed in channels.scm. It also helps to have authorized the Inria substitute server to get binaries for CUDA-enabled packages. See this page for instructions on how to configure fetching binary substitutes from the build servers.

You can build a Guix package from the current committed state of your git checkout and using the specified state of Guix like this:

guix time-machine -C channels.scm -- \
    build --no-grafts -f guix.scm

To enter an environment containing just Flexynesis:

guix time-machine -C channels.scm -- \
    shell --no-grafts -f guix.scm

To enter a development environment to hack on Flexynesis:

guix time-machine -C channels.scm -- \
    shell --no-grafts -Df guix.scm

Do this to build a Docker image containing this package together with a matching Python installation:

guix time-machine -C channels.scm -- \
  pack -C none \
  -e '(load "guix.scm")' \
  -f docker \
  -S /bin=bin -S /lib=lib -S /share=share \
  glibc-locales coreutils bash python

Jupyter Notebooks

Defining a kernel

For interactively using flexynesis on Jupyter notebooks, one can define the kernel to make flexynesis and its dependencies available on the jupyter session.

Assuming you have already defined an environment and installed the package:

mamba activate flexynesisenv 
python -m ipykernel install --user --name "flexynesisenv" --display-name "flexynesisenv"

Compiling Notebooks

papermill can be used to compile the tutorials under examples/tutorials.

If the purpose is to quickly check if the notebook can be run; set HPO_ITER to 1. This sets hyperparameter optimisation steps to 1. For longer training runs to see more meaningful results from the notebook, increase this number to e.g. 50.

Example:

papermill examples/tutorials/brca_subtypes.ipynb brca_subtypes.ipynb -p HPO_ITER 1

The output from papermill can be converted to an html file as follows:

jupyter nbconvert --to html brca_subtypes.ipynb