Skip to content
Snippets Groups Projects

JASS analysis pipeline

Overview

We present here a nextflow pipeline to harmonize, impute and analyze jointly GWAS summary statistics.

The current pipeline integrate the following workflow:

workflow image

Quick Start - Run pipeline on test data

Install pipeline dependencies

This pipeline enables you to run multi-trait GWAS in a computationaly efficient way

  1. Install nextflow as explain here : https://www.nextflow.io/docs/latest/getstarted.html
  2. Install the jass_preprocessing python package or use its docker container (see below).
  3. Install the JASS python package or download its docker container.

Launch pipeline on test data

Clone the current repository locally:

    git clone https://gitlab.pasteur.fr/statistical-genetics/jass_suite_pipeline.git

Test data are located in the ${PATH_TO_PIPELINE_FOLDER}/test_data/hg38_EAS/ folder

These are extracts of summary statistics from a trans ancestry GWAS on blood traits (Chen et al): WBC, White blood cell count; RBC, Red blood cell count; PLT, platelet count.

They correspond to the chromosome 21 and 22 for the East asian ancestry.

Once done you can launch the pipeline as:

    NXF_VER=22.10.5 nextflow run jass_pipeline.nf --ref_panel_WG {ABSOLUTE_PATH_TO_PIPELINE_FOLDER}Ref_Panel/1000G_EAS_0_01_chr22_21.csv --gwas_folder {ABSOLUTE_PATH_TO_PIPELINE_FOLDER}/test_data/hg38_EAS/ --meta-data {ABSOLUTE_PATH_TO_PIPELINE_FOLDER}/input_files/Data_test_EAS.csv --region {ABSOLUTE_PATH_TO_PIPELINE_FOLDER}/input_files/All_Regions_ALL_ensemble_1000G_hg38_EAS.bed --group {ABSOLUTE_PATH_TO_PIPELINE_FOLDER}/input_files/group.txt -with-report jass_report.html -c nextflow_local.config

See the description of required parameters in the next section. You can specify parameter in the jass_pipeline.nf header if prefered.

If all went well, you have cleaned the three summary statistic files, aligned them on the reference panel, and integrated them in one database. This database was used to perform a multi-trait GWAS on the three traits.

Here are the output files produce by the pipeline:

  • ${PIPELINE_FOLDER}/harmonized_GWAS_1_file/ : genome wide harmonized summary statistics
  • ${PIPELINE_FOLDER}/harmonized_GWAS_files/ : harmonized summary statistics by chromosomes
  • ${PIPELINE_FOLDER}/init_table : database containing all summary statistics to perform multi-trait GWAS
  • ${PIPELINE_FOLDER}/worktable : multi-trait GWAS results file
  • ${PIPELINE_FOLDER}/quadrant : quadrant plot of the multi-trait GWAS
  • ${PIPELINE_FOLDER}/manhattan : manhattan plot of the multi-trait GWAS

To run the pipeline as it is configured for the tutorial (with no imputation, and without the LD-score inference step), adapt the meta_data file describing your summary statistics, select an appropriate reference panel below and provide a path to the folder containing all your summary statistics.

See sections below, for running the imputation step and/or the LD-score step.

Required Input

The following Item are necessary to run JASS pipeline on real data

  1. --meta_data: A path toward a meta-data file describing GWAS (see example file in ./input_files/test1.csv and jass_preprocessing documentation)
  2. --gwas_folder: A path toward a folder containing the summary statistics to analyze
  3. --ref_panel_WG: a path toward a reference panel (all genome as 1 file). See below to download curated reference panels by ancestries derived from 1000G V3 on hg38 assembly
  4. --region: Quasi LD independent regions. These regions are used by JASS to determine quickly LD-independent hits accross the genome. The input_files folder contains one region file by ancestry on hg38 assembly. If working with a different assembly or population, you can provide 1Mb delimitations as a rough equivalent of these regions.

Optional parameters

  • --output_folder : A path toward a folder to write pipeline results (inittable, worktable...). by default results will be publish in the workflow directory.

to launch multi-trait GWAS at the end of the pipeline

You can use this pipeline to launch a batch of multi-trait GWAS at the end of the pipeline

  • --compute_project: flag indicating that you wish to perform multi trait GWAS at the end of the pipeline
  • --group If you wish to compute joint analyses with the pipeline, a group file with the each phenotype group written on a separated line

Alternatively, use the jass create-project-data command line on the inittable file (all your summary statistique harmonized) stored in the init_table folder. See JASS documentation for its usage (https://statistical-genetics.pages.pasteur.fr/jass/generating_joint_analysis.html).

To launch imputation based on summary statistics

For this step you will need to install an additional dependency RAISS python package.

imputed files will be stored in

  • ${PIPELINE_FOLDER}/imputed_GWAS/ : harmonized summary statistics by chromosomes

Available reference panels

To make reference panel readily available, we use git lfs. To download them, you can either install git lfs or simply download the file through this here and place it in the ./Ref_Panel folder.

Solution with git LFS:

    git lfs pull --include 1000G_AFR_0_01.csv

We provide a reference panel for common SNPs (MAF > 1%) for the East asian (EAS), African, South east Asian, Hispanic and European populations. They were built from 1000 Genomes consortium phase 3 data (hg38 build) for each ancestry. (The 1000 Genomes Project Consortium 2015).

You can download the five panel using the command:

    git lfs fetch --all

or manualy through the gitlab interface: workflow image

Running the LDSC regression covariance step

To infer multi-trait z-scores null distribution, heritabilities, genetic correlations using the LDscore regression

For exactitude, we recommend using the LDscore regression to infer the multivariate distribution of Z-scores under the null. The alternative, implemented by default, is to estimate the null distribution by computing the covariance of Zscore with low genetic signal. Hence this step is not strickly required.

When computed for a large number of trait, this step can be computationally intensive, and require a HPC cluster.

  1. For hg37 and the EUR ancestry, you can download their Download and extract reference panel for LD-score in the pipeline folder:
    wget https://data.broadinstitute.org/alkesgroup/LDSCORE/eur_w_ld_chr.tar.bz2
    tar -jxvf eur_w_ld_chr.tar.bz2

If you want to analyze data in hg38 and for all ancestries, you can contact the main developper of this pipeline (hanna.julienne@pasteur.fr) to request the needed input files 2. To activate the LDscore option turn this flag to true:

    --compute_LDSC_matrix=true
  1. Give the path of the reference panel Using the LDscore regression on
    --LD_SCORE_folder ${PATH_to_REFERENCE}

If you run this additional step, the following outputs will be generated

  • ${PIPELINE_FOLDER}/ldsc_data : preprocessed data to run
  • ${PIPELINE_FOLDER}/h2_data: heritability estimation logs
  • ${PIPELINE_FOLDER}/cor_data: covariance estimation logs
  • ${PIPELINE_FOLDER}/Correlation_matrices: parsed covariance matrices

The H0 matrix will be integrated in the inittable file by the pipeline, and hence taken into account in the inittable.

Usage Example on HPC Cluster

If you are working with a HPC server (Slurm job scheduler), you can adapt the nextflow_sbatch.config file and launch the pipeline with a command like:

sbatch --mem-per-cpu 32G -p common,dedicated,ggs --qos=long --wrap "module load java/13.0.2;module load singularity/3.8.3;module load graphviz/2.42.3;./nextflow run imputation_only.nf -with-report imput_report.html -with-timeline imput_timeline.html -c nextflow_slurm.config -qs 300"

Using docker container

Stable versions of JASS tools and dependencies are available as docker container: