JASS analysis pipeline
Overview
We present here a nextflow pipeline to harmonize, impute and analyze jointly GWAS summary statistics.
The current pipeline integrate the following workflow:
Quick Start - Run pipeline on test data
Install pipeline dependencies
This pipeline enables you to run multi-trait GWAS in a computationaly efficient way
- Install nextflow as explain here : https://www.nextflow.io/docs/latest/getstarted.html
- Install the jass_preprocessing python package or use its docker container (see below).
- Install the JASS python package or download its docker container.
Launch pipeline on test data
Clone the current repository locally:
git clone https://gitlab.pasteur.fr/statistical-genetics/jass_suite_pipeline.git
Test data are located in the ${PATH_TO_PIPELINE_FOLDER}/test_data/hg38_EAS/ folder
These are extracts of summary statistics from a trans ancestry GWAS on blood traits (Chen et al): WBC, White blood cell count; RBC, Red blood cell count; PLT, platelet count.
They correspond to the chromosome 21 and 22 for the East asian ancestry.
Once done you can launch the pipeline as:
NXF_VER=22.10.5 nextflow run jass_pipeline.nf --ref_panel_WG {ABSOLUTE_PATH_TO_PIPELINE_FOLDER}Ref_Panel/1000G_EAS_0_01_chr22_21.csv --gwas_folder {ABSOLUTE_PATH_TO_PIPELINE_FOLDER}/test_data/hg38_EAS/ --meta-data {ABSOLUTE_PATH_TO_PIPELINE_FOLDER}/input_files/Data_test_EAS.csv --region {ABSOLUTE_PATH_TO_PIPELINE_FOLDER}/input_files/All_Regions_ALL_ensemble_1000G_hg38_EAS.bed --group {ABSOLUTE_PATH_TO_PIPELINE_FOLDER}/input_files/group.txt -with-report jass_report.html -c nextflow_local.config
See the description of required parameters in the next section. You can specify parameter in the jass_pipeline.nf header if prefered.
If all went well, you have cleaned the three summary statistic files, aligned them on the reference panel, and integrated them in one database. This database was used to perform a multi-trait GWAS on the three traits.
Here are the output files produce by the pipeline:
- ${PIPELINE_FOLDER}/harmonized_GWAS_1_file/ : genome wide harmonized summary statistics
- ${PIPELINE_FOLDER}/harmonized_GWAS_files/ : harmonized summary statistics by chromosomes
- ${PIPELINE_FOLDER}/init_table : database containing all summary statistics to perform multi-trait GWAS
- ${PIPELINE_FOLDER}/worktable : multi-trait GWAS results file
- ${PIPELINE_FOLDER}/quadrant : quadrant plot of the multi-trait GWAS
- ${PIPELINE_FOLDER}/manhattan : manhattan plot of the multi-trait GWAS
To run the pipeline as it is configured for the tutorial (with no imputation, and without the LD-score inference step), adapt the meta_data file describing your summary statistics, select an appropriate reference panel below and provide a path to the folder containing all your summary statistics.
See sections below, for running the imputation step and/or the LD-score step.
Required Input
The following Item are necessary to run JASS pipeline on real data
- --meta_data: A path toward a meta-data file describing GWAS (see example file in ./input_files/test1.csv and jass_preprocessing documentation)
- --gwas_folder: A path toward a folder containing the summary statistics to analyze
- --ref_panel_WG: a path toward a reference panel (all genome as 1 file). See below to download curated reference panels by ancestries derived from 1000G V3 on hg38 assembly
- --region: Quasi LD independent regions. These regions are used by JASS to determine quickly LD-independent hits accross the genome. The input_files folder contains one region file by ancestry on hg38 assembly. If working with a different assembly or population, you can provide 1Mb delimitations as a rough equivalent of these regions.
Optional parameters
- --output_folder : A path toward a folder to write pipeline results (inittable, worktable...). by default results will be publish in the workflow directory.
to launch multi-trait GWAS at the end of the pipeline
You can use this pipeline to launch a batch of multi-trait GWAS at the end of the pipeline
- --compute_project: flag indicating that you wish to perform multi trait GWAS at the end of the pipeline
- --group If you wish to compute joint analyses with the pipeline, a group file with the each phenotype group written on a separated line
Alternatively, use the jass create-project-data command line on the inittable file (all your summary statistique harmonized) stored in the init_table folder. See JASS documentation for its usage (https://statistical-genetics.pages.pasteur.fr/jass/generating_joint_analysis.html).
To launch imputation based on summary statistics
For this step you will need to install an additional dependency RAISS python package.
- --compute_imputation=true : the imputation step will be performed
- --ref_panel : A folder containing a Reference Panel in the .bim, .bed, .fam format for imputation with RAISS
- --ld-folder : A path toward a folder containing LD matrices (that can be generated from the reference panel with the raiss package as described here : http://statistical-genetics.pages.pasteur.fr/raiss/#precomputation-of-ld-correlation)
imputed files will be stored in
- ${PIPELINE_FOLDER}/imputed_GWAS/ : harmonized summary statistics by chromosomes
Available reference panels
To make reference panel readily available, we use git lfs. To download them, you can either install git lfs or simply download the file through this here and place it in the ./Ref_Panel folder.
Solution with git LFS:
git lfs pull --include 1000G_AFR_0_01.csv
We provide a reference panel for common SNPs (MAF > 1%) for the East asian (EAS), African, South east Asian, Hispanic and European populations. They were built from 1000 Genomes consortium phase 3 data (hg38 build) for each ancestry. (The 1000 Genomes Project Consortium 2015).
You can download the five panel using the command:
git lfs fetch --all
or manualy through the gitlab interface:
Running the LDSC regression covariance step
To infer multi-trait z-scores null distribution, heritabilities, genetic correlations using the LDscore regression
For exactitude, we recommend using the LDscore regression to infer the multivariate distribution of Z-scores under the null. The alternative, implemented by default, is to estimate the null distribution by computing the covariance of Zscore with low genetic signal. Hence this step is not strickly required.
When computed for a large number of trait, this step can be computationally intensive, and require a HPC cluster.
- For hg37 and the EUR ancestry, you can download their Download and extract reference panel for LD-score in the pipeline folder:
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/eur_w_ld_chr.tar.bz2
tar -jxvf eur_w_ld_chr.tar.bz2
If you want to analyze data in hg38 and for all ancestries, you can contact the main developper of this pipeline (hanna.julienne@pasteur.fr) to request the needed input files 2. To activate the LDscore option turn this flag to true:
--compute_LDSC_matrix=true
- Give the path of the reference panel Using the LDscore regression on
--LD_SCORE_folder ${PATH_to_REFERENCE}
If you run this additional step, the following outputs will be generated
- ${PIPELINE_FOLDER}/ldsc_data : preprocessed data to run
- ${PIPELINE_FOLDER}/h2_data: heritability estimation logs
- ${PIPELINE_FOLDER}/cor_data: covariance estimation logs
- ${PIPELINE_FOLDER}/Correlation_matrices: parsed covariance matrices
The H0 matrix will be integrated in the inittable file by the pipeline, and hence taken into account in the inittable.
Usage Example on HPC Cluster
If you are working with a HPC server (Slurm job scheduler), you can adapt the nextflow_sbatch.config file and launch the pipeline with a command like:
sbatch --mem-per-cpu 32G -p common,dedicated,ggs --qos=long --wrap "module load java/13.0.2;module load singularity/3.8.3;module load graphviz/2.42.3;./nextflow run imputation_only.nf -with-report imput_report.html -with-timeline imput_timeline.html -c nextflow_slurm.config -qs 300"
Using docker container
Stable versions of JASS tools and dependencies are available as docker container:
- plink: https://quay.io/repository/biocontainers/plink?tab=tags
- LDscore: https://quay.io/repository/biocontainers/ldsc?tab=tags
- JASS preprocessing: https://quay.io/repository/biocontainers/jass_preprocessing?tab=tags
- JASS containers: https://quay.io/repository/biocontainers/jass?tab=tags
- RAISS containers: https://quay.io/repository/biocontainers/raiss?tab=tags