diff --git a/README.md b/README.md index cf72ec67eff0cc8e3d876b1e7401f4d85d8f39f2..91bd08354e4c5e50138240b72a4d6a34b4cbd935 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,49 @@ # JASS analysis pipeline contact hanna.julienne@pasteur.fr + +## Table of Content + +1. [Overview](##Overview) +2. [Quick Start](#quick-start---run-pipeline-on-test-data) +3. [Advanced Options](#optional-parameters) +4. [Available Reference Panels](#available-reference-panels) +5. [Usage on HPC and docker container](#usage-example-on-hpc-cluster) + ## Overview We present here a nextflow pipeline to harmonize, impute and analyze jointly GWAS summary statistics. +For more detail about the multi-trait GWAS or imputation method and to see example potential results from this type of analysis: + +**When referring to theoretical basis of JASS tests, cite :** + +* Julienne H, Laville V, McCaw ZR, He Z, Guillemot V, Lasry C, Ziyatdinov A, Nerin C, Vaysse A, Lechat P, Ménager H, Le Goff W, Dube MP, Kraft P, Ionita-Laza I, Vilhjálmsson BJ, Aschard H. +Multitrait GWAS to connect disease variants and biological mechanisms. +PLoS Genet. 2021 Aug 30;17(8):e1009713. +doi: 10.1371/journal.pgen.1009713. + +**When using JASS software in publication, cite :** +* Julienne H, Lechat P, Guillemot V, Lasry C, Yao C, Araud R, Laville V, Vilhjalmsson B, Ménager H, Aschard H. +JASS: command line and web interface for the joint analysis of GWAS results. +NAR Genom Bioinform. 2020 Mar;2(1):lqaa003. +doi: 10.1093/nargab/lqaa003. + +**When referring to the imputation of summary statistics, cite :** +* Julienne H, Shi H, Pasaniuc B, Aschard H. +RAISS: robust and accurate imputation from summary statistics. Bioinformatics. +Bioinformatics. 2019 Nov 1;35(22):4837-4839. +doi: 10.1093/bioinformatics/btz466. + +**Additional publications tied to JASS** +* Suzuki, Y., Ménager, H., Brancotte, B., Vernet, R., Nerin, C., Boetto, C., Auvergne, A., Linhard, C., Torchet, R., Lechat, P., Troubat, L., Cho, M.H., Bouzigon, E., Aschard, H., Julienne, H., 2023. +Trait selection strategy in multi-trait GWAS: Boosting SNPs discoverability. +doi.org/10.1101/2023.10.27.564319 + +* Troubat, L., Fettahoglu, D., Henches, L., Aschard, H., Julienne, H., 2023. +Multi-trait GWAS for diverse ancestries: Mapping the knowledge gap. +doi.org/10.1101/2023.06.23.546248 +Auvergne, A., Traut, N., Henches, L., Troubat, L., Frouin, A., Boetto, C., Kazem, S., Julienne, H., Toro, R., Aschard, H., 2023. +Linking the genetic structure of neuroanatomical phenotypes with psychiatric disorders. +doi.org/10.1101/2023.11.01.564329 The current pipeline integrate the following workflow: @@ -26,8 +67,7 @@ Clone the current repository locally: git clone https://gitlab.pasteur.fr/statistical-genetics/jass_suite_pipeline.git ``` -[!NOTE] -The pipeline has been upgraded to nextflow DSL2 syntax recently. If you wish to use the previous version in DSL1, you find it in ./old_versions and run it with previous version of nextflow ("NXF_VER=22.10.5 nextflow run jass_pipeline.nf ....") +[!NOTE] The pipeline has been upgraded to nextflow DSL2 syntax recently. If you wish to use the previous version in DSL1, you find it in ./old_versions and run it with previous version of nextflow ("NXF_VER=22.10.5 nextflow run jass_pipeline.nf ....") Test data are located in the ${PATH_TO_PIPELINE_FOLDER}/test_data/hg38_EAS/ folder @@ -36,7 +76,7 @@ These are extracts of summary statistics from a trans ancestry GWAS on blood tra They correspond to the chromosome 21 and 22 for the East asian ancestry. -Once done you can launch the pipeline as: +Once done you can launch the pipeline using by replacing {ABSOLUTE_PATH_TO_PIPELINE_FOLDER} by the absolute path to the folder where you cloned the JASS pipeline: ``` nextflow run jass_pipeline.nf --ref_panel_WG {ABSOLUTE_PATH_TO_PIPELINE_FOLDER}Ref_Panel/1000G_EAS_0_01_chr22_21.csv --gwas_folder {ABSOLUTE_PATH_TO_PIPELINE_FOLDER}/test_data/hg38_EAS/ --meta-data {ABSOLUTE_PATH_TO_PIPELINE_FOLDER}/input_files/Data_test_EAS.csv --region {ABSOLUTE_PATH_TO_PIPELINE_FOLDER}/input_files/All_Regions_ALL_ensemble_1000G_hg38_EAS.bed --group {ABSOLUTE_PATH_TO_PIPELINE_FOLDER}/input_files/group.txt -with-report jass_report.html -c nextflow_local.config ``` @@ -62,7 +102,7 @@ See sections below, for running the imputation step and/or the LD-score step. The following Item are necessary to run JASS pipeline on real data -1. --meta_data: A path toward a meta-data file describing GWAS (see example file in ./input_files/test1.tsv and [jass_preprocessing documentation](http://statistical-genetics.pages.pasteur.fr/jass_preprocessing/)) +1. --meta_data: A path toward a meta-data file describing GWAS (see example file in ./input_files/Data_test_EAS.csv and [jass_preprocessing documentation](http://statistical-genetics.pages.pasteur.fr/jass_preprocessing/)) 2. --gwas_folder: A path toward a folder containing the summary statistics to analyze 3. --ref_panel_WG: a path toward a reference panel (all genome as 1 file). See below to download curated reference panels by ancestries derived from 1000G V3 on hg38 assembly 4. --region: Quasi LD independent regions. These regions are used by JASS to determine quickly LD-independent hits accross the genome. The input_files folder contains one region file by ancestry on hg38 assembly. If working with a different assembly or population, you can provide 1Mb delimitations as a rough equivalent of these regions. @@ -93,26 +133,12 @@ imputed files will be stored in ## Available reference panels -To make reference panel readily available, we use git lfs. -To download them, you can either install git lfs or simply download the file through this here and place it in the ./Ref_Panel folder. - -Solution with git LFS: - -``` - git lfs pull --include 1000G_AFR_0_01.csv -``` +Reference panels for JASS can be downloaded on Zenodo: https://zenodo.org/records/13940447 +You can then decompress than and place them in the ./Ref_Panel folder. We provide a reference panel for common SNPs (MAF > 1%) for the East asian (EAS), African, South east Asian, Hispanic and European populations. They were built from 1000 Genomes consortium phase 3 data (hg38 build) for each ancestry. (The 1000 Genomes Project Consortium 2015). -You can download the five panel using the command: - -``` - git lfs fetch --all -``` -or manualy through the gitlab interface: - - ## Running the LDSC regression covariance step ### To infer multi-trait z-scores null distribution, heritabilities, genetic correlations using the LDscore regression diff --git a/Ref_Panel/1000G_AFR_0_01.csv b/Ref_Panel/1000G_AFR_0_01.csv deleted file mode 100755 index 34732ab61353114c4ab406dce02a6e50b3520077..0000000000000000000000000000000000000000 --- a/Ref_Panel/1000G_AFR_0_01.csv +++ /dev/null @@ -1,3 +0,0 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:934c63984d760943b0228802f1fcbf3a778d8a3c525e3b78138948b4c07f4ab4 -size 515852597 diff --git a/Ref_Panel/1000G_AMR_0_01.csv b/Ref_Panel/1000G_AMR_0_01.csv deleted file mode 100755 index 8517b1b84654f4b2357560d32ca1e78526789b6f..0000000000000000000000000000000000000000 Binary files a/Ref_Panel/1000G_AMR_0_01.csv and /dev/null differ diff --git a/Ref_Panel/1000G_EAS_0_01.csv b/Ref_Panel/1000G_EAS_0_01.csv deleted file mode 100644 index fb62efacb3efe39254248a5285f9a696aeb40a4e..0000000000000000000000000000000000000000 Binary files a/Ref_Panel/1000G_EAS_0_01.csv and /dev/null differ diff --git a/Ref_Panel/1000G_EUR_0_01.csv b/Ref_Panel/1000G_EUR_0_01.csv deleted file mode 100755 index 44d786177f9c6b6ca56f0f79753640021af39bd1..0000000000000000000000000000000000000000 Binary files a/Ref_Panel/1000G_EUR_0_01.csv and /dev/null differ diff --git a/Ref_Panel/1000G_SAS_0_01.csv b/Ref_Panel/1000G_SAS_0_01.csv deleted file mode 100755 index 52db85218eeb10bd8be57ca2584ff77a3af68527..0000000000000000000000000000000000000000 Binary files a/Ref_Panel/1000G_SAS_0_01.csv and /dev/null differ diff --git a/input_files/Data_test_EAS.csv b/input_files/Data_test_EAS.csv index 39596cb6da016424d3a85faf08193d4ec571b555..11be861682923eefd7e2da9dd450a0caaad61a1b 100644 --- a/input_files/Data_test_EAS.csv +++ b/input_files/Data_test_EAS.csv @@ -1,4 +1,4 @@ -"filename" "Consortium" "Outcome" "FullName" "internalDataLink" "Type" "Reference" "ReferenceLink" "dataLink" "Nsample" "Ncase" "Ncontrol" "Nsnp" "snpid" "POS" "CHR" "a1" "a2" "freq" "pval" "n" "z" "OR" "se" "index_type" "imp" "ncas" "ncont" +"filename" "Consortium" "Outcome" "FullName" "internalDataLink" "Type" "Reference" "ReferenceLink" "dataLink" "Nsample" "Ncase" "Ncontrol" "Nsnp" "snpid" "POS" "CHR" "a1" "a2" "freq" "pval" "n" "beta_or_Z" "OR" "se" "index_type" "imp" "ncas" "ncont" "WBC_EAS_chr22.tsv" "BCT" "WBC" "White blood cell count" "Cellular" "Chen MH et al. 2020" "https://pubmed.ncbi.nlm.nih.gov/32888493/" "http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90002001-GCST90003000/GCST90002373/harmonised/32888493-GCST90002373-EFO_0007988.h.tsv.gz" 15061 36864690 "hm_rsid" "hm_pos" "hm_chrom" "hm_effect_allele" "hm_other_allele" "hm_effect_allele_frequency" "p_value" "hm_beta" "standard_error" "rs-number" "RBC_EAS_chr22.tsv" "BCT" "RBC" "Red blood cell count" "Cellular" "Chen MH et al. 2020" "https://pubmed.ncbi.nlm.nih.gov/32888493/" "http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90002001-GCST90003000/GCST90002373/harmonised/32888493-GCST90002373-EFO_0007988.h.tsv.gz" 15061 36864690 "hm_rsid" "hm_pos" "hm_chrom" "hm_effect_allele" "hm_other_allele" "hm_effect_allele_frequency" "p_value" "hm_beta" "standard_error" "rs-number" "PLT_EAS_chr22.tsv" "BCT" "PLT" "Platelet cell count" "Cellular" "Chen MH et al. 2020" "https://pubmed.ncbi.nlm.nih.gov/32888493/" "http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST90002001-GCST90003000/GCST90002373/harmonised/32888493-GCST90002373-EFO_0007988.h.tsv.gz" 15061 36864690 "hm_rsid" "hm_pos" "hm_chrom" "hm_effect_allele" "hm_other_allele" "hm_effect_allele_frequency" "p_value" "hm_beta" "standard_error" "rs-number"