Commit c9e31651 authored by hjulienne's avatar hjulienne

improve doc + add SLURM script

parent 6cfacbf3
Pipeline #7719 passed with stages
in 2 minutes and 11 seconds
......@@ -20,7 +20,7 @@ sys.path.insert(0, os.path.abspath('../..'))
# -- Project information -----------------------------------------------------
project = 'Peppa-PIG'
project = 'RAISS'
copyright = '2018, hjulienne'
author = 'hjulienne'
......@@ -82,7 +82,7 @@ pygments_style = 'sphinx'
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'alabaster'
html_theme = 'bizstyle'
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
......
.. Peppa-PIG documentation master file, created by
.. RAISS documentation master file, created by
sphinx-quickstart on Mon Aug 20 16:17:59 2018.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to Peppa-PIG's documentation!
Welcome to the Robust and Accurate Imputation from Summary Statistics (RAISS) documentation!
=====================================
.. toctree::
:maxdepth: 2
:caption: Contents:
What is Peppa-PIG ?
What is RAISS ?
===================
Peppa-PIG is python package to impute missing SNP summary statistics from
RAISS is python package to impute missing SNP summary statistics from
neighborring SNPs in linkage desiquilibrium.
The statistical model used to make the imputation is described in :cite:`Pasaniuc2014`
The imputation execution time is optimized by precomputing Linkage desiquilibrium between SNPs.
Dependancies
============
Peppa-PIG require plink version 1.9
RAISS require plink version 1.9
Installation
============
.. code-block:: shell
pip3 install git+https://gitlab.pasteur.fr/statistical-genetics/imputation_for_jass
Precomputation of LD-correlation
......@@ -36,25 +39,63 @@ Precomputation of LD-correlation
The imputation is based the Linkage desiquilibrium between SNPs.
To save computation the LD is computed and saved perform the imputation is
performed. To limit the number of SNP pairs, the LD is computed between SNPs in
region of approximately LD-independant. For an european ancestry, you can use
the region defined by Berisa.
performed. To limit the number of SNP pairs, the LD is computed between SNPs in
region of approximately LD-independant. For an european ancestry, you can use
the region defined by :cite:`Berisa2015` that are provided in the package data folder.
To compute the LD you need to specify a reference panel splitted by chromosomes
(bed, fam and bim formats of plink, see `PLINK formats <https://www.cog-genomics.org/plink2/formats>` )
To compute the LD you need to specify a reference panel (bed, fam and bim
formats of plink, see `PLINK formats <https://www.cog-genomics.org/plink2/formats>` )
.. code-block:: python
for i in 1:22:
impute_jass.LD.launch_plink_ld(start,stop,...)
# path to the Region file
region_berisa = "/mnt/atlas/PCMA/WKD_Hanna/cleaned_jass_input/Region_LD.csv"
# Path to the reference panel
ref_folder="/mnt/atlas/PCMA/1._DATA/ImpG_refpanel"
# path to the folder to store the results
ld_folder_out = "/mnt/atlas/PCMA/WKD_Hanna/impute_for_jass/berisa_ld_block"
raiss.LD.generate_genome_matrices(, ...)
Input format:
=============
GWAS results files must be provided in the tabular format by chromosome (tab separated)
all in the same folder with the following columns with the same header:
+----------+-------+------+-----+--------+
| rsID | pos | A0 | A1 | Z |
+==========+=======+======+=====+========+
| rs6548219| 30762 | A | G | -1.133 |
+----------+-------+------+-----+--------+
This format can be obtain with the Processing package.
Launching imputation on one chromosome
======================================
Peppa-PIG has an interface with the command line.
RAISS has an interface with the command line.
If you have access to a cluster, an efficient way to use RAISS is to launch
the imputation of each chromosome on a separate cluster node. The script
launch_imputation_all_gwas.sh contain an example for SLURM scheduler.
#TODO check command line interface
Output
======
The raiss package output imputed GWAS files in the tabular format:
If you have access to a cluster, an efficient way to use Peppa-PIG is to launch
the imputation of each chromosome on a separate cluster nude
+------------+---+--+----------------+-----+-----+----------------+------------------+---------+---------+
| |A0 |A1| Nsnp_to_impute |Var |Z |condition_number|correct_inversion |ld_score | pos |
+============+===+==+================+=====+=====+================+==================+=========+=========+
| rs11584349 |C | T| 18 | 0.85|-0.28| 116.9 | False | 1.34 | 1000156 |
+------------+---+--+----------------+-----+-----+----------------+------------------+---------+---------+
# Keep only useful columns
Indices and tables
......
@article{Pasaniuc2014,
abstract = {MOTIVATION Imputation using external reference panels (e.g. 1000 Genomes) is a widely used approach for increasing power in genome-wide association studies and meta-analysis. Existing hidden Markov models (HMM)-based imputation approaches require individual-level genotypes. Here, we develop a new method for Gaussian imputation from summary association statistics, a type of data that is becoming widely available. RESULTS In simulations using 1000 Genomes (1000G) data, this method recovers 84{\%} (54{\%}) of the effective sample size for common ({\textgreater}5{\%}) and low-frequency (1-5{\%}) variants [increasing to 87{\%} (60{\%}) when summary linkage disequilibrium information is available from target samples] versus the gold standard of 89{\%} (67{\%}) for HMM-based imputation, which cannot be applied to summary statistics. Our approach accounts for the limited sample size of the reference panel, a crucial step to eliminate false-positive associations, and it is computationally very fast. As an empirical demonstration, we apply our method to seven case-control phenotypes from the Wellcome Trust Case Control Consortium (WTCCC) data and a study of height in the British 1958 birth cohort (1958BC). Gaussian imputation from summary statistics recovers 95{\%} (105{\%}) of the effective sample size (as quantified by the ratio of [Formula: see text] association statistics) compared with HMM-based imputation from individual-level genotypes at the 227 (176) published single nucleotide polymorphisms (SNPs) in the WTCCC (1958BC height) data. In addition, for publicly available summary statistics from large meta-analyses of four lipid traits, we publicly release imputed summary statistics at 1000G SNPs, which could not have been obtained using previously published methods, and demonstrate their accuracy by masking subsets of the data. We show that 1000G imputation using our approach increases the magnitude and statistical evidence of enrichment at genic versus non-genic loci for these traits, as compared with an analysis without 1000G imputation. Thus, imputation of summary statistics will be a valuable tool in future functional enrichment analyses. AVAILABILITY AND IMPLEMENTATION Publicly available software package available at http://bogdan.bioinformatics.ucla.edu/software/. CONTACT bpasaniuc@mednet.ucla.edu or aprice@hsph.harvard.edu SUPPLEMENTARY INFORMATION Supplementary materials are available at Bioinformatics online.},
archivePrefix = {arXiv},
arxivId = {arXiv:1309.3258v1},
author = {Pasaniuc, Bogdan and Zaitlen, Noah and Shi, Huwenbo and Bhatia, Gaurav and Gusev, Alexander and Pickrell, Joseph and Hirschhorn, Joel and Strachan, David P. and Patterson, Nick and Price, Alkes L.},
doi = {10.1093/bioinformatics/btu416},
eprint = {arXiv:1309.3258v1},
file = {:home/hjulienn/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Pasaniuc et al. - 2014 - Fast and accurate imputation of summary statistics enhances evidence of functional enrichment.pdf:pdf},
issn = {13674811},
journal = {Bioinformatics (Oxford, England)},
keywords = {Hughes},
mendeley-tags = {Hughes},
number = {20},
pages = {2906--2914},
pmid = {24990607},
title = {{Fast and accurate imputation of summary statistics enhances evidence of functional enrichment}},
volume = {30},
year = {2014}
}
@article{Berisa2015,
abstract = {We present a method to identify approximately independent blocks of linkage disequilibrium (LD) in the human genome. These blocks enable automated analysis of multiple genome-wide association studies.},
author = {Berisa, Tomaz and Pickrell, Joseph K.},
doi = {10.1093/bioinformatics/btv546},
file = {:home/hjulienn/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Berisa, Pickrell - 2015 - Approximately independent linkage disequilibrium blocks in human populations.pdf:pdf},
isbn = {1367-4811 (Electronic) 1367-4803 (Linking)},
issn = {14602059},
journal = {Bioinformatics},
mendeley-groups = {Genetics},
number = {2},
pages = {283--285},
pmid = {26395773},
title = {{Approximately independent linkage disequilibrium blocks in human populations}},
volume = {32},
year = {2015}
}
......@@ -67,7 +67,7 @@ def generate_sparse_matrix(plink_ld, ref_chr_df):
mat_ld = mat_ld.to_sparse()
return mat_ld
def generate_genome_matrices(region_files, reffolder, folder_output):
def generate_genome_matrices(region_files, reffolder, folder_output, suffix = ""):
"""
go through region files and compute LD matrix for each transform and
save the results in a pandas sparse dataframe
......@@ -81,7 +81,7 @@ def generate_genome_matrices(region_files, reffolder, folder_output):
for reg in regions.iterrows():
print(reg[0])
# input reference panel file
fi_ref = "{0}/{1}.eur.1pct".format(reffolder, reg[1]['chr'])
fi_ref = "{0}/{1}.{2}".format(reffolder, reg[1]['chr'], suffix)
chr_int = re.search('([0-9]{1,2})', str(reg[1]['chr'])).group()
# Compute the LD correlation with LD
......
#!/bin/bash
#SBATCH --mem=32000
#SBATCH -o /pasteur/projets/policy01/PCMA/WKD_Hanna/impute_for_jass/genome_imputation_script/imputation_log/IMPFJ_%a.log
#SBATCH -e /pasteur/projets/policy01/PCMA/WKD_Hanna/impute_for_jass/genome_imputation_script/error_log/error_%a.log
#define data location
module load Python/3.6.0
output_folder="/pasteur/projets/policy01/PCMA/1._DATA/ImFJ_imputed_zfiles"
ref_folder="/pasteur/projets/policy01/PCMA/1._DATA/ImpG_refpanel"
zscore_folder="/pasteur/projets/policy01/PCMA/1._DATA/ImpG_zfiles"
ld_folder="/pasteur/projets/policy01/PCMA/WKD_Hanna/impute_for_jass/berisa_ld_block"
args=$(head -n ${SLURM_ARRAY_TASK_ID} $1 | tail -n 1)
study=$(echo $args | cut -d' ' -f1)
chrom=$(echo $args | cut -d' ' -f2)
#quick fix to access entry point on tars
alias impute_jass="python3 /pasteur/homes/hjulienn/.local/lib/python3.6/site-packages/impute_jass/__main__.py"
#study="GIANT_HEIGHT"
#chrom="chr22"
echo "Study: ${study}; Chrom: ${chrom}"
#time impute_jass --chrom ${chrom} --gwas ${study} --ref-folder ${ref_folder} --ld-folder ${ld_folder} --zscore-folder ${zscore_folder} --output-folder ${output_folder}
time python3 /pasteur/homes/hjulienn/.local/lib/python3.6/site-packages/impute_jass/__main__.py --chrom ${chrom} --gwas ${study} --ref-folder ${ref_folder} --ld-folder ${ld_folder} --zscore-folder ${zscore_folder} --output-folder ${output_folder}
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment