Welcome to jass_preprocessing’s documentation!

What is jass preprocessing ?

Jass preprocessing is a tool that takes in input heterogeneous GWAS summary statistics and performs standardization and quality checks to output standardized summary statistic files that can be used as input in the JASS python package and the RAISS imputation package.

Overview

The QC and preprocessing step goes as follow:

  • map column from of a specific GWAS to standardize names
  • Select GWAS SNPs that are in the input reference panel
  • Align coded allele of the GWAS with the reference panel
  • Infer Number of sample by SNPs if not present in input data
  • Filter SNPs with a small sample size
  • Normalize the effect size by sample size to have Z-scores
  • Save the output by chromosome as the following example:
rsID pos A0 A1 Z
rs6548219 30762 A G -1.133
  • (Optional) Save the output to one file with a chromosome column

(input format needed to perform LD-score)

Installation

In a terminal, execute the following lines:

pip3 install git+https://gitlab.pasteur.fr/statistical-genetics/JASS_Pre-processing

Input

  • A reference panel (1000 genome format). The user is expected to provide a reference panel in tsv format with the following columns in that order, without header:
chr pos snp_id ref alt MAF
1 13116 rs62635286 T G 0.0970447
1 13118 rs200579949 A G 0.0970447
1 14604 rs541940975 A G 0.147564
1 14930 rs75454623 A G 0.482228
  • Folder containing all raw gwas data (all chromosomes in one file) (minimal conditions?? tab separated?)
  • a list containing the name of GWAS file to the string format.
  • A descriptor csv files that will described each GWAS summary statistic files:
    • a header
    • 1 line per study
    • the fields are:
category field name
path to the data filename
study info fields consortia,outcome,fullName,type,Nsample,Ncase,Ncontrol,Nsnp
names of the header in the GWAS file snpid,a1,a2,freq,pval,n,z,OR,se,code,imp,ncas,ncont

Here is an example of descriptor field, the field irrelevant (for example odd ratio for continuous trait) for the study must be filled with na.

GWAS information table!
header:“filename”
consortia outcome fullName type Nsample Ncase Ncontrol Nsnp snpid a1 a2 freq pval n z OR se code imp ncas ncont
GIANT_HEIGHT_Wood_et_al.txt GIANT HEIGHT Height Anthropometry 253288 na na 2550858 MarkerName Allele1 Allele2 Freq.Allele1.HapMapCEU p N b na SE na na na na

Command line usage example:

It is possible to launch the complete preprocessing (all steps described in section Overview section) from the command line:

usage: jass_preprocessing [-h] --percent-sample-size PERCENT_SAMPLE_SIZE
                          --gwas-info GWAS_INFO --ref-folder REF_FOLDER
                          --gwas-folder GWAS_FOLDER --output-folder
                          OUTPUT_FOLDER
                          [--output-folder-1-file OUTPUT_FOLDER_1_FILE]

Named Arguments

--percent-sample-size
 the proportion of the 90th percentile of the sample size used to filter the SNPs
--gwas-info Path to the file describing the format of the individual GWASs files
--ref-folder reference panel location (used to determine which snp to impute)
--gwas-folder
Path to the folder containing the GWASs summ stat files, must end by ‘/’
--output-folder
 Location of main ouput folder for preprocessed GWAS files (splitted by chromosome)
--output-folder-1-file
 optional location to store the preprocessing in one tabular file with one chromosome columns

Indices and tables

imputation_launcher Imputation launcher
ld_matrix Function set to compute LD correlation from a reference panel in predefined Region
stat_models This module contain the statistical library for imputation.
windows implement the imputation window is sliding along the genome: