Welcome to jass_preprocessing’s documentation!

What is jass preprocessing ?

Jass preprocessing is a tool that takes in input heterogeneous GWAS summary statistics and performs standardization and quality checks to output standardized summary statistic files that can be used as input in the JASS python package and the RAISS imputation package.

Overview

The QC and preprocessing step goes as follow:

  • map column from of a specific GWAS to standardize names

  • Select GWAS SNPs that are in the input reference panel

  • Align coded allele of the GWAS with the reference panel

  • Infer Number of sample by SNPs if not present in input data

  • Filter SNPs with a small sample size

  • Normalize the effect size by sample size to have Z-scores

  • Save the output by chromosome as the following example:

rsID

pos

A0

A1

Z

rs6548219

30762

A

G

-1.133

  • (Optional) Save the output to one file with a chromosome column

(input format needed to perform LD-score)

Installation

In a terminal, execute the following lines:

pip3 install git+https://gitlab.pasteur.fr/statistical-genetics/JASS_Pre-processing

Input

  • A reference panel (1000 genome format). The user is expected to provide a reference panel in tsv format with the following columns in that order, without header:

chr

pos

snp_id

ref

alt

MAF

1

13116

rs62635286

T

G

0.0970447

1

13118

rs200579949

A

G

0.0970447

1

14604

rs541940975

A

G

0.147564

1

14930

rs75454623

A

G

0.482228

  • Folder containing all raw gwas data : (all chromosomes in one file) (minimal conditions?? tab separated?)

  • a list containing the name of GWAS file to the string format.

  • A descriptor csv files that will described each GWAS summary statistic files:

    • a header

    • 1 line per study

    • the fields are:

Note that the combination of Consortium and outcome must be unique because it will be used as an index in the cleaning process.

Here is an example of descriptor field, the field irrelevant (for example odd ratio for continuous trait) for the study must be filled with na.

GWAS information table!
header

“filename”

consortia

outcome

fullName

type

Nsample

Ncase

Ncontrol

Nsnp

snpid

a1

a2

freq

pval

n

z

OR

se

code

imp

ncas

ncont

GIANT_HEIGHT_Wood_et_al.txt

GIANT

HEIGHT

Height

Anthropometry

253288

na

na

2550858

MarkerName

Allele1

Allele2

Freq.Allele1.HapMapCEU

p

N

b

na

SE

na

na

na

na

Command line usage example:

It is possible to launch the complete preprocessing (all steps described in section Overview section) from the command line:

usage: jass_preprocessing [-h] --gwas-info GWAS_INFO --ref-path REF_PATH
                          --input-folder INPUT_FOLDER --diagnostic-folder
                          DIAGNOSTIC_FOLDER --output-folder OUTPUT_FOLDER
                          [--output-folder-1-file OUTPUT_FOLDER_1_FILE]
                          [--percent-sample-size PERCENT_SAMPLE_SIZE]
                          [--minimum-MAF MINIMUM_MAF] [--mask-MHC MASK_MHC]
                          [--additional-masked-region ADDITIONAL_MASKED_REGION]

Named Arguments

--gwas-info

Path to the file describing the format of the individual GWASs files with correct header

--ref-path

reference panel location (used to determine which snp to impute)

--input-folder

Path to the folder containing the Raw GWASs summary statistic files, must end by ‘/’

--diagnostic-folder

Path to the reporting information on the PreProcessing such as the SNPs sample size distribution

--output-folder

Location of main ouput folder for preprocessed GWAS files (splitted by chromosome)

--output-folder-1-file

optional location to store the preprocessing in one tabular file with one chromosome columns (useful to compute LDSC correlation for instance)

--percent-sample-size

the proportion (between 0 and 1) of the 90th percentile of the sample size used to filter the SNPs

Default: 0.7

--minimum-MAF

Filter the reference panel by minimum allele frequency

Default: 0.01

--mask-MHC

Whether the MHC region should be masked or not. default is False

Default: False

--additional-masked-region

List of dictionary containing coordinate of region to mask. For example :[{‘chr’:6, ‘start’:50000000, ‘end’: 70000000}, {‘chr’:6, ‘start’:100000000, ‘end’: 120000000}]

Indices and tables