Welcome to jass_preprocessing’s documentation!¶
What is jass preprocessing ?¶
Jass preprocessing is a tool that takes in input heterogeneous GWAS summary statistics and performs standardization and quality checks to output standardized summary statistic files that can be used as input in the JASS python package and the RAISS imputation package.
Overview¶
The QC and preprocessing step goes as follow:
- map column from of a specific GWAS to standardize names
- Select GWAS SNPs that are in the input reference panel
- Align coded allele of the GWAS with the reference panel
- Infer Number of sample by SNPs if not present in input data
- Filter SNPs with a small sample size
- Normalize the effect size by sample size to have Z-scores
- Save the output by chromosome as the following example:
rsID | pos | A0 | A1 | Z |
---|---|---|---|---|
rs6548219 | 30762 | A | G | -1.133 |
- (Optional) Save the output to one file with a chromosome column
(input format needed to perform LD-score)
Installation¶
In a terminal, execute the following lines:
pip3 install git+https://gitlab.pasteur.fr/statistical-genetics/JASS_Pre-processing
Input¶
- A reference panel (1000 genome format). The user is expected to provide a reference panel in tsv format with the following columns in that order, without header:
chr | pos | snp_id | ref | alt | MAF |
---|---|---|---|---|---|
1 | 13116 | rs62635286 | T | G | 0.0970447 |
1 | 13118 | rs200579949 | A | G | 0.0970447 |
1 | 14604 | rs541940975 | A | G | 0.147564 |
1 | 14930 | rs75454623 | A | G | 0.482228 |
- Folder containing all raw gwas data (all chromosomes in one file) (minimal conditions?? tab separated?)
- a list containing the name of GWAS file to the string format.
- A descriptor csv files that will described each GWAS summary statistic files:
- a header
- 1 line per study
- the fields are:
category | field name |
---|---|
path to the data | filename |
study info fields | consortia,outcome,fullName,type,Nsample,Ncase,Ncontrol,Nsnp |
names of the header in the GWAS file | snpid,a1,a2,freq,pval,n,z,OR,se,code,imp,ncas,ncont |
Here is an example of descriptor field, the field irrelevant (for example odd ratio for continuous trait) for the study must be filled with na.
|
consortia | outcome | fullName | type | Nsample | Ncase | Ncontrol | Nsnp | snpid | a1 | a2 | freq | pval | n | z | OR | se | code | imp | ncas | ncont | ||
GIANT_HEIGHT_Wood_et_al.txt | GIANT | HEIGHT | Height | Anthropometry | 253288 | na | na | 2550858 | MarkerName | Allele1 | Allele2 | Freq.Allele1.HapMapCEU | p | N | b | na | SE | na | na | na | na |
Command line usage example:¶
It is possible to launch the complete preprocessing (all steps described in section Overview section) from the command line:
usage: jass_preprocessing [-h] --percent-sample-size PERCENT_SAMPLE_SIZE
--gwas-info GWAS_INFO --ref-folder REF_FOLDER
--gwas-folder GWAS_FOLDER --output-folder
OUTPUT_FOLDER
[--output-folder-1-file OUTPUT_FOLDER_1_FILE]
Named Arguments¶
--percent-sample-size | |
the proportion of the 90th percentile of the sample size used to filter the SNPs | |
--gwas-info | Path to the file describing the format of the individual GWASs files |
--ref-folder | reference panel location (used to determine which snp to impute) |
--gwas-folder |
|
--output-folder | |
Location of main ouput folder for preprocessed GWAS files (splitted by chromosome) | |
--output-folder-1-file | |
optional location to store the preprocessing in one tabular file with one chromosome columns |
Indices and tables¶
imputation_launcher |
Imputation launcher |
ld_matrix |
Function set to compute LD correlation from a reference panel in predefined Region |
stat_models |
This module contain the statistical library for imputation. |
windows |
implement the imputation window is sliding along the genome: |