Welcome to jass_preprocessing’s documentation!¶
What is jass preprocessing ?¶
Jass preprocessing is a tool that takes in input heterogeneous GWAS summary statistics and performs standardization and quality checks to output standardized summary statistic files that can be used as input in the JASS python package and the RAISS imputation package.
Overview¶
The QC and preprocessing step goes as follow:
map column from of a specific GWAS to standardize names
Select GWAS SNPs that are in the input reference panel
Align coded allele of the GWAS with the reference panel
Infer Number of sample by SNPs if not present in input data
Filter SNPs with a small sample size
Normalize the effect size by sample size to have Z-scores
Save the output by chromosome as the following example:
rsID |
pos |
A0 |
A1 |
Z |
---|---|---|---|---|
rs6548219 |
30762 |
A |
G |
-1.133 |
(Optional) Save the output to one file with a chromosome column
(input format needed to perform LD-score)
Installation¶
In a terminal, execute the following lines:
pip3 install git+https://gitlab.pasteur.fr/statistical-genetics/JASS_Pre-processing
Input¶
A reference panel (1000 genome format). The user is expected to provide a reference panel in tsv format with the following columns in that order, without header:
chr |
pos |
snp_id |
ref |
alt |
MAF |
---|---|---|---|---|---|
1 |
13116 |
rs62635286 |
T |
G |
0.0970447 |
1 |
13118 |
rs200579949 |
A |
G |
0.0970447 |
1 |
14604 |
rs541940975 |
A |
G |
0.147564 |
1 |
14930 |
rs75454623 |
A |
G |
0.482228 |
Folder containing all raw gwas data : (all chromosomes in one file) (minimal conditions?? tab separated?)
a list containing the name of GWAS file to the string format.
A descriptor csv files that will described each GWAS summary statistic files:
a header
1 line per study
the fields are:
Note that the combination of Consortium and outcome must be unique because it will be used as an index in the cleaning process.
Here is an example of descriptor field, the field irrelevant (for example odd ratio for continuous trait) for the study must be filled with na.
|
consortia |
outcome |
fullName |
type |
Nsample |
Ncase |
Ncontrol |
Nsnp |
snpid |
a1 |
a2 |
freq |
pval |
n |
z |
OR |
se |
code |
imp |
ncas |
ncont |
GIANT_HEIGHT_Wood_et_al.txt |
GIANT |
HEIGHT |
Height |
Anthropometry |
253288 |
na |
na |
2550858 |
MarkerName |
Allele1 |
Allele2 |
Freq.Allele1.HapMapCEU |
p |
N |
b |
na |
SE |
na |
na |
na |
na |
Command line usage example:¶
It is possible to launch the complete preprocessing (all steps described in section Overview section) from the command line:
usage: jass_preprocessing [-h] --gwas-info GWAS_INFO --ref-path REF_PATH
--input-folder INPUT_FOLDER --diagnostic-folder
DIAGNOSTIC_FOLDER --output-folder OUTPUT_FOLDER
[--output-folder-1-file OUTPUT_FOLDER_1_FILE]
[--percent-sample-size PERCENT_SAMPLE_SIZE]
[--minimum-MAF MINIMUM_MAF] [--mask-MHC MASK_MHC]
[--additional-masked-region ADDITIONAL_MASKED_REGION]
Named Arguments¶
- --gwas-info
Path to the file describing the format of the individual GWASs files with correct header
- --ref-path
reference panel location (used to determine which snp to impute)
- --input-folder
Path to the folder containing the Raw GWASs summary statistic files, must end by ‘/’
- --diagnostic-folder
Path to the reporting information on the PreProcessing such as the SNPs sample size distribution
- --output-folder
Location of main ouput folder for preprocessed GWAS files (splitted by chromosome)
- --output-folder-1-file
optional location to store the preprocessing in one tabular file with one chromosome columns (useful to compute LDSC correlation for instance)
- --percent-sample-size
the proportion (between 0 and 1) of the 90th percentile of the sample size used to filter the SNPs
Default: 0.7
- --minimum-MAF
Filter the reference panel by minimum allele frequency
Default: 0.01
- --mask-MHC
Whether the MHC region should be masked or not. default is False
Default: False
- --additional-masked-region
List of dictionary containing coordinate of region to mask. For example :[{‘chr’:6, ‘start’:50000000, ‘end’: 70000000}, {‘chr’:6, ‘start’:100000000, ‘end’: 120000000}]