Commit 91cba587 authored by Hanna  JULIENNE's avatar Hanna JULIENNE
Browse files

Merge branch 'dev' into 'master'

Dev

See merge request !9
parents cfa5ee1d c25ec683
Pipeline #81112 passed with stages
in 1 minute and 7 seconds
......@@ -109,7 +109,7 @@ The raiss package outputs imputed GWAS files in the tabular format:
| rs111876722 | 201922 | C | T | 0.297 | 0.09 | 5.412 | 0.91578 |
+-------------+----------+------------+------------+---------+-------+----------+---------------+
Variance is set to -1 for variants present in the input dataset
*Variance is set to -1 for variants present in the input dataset*
Optimizing RAISS parameter for your data
========================================
......@@ -119,8 +119,9 @@ to assess its performance on your data and fine tune RAISS parameter.
Test procedure :
1. Mask N SNPs on a chromosome
2. Imputed masked file
3. Compute correlation between genotype Z-values to imputed Z-values
2. Impute masked files for different values of the --eigen-threshold
and the --minimum-ld parameters
3. Compute correlation (and other statistics) between genotype Z-values to imputed Z-values
To perform this test follow this procedure :
......@@ -128,44 +129,68 @@ To perform this test follow this procedure :
2. Create a folder to store z-score files imputed with different parameter
3. Adapt the following code snippet to apply the function to your data:
.. code-block::
.. code-block:: python
:linenos:
import raiss
import pandas as pd
import sys
perf_results = raiss.imputation_R2.grid_search(
${path_z-scores_folder},
${path_to_masked_z-scores_folder},
${path_to_imputed_z-scores_folder},
${path_to_reference_panel_folder},
${path_to_LD_matrices_folder},
"GWAS_TAG", chrom="chr22",
eigen_ratio_grid = [ 1, 0.5 ,0.1, 0.01], # Enter the value you want to test in this list
window_size= 500000, buffer_size=125000, l2_regularization=0.1,
R2_threshold=0.6)
fout = "./Perf_"+GWAS_TAG+".csv"
$path_to_initial_zscores_folder,
$path_to_masked_zscores_output_folder,
$path_to_store_masked_zscores_output_folder,
$path_to_reference_panel,
$path_to_LD_matrices,
gwas, chrom="chr22",ref_panel_preffix="",ref_panel_suffix=".bim",
eigen_ratio_grid = [1.1,1,0.9,0.5,0.25,0.2,0.15,0.1],
ld_threshold_grid = [0,2, 5,7],
window_size= 500000, buffer_size=125000, l2_regularization=0.1,
R2_threshold=0.6)
fout = "performance_report.csv"
print(perf_results)
perf_results.to_csv(fout, sep="\t")
The file Perf_GWAS_TAG ressemble the following output:
+----+----+--------------------+-----------------+
| |cor |mean_absolute_error |fraction_imputed |
+====+====+====================+=================+
|1.0 |0.95| 0.243 | 1.0 |
+----+----+--------------------+-----------------+
| 0.5|0.94| 0.246 | 0.95 |
+----+----+--------------------+-----------------+
The row names correspond to the eigen ratio parameter that was tested.
The second column is the correlation between imputed and genotyped Z-scores.
The third column is the mean L1-error between imputed and genotyped Z-scores.
The fourth column is the fraction of SNPs on the 5000 that were imputed.
The optimal eigen_ratio can vary depending on the density of your reference panel and input data.
The file Perf_GWAS_TAG ressembles the following output:
.. csv-table:: Performance Report
:widths: 10, 10, 10,10, 10, 10,10,10,10,10
:header-rows: 1
"eigen_ratio","min_ld","N_SNP","fraction_imputed","cor","mean_absolute_error","median_absolute_error","min_absolute_error","max_absolute_error","SNP_max_error"
0.1,0,2970,0.594,0.978,0.277,0.171,1.47e-05,6.92,"rs5756504"
0.1,2,2970,0.594,0.978,0.277,0.171,1.47e-05,6.92,"rs5756504"
0.1,5,2840,0.568,0.978,0.277,0.169,1.47e-05,6.92,"rs5756504"
0.1,7,2550,0.51,0.978,0.275,0.164,0.000285,6.92,"rs5756504"
0.15,0,2470,0.494,0.976,0.282,0.172,2.43e-05,4.22,"rs59411032"
0.15,2,2470,0.494,0.976,0.282,0.172,2.43e-05,4.22,"rs59411032"
0.15,5,2450,0.49,0.976,0.281,0.172,2.43e-05,4.22,"rs59411032"
0.15,7,2320,0.465,0.976,0.282,0.172,0.00044,4.22,"rs59411032"
0.2,0,2040,0.409,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
0.2,2,2040,0.409,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
0.2,5,2040,0.408,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
0.2,7,2020,0.403,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
* **eigen_ratio** : eigen ratio parameter that was tested.
* **min_ld** : eigen ratio parameter that was tested.
* **N_SNP** : number of the masked SNPs that were successfully imputed (i.e. not filtered out by the R2 criteria and/or min_ld criteria)
* **fraction_imputed** : fraction of the masked SNPs that were successfully imputed (N_SNP / total_number_of_masked_SNP)
* **cor** : the correlation between imputed and genotyped Z-scores.
* **mean_absolute_error** : :math:`\mathbb{E}|Z_{imputed} - Z_{masked}|`
* **median_absolute_error** : :math:`median|Z_{imputed} - Z_{masked}|`
* **min_absolute_error** : :math:`min|Z_{imputed} - Z_{masked}|`
* **max_absolute_error** : :math:`max|Z_{imputed} - Z_{masked}|`
* **SNP_max_error** : :math:`argmax|Z_{imputed} - Z_{masked}|`
To pick the best parameters, we recommend to search for a compromise between low imputation error and an high fraction of masked SNPs imputed
(a trade-off between **fraction_imputed** and **mean_absolute_error**).
The optimal eigen_ratio and min_ld can vary depending on the density of your reference panel and input data.
Hence, we recommend to run a grid search to pick the best parameter for your data.
However, empirically, we never observed a difference of performance from one chromosome to another.
However, so far, we never observed a difference of performance from one chromosome to another.
We suggest testing on the chr22 for computational efficiency.
Command Line Usage
==================
......
import argparse
import pandas as pd
import sys
from raiss.filter_format_output import filter_output
from raiss.imputation_launcher import ImputationLauncher
from raiss.pipes import save_chromosome_imputation
......@@ -23,7 +24,7 @@ def launch_chromosome_imputation(args):
def add_chromosome_imputation_argument():
parser = argparse.ArgumentParser()
parser = argparse.ArgumentParser(description="raiss command launch imputation for one trait on one chromosome")
parser.add_argument('--chrom', required=True, help= "chromosome to impute to the chr\d+ format")
parser.add_argument('--gwas', required=True, help= "GWAS to impute to the consortia_trait format")
......@@ -41,16 +42,29 @@ def add_chromosome_imputation_argument():
parser.add_argument("--ld-type", help= "Ld can be supplied as plink command --ld-snp-list output files (see raiss.ld_matrix.launch_plink_ld to compute these data using plink) or as a couple of a scipy sparse matrix (.npz )and an .csv containing SNPs index", default="plink")
parser.add_argument('--ref-panel-prefix', help= "prefix for the reference panel files", default = "")
parser.add_argument('--ref-panel-suffix', help= "suffix for the reference panel files", default = ".bim")
parser.add_argument('--minimum-ld', help = "this parameter ensure that their is enough typed SNPs around the imputed to perform a high accuracy imputation", default = 4)
parser.add_argument('--minimum-ld', help = "this parameter ensure that their is enough typed SNPs around the imputed loci to perform a high accuracy imputation", default = 4)
parser.set_defaults(func=launch_chromosome_imputation)
return(parser)
def main():
print("", file=sys.stderr)
print(" ******* ******* ******* ******* *******", file=sys.stderr)
print(" ** ** ** ** *** ** **", file=sys.stderr)
print(" ** ** ** ** *** ** **", file=sys.stderr)
print(" ** ***** ** ** *** ****** ******", file=sys.stderr)
print(" ** ** *********** *** ** **", file=sys.stderr)
print(" ** ** ** ** *** ** **", file=sys.stderr)
print(" ** ** ** ** ******* ******* *******", file=sys.stderr)
print("", file=sys.stderr)
print("", file=sys.stderr)
parser = add_chromosome_imputation_argument()
args = parser.parse_args()
args.func(args)
if __name__=="__main__":
main()
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment