Skip to content
Snippets Groups Projects
Commit c25ec683 authored by Hanna  JULIENNE's avatar Hanna JULIENNE
Browse files
parents 34e954d3 b3672568
No related branches found
No related tags found
1 merge request!9Dev
...@@ -109,7 +109,7 @@ The raiss package outputs imputed GWAS files in the tabular format: ...@@ -109,7 +109,7 @@ The raiss package outputs imputed GWAS files in the tabular format:
| rs111876722 | 201922 | C | T | 0.297 | 0.09 | 5.412 | 0.91578 | | rs111876722 | 201922 | C | T | 0.297 | 0.09 | 5.412 | 0.91578 |
+-------------+----------+------------+------------+---------+-------+----------+---------------+ +-------------+----------+------------+------------+---------+-------+----------+---------------+
Variance is set to -1 for variants present in the input dataset *Variance is set to -1 for variants present in the input dataset*
Optimizing RAISS parameter for your data Optimizing RAISS parameter for your data
======================================== ========================================
...@@ -119,8 +119,9 @@ to assess its performance on your data and fine tune RAISS parameter. ...@@ -119,8 +119,9 @@ to assess its performance on your data and fine tune RAISS parameter.
Test procedure : Test procedure :
1. Mask N SNPs on a chromosome 1. Mask N SNPs on a chromosome
2. Imputed masked file 2. Impute masked files for different values of the --eigen-threshold
3. Compute correlation between genotype Z-values to imputed Z-values and the --minimum-ld parameters
3. Compute correlation (and other statistics) between genotype Z-values to imputed Z-values
To perform this test follow this procedure : To perform this test follow this procedure :
...@@ -128,44 +129,68 @@ To perform this test follow this procedure : ...@@ -128,44 +129,68 @@ To perform this test follow this procedure :
2. Create a folder to store z-score files imputed with different parameter 2. Create a folder to store z-score files imputed with different parameter
3. Adapt the following code snippet to apply the function to your data: 3. Adapt the following code snippet to apply the function to your data:
.. code-block:: .. code-block:: python
:linenos: :linenos:
import raiss
import pandas as pd
import sys
perf_results = raiss.imputation_R2.grid_search( perf_results = raiss.imputation_R2.grid_search(
${path_z-scores_folder}, $path_to_initial_zscores_folder,
${path_to_masked_z-scores_folder}, $path_to_masked_zscores_output_folder,
${path_to_imputed_z-scores_folder}, $path_to_store_masked_zscores_output_folder,
${path_to_reference_panel_folder}, $path_to_reference_panel,
${path_to_LD_matrices_folder}, $path_to_LD_matrices,
"GWAS_TAG", chrom="chr22", gwas, chrom="chr22",ref_panel_preffix="",ref_panel_suffix=".bim",
eigen_ratio_grid = [ 1, 0.5 ,0.1, 0.01], # Enter the value you want to test in this list eigen_ratio_grid = [1.1,1,0.9,0.5,0.25,0.2,0.15,0.1],
window_size= 500000, buffer_size=125000, l2_regularization=0.1, ld_threshold_grid = [0,2, 5,7],
R2_threshold=0.6) window_size= 500000, buffer_size=125000, l2_regularization=0.1,
fout = "./Perf_"+GWAS_TAG+".csv" R2_threshold=0.6)
fout = "performance_report.csv"
print(perf_results) print(perf_results)
perf_results.to_csv(fout, sep="\t") perf_results.to_csv(fout, sep="\t")
The file Perf_GWAS_TAG ressemble the following output: The file Perf_GWAS_TAG ressembles the following output:
+----+----+--------------------+-----------------+ .. csv-table:: Performance Report
| |cor |mean_absolute_error |fraction_imputed | :widths: 10, 10, 10,10, 10, 10,10,10,10,10
+====+====+====================+=================+ :header-rows: 1
|1.0 |0.95| 0.243 | 1.0 |
+----+----+--------------------+-----------------+ "eigen_ratio","min_ld","N_SNP","fraction_imputed","cor","mean_absolute_error","median_absolute_error","min_absolute_error","max_absolute_error","SNP_max_error"
| 0.5|0.94| 0.246 | 0.95 | 0.1,0,2970,0.594,0.978,0.277,0.171,1.47e-05,6.92,"rs5756504"
+----+----+--------------------+-----------------+ 0.1,2,2970,0.594,0.978,0.277,0.171,1.47e-05,6.92,"rs5756504"
0.1,5,2840,0.568,0.978,0.277,0.169,1.47e-05,6.92,"rs5756504"
The row names correspond to the eigen ratio parameter that was tested. 0.1,7,2550,0.51,0.978,0.275,0.164,0.000285,6.92,"rs5756504"
The second column is the correlation between imputed and genotyped Z-scores. 0.15,0,2470,0.494,0.976,0.282,0.172,2.43e-05,4.22,"rs59411032"
The third column is the mean L1-error between imputed and genotyped Z-scores. 0.15,2,2470,0.494,0.976,0.282,0.172,2.43e-05,4.22,"rs59411032"
The fourth column is the fraction of SNPs on the 5000 that were imputed. 0.15,5,2450,0.49,0.976,0.281,0.172,2.43e-05,4.22,"rs59411032"
0.15,7,2320,0.465,0.976,0.282,0.172,0.00044,4.22,"rs59411032"
The optimal eigen_ratio can vary depending on the density of your reference panel and input data. 0.2,0,2040,0.409,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
0.2,2,2040,0.409,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
0.2,5,2040,0.408,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
0.2,7,2020,0.403,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
* **eigen_ratio** : eigen ratio parameter that was tested.
* **min_ld** : eigen ratio parameter that was tested.
* **N_SNP** : number of the masked SNPs that were successfully imputed (i.e. not filtered out by the R2 criteria and/or min_ld criteria)
* **fraction_imputed** : fraction of the masked SNPs that were successfully imputed (N_SNP / total_number_of_masked_SNP)
* **cor** : the correlation between imputed and genotyped Z-scores.
* **mean_absolute_error** : :math:`\mathbb{E}|Z_{imputed} - Z_{masked}|`
* **median_absolute_error** : :math:`median|Z_{imputed} - Z_{masked}|`
* **min_absolute_error** : :math:`min|Z_{imputed} - Z_{masked}|`
* **max_absolute_error** : :math:`max|Z_{imputed} - Z_{masked}|`
* **SNP_max_error** : :math:`argmax|Z_{imputed} - Z_{masked}|`
To pick the best parameters, we recommend to search for a compromise between low imputation error and an high fraction of masked SNPs imputed
(a trade-off between **fraction_imputed** and **mean_absolute_error**).
The optimal eigen_ratio and min_ld can vary depending on the density of your reference panel and input data.
Hence, we recommend to run a grid search to pick the best parameter for your data. Hence, we recommend to run a grid search to pick the best parameter for your data.
However, empirically, we never observed a difference of performance from one chromosome to another. However, so far, we never observed a difference of performance from one chromosome to another.
We suggest testing on the chr22 for computational efficiency. We suggest testing on the chr22 for computational efficiency.
Command Line Usage Command Line Usage
================== ==================
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment