Skip to content
Snippets Groups Projects

Dev

Merged Hanna JULIENNE requested to merge dev into master
2 files
+ 73
34
Compare changes
  • Side-by-side
  • Inline
Files
2
+ 57
32
@@ -109,7 +109,7 @@ The raiss package outputs imputed GWAS files in the tabular format:
| rs111876722 | 201922 | C | T | 0.297 | 0.09 | 5.412 | 0.91578 |
+-------------+----------+------------+------------+---------+-------+----------+---------------+
Variance is set to -1 for variants present in the input dataset
*Variance is set to -1 for variants present in the input dataset*
Optimizing RAISS parameter for your data
========================================
@@ -119,8 +119,9 @@ to assess its performance on your data and fine tune RAISS parameter.
Test procedure :
1. Mask N SNPs on a chromosome
2. Imputed masked file
3. Compute correlation between genotype Z-values to imputed Z-values
2. Impute masked files for different values of the --eigen-threshold
and the --minimum-ld parameters
3. Compute correlation (and other statistics) between genotype Z-values to imputed Z-values
To perform this test follow this procedure :
@@ -128,44 +129,68 @@ To perform this test follow this procedure :
2. Create a folder to store z-score files imputed with different parameter
3. Adapt the following code snippet to apply the function to your data:
.. code-block::
.. code-block:: python
:linenos:
import raiss
import pandas as pd
import sys
perf_results = raiss.imputation_R2.grid_search(
${path_z-scores_folder},
${path_to_masked_z-scores_folder},
${path_to_imputed_z-scores_folder},
${path_to_reference_panel_folder},
${path_to_LD_matrices_folder},
"GWAS_TAG", chrom="chr22",
eigen_ratio_grid = [ 1, 0.5 ,0.1, 0.01], # Enter the value you want to test in this list
window_size= 500000, buffer_size=125000, l2_regularization=0.1,
R2_threshold=0.6)
fout = "./Perf_"+GWAS_TAG+".csv"
$path_to_initial_zscores_folder,
$path_to_masked_zscores_output_folder,
$path_to_store_masked_zscores_output_folder,
$path_to_reference_panel,
$path_to_LD_matrices,
gwas, chrom="chr22",ref_panel_preffix="",ref_panel_suffix=".bim",
eigen_ratio_grid = [1.1,1,0.9,0.5,0.25,0.2,0.15,0.1],
ld_threshold_grid = [0,2, 5,7],
window_size= 500000, buffer_size=125000, l2_regularization=0.1,
R2_threshold=0.6)
fout = "performance_report.csv"
print(perf_results)
perf_results.to_csv(fout, sep="\t")
The file Perf_GWAS_TAG ressemble the following output:
+----+----+--------------------+-----------------+
| |cor |mean_absolute_error |fraction_imputed |
+====+====+====================+=================+
|1.0 |0.95| 0.243 | 1.0 |
+----+----+--------------------+-----------------+
| 0.5|0.94| 0.246 | 0.95 |
+----+----+--------------------+-----------------+
The row names correspond to the eigen ratio parameter that was tested.
The second column is the correlation between imputed and genotyped Z-scores.
The third column is the mean L1-error between imputed and genotyped Z-scores.
The fourth column is the fraction of SNPs on the 5000 that were imputed.
The optimal eigen_ratio can vary depending on the density of your reference panel and input data.
The file Perf_GWAS_TAG ressembles the following output:
.. csv-table:: Performance Report
:widths: 10, 10, 10,10, 10, 10,10,10,10,10
:header-rows: 1
"eigen_ratio","min_ld","N_SNP","fraction_imputed","cor","mean_absolute_error","median_absolute_error","min_absolute_error","max_absolute_error","SNP_max_error"
0.1,0,2970,0.594,0.978,0.277,0.171,1.47e-05,6.92,"rs5756504"
0.1,2,2970,0.594,0.978,0.277,0.171,1.47e-05,6.92,"rs5756504"
0.1,5,2840,0.568,0.978,0.277,0.169,1.47e-05,6.92,"rs5756504"
0.1,7,2550,0.51,0.978,0.275,0.164,0.000285,6.92,"rs5756504"
0.15,0,2470,0.494,0.976,0.282,0.172,2.43e-05,4.22,"rs59411032"
0.15,2,2470,0.494,0.976,0.282,0.172,2.43e-05,4.22,"rs59411032"
0.15,5,2450,0.49,0.976,0.281,0.172,2.43e-05,4.22,"rs59411032"
0.15,7,2320,0.465,0.976,0.282,0.172,0.00044,4.22,"rs59411032"
0.2,0,2040,0.409,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
0.2,2,2040,0.409,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
0.2,5,2040,0.408,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
0.2,7,2020,0.403,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
* **eigen_ratio** : eigen ratio parameter that was tested.
* **min_ld** : eigen ratio parameter that was tested.
* **N_SNP** : number of the masked SNPs that were successfully imputed (i.e. not filtered out by the R2 criteria and/or min_ld criteria)
* **fraction_imputed** : fraction of the masked SNPs that were successfully imputed (N_SNP / total_number_of_masked_SNP)
* **cor** : the correlation between imputed and genotyped Z-scores.
* **mean_absolute_error** : :math:`\mathbb{E}|Z_{imputed} - Z_{masked}|`
* **median_absolute_error** : :math:`median|Z_{imputed} - Z_{masked}|`
* **min_absolute_error** : :math:`min|Z_{imputed} - Z_{masked}|`
* **max_absolute_error** : :math:`max|Z_{imputed} - Z_{masked}|`
* **SNP_max_error** : :math:`argmax|Z_{imputed} - Z_{masked}|`
To pick the best parameters, we recommend to search for a compromise between low imputation error and an high fraction of masked SNPs imputed
(a trade-off between **fraction_imputed** and **mean_absolute_error**).
The optimal eigen_ratio and min_ld can vary depending on the density of your reference panel and input data.
Hence, we recommend to run a grid search to pick the best parameter for your data.
However, empirically, we never observed a difference of performance from one chromosome to another.
However, so far, we never observed a difference of performance from one chromosome to another.
We suggest testing on the chr22 for computational efficiency.
Command Line Usage
==================