Skip to content
Snippets Groups Projects


Merged Hanna JULIENNE requested to merge dev into master
2 files
+ 73
Compare changes
  • Side-by-side
  • Inline
+ 57
@@ -109,7 +109,7 @@ The raiss package outputs imputed GWAS files in the tabular format:
| rs111876722 | 201922 | C | T | 0.297 | 0.09 | 5.412 | 0.91578 |
Variance is set to -1 for variants present in the input dataset
*Variance is set to -1 for variants present in the input dataset*
Optimizing RAISS parameter for your data
@@ -119,8 +119,9 @@ to assess its performance on your data and fine tune RAISS parameter.
Test procedure :
1. Mask N SNPs on a chromosome
2. Imputed masked file
3. Compute correlation between genotype Z-values to imputed Z-values
2. Impute masked files for different values of the --eigen-threshold
and the --minimum-ld parameters
3. Compute correlation (and other statistics) between genotype Z-values to imputed Z-values
To perform this test follow this procedure :
@@ -128,44 +129,68 @@ To perform this test follow this procedure :
2. Create a folder to store z-score files imputed with different parameter
3. Adapt the following code snippet to apply the function to your data:
.. code-block::
.. code-block:: python
import raiss
import pandas as pd
import sys
perf_results = raiss.imputation_R2.grid_search(
"GWAS_TAG", chrom="chr22",
eigen_ratio_grid = [ 1, 0.5 ,0.1, 0.01], # Enter the value you want to test in this list
window_size= 500000, buffer_size=125000, l2_regularization=0.1,
fout = "./Perf_"+GWAS_TAG+".csv"
gwas, chrom="chr22",ref_panel_preffix="",ref_panel_suffix=".bim",
eigen_ratio_grid = [1.1,1,0.9,0.5,0.25,0.2,0.15,0.1],
ld_threshold_grid = [0,2, 5,7],
window_size= 500000, buffer_size=125000, l2_regularization=0.1,
fout = "performance_report.csv"
perf_results.to_csv(fout, sep="\t")
The file Perf_GWAS_TAG ressemble the following output:
| |cor |mean_absolute_error |fraction_imputed |
|1.0 |0.95| 0.243 | 1.0 |
| 0.5|0.94| 0.246 | 0.95 |
The row names correspond to the eigen ratio parameter that was tested.
The second column is the correlation between imputed and genotyped Z-scores.
The third column is the mean L1-error between imputed and genotyped Z-scores.
The fourth column is the fraction of SNPs on the 5000 that were imputed.
The optimal eigen_ratio can vary depending on the density of your reference panel and input data.
The file Perf_GWAS_TAG ressembles the following output:
.. csv-table:: Performance Report
:widths: 10, 10, 10,10, 10, 10,10,10,10,10
:header-rows: 1
* **eigen_ratio** : eigen ratio parameter that was tested.
* **min_ld** : eigen ratio parameter that was tested.
* **N_SNP** : number of the masked SNPs that were successfully imputed (i.e. not filtered out by the R2 criteria and/or min_ld criteria)
* **fraction_imputed** : fraction of the masked SNPs that were successfully imputed (N_SNP / total_number_of_masked_SNP)
* **cor** : the correlation between imputed and genotyped Z-scores.
* **mean_absolute_error** : :math:`\mathbb{E}|Z_{imputed} - Z_{masked}|`
* **median_absolute_error** : :math:`median|Z_{imputed} - Z_{masked}|`
* **min_absolute_error** : :math:`min|Z_{imputed} - Z_{masked}|`
* **max_absolute_error** : :math:`max|Z_{imputed} - Z_{masked}|`
* **SNP_max_error** : :math:`argmax|Z_{imputed} - Z_{masked}|`
To pick the best parameters, we recommend to search for a compromise between low imputation error and an high fraction of masked SNPs imputed
(a trade-off between **fraction_imputed** and **mean_absolute_error**).
The optimal eigen_ratio and min_ld can vary depending on the density of your reference panel and input data.
Hence, we recommend to run a grid search to pick the best parameter for your data.
However, empirically, we never observed a difference of performance from one chromosome to another.
However, so far, we never observed a difference of performance from one chromosome to another.
We suggest testing on the chr22 for computational efficiency.
Command Line Usage