Merge branch 'dev' of https://gitlab.pasteur.fr/statistical-genetics/raiss into dev

c25ec683 · Hanna JULIENNE · 34e954d3 · b3672568 · c25ec683
Commit c25ec683 authored 3 years ago by Hanna JULIENNE
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -109,7 +109,7 @@ The raiss package outputs imputed GWAS files in the tabular format:
 | rs111876722 |  201922  |      C     |      T     |  0.297  |	0.09 |  5.412   |    0.91578    |
 +-------------+----------+------------+------------+---------+-------+----------+---------------+
-Variance is set to -1 for variants present in the input dataset
+*Variance is set to -1 for variants present in the input dataset*
 Optimizing RAISS parameter for your data
 ========================================
@@ -119,8 +119,9 @@ to assess its performance on your data and fine tune RAISS parameter.
 Test procedure :
 1. Mask N SNPs on a chromosome
-2. Imputed masked file
+2. Impute masked files for different values of the --eigen-threshold
-3. Compute correlation between genotype Z-values to imputed Z-values
+and the --minimum-ld parameters
+3. Compute correlation (and other statistics) between genotype Z-values to imputed Z-values
 To perform this test follow this procedure :
@@ -128,44 +129,68 @@ To perform this test follow this procedure :
 2. Create a folder to store z-score files imputed with different parameter
 3. Adapt the following code snippet to apply the function to your data:
-.. code-block::
+.. code-block:: python
  :linenos:
+  import raiss
+  import pandas as pd
+  import sys
  perf_results = raiss.imputation_R2.grid_search(
-      ${path_z-scores_folder},
+              $path_to_initial_zscores_folder,
-      ${path_to_masked_z-scores_folder},
+              $path_to_masked_zscores_output_folder,
-      ${path_to_imputed_z-scores_folder},
+              $path_to_store_masked_zscores_output_folder,
-      ${path_to_reference_panel_folder},
+              $path_to_reference_panel,
-      ${path_to_LD_matrices_folder},
+              $path_to_LD_matrices,
-      "GWAS_TAG", chrom="chr22",
+              gwas, chrom="chr22",ref_panel_preffix="",ref_panel_suffix=".bim",
-      eigen_ratio_grid = [ 1, 0.5 ,0.1, 0.01], # Enter the value you want to test in this list
+              eigen_ratio_grid = [1.1,1,0.9,0.5,0.25,0.2,0.15,0.1],
-      window_size= 500000, buffer_size=125000, l2_regularization=0.1,
+              ld_threshold_grid = [0,2, 5,7],
-       R2_threshold=0.6)
+              window_size= 500000, buffer_size=125000, l2_regularization=0.1,
-  fout = "./Perf_"+GWAS_TAG+".csv"
+               R2_threshold=0.6)
+  fout = "performance_report.csv"
  print(perf_results)
  perf_results.to_csv(fout, sep="\t")
-The file Perf_GWAS_TAG ressemble the following output:
+The file Perf_GWAS_TAG ressembles the following output:
-+----+----+--------------------+-----------------+
+.. csv-table:: Performance Report
-|    |cor |mean_absolute_error |fraction_imputed |
+  :widths: 10, 10, 10,10, 10, 10,10,10,10,10
-+====+====+====================+=================+
+  :header-rows: 1
-|1.0 |0.95|       0.243        |      1.0        |
-+----+----+--------------------+-----------------+
+  "eigen_ratio","min_ld","N_SNP","fraction_imputed","cor","mean_absolute_error","median_absolute_error","min_absolute_error","max_absolute_error","SNP_max_error"
-| 0.5|0.94|       0.246        |      0.95       |
+  0.1,0,2970,0.594,0.978,0.277,0.171,1.47e-05,6.92,"rs5756504"
-+----+----+--------------------+-----------------+
+  0.1,2,2970,0.594,0.978,0.277,0.171,1.47e-05,6.92,"rs5756504"
+  0.1,5,2840,0.568,0.978,0.277,0.169,1.47e-05,6.92,"rs5756504"
-The row names correspond to the eigen ratio parameter that was tested.
+  0.1,7,2550,0.51,0.978,0.275,0.164,0.000285,6.92,"rs5756504"
-The second column is the correlation between imputed and genotyped Z-scores.
+  0.15,0,2470,0.494,0.976,0.282,0.172,2.43e-05,4.22,"rs59411032"
-The third column is the mean L1-error between imputed and genotyped Z-scores.
+  0.15,2,2470,0.494,0.976,0.282,0.172,2.43e-05,4.22,"rs59411032"
-The fourth column is the fraction of SNPs on the 5000 that were imputed.
+  0.15,5,2450,0.49,0.976,0.281,0.172,2.43e-05,4.22,"rs59411032"
+  0.15,7,2320,0.465,0.976,0.282,0.172,0.00044,4.22,"rs59411032"
-The optimal eigen_ratio can vary depending on the density of your reference panel and input data.
+  0.2,0,2040,0.409,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
+  0.2,2,2040,0.409,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
+  0.2,5,2040,0.408,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
+  0.2,7,2020,0.403,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
+* **eigen_ratio** : eigen ratio parameter that was tested.
+* **min_ld** : eigen ratio parameter that was tested.
+* **N_SNP** : number of the masked SNPs that were successfully imputed (i.e. not filtered out by the R2 criteria and/or min_ld criteria)
+* **fraction_imputed** : fraction of the masked SNPs that were successfully imputed (N_SNP / total_number_of_masked_SNP)
+* **cor** : the correlation between imputed and genotyped Z-scores.
+* **mean_absolute_error** : :math:`\mathbb{E}|Z_{imputed} - Z_{masked}|`
+* **median_absolute_error** : :math:`median|Z_{imputed} - Z_{masked}|`
+* **min_absolute_error** : :math:`min|Z_{imputed} - Z_{masked}|`
+* **max_absolute_error** : :math:`max|Z_{imputed} - Z_{masked}|`
+* **SNP_max_error** : :math:`argmax|Z_{imputed} - Z_{masked}|`
+To pick the best parameters, we recommend to search for a compromise between low imputation error and an high fraction of masked SNPs imputed
+(a trade-off between **fraction_imputed** and **mean_absolute_error**).
+The optimal eigen_ratio and min_ld can vary depending on the density of your reference panel and input data.
 Hence, we recommend to run a grid search to pick the best parameter for your data.
-However, empirically, we never observed a difference of performance from one chromosome to another.
+However, so far, we never observed a difference of performance from one chromosome to another.
 We suggest testing on the chr22 for computational efficiency.
 Command Line Usage
 ==================