diff --git a/doc/source/index.rst b/doc/source/index.rst index 1f882fe17a8c9825113a221845520b37a3838350..339c9bab157947f1864342474b3cb4bd5f77d688 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -109,7 +109,7 @@ The raiss package outputs imputed GWAS files in the tabular format: | rs111876722 | 201922 | C | T | 0.297 | 0.09 | 5.412 | 0.91578 | +-------------+----------+------------+------------+---------+-------+----------+---------------+ -Variance is set to -1 for variants present in the input dataset +*Variance is set to -1 for variants present in the input dataset* Optimizing RAISS parameter for your data ======================================== @@ -119,8 +119,9 @@ to assess its performance on your data and fine tune RAISS parameter. Test procedure : 1. Mask N SNPs on a chromosome -2. Imputed masked file -3. Compute correlation between genotype Z-values to imputed Z-values +2. Impute masked files for different values of the --eigen-threshold +and the --minimum-ld parameters +3. Compute correlation (and other statistics) between genotype Z-values to imputed Z-values To perform this test follow this procedure : @@ -128,44 +129,68 @@ To perform this test follow this procedure : 2. Create a folder to store z-score files imputed with different parameter 3. Adapt the following code snippet to apply the function to your data: -.. code-block:: +.. code-block:: python :linenos: + import raiss + import pandas as pd + import sys + perf_results = raiss.imputation_R2.grid_search( - ${path_z-scores_folder}, - ${path_to_masked_z-scores_folder}, - ${path_to_imputed_z-scores_folder}, - ${path_to_reference_panel_folder}, - ${path_to_LD_matrices_folder}, - "GWAS_TAG", chrom="chr22", - eigen_ratio_grid = [ 1, 0.5 ,0.1, 0.01], # Enter the value you want to test in this list - window_size= 500000, buffer_size=125000, l2_regularization=0.1, - R2_threshold=0.6) - fout = "./Perf_"+GWAS_TAG+".csv" + $path_to_initial_zscores_folder, + $path_to_masked_zscores_output_folder, + $path_to_store_masked_zscores_output_folder, + $path_to_reference_panel, + $path_to_LD_matrices, + gwas, chrom="chr22",ref_panel_preffix="",ref_panel_suffix=".bim", + eigen_ratio_grid = [1.1,1,0.9,0.5,0.25,0.2,0.15,0.1], + ld_threshold_grid = [0,2, 5,7], + window_size= 500000, buffer_size=125000, l2_regularization=0.1, + R2_threshold=0.6) + fout = "performance_report.csv" print(perf_results) perf_results.to_csv(fout, sep="\t") -The file Perf_GWAS_TAG ressemble the following output: - -+----+----+--------------------+-----------------+ -| |cor |mean_absolute_error |fraction_imputed | -+====+====+====================+=================+ -|1.0 |0.95| 0.243 | 1.0 | -+----+----+--------------------+-----------------+ -| 0.5|0.94| 0.246 | 0.95 | -+----+----+--------------------+-----------------+ - -The row names correspond to the eigen ratio parameter that was tested. -The second column is the correlation between imputed and genotyped Z-scores. -The third column is the mean L1-error between imputed and genotyped Z-scores. -The fourth column is the fraction of SNPs on the 5000 that were imputed. - -The optimal eigen_ratio can vary depending on the density of your reference panel and input data. +The file Perf_GWAS_TAG ressembles the following output: + +.. csv-table:: Performance Report + :widths: 10, 10, 10,10, 10, 10,10,10,10,10 + :header-rows: 1 + + "eigen_ratio","min_ld","N_SNP","fraction_imputed","cor","mean_absolute_error","median_absolute_error","min_absolute_error","max_absolute_error","SNP_max_error" + 0.1,0,2970,0.594,0.978,0.277,0.171,1.47e-05,6.92,"rs5756504" + 0.1,2,2970,0.594,0.978,0.277,0.171,1.47e-05,6.92,"rs5756504" + 0.1,5,2840,0.568,0.978,0.277,0.169,1.47e-05,6.92,"rs5756504" + 0.1,7,2550,0.51,0.978,0.275,0.164,0.000285,6.92,"rs5756504" + 0.15,0,2470,0.494,0.976,0.282,0.172,2.43e-05,4.22,"rs59411032" + 0.15,2,2470,0.494,0.976,0.282,0.172,2.43e-05,4.22,"rs59411032" + 0.15,5,2450,0.49,0.976,0.281,0.172,2.43e-05,4.22,"rs59411032" + 0.15,7,2320,0.465,0.976,0.282,0.172,0.00044,4.22,"rs59411032" + 0.2,0,2040,0.409,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798" + 0.2,2,2040,0.409,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798" + 0.2,5,2040,0.408,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798" + 0.2,7,2020,0.403,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798" + + +* **eigen_ratio** : eigen ratio parameter that was tested. +* **min_ld** : eigen ratio parameter that was tested. +* **N_SNP** : number of the masked SNPs that were successfully imputed (i.e. not filtered out by the R2 criteria and/or min_ld criteria) +* **fraction_imputed** : fraction of the masked SNPs that were successfully imputed (N_SNP / total_number_of_masked_SNP) +* **cor** : the correlation between imputed and genotyped Z-scores. +* **mean_absolute_error** : :math:`\mathbb{E}|Z_{imputed} - Z_{masked}|` +* **median_absolute_error** : :math:`median|Z_{imputed} - Z_{masked}|` +* **min_absolute_error** : :math:`min|Z_{imputed} - Z_{masked}|` +* **max_absolute_error** : :math:`max|Z_{imputed} - Z_{masked}|` +* **SNP_max_error** : :math:`argmax|Z_{imputed} - Z_{masked}|` + +To pick the best parameters, we recommend to search for a compromise between low imputation error and an high fraction of masked SNPs imputed +(a trade-off between **fraction_imputed** and **mean_absolute_error**). + +The optimal eigen_ratio and min_ld can vary depending on the density of your reference panel and input data. Hence, we recommend to run a grid search to pick the best parameter for your data. -However, empirically, we never observed a difference of performance from one chromosome to another. +However, so far, we never observed a difference of performance from one chromosome to another. We suggest testing on the chr22 for computational efficiency. - Command Line Usage ==================