Commit b3672568 authored by Hanna  JULIENNE's avatar Hanna JULIENNE
Browse files

finished doc on RAISS tuning

parent 13d0931d
......@@ -129,37 +129,27 @@ To perform this test follow this procedure :
2. Create a folder to store z-score files imputed with different parameter
3. Adapt the following code snippet to apply the function to your data:
.. code-block::
.. code-block:: python
:linenos:
import raiss
import pandas as pd
import sys
from os import listdir
files = listdir("/pasteur/zeus/projets/p02/GGS_JASS/WKD_Hanna/RAISS_tuning/data/{0}/zscores/".format(ancestry))
def get_gwas(x):
return "_".join(x.split("_")[1:3])
gwas_list = list(map(get_gwas, files))
for gwas in gwas_list:
perf_results = raiss.imputation_R2.grid_search(
$path_to_initial_zscores_folder,
$path_to_masked_zscores_output_folder,
$path_to_store_masked_zscores_output_folder,
$path_to_reference_panel,
$path_to_LD_matrices,
gwas, chrom="chr22",ref_panel_preffix="",ref_panel_suffix=".bim",
eigen_ratio_grid = [1.1,1,0.9,0.5,0.25,0.2,0.15,0.1],
ld_threshold_grid = [0,2, 5,7],
window_size= 500000, buffer_size=125000, l2_regularization=0.1,
R2_threshold=0.6)
fout = "performance_report.csv"
print(perf_results)
perf_results.to_csv(fout, sep="\t")
perf_results = raiss.imputation_R2.grid_search(
$path_to_initial_zscores_folder,
$path_to_masked_zscores_output_folder,
$path_to_store_masked_zscores_output_folder,
$path_to_reference_panel,
$path_to_LD_matrices,
gwas, chrom="chr22",ref_panel_preffix="",ref_panel_suffix=".bim",
eigen_ratio_grid = [1.1,1,0.9,0.5,0.25,0.2,0.15,0.1],
ld_threshold_grid = [0,2, 5,7],
window_size= 500000, buffer_size=125000, l2_regularization=0.1,
R2_threshold=0.6)
fout = "performance_report.csv"
print(perf_results)
perf_results.to_csv(fout, sep="\t")
The file Perf_GWAS_TAG ressembles the following output:
......@@ -182,17 +172,25 @@ The file Perf_GWAS_TAG ressembles the following output:
0.2,7,2020,0.403,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
The row names correspond to the eigen ratio parameter that was tested.
The second column is the correlation between imputed and genotyped Z-scores.
The third column is the mean L1-error between imputed and genotyped Z-scores.
The fourth column is the fraction of SNPs on the 5000 that were imputed.
* **eigen_ratio** : eigen ratio parameter that was tested.
* **min_ld** : eigen ratio parameter that was tested.
* **N_SNP** : number of the masked SNPs that were successfully imputed (i.e. not filtered out by the R2 criteria and/or min_ld criteria)
* **fraction_imputed** : fraction of the masked SNPs that were successfully imputed (N_SNP / total_number_of_masked_SNP)
* **cor** : the correlation between imputed and genotyped Z-scores.
* **mean_absolute_error** : :math:`\mathbb{E}|Z_{imputed} - Z_{masked}|`
* **median_absolute_error** : :math:`median|Z_{imputed} - Z_{masked}|`
* **min_absolute_error** : :math:`min|Z_{imputed} - Z_{masked}|`
* **max_absolute_error** : :math:`max|Z_{imputed} - Z_{masked}|`
* **SNP_max_error** : :math:`argmax|Z_{imputed} - Z_{masked}|`
To pick the best parameters, we recommend to search for a compromise between low imputation error and an high fraction of masked SNPs imputed
(a trade-off between **fraction_imputed** and **mean_absolute_error**).
The optimal eigen_ratio can vary depending on the density of your reference panel and input data.
The optimal eigen_ratio and min_ld can vary depending on the density of your reference panel and input data.
Hence, we recommend to run a grid search to pick the best parameter for your data.
However, so far, we never observed a difference of performance from one chromosome to another.
We suggest testing on the chr22 for computational efficiency.
Command Line Usage
==================
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment