Commit f6f688c3 authored by Hanna  JULIENNE's avatar Hanna JULIENNE
Browse files

add code Snippet in documentation (on perf testing)

parent 13b5783e
Pipeline #68634 passed with stages
in 1 minute and 33 seconds
......@@ -97,6 +97,61 @@ The raiss package outputs imputed GWAS files in the tabular format:
| rs111876722 | 201922 | C | T | 0.297 | 0.16 | 5.412 |
+-------------+----------+------------+------------+---------+-------+----------+
Optimizing RAISS parameter for your data
========================================
Raiss package contains a function (raiss.imputation_R2.grid_search)
to assess its performance on your data and fine tune RAISS parameter.
Test procedure :
1. Mask N SNPs on a chromosome
2. Imputed masked file
3. Compute correlation between genotype Z-values to imputed Z-values
To perform this test follow this procedure :
1. Create a folder to store masked z-score files
2. Create a folder to store z-score files imputed with different parameter
3. Adapt the following code snippet to apply the function to your data:
.. code-block::
:linenos:
perf_results = raiss.imputation_R2.grid_search(
${path_z-scores_folder},
${path_to_masked_z-scores_folder},
${path_to_imputed_z-scores_folder},
${path_to_reference_panel_folder},
${path_to_LD_matrices_folder},
"GWAS_TAG", chrom="chr22",
eigen_ratio_grid = [ 1, 0.5 ,0.1, 0.01], # Enter the value you want to test in this list
window_size= 500000, buffer_size=125000, l2_regularization=0.1,
R2_threshold=0.6)
fout = "./Perf_"+GWAS_TAG+".csv"
print(perf_results)
perf_results.to_csv(fout, sep="\t")
The file Perf_GWAS_TAG ressemble the following output:
+----+----+--------------------+-----------------+
| |cor |mean_absolute_error |fraction_imputed |
+====+====+====================+=================+
|1.0 |0.95| 0.243 | 1.0 |
+----+----+--------------------+-----------------+
| 0.5|0.94| 0.246 | 0.95 |
+----+----+--------------------+-----------------+
The row names correspond to the eigen ratio parameter that was tested.
The second column is the correlation between imputed and genotyped Z-scores.
The third column is the mean L1-error between imputed and genotyped Z-scores.
The fourth column is the fraction of SNPs on the 5000 that were imputed.
The optimal eigen_ratio can vary depending on the density of your reference panel and input data.
Hence, we recommend to run a grid search to pick the best parameter for your data.
However, empirically, we never observed a difference of performance from one chromosome to another.
We suggest testing on the chr22 for computational efficiency.
Command Line Usage
==================
......
......@@ -24,14 +24,13 @@ def generated_test_data(zscore, N_to_mask=5000, condition=None, stratifying_vec
"""
try:
if isinstance(condition, pd.Series)==True:
print("Condition vector")
masked = np.random.choice(zscore.index[condition], N_to_mask, replace=False)
else:
print("Stratifying vector?")
inter_id = zscore.index.intersection(stratifying_vector.index).drop_duplicates(keep='first')
print(inter_id[1:10])
stratifying_vector = stratifying_vector.loc[inter_id]
if isinstance(stratifying_vector, pd.Series)==True:
print("Stratifying vector")
inter_id = zscore.index.intersection(stratifying_vector.index).drop_duplicates(keep='first')
stratifying_vector = stratifying_vector.loc[inter_id]
masked = []
binned = np.digitize(stratifying_vector, stratifying_bins)
N_bins = len(stratifying_bins)-1
......
......@@ -104,6 +104,6 @@ def raiss_model(zt, sig_t, sig_i_t, lamb=0.01, rcond=0.01, batch=True):
var_norm = var_in_boundaries(var, lamb)
R2 = ((1+lamb)-var_norm)
print(R2)
mu = mu / np.sqrt(R2)
return({"var" : var, "mu" : mu, "ld_score" : ld_score, "condition_number" : condition_number, "correct_inversion":correct_inversion })
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment