Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
R
RAISS
Manage
Activity
Members
Labels
Plan
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container Registry
Model registry
Operate
Environments
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Statistical-Genetics
RAISS
Commits
c25ec683
Commit
c25ec683
authored
3 years ago
by
Hanna JULIENNE
Browse files
Options
Downloads
Plain Diff
Merge branch 'dev' of
https://gitlab.pasteur.fr/statistical-genetics/raiss
into dev
parents
34e954d3
b3672568
No related branches found
No related tags found
1 merge request
!9
Dev
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
doc/source/index.rst
+57
-32
57 additions, 32 deletions
doc/source/index.rst
with
57 additions
and
32 deletions
doc/source/index.rst
+
57
−
32
View file @
c25ec683
...
@@ -109,7 +109,7 @@ The raiss package outputs imputed GWAS files in the tabular format:
...
@@ -109,7 +109,7 @@ The raiss package outputs imputed GWAS files in the tabular format:
| rs111876722 | 201922 | C | T | 0.297 | 0.09 | 5.412 | 0.91578 |
| rs111876722 | 201922 | C | T | 0.297 | 0.09 | 5.412 | 0.91578 |
+-------------+----------+------------+------------+---------+-------+----------+---------------+
+-------------+----------+------------+------------+---------+-------+----------+---------------+
Variance is set to -1 for variants present in the input dataset
*
Variance is set to -1 for variants present in the input dataset
*
Optimizing RAISS parameter for your data
Optimizing RAISS parameter for your data
========================================
========================================
...
@@ -119,8 +119,9 @@ to assess its performance on your data and fine tune RAISS parameter.
...
@@ -119,8 +119,9 @@ to assess its performance on your data and fine tune RAISS parameter.
Test procedure :
Test procedure :
1. Mask N SNPs on a chromosome
1. Mask N SNPs on a chromosome
2. Imputed masked file
2. Impute masked files for different values of the --eigen-threshold
3. Compute correlation between genotype Z-values to imputed Z-values
and the --minimum-ld parameters
3. Compute correlation (and other statistics) between genotype Z-values to imputed Z-values
To perform this test follow this procedure :
To perform this test follow this procedure :
...
@@ -128,44 +129,68 @@ To perform this test follow this procedure :
...
@@ -128,44 +129,68 @@ To perform this test follow this procedure :
2. Create a folder to store z-score files imputed with different parameter
2. Create a folder to store z-score files imputed with different parameter
3. Adapt the following code snippet to apply the function to your data:
3. Adapt the following code snippet to apply the function to your data:
.. code-block::
.. code-block::
python
:linenos:
:linenos:
import raiss
import pandas as pd
import sys
perf_results = raiss.imputation_R2.grid_search(
perf_results = raiss.imputation_R2.grid_search(
${path_z-scores_folder},
$path_to_initial_zscores_folder,
${path_to_masked_z-scores_folder},
$path_to_masked_zscores_output_folder,
${path_to_imputed_z-scores_folder},
$path_to_store_masked_zscores_output_folder,
${path_to_reference_panel_folder},
$path_to_reference_panel,
${path_to_LD_matrices_folder},
$path_to_LD_matrices,
"GWAS_TAG", chrom="chr22",
gwas, chrom="chr22",ref_panel_preffix="",ref_panel_suffix=".bim",
eigen_ratio_grid = [ 1, 0.5 ,0.1, 0.01], # Enter the value you want to test in this list
eigen_ratio_grid = [1.1,1,0.9,0.5,0.25,0.2,0.15,0.1],
window_size= 500000, buffer_size=125000, l2_regularization=0.1,
ld_threshold_grid = [0,2, 5,7],
R2_threshold=0.6)
window_size= 500000, buffer_size=125000, l2_regularization=0.1,
fout = "./Perf_"+GWAS_TAG+".csv"
R2_threshold=0.6)
fout = "performance_report.csv"
print(perf_results)
print(perf_results)
perf_results.to_csv(fout, sep="\t")
perf_results.to_csv(fout, sep="\t")
The file Perf_GWAS_TAG ressemble the following output:
The file Perf_GWAS_TAG ressembles the following output:
+----+----+--------------------+-----------------+
.. csv-table:: Performance Report
| |cor |mean_absolute_error |fraction_imputed |
:widths: 10, 10, 10,10, 10, 10,10,10,10,10
+====+====+====================+=================+
:header-rows: 1
|1.0 |0.95| 0.243 | 1.0 |
+----+----+--------------------+-----------------+
"eigen_ratio","min_ld","N_SNP","fraction_imputed","cor","mean_absolute_error","median_absolute_error","min_absolute_error","max_absolute_error","SNP_max_error"
| 0.5|0.94| 0.246 | 0.95 |
0.1,0,2970,0.594,0.978,0.277,0.171,1.47e-05,6.92,"rs5756504"
+----+----+--------------------+-----------------+
0.1,2,2970,0.594,0.978,0.277,0.171,1.47e-05,6.92,"rs5756504"
0.1,5,2840,0.568,0.978,0.277,0.169,1.47e-05,6.92,"rs5756504"
The row names correspond to the eigen ratio parameter that was tested.
0.1,7,2550,0.51,0.978,0.275,0.164,0.000285,6.92,"rs5756504"
The second column is the correlation between imputed and genotyped Z-scores.
0.15,0,2470,0.494,0.976,0.282,0.172,2.43e-05,4.22,"rs59411032"
The third column is the mean L1-error between imputed and genotyped Z-scores.
0.15,2,2470,0.494,0.976,0.282,0.172,2.43e-05,4.22,"rs59411032"
The fourth column is the fraction of SNPs on the 5000 that were imputed.
0.15,5,2450,0.49,0.976,0.281,0.172,2.43e-05,4.22,"rs59411032"
0.15,7,2320,0.465,0.976,0.282,0.172,0.00044,4.22,"rs59411032"
The optimal eigen_ratio can vary depending on the density of your reference panel and input data.
0.2,0,2040,0.409,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
0.2,2,2040,0.409,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
0.2,5,2040,0.408,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
0.2,7,2020,0.403,0.973,0.291,0.168,6.61e-05,4.37,"rs5752798"
* **eigen_ratio** : eigen ratio parameter that was tested.
* **min_ld** : eigen ratio parameter that was tested.
* **N_SNP** : number of the masked SNPs that were successfully imputed (i.e. not filtered out by the R2 criteria and/or min_ld criteria)
* **fraction_imputed** : fraction of the masked SNPs that were successfully imputed (N_SNP / total_number_of_masked_SNP)
* **cor** : the correlation between imputed and genotyped Z-scores.
* **mean_absolute_error** : :math:`\mathbb{E}|Z_{imputed} - Z_{masked}|`
* **median_absolute_error** : :math:`median|Z_{imputed} - Z_{masked}|`
* **min_absolute_error** : :math:`min|Z_{imputed} - Z_{masked}|`
* **max_absolute_error** : :math:`max|Z_{imputed} - Z_{masked}|`
* **SNP_max_error** : :math:`argmax|Z_{imputed} - Z_{masked}|`
To pick the best parameters, we recommend to search for a compromise between low imputation error and an high fraction of masked SNPs imputed
(a trade-off between **fraction_imputed** and **mean_absolute_error**).
The optimal eigen_ratio and min_ld can vary depending on the density of your reference panel and input data.
Hence, we recommend to run a grid search to pick the best parameter for your data.
Hence, we recommend to run a grid search to pick the best parameter for your data.
However,
empirically
, we never observed a difference of performance from one chromosome to another.
However,
so far
, we never observed a difference of performance from one chromosome to another.
We suggest testing on the chr22 for computational efficiency.
We suggest testing on the chr22 for computational efficiency.
Command Line Usage
Command Line Usage
==================
==================
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment