Skip to content
Snippets Groups Projects
Commit 4a9ec44c authored by Yuka  SUZUKI's avatar Yuka SUZUKI
Browse files

updated

parent b3256c4d
No related branches found
No related tags found
No related merge requests found
...@@ -4,13 +4,112 @@ Scripts and data used in the analysis and visualisation in the manuscript ``Trai ...@@ -4,13 +4,112 @@ Scripts and data used in the analysis and visualisation in the manuscript ``Trai
* README.md : this file. * README.md : this file.
* /Figures_manuscript : contains all data (/inputs) and scripts (/scripts) used to generate figures. * /Figures_manuscript : contains all data (/inputs) and scripts (/scripts) used to generate figures.
|- /scripts : script name indicates the figure number.
|- /inputs
|---- 72trait_data_2023-07-07.csv : Features of 72 traits
| [Column names] 'log10_mes_semilogadjust': log10(mean effect size).
| 'log10_pi_semilogadjust': log10(polygenicity), which is the log10 of polygenicity after adjusted by effective sample size.
| 'mean effect size (adjusted)': mean effect size.
| 'polygenicity (adjusted)': polygenicity after adjusted by effective sample size. This is the one used in the main analysis.
| 'polygenicity (unadjusted)': polygenicity before adjusted by effective sample size.
| 'h2GWAS_mixer': h2GWAS estimated by mixer.
| 'h2m': h2GWAS minus heritability explained by univ significant regions (independent regions with significant univ hits).
| 'Neff': effective sample size.
| '# of univGWAS hits': number of non-independent hits by univariate analysis.
| '# of regions with univGWAS hits': number of univ significant regions (i.e. independent regions with hits by univariate analysis),
| 'h2_LD': heritability estimated by LDSC.
| 'perc_h2m': h2m/h2GWAS.
| 'h2_univhit_region': heritability explained by univ significant regions.
|---- berisa_region.bed : Independent regions (loci) and their position on chromatine.
|---- /BMIanalysis
| |---- Evaluation_new_associations_BMI_with_allSNPs.tsv : For comparison between JASS detection with GIANT_BMI (smaller sample size) and univariate GWAS for BMI with a larger sample size.
| [Column names] 'Region': Independent region (loci),
| 'n_associated_JASS': out of 1776 traitsets containing GIANT_BMI, how many times the region was associated by JASS while not associated by GIANT_BMI.
| 'n_associated_JASS_corrected': same as 'n_associated_JASS' except that the significance by JASS is evaluated by q-value (p-value x 1776).
| 'P_small_gwas': P-value in the smaller univariate GWAS (GIANT_BMI).
| 'min_P_jass_joint': minimum P-value of the region by JASS across 1776 traitsets with GIANT_BMI
| 'P_univ_large_gwas': P-value in the larger univariate GWAS.
| 'evaluate': comparison between JASS and large GWAS results.
| 'JASS': whether JASS detected the region as significant.
| 'Large_GWAS': whether larger GWAS detected the region as significant.
|---- /clinical_grouping_analysis_2023-09-06
| |---- Trait_clinical_groupings.csv : Clinical categories of 72 traits.
| |---- category_traitset_with_mean_test_jass.tsv : Summary of the number of clinical categories per each traitset jointly analysed, as well as JASS multi-trait gain (for validation sets)
| |---- category_traitset_with_mean_train_jass.tsv : Summary of the number of clinical categories per each traitset jointly analysed, as well as JASS multi-trait gain (for training sets)
| |---- category_traitset_with_mean_test_mtag.tsv : Summary of the number of clinical categories per each traitset jointly analysed, as well as MTAG multi-trait gain (for validation sets)
| |---- category_traitset_with_mean_train_mtag.tsv : Summary of the number of clinical categories per each traitset jointly analysed, as well as MTAG multi-trait gain (for training sets)
| |---- category_traitset_with_mean_all_jass_no_duplicates : A file combining category_traitset_with_mean_test_jass.tsv and category_traitset_with_mean_train_jass.tsv without duplicates.
| [Column names] 'trait_set': list of traits, 'obs_joint': #new associations observed, 'obs_gain': multitrait gain observed, 'n_group': number of clinical categories in the trait set,
| 'group_names': Group_ID of each trait (as in Trait_clinical_groupings.csv), 'rank_datadriven': rank by multi-trait gain estimated by our regression model.
|---- Correlation_matrix_genetic.csv : genetic correlation matrix across 72 traits.
|---- COV_H0.csv : covariance matrix under null across 72 traits.
|---- GWAS_hit_count_plink.csv : univariate GWAS result summary of 72 traits
| [Column names] 'trait_name': name of the trait
| 'univ_hit_count': the number of non-independent variants significant by univariate gwas.
| 'univ_hit_region_count': the number of independent region (loci) with variants significant by univariate gwas.
| 'univ_hit_plinkclumps_count': the number of independent region estimated by plink that contain variants significant by univariate gwas.
|---- /JASS_5CVdata-2023-08-01 <- inputs and outputs in the regression analysis for estimating contributions of trait features to multi-trait gain (as in ~/Regression_Analysis/ below)
| |---- 5foldCV_linear_summary_semilogadjust_log10_excCorVar.csv : summary of 5-fold cross validation with a linear model (main).
| |---- 5foldCV_linear_summary_semilogadjust_log10_excCorVar_wMES.csv : summary of 5-fold cross validation with a lienar model when replacing %h2u with MES.
| |---- 5foldCV_linear_summary_semilogadjust_log10_excCorVar_wPI.csv : summary of 5-fold cross validation with a linear model when replacing %h2u with polygenicity.
| |---- 5foldCV_???_summary_semilogadjust_log10_excCorVar.csv : same as above main model except the model is non-linear (SVR or RFR).
| |---- 5foldCV_???_summary_semilogadjust_log10_excCorVar_wMES.csv : same as above wMES model except the model is non-linear (SVR or RFR).
| |---- 5foldCV_???_summary_semilogadjust_log10_excCorVar_wPI.csv : same as above wPPI model except the model is non-linear (SVR or RFR).
| | [Column names] '**_coef_mv' : coefficient in multivariate linear regression, where ** represent a feature (as described below).
| | '**_SE_mv' : standard error of the coefficient in multivariate lienar regression.
| | '**_pval_mv': pvalue of the coefficient in multivariate linear regression.
| | '**_coef_uv': coefficient in univariate linear regression.
| | '**_SE_uv' : standard error of the coefficient in univariate linear regression.
| | '**_pval_uv' : pvalue of the coefficient in univariate linear regression.
| | 'R2train_mv','R2val_mv' : r2_score calculated using sklearn.metrics.r2_score (for training and validation datasets, respectively).
| | 'R2val_adj_mv' : r2_score between observed multi-trait gain and prediction adjusted by standard deviation. (for validation dataset)
| | 'corRtrain_mv','corRval_mv': pearson's correlation coefficient between observed and predicted multi-trait gain (for training and validation datasets, respectively). This is the one used to evaluate the model performance.
| | 'corR2train_mv','corR2val_mv': squared-pearson's correlation coefficient (for training and validation datasets, respectively).
| | 'corR2val_adj_mv': squared-pearson's correlation coefficient between observed multi-trait gain and prediction adjusted by standard deviation. (for validation dataset).
| | 'cor_p_train_mv','cor_p_val_mv': p-value in the pearson's correlation (for training and validation datasets, respectively)
| | **feature names <-- 'k':# of traits,'log10_avg_distance_cor': log10(Delta_Sigma),'log10_mean_gencov':log10(mean genetic covariance)$,'avg_Neff':mean effective sample size,'avg_h2_mixer':mean h2GWAS,'avg_perc_h2_diff_region': mean %h2u,
| | 'avg_log10pi_semilogadjust': mean log10(polygenicity), 'avg_log10mes_semilogadjust':mean log10(MES).
| |---- stat_set_with_fullJASS_power_5CVcombined_without_duplicates.tsv : summary of results by JASS full version (including variants with missing data).
| |---- traitset_jass_CV*.tsv files same as in /Regression_analysis/inputs/ described below
| |---- traitset_jass_5CVcombined_without_duplicates.tsv : a file combining traitset_jass_CV*.tsv files (from ) across 5CV without duplicated traitsets.
|---- /JASS_true_pred_gains-2023-08-03
| |---- gain-true-predicted_jass_linear_pre-selected-features_CV?.txt : observed, predicted, and adjusted predicted gains across traitsets in each CV.
|---- /MTAG_5CVdata-2023-08-01
| |---- 5foldCV_MTAG_linear_summary_semilogadjust_log10_excCorVar.csv : summary of 5-fold cross validation with a linear model for MTAG.
| |---- traitset_mtag_CV*.tsv files in the same format as traitset_jass_CV*.tsv in /Regression_analysis/inputs/. For MTAG analysis instead of JASS.
|---- Pval_cor_matrix_genetic.csv : Pvalues for genetic correlations across 72 traits.
* /Regression_analysis : contains all data (/inputs) and scripts (/scripts) used for regression analysis with cross validation using JASS and MTAG. * /Regression_analysis : contains all data (/inputs) and scripts (/scripts) used for regression analysis with cross validation using JASS and MTAG.
- /scripts |- /scripts
- 5f-cross_validation_linear.py : Linear regression analysis for JASS. |---- 5f-cross_validation_linear.py : Linear regression analysis for JASS.
- 5f-cross_validation_linear_mtag.py : Linear regression analysis for MTAG. |---- 5f-cross_validation_linear_mtag.py : Linear regression analysis for MTAG.
- 5f-cross_validation_nonlinear.py : Non-linear (SVR, RFR) regression analysis for JASS. |---- 5f-cross_validation_nonlinear.py : Non-linear (SVR, RFR) regression analysis for JASS.
- /inputs |- /inputs
- traitset_jass_CVtraining?-newSUMMARY_remove-nan.tsv : training data files for each of 5-fold cross validations for JASS. |---- traitset_jass_CVtraining?-newSUMMARY_remove-nan.tsv : training data files for each of 5-fold cross validations for JASS.
- traitset_jass_CVtest?-newSUMMARY_remove-nan.tsv : validation data files for each of 5-fold cross validations for JASS. |---- traitset_jass_CVtest?-newSUMMARY_remove-nan.tsv : validation data files for each of 5-fold cross validations for JASS.
- traitset_mtag_CVtraining?-newSUMMARY_comp_correction.tsv : training data files for each of 5-fold cross validation for MTAG. |---- traitset_mtag_CVtraining?-newSUMMARY_comp_correction.tsv : training data files for each of 5-fold cross validation for MTAG.
- traitset_mtag_CVtest?-newSUMMARY_comp_correction.tsv : validation data files for each of 5-fold cross validation for MTAG. |---- traitset_mtag_CVtest?-newSUMMARY_comp_correction.tsv : validation data files for each of 5-fold cross validation for MTAG.
Each of these input data contains,
1. k : the number of traits jointly analysed
2. trait : list of traits jointly analysed
3. type_sampling : type of stratified sampling used
4. Both : the number of independent regions (loci) that were significant by univariate and multi-trait analyses
5. Joint : the number of independent regions (loci) that were significant only by multi-trait analysis
6. None : the number of independent regions (loci) that were not significant either by univariate nor multi-trait analyses
7. Univariate : the number of independent regions (loci) that were significant only by univariate analysis
8. fraction_more_significant_joint : the fraction of independent regions (loci) that had smaller p-value from multi-trait analysis than p-value from univariate analysis.
9. fraction_more_significant_joint_qval : "multi-trait gain", the fraction of independent regions (loci) that had smaller q-value (multiple testing corrected p-value) from multi-trait analysis than p-value from univariate analysis.
10. avg_log10mes_semilogadjust : mean log10(mean effect size)
11. avg_log10pi_semilogadjust : mean log10(polygenicity)
12. avg_Neff : mean effective sample size
13. avg_h2_sigSNP_region : mean across traits of heritability explained by independent regions (loci) significant by univariate analysis.
14. avg_h2_mixer : mean of h2GWAS estimated by mixer.
15. avg_perc_h2_diff_region : mean of %h2u
16. var_h2_sigSNP_region : variance across traits of heritability explaied by independent regions (loci) significant by univariate analysis.
17. mean_gencor : mean genetic correlation
18. mean_null_phencor : mean correlation under null
19. avg_distance_cor : average distance between the genetic and residual correlation matrices.
20. mean_gencov : mean genetic covariance
21. mean_null_phencov : mean covariance under null
22. condition_number_rcov : condition number of covariance matrix under null
23. condition_number_gcov : condition number of genetic covariance matrix
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment