Skip to content
Snippets Groups Projects

PROJECT_multitrait_power_traitselection

Scripts and data used in the analysis and visualisation in the manuscript Trait selection strategy in multi-trait GWAS: Boosting SNPs discoverability.

  • README.md : this file.

  • /Figures_manuscript : contains all data (/inputs) and scripts (/scripts) used to generate figures. |- /scripts : script name indicates the figure number. |- /inputs |---- 72trait_data_2023-07-07.csv : Features of 72 traits | [Column names] 'log10_mes_semilogadjust': log10(mean effect size). | 'log10_pi_semilogadjust': log10(polygenicity), which is the log10 of polygenicity after adjusted by effective sample size. | 'mean effect size (adjusted)': mean effect size. | 'polygenicity (adjusted)': polygenicity after adjusted by effective sample size. This is the one used in the main analysis. | 'polygenicity (unadjusted)': polygenicity before adjusted by effective sample size. | 'h2GWAS_mixer': h2GWAS estimated by mixer. | 'h2m': h2GWAS minus heritability explained by univ significant regions (independent regions with significant univ hits). | 'Neff': effective sample size. | '# of univGWAS hits': number of non-independent hits by univariate analysis. | '# of regions with univGWAS hits': number of univ significant regions (i.e. independent regions with hits by univariate analysis), | 'h2_LD': heritability estimated by LDSC. | 'perc_h2m': h2m/h2GWAS. | 'h2_univhit_region': heritability explained by univ significant regions. |---- berisa_region.bed : Independent regions (loci) and their position on chromatine. |---- /BMIanalysis | |---- Evaluation_new_associations_BMI_with_allSNPs.tsv : For comparison between JASS detection with GIANT_BMI (smaller sample size) and univariate GWAS for BMI with a larger sample size. | [Column names] 'Region': Independent region (loci), | 'n_associated_JASS': out of 1776 traitsets containing GIANT_BMI, how many times the region was associated by JASS while not associated by GIANT_BMI. | 'n_associated_JASS_corrected': same as 'n_associated_JASS' except that the significance by JASS is evaluated by q-value (p-value x 1776). | 'P_small_gwas': P-value in the smaller univariate GWAS (GIANT_BMI). (It's the minimum across SNPs in the same region.) | 'min_P_jass_joint': minimum P-value of the region by JASS across 1776 traitsets with GIANT_BMI. -1 if the region is associated by the small univ BMI GWAS. | 'P_univ_large_gwas': P-value in the larger univariate GWAS. | 'evaluate': comparison between JASS and large GWAS results. | 'JASS': whether JASS detected the region as significant. | 'Large_GWAS': whether larger GWAS detected the region as significant. |---- /clinical_grouping_analysis_2023-09-06 | |---- Trait_clinical_groupings.csv : Clinical categories of 72 traits. | |---- category_traitset_with_mean_test_jass.tsv : Summary of the number of clinical categories per each traitset jointly analysed, as well as JASS multi-trait gain (for validation sets) | |---- category_traitset_with_mean_train_jass.tsv : Summary of the number of clinical categories per each traitset jointly analysed, as well as JASS multi-trait gain (for training sets) | |---- category_traitset_with_mean_test_mtag.tsv : Summary of the number of clinical categories per each traitset jointly analysed, as well as MTAG multi-trait gain (for validation sets) | |---- category_traitset_with_mean_train_mtag.tsv : Summary of the number of clinical categories per each traitset jointly analysed, as well as MTAG multi-trait gain (for training sets) | |---- category_traitset_with_mean_all_jass_no_duplicates : A file combining category_traitset_with_mean_test_jass.tsv and category_traitset_with_mean_train_jass.tsv without duplicates. | [Column names] 'trait_set': list of traits, 'obs_joint': #new associations observed, 'obs_gain': multitrait gain observed, 'n_group': number of clinical categories in the trait set, | 'group_names': Group_ID of each trait (as in Trait_clinical_groupings.csv), 'rank_datadriven': rank by multi-trait gain estimated by our regression model. |---- Correlation_matrix_genetic.csv : genetic correlation matrix across 72 traits. |---- COV_H0.csv : covariance matrix under null across 72 traits. |---- GWAS_hit_count_plink.csv : univariate GWAS result summary of 72 traits | [Column names] 'trait_name': name of the trait | 'univ_hit_count': the number of non-independent variants significant by univariate gwas. | 'univ_hit_region_count': the number of independent region (loci) with variants significant by univariate gwas. | 'univ_hit_plinkclumps_count': the number of independent region estimated by plink that contain variants significant by univariate gwas. |---- /JASS_5CVdata-2023-08-01 <- inputs and outputs in the regression analysis for estimating contributions of trait features to multi-trait gain (as in ~/Regression_Analysis/ below) | |---- 5foldCV_linear_summary_semilogadjust_log10_excCorVar.csv : summary of 5-fold cross validation with a linear model (main). | |---- 5foldCV_linear_summary_semilogadjust_log10_excCorVar_wMES.csv : summary of 5-fold cross validation with a lienar model when replacing %h2u with MES. | |---- 5foldCV_linear_summary_semilogadjust_log10_excCorVar_wPI.csv : summary of 5-fold cross validation with a linear model when replacing %h2u with polygenicity. | |---- 5foldCV_???summary_semilogadjust_log10_excCorVar.csv : same as above main model except the model is non-linear (SVR or RFR). | |---- 5foldCV???summary_semilogadjust_log10_excCorVar_wMES.csv : same as above wMES model except the model is non-linear (SVR or RFR). | |---- 5foldCV???_summary_semilogadjust_log10_excCorVar_wPI.csv : same as above wPPI model except the model is non-linear (SVR or RFR). | | [Column names] '_coef_mv' : coefficient in multivariate linear regression, where ** represent a feature (as described below). | | '_SE_mv' : standard error of the coefficient in multivariate lienar regression. | | '_pval_mv': pvalue of the coefficient in multivariate linear regression. | | '_coef_uv': coefficient in univariate linear regression. | | '_SE_uv' : standard error of the coefficient in univariate linear regression. | | '_pval_uv' : pvalue of the coefficient in univariate linear regression. | | 'R2train_mv','R2val_mv' : r2_score calculated using sklearn.metrics.r2_score (for training and validation datasets, respectively). | | 'R2val_adj_mv' : r2_score between observed multi-trait gain and prediction adjusted by standard deviation. (for validation dataset) | | 'corRtrain_mv','corRval_mv': pearson's correlation coefficient between observed and predicted multi-trait gain (for training and validation datasets, respectively). This is the one used to evaluate the model performance. | | 'corR2train_mv','corR2val_mv': squared-pearson's correlation coefficient (for training and validation datasets, respectively). | | 'corR2val_adj_mv': squared-pearson's correlation coefficient between observed multi-trait gain and prediction adjusted by standard deviation. (for validation dataset). | | 'cor_p_train_mv','cor_p_val_mv': p-value in the pearson's correlation (for training and validation datasets, respectively) | | feature names <-- 'k':# of traits,'log10_avg_distance_cor': log10(Delta_Sigma),'log10_mean_gencov':log10(mean genetic covariance)$,'avg_Neff':mean effective sample size,'avg_h2_mixer':mean h2GWAS,'avg_perc_h2_diff_region': mean %h2u, | | 'avg_log10pi_semilogadjust': mean log10(polygenicity), 'avg_log10mes_semilogadjust':mean log10(MES). | |---- stat_set_with_fullJASS_power_5CVcombined_without_duplicates.tsv : summary of results by JASS full version (including variants with missing data). | |---- traitset_jass_CV.tsv files same as in /Regression_analysis/inputs/ described below | |---- traitset_jass_5CVcombined_without_duplicates.tsv : a file combining traitset_jass_CV.tsv files (from ) across 5CV without duplicated traitsets. |---- /JASS_true_pred_gains-2023-08-03 | |---- gain-true-predicted_jass_linear_pre-selected-features_CV?.txt : observed, predicted, and adjusted predicted gains across traitsets in each CV. |---- /MTAG_5CVdata-2023-08-01 | |---- 5foldCV_MTAG_linear_summary_semilogadjust_log10_excCorVar.csv : summary of 5-fold cross validation with a linear model for MTAG. | |---- traitset_mtag_CV*.tsv files in the same format as traitset_jass_CV*.tsv in /Regression_analysis/inputs/. For MTAG analysis instead of JASS. |---- Pval_cor_matrix_genetic.csv : Pvalues for genetic correlations across 72 traits.

  • /Regression_analysis : contains all data (/inputs) and scripts (/scripts) used for regression analysis with cross validation using JASS and MTAG. |- /scripts |---- 5f-cross_validation_linear.py : Linear regression analysis for JASS. |---- 5f-cross_validation_linear_mtag.py : Linear regression analysis for MTAG. |---- 5f-cross_validation_nonlinear.py : Non-linear (SVR, RFR) regression analysis for JASS. |- /inputs |---- traitset_jass_CVtraining?-newSUMMARY_remove-nan.tsv : training data files for each of 5-fold cross validations for JASS. |---- traitset_jass_CVtest?-newSUMMARY_remove-nan.tsv : validation data files for each of 5-fold cross validations for JASS. |---- traitset_mtag_CVtraining?-newSUMMARY_comp_correction.tsv : training data files for each of 5-fold cross validation for MTAG. |---- traitset_mtag_CVtest?-newSUMMARY_comp_correction.tsv : validation data files for each of 5-fold cross validation for MTAG. Each of these input data contains, 1. k : the number of traits jointly analysed 2. trait : list of traits jointly analysed 3. type_sampling : type of stratified sampling used 4. Both : the number of independent regions (loci) that were significant by univariate and multi-trait analyses 5. Joint : the number of independent regions (loci) that were significant only by multi-trait analysis 6. None : the number of independent regions (loci) that were not significant either by univariate nor multi-trait analyses 7. Univariate : the number of independent regions (loci) that were significant only by univariate analysis 8. fraction_more_significant_joint : the fraction of independent regions (loci) that had smaller p-value from multi-trait analysis than p-value from univariate analysis. 9. fraction_more_significant_joint_qval : "multi-trait gain", the fraction of independent regions (loci) that had smaller q-value (multiple testing corrected p-value) from multi-trait analysis than p-value from univariate analysis. 10. avg_log10mes_semilogadjust : mean log10(mean effect size) 11. avg_log10pi_semilogadjust : mean log10(polygenicity) 12. avg_Neff : mean effective sample size 13. avg_h2_sigSNP_region : mean across traits of heritability explained by independent regions (loci) significant by univariate analysis. 14. avg_h2_mixer : mean of h2GWAS estimated by mixer. 15. avg_perc_h2_diff_region : mean of %h2u 16. var_h2_sigSNP_region : variance across traits of heritability explaied by independent regions (loci) significant by univariate analysis. 17. mean_gencor : mean genetic correlation 18. mean_null_phencor : mean correlation under null 19. avg_distance_cor : average distance between the genetic and residual correlation matrices. 20. mean_gencov : mean genetic covariance 21. mean_null_phencov : mean covariance under null 22. condition_number_rcov : condition number of covariance matrix under null 23. condition_number_gcov : condition number of genetic covariance matrix