diff --git a/README.md b/README.md index f09b36af49e5c3ef04cd2b024862c816dfe83911..6d2df2ad5e0aad11f9ab687d6993d0f900d1f3f8 100644 --- a/README.md +++ b/README.md @@ -4,13 +4,112 @@ Scripts and data used in the analysis and visualisation in the manuscript ``Trai * README.md : this file. * /Figures_manuscript : contains all data (/inputs) and scripts (/scripts) used to generate figures. + |- /scripts : script name indicates the figure number. + |- /inputs + |---- 72trait_data_2023-07-07.csv : Features of 72 traits + | [Column names] 'log10_mes_semilogadjust': log10(mean effect size). + | 'log10_pi_semilogadjust': log10(polygenicity), which is the log10 of polygenicity after adjusted by effective sample size. + | 'mean effect size (adjusted)': mean effect size. + | 'polygenicity (adjusted)': polygenicity after adjusted by effective sample size. This is the one used in the main analysis. + | 'polygenicity (unadjusted)': polygenicity before adjusted by effective sample size. + | 'h2GWAS_mixer': h2GWAS estimated by mixer. + | 'h2m': h2GWAS minus heritability explained by univ significant regions (independent regions with significant univ hits). + | 'Neff': effective sample size. + | '# of univGWAS hits': number of non-independent hits by univariate analysis. + | '# of regions with univGWAS hits': number of univ significant regions (i.e. independent regions with hits by univariate analysis), + | 'h2_LD': heritability estimated by LDSC. + | 'perc_h2m': h2m/h2GWAS. + | 'h2_univhit_region': heritability explained by univ significant regions. + |---- berisa_region.bed : Independent regions (loci) and their position on chromatine. + |---- /BMIanalysis + | |---- Evaluation_new_associations_BMI_with_allSNPs.tsv : For comparison between JASS detection with GIANT_BMI (smaller sample size) and univariate GWAS for BMI with a larger sample size. + | [Column names] 'Region': Independent region (loci), + | 'n_associated_JASS': out of 1776 traitsets containing GIANT_BMI, how many times the region was associated by JASS while not associated by GIANT_BMI. + | 'n_associated_JASS_corrected': same as 'n_associated_JASS' except that the significance by JASS is evaluated by q-value (p-value x 1776). + | 'P_small_gwas': P-value in the smaller univariate GWAS (GIANT_BMI). + | 'min_P_jass_joint': minimum P-value of the region by JASS across 1776 traitsets with GIANT_BMI + | 'P_univ_large_gwas': P-value in the larger univariate GWAS. + | 'evaluate': comparison between JASS and large GWAS results. + | 'JASS': whether JASS detected the region as significant. + | 'Large_GWAS': whether larger GWAS detected the region as significant. + |---- /clinical_grouping_analysis_2023-09-06 + | |---- Trait_clinical_groupings.csv : Clinical categories of 72 traits. + | |---- category_traitset_with_mean_test_jass.tsv : Summary of the number of clinical categories per each traitset jointly analysed, as well as JASS multi-trait gain (for validation sets) + | |---- category_traitset_with_mean_train_jass.tsv : Summary of the number of clinical categories per each traitset jointly analysed, as well as JASS multi-trait gain (for training sets) + | |---- category_traitset_with_mean_test_mtag.tsv : Summary of the number of clinical categories per each traitset jointly analysed, as well as MTAG multi-trait gain (for validation sets) + | |---- category_traitset_with_mean_train_mtag.tsv : Summary of the number of clinical categories per each traitset jointly analysed, as well as MTAG multi-trait gain (for training sets) + | |---- category_traitset_with_mean_all_jass_no_duplicates : A file combining category_traitset_with_mean_test_jass.tsv and category_traitset_with_mean_train_jass.tsv without duplicates. + | [Column names] 'trait_set': list of traits, 'obs_joint': #new associations observed, 'obs_gain': multitrait gain observed, 'n_group': number of clinical categories in the trait set, + | 'group_names': Group_ID of each trait (as in Trait_clinical_groupings.csv), 'rank_datadriven': rank by multi-trait gain estimated by our regression model. + |---- Correlation_matrix_genetic.csv : genetic correlation matrix across 72 traits. + |---- COV_H0.csv : covariance matrix under null across 72 traits. + |---- GWAS_hit_count_plink.csv : univariate GWAS result summary of 72 traits + | [Column names] 'trait_name': name of the trait + | 'univ_hit_count': the number of non-independent variants significant by univariate gwas. + | 'univ_hit_region_count': the number of independent region (loci) with variants significant by univariate gwas. + | 'univ_hit_plinkclumps_count': the number of independent region estimated by plink that contain variants significant by univariate gwas. + |---- /JASS_5CVdata-2023-08-01 <- inputs and outputs in the regression analysis for estimating contributions of trait features to multi-trait gain (as in ~/Regression_Analysis/ below) + | |---- 5foldCV_linear_summary_semilogadjust_log10_excCorVar.csv : summary of 5-fold cross validation with a linear model (main). + | |---- 5foldCV_linear_summary_semilogadjust_log10_excCorVar_wMES.csv : summary of 5-fold cross validation with a lienar model when replacing %h2u with MES. + | |---- 5foldCV_linear_summary_semilogadjust_log10_excCorVar_wPI.csv : summary of 5-fold cross validation with a linear model when replacing %h2u with polygenicity. + | |---- 5foldCV_???_summary_semilogadjust_log10_excCorVar.csv : same as above main model except the model is non-linear (SVR or RFR). + | |---- 5foldCV_???_summary_semilogadjust_log10_excCorVar_wMES.csv : same as above wMES model except the model is non-linear (SVR or RFR). + | |---- 5foldCV_???_summary_semilogadjust_log10_excCorVar_wPI.csv : same as above wPPI model except the model is non-linear (SVR or RFR). + | | [Column names] '**_coef_mv' : coefficient in multivariate linear regression, where ** represent a feature (as described below). + | | '**_SE_mv' : standard error of the coefficient in multivariate lienar regression. + | | '**_pval_mv': pvalue of the coefficient in multivariate linear regression. + | | '**_coef_uv': coefficient in univariate linear regression. + | | '**_SE_uv' : standard error of the coefficient in univariate linear regression. + | | '**_pval_uv' : pvalue of the coefficient in univariate linear regression. + | | 'R2train_mv','R2val_mv' : r2_score calculated using sklearn.metrics.r2_score (for training and validation datasets, respectively). + | | 'R2val_adj_mv' : r2_score between observed multi-trait gain and prediction adjusted by standard deviation. (for validation dataset) + | | 'corRtrain_mv','corRval_mv': pearson's correlation coefficient between observed and predicted multi-trait gain (for training and validation datasets, respectively). This is the one used to evaluate the model performance. + | | 'corR2train_mv','corR2val_mv': squared-pearson's correlation coefficient (for training and validation datasets, respectively). + | | 'corR2val_adj_mv': squared-pearson's correlation coefficient between observed multi-trait gain and prediction adjusted by standard deviation. (for validation dataset). + | | 'cor_p_train_mv','cor_p_val_mv': p-value in the pearson's correlation (for training and validation datasets, respectively) + | | **feature names <-- 'k':# of traits,'log10_avg_distance_cor': log10(Delta_Sigma),'log10_mean_gencov':log10(mean genetic covariance)$,'avg_Neff':mean effective sample size,'avg_h2_mixer':mean h2GWAS,'avg_perc_h2_diff_region': mean %h2u, + | | 'avg_log10pi_semilogadjust': mean log10(polygenicity), 'avg_log10mes_semilogadjust':mean log10(MES). + | |---- stat_set_with_fullJASS_power_5CVcombined_without_duplicates.tsv : summary of results by JASS full version (including variants with missing data). + | |---- traitset_jass_CV*.tsv files same as in /Regression_analysis/inputs/ described below + | |---- traitset_jass_5CVcombined_without_duplicates.tsv : a file combining traitset_jass_CV*.tsv files (from ) across 5CV without duplicated traitsets. + |---- /JASS_true_pred_gains-2023-08-03 + | |---- gain-true-predicted_jass_linear_pre-selected-features_CV?.txt : observed, predicted, and adjusted predicted gains across traitsets in each CV. + |---- /MTAG_5CVdata-2023-08-01 + | |---- 5foldCV_MTAG_linear_summary_semilogadjust_log10_excCorVar.csv : summary of 5-fold cross validation with a linear model for MTAG. + | |---- traitset_mtag_CV*.tsv files in the same format as traitset_jass_CV*.tsv in /Regression_analysis/inputs/. For MTAG analysis instead of JASS. + |---- Pval_cor_matrix_genetic.csv : Pvalues for genetic correlations across 72 traits. + * /Regression_analysis : contains all data (/inputs) and scripts (/scripts) used for regression analysis with cross validation using JASS and MTAG. - - /scripts - - 5f-cross_validation_linear.py : Linear regression analysis for JASS. - - 5f-cross_validation_linear_mtag.py : Linear regression analysis for MTAG. - - 5f-cross_validation_nonlinear.py : Non-linear (SVR, RFR) regression analysis for JASS. - - /inputs - - traitset_jass_CVtraining?-newSUMMARY_remove-nan.tsv : training data files for each of 5-fold cross validations for JASS. - - traitset_jass_CVtest?-newSUMMARY_remove-nan.tsv : validation data files for each of 5-fold cross validations for JASS. - - traitset_mtag_CVtraining?-newSUMMARY_comp_correction.tsv : training data files for each of 5-fold cross validation for MTAG. - - traitset_mtag_CVtest?-newSUMMARY_comp_correction.tsv : validation data files for each of 5-fold cross validation for MTAG. + |- /scripts + |---- 5f-cross_validation_linear.py : Linear regression analysis for JASS. + |---- 5f-cross_validation_linear_mtag.py : Linear regression analysis for MTAG. + |---- 5f-cross_validation_nonlinear.py : Non-linear (SVR, RFR) regression analysis for JASS. + |- /inputs + |---- traitset_jass_CVtraining?-newSUMMARY_remove-nan.tsv : training data files for each of 5-fold cross validations for JASS. + |---- traitset_jass_CVtest?-newSUMMARY_remove-nan.tsv : validation data files for each of 5-fold cross validations for JASS. + |---- traitset_mtag_CVtraining?-newSUMMARY_comp_correction.tsv : training data files for each of 5-fold cross validation for MTAG. + |---- traitset_mtag_CVtest?-newSUMMARY_comp_correction.tsv : validation data files for each of 5-fold cross validation for MTAG. + Each of these input data contains, + 1. k : the number of traits jointly analysed + 2. trait : list of traits jointly analysed + 3. type_sampling : type of stratified sampling used + 4. Both : the number of independent regions (loci) that were significant by univariate and multi-trait analyses + 5. Joint : the number of independent regions (loci) that were significant only by multi-trait analysis + 6. None : the number of independent regions (loci) that were not significant either by univariate nor multi-trait analyses + 7. Univariate : the number of independent regions (loci) that were significant only by univariate analysis + 8. fraction_more_significant_joint : the fraction of independent regions (loci) that had smaller p-value from multi-trait analysis than p-value from univariate analysis. + 9. fraction_more_significant_joint_qval : "multi-trait gain", the fraction of independent regions (loci) that had smaller q-value (multiple testing corrected p-value) from multi-trait analysis than p-value from univariate analysis. + 10. avg_log10mes_semilogadjust : mean log10(mean effect size) + 11. avg_log10pi_semilogadjust : mean log10(polygenicity) + 12. avg_Neff : mean effective sample size + 13. avg_h2_sigSNP_region : mean across traits of heritability explained by independent regions (loci) significant by univariate analysis. + 14. avg_h2_mixer : mean of h2GWAS estimated by mixer. + 15. avg_perc_h2_diff_region : mean of %h2u + 16. var_h2_sigSNP_region : variance across traits of heritability explaied by independent regions (loci) significant by univariate analysis. + 17. mean_gencor : mean genetic correlation + 18. mean_null_phencor : mean correlation under null + 19. avg_distance_cor : average distance between the genetic and residual correlation matrices. + 20. mean_gencov : mean genetic covariance + 21. mean_null_phencov : mean covariance under null + 22. condition_number_rcov : condition number of covariance matrix under null + 23. condition_number_gcov : condition number of genetic covariance matrix