updated

4a9ec44c · Yuka SUZUKI · b3256c4d · 4a9ec44c
Commit 4a9ec44c authored 1 year ago by Yuka SUZUKI
--- a/README.md
+++ b/README.md
@@ -4,13 +4,112 @@ Scripts and data used in the analysis and visualisation in the manuscript ``Trai
 * README.md : this file.
 * /Figures_manuscript : contains all data (/inputs) and scripts (/scripts) used to generate figures.
+    |- /scripts : script name indicates the figure number.
+    |- /inputs
+    |---- 72trait_data_2023-07-07.csv : Features of 72 traits
+    |     [Column names] 'log10_mes_semilogadjust': log10(mean effect size).
+    |			 'log10_pi_semilogadjust': log10(polygenicity), which is the log10 of polygenicity after adjusted by effective sample size.
+    |	 		 'mean effect size (adjusted)': mean effect size.
+    |			 'polygenicity (adjusted)': polygenicity after adjusted by effective sample size. This is the one used in the main analysis.
+    |			 'polygenicity (unadjusted)': polygenicity before adjusted by effective sample size.
+    |			 'h2GWAS_mixer': h2GWAS estimated by mixer.
+    |			 'h2m': h2GWAS minus heritability explained by univ significant regions (independent regions with significant univ hits).
+    |			 'Neff': effective sample size.
+    |			 '# of univGWAS hits': number of non-independent hits by univariate analysis.
+    |			 '# of regions with univGWAS hits': number of univ significant regions (i.e. independent regions with hits by univariate analysis),
+    |			 'h2_LD': heritability estimated by LDSC.
+    |			 'perc_h2m': h2m/h2GWAS.
+    |			 'h2_univhit_region': heritability explained by univ significant regions.
+    |---- berisa_region.bed : Independent regions (loci) and their position on chromatine.
+    |---- /BMIanalysis
+    |      |---- Evaluation_new_associations_BMI_with_allSNPs.tsv : For comparison between JASS detection with GIANT_BMI (smaller sample size) and univariate GWAS for BMI with a larger sample size.
+    |            [Column names] 'Region': Independent region (loci), 
+    |				'n_associated_JASS': out of 1776 traitsets containing GIANT_BMI, how many times the region was associated by JASS while not associated by GIANT_BMI.
+    |				'n_associated_JASS_corrected': same as 'n_associated_JASS' except that the significance by JASS is evaluated by q-value (p-value x 1776).
+    |				'P_small_gwas': P-value in the smaller univariate GWAS (GIANT_BMI).
+    |				'min_P_jass_joint': minimum P-value of the region by JASS across 1776 traitsets with GIANT_BMI
+    |				'P_univ_large_gwas': P-value in the larger univariate GWAS.
+    |				'evaluate': comparison between JASS and large GWAS results.
+    |				'JASS': whether JASS detected the region as significant.
+    |				'Large_GWAS': whether larger GWAS detected the region as significant.
+    |---- /clinical_grouping_analysis_2023-09-06
+    |      |---- Trait_clinical_groupings.csv : Clinical categories of 72 traits. 
+    |      |---- category_traitset_with_mean_test_jass.tsv : Summary of the number of clinical categories per each traitset jointly analysed, as well as JASS multi-trait gain (for validation sets)
+    |      |---- category_traitset_with_mean_train_jass.tsv : Summary of the number of clinical categories per each traitset jointly analysed, as well as JASS multi-trait gain (for training sets)
+    |      |---- category_traitset_with_mean_test_mtag.tsv : Summary of the number of clinical categories per each traitset jointly analysed, as well as MTAG multi-trait gain (for validation sets)
+    |      |---- category_traitset_with_mean_train_mtag.tsv : Summary of the number of clinical categories per each traitset jointly analysed, as well as MTAG multi-trait gain (for training sets)
+    |      |---- category_traitset_with_mean_all_jass_no_duplicates : A file combining category_traitset_with_mean_test_jass.tsv and category_traitset_with_mean_train_jass.tsv without duplicates.
+    |            [Column names] 'trait_set': list of traits, 'obs_joint': #new associations observed, 'obs_gain': multitrait gain observed, 'n_group': number of clinical categories in the trait set, 
+    |				'group_names': Group_ID of each trait (as in Trait_clinical_groupings.csv), 'rank_datadriven': rank by multi-trait gain estimated by our regression model.
+    |---- Correlation_matrix_genetic.csv : genetic correlation matrix across 72 traits.
+    |---- COV_H0.csv : covariance matrix under null across 72 traits.
+    |---- GWAS_hit_count_plink.csv : univariate GWAS result summary of 72 traits
+    |     [Column names] 'trait_name': name of the trait
+    |                    'univ_hit_count': the number of non-independent variants significant by univariate gwas.
+    |                    'univ_hit_region_count': the number of independent region (loci) with variants significant by univariate gwas.
+    |                    'univ_hit_plinkclumps_count': the number of independent region estimated by plink that contain variants significant by univariate gwas.
+    |---- /JASS_5CVdata-2023-08-01 <- inputs and outputs in the regression analysis for estimating contributions of trait features to multi-trait gain (as in ~/Regression_Analysis/ below)
+    |      |---- 5foldCV_linear_summary_semilogadjust_log10_excCorVar.csv : summary of 5-fold cross validation with a linear model (main).
+    |      |---- 5foldCV_linear_summary_semilogadjust_log10_excCorVar_wMES.csv : summary of 5-fold cross validation with a lienar model when replacing %h2u with MES. 
+    |      |---- 5foldCV_linear_summary_semilogadjust_log10_excCorVar_wPI.csv : summary of 5-fold cross validation with a linear model when replacing %h2u with polygenicity.
+    |      |---- 5foldCV_???_summary_semilogadjust_log10_excCorVar.csv : same as above main model except the model is non-linear (SVR or RFR).
+    |      |---- 5foldCV_???_summary_semilogadjust_log10_excCorVar_wMES.csv : same as above wMES model except the model is non-linear (SVR or RFR).
+    |      |---- 5foldCV_???_summary_semilogadjust_log10_excCorVar_wPI.csv : same as above wPPI model except the model is non-linear (SVR or RFR).
+    |      |      [Column names] '**_coef_mv' : coefficient in multivariate linear regression, where ** represent a feature (as described below).
+    |      |                     '**_SE_mv' : standard error of the coefficient in multivariate lienar regression.
+    |      |                     '**_pval_mv': pvalue of the coefficient in multivariate linear regression.
+    |      |                     '**_coef_uv': coefficient in univariate linear regression.
+    |      |                     '**_SE_uv' : standard error of the coefficient in univariate linear regression.
+    |      |                     '**_pval_uv' : pvalue of the coefficient in univariate linear regression.
+    |      |                     'R2train_mv','R2val_mv' : r2_score calculated using sklearn.metrics.r2_score (for training and validation datasets, respectively).
+    |      |                     'R2val_adj_mv' : r2_score between observed multi-trait gain and prediction adjusted by standard deviation. (for validation dataset)
+    |      |                     'corRtrain_mv','corRval_mv': pearson's correlation coefficient between observed and predicted multi-trait gain (for training and validation datasets, respectively). This is the one used to evaluate the model performance.
+    |      |                     'corR2train_mv','corR2val_mv': squared-pearson's correlation coefficient (for training and validation datasets, respectively).
+    |      |                     'corR2val_adj_mv': squared-pearson's correlation coefficient between observed multi-trait gain and prediction adjusted by standard deviation. (for validation dataset).
+    |      |                     'cor_p_train_mv','cor_p_val_mv': p-value in the pearson's correlation (for training and validation datasets, respectively)
+    |      |                    **feature names <-- 'k':# of traits,'log10_avg_distance_cor': log10(Delta_Sigma),'log10_mean_gencov':log10(mean genetic covariance)$,'avg_Neff':mean effective sample size,'avg_h2_mixer':mean h2GWAS,'avg_perc_h2_diff_region': mean %h2u,
+    |      |					    'avg_log10pi_semilogadjust': mean log10(polygenicity), 'avg_log10mes_semilogadjust':mean log10(MES). 
+    |      |---- stat_set_with_fullJASS_power_5CVcombined_without_duplicates.tsv : summary of results by JASS full version (including variants with missing data). 
+    |      |---- traitset_jass_CV*.tsv files same as in /Regression_analysis/inputs/ described below
+    |      |---- traitset_jass_5CVcombined_without_duplicates.tsv : a file combining traitset_jass_CV*.tsv files (from ) across 5CV without duplicated traitsets.
+    |---- /JASS_true_pred_gains-2023-08-03
+    |	   |---- gain-true-predicted_jass_linear_pre-selected-features_CV?.txt : observed, predicted, and adjusted predicted gains across traitsets in each CV.
+    |---- /MTAG_5CVdata-2023-08-01
+    |      |---- 5foldCV_MTAG_linear_summary_semilogadjust_log10_excCorVar.csv : summary of 5-fold cross validation with a linear model for MTAG.
+    |      |---- traitset_mtag_CV*.tsv files in the same format as traitset_jass_CV*.tsv in /Regression_analysis/inputs/. For MTAG analysis instead of JASS.
+    |---- Pval_cor_matrix_genetic.csv : Pvalues for genetic correlations across 72 traits.
 * /Regression_analysis : contains all data (/inputs) and scripts (/scripts) used for regression analysis with cross validation using JASS and MTAG.
-    - /scripts
+    |- /scripts
-        - 5f-cross_validation_linear.py : Linear regression analysis for JASS.
+    |---- 5f-cross_validation_linear.py : Linear regression analysis for JASS.
-        - 5f-cross_validation_linear_mtag.py : Linear regression analysis for MTAG.
+    |---- 5f-cross_validation_linear_mtag.py : Linear regression analysis for MTAG.
-        - 5f-cross_validation_nonlinear.py : Non-linear (SVR, RFR) regression analysis for JASS.
+    |---- 5f-cross_validation_nonlinear.py : Non-linear (SVR, RFR) regression analysis for JASS.
-    - /inputs
+    |- /inputs
-        - traitset_jass_CVtraining?-newSUMMARY_remove-nan.tsv : training data files for each of 5-fold cross validations for JASS.
+    |---- traitset_jass_CVtraining?-newSUMMARY_remove-nan.tsv : training data files for each of 5-fold cross validations for JASS.
-        - traitset_jass_CVtest?-newSUMMARY_remove-nan.tsv : validation data files for each of 5-fold cross validations for JASS.
+    |---- traitset_jass_CVtest?-newSUMMARY_remove-nan.tsv : validation data files for each of 5-fold cross validations for JASS.
-        - traitset_mtag_CVtraining?-newSUMMARY_comp_correction.tsv : training data files for each of 5-fold cross validation for MTAG.
+    |---- traitset_mtag_CVtraining?-newSUMMARY_comp_correction.tsv : training data files for each of 5-fold cross validation for MTAG.
-        - traitset_mtag_CVtest?-newSUMMARY_comp_correction.tsv : validation data files for each of 5-fold cross validation for MTAG.
+    |---- traitset_mtag_CVtest?-newSUMMARY_comp_correction.tsv : validation data files for each of 5-fold cross validation for MTAG.
+	  Each of these input data contains,
+		1. k : the number of traits jointly analysed
+		2. trait : list of traits jointly analysed
+		3. type_sampling : type of stratified sampling used
+		4. Both : the number of independent regions (loci) that were significant by univariate and multi-trait analyses
+		5. Joint : the number of independent regions (loci) that were significant only by multi-trait analysis
+		6. None : the number of independent regions (loci) that were not significant either by univariate nor multi-trait analyses
+		7. Univariate : the number of independent regions (loci) that were significant only by univariate analysis
+		8. fraction_more_significant_joint : the fraction of independent regions (loci) that had smaller p-value from multi-trait analysis than p-value from univariate analysis.
+		9. fraction_more_significant_joint_qval : "multi-trait gain", the fraction of independent regions (loci) that had smaller q-value (multiple testing corrected p-value) from multi-trait analysis than p-value from univariate analysis.
+		10. avg_log10mes_semilogadjust : mean log10(mean effect size)
+		11. avg_log10pi_semilogadjust : mean log10(polygenicity) 
+		12. avg_Neff : mean effective sample size
+		13. avg_h2_sigSNP_region : mean across traits of heritability explained by independent regions (loci) significant by univariate analysis.
+		14. avg_h2_mixer : mean of h2GWAS estimated by mixer.
+		15. avg_perc_h2_diff_region : mean of %h2u
+		16. var_h2_sigSNP_region : variance across traits of heritability explaied by independent regions (loci) significant by univariate analysis. 	
+		17. mean_gencor : mean genetic correlation
+		18. mean_null_phencor : mean correlation under null
+		19. avg_distance_cor : average distance between the genetic and residual correlation matrices.
+		20. mean_gencov : mean genetic covariance
+		21. mean_null_phencov : mean covariance under null
+		22. condition_number_rcov : condition number of covariance matrix under null
+		23. condition_number_gcov : condition number of genetic covariance matrix