Intégration de Données avec des Méthodes à Noyaux

Café Méthodo : mardi 13 décembre

Vincent Guillemot

Course website

Our goals for this workshop

  • Introduce tidymodels and its general philosophy on modeling.
  • Help you become proficient with the core packages for modeling.
  • Point you to places to learn more and get help.

Why tidymodels?

There are several other modeling frameworks in R that try to:

  • create a uniform, cohesive, and unsurprising set of modeling APIs

Examples are caret, mlr3, and others.

  • caret is more favorable for people who prefer base R/traditional interfaces.
  • mlr3 is more pythonic and also has many features.
  • tidymodels would probably be preferable to those who place importance on a tidy R interface, a large number of features, and the idea that the interfaces should enable the “pit of success”.

The tidymodels package

There are a lot of tidymodels packages but about 90% of the work is done by 5 packages. (rsample, recipes, parsnip, tune, and yardstick)

The best way to get started with tidymodels is to use the tidymodels meta-package. It loads the core packages plus some tidyverse packages.

Some helpful links:

The tidymodels package

library(tidymodels)
#> ── Attaching packages ──────────────────────────── tidymodels 1.0.0 ──
#> ✔ broom        1.0.1      ✔ rsample      1.1.0 
#> ✔ dials        1.1.0      ✔ tibble       3.1.8 
#> ✔ dplyr        1.0.10     ✔ tidyr        1.2.1 
#> ✔ infer        1.0.3      ✔ tune         1.0.1 
#> ✔ modeldata    1.0.1      ✔ workflows    1.1.2 
#> ✔ parsnip      1.0.3      ✔ workflowsets 1.0.0 
#> ✔ purrr        0.3.5      ✔ yardstick    1.1.0 
#> ✔ recipes      1.0.3
#> ── Conflicts ─────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Dig deeper into tidy modeling with R at https://www.tmwr.org

Managing name conflicts

tidymodels_prefer(quiet = FALSE)
#> [conflicted] Will prefer dplyr::filter over any other package
#> [conflicted] Will prefer dplyr::select over any other package
#> [conflicted] Will prefer dplyr::slice over any other package
#> [conflicted] Will prefer dplyr::rename over any other package
#> [conflicted] Will prefer dials::neighbors over any other package
#> [conflicted] Will prefer parsnip::fit over any other package
#> [conflicted] Will prefer parsnip::bart over any other package
#> [conflicted] Will prefer parsnip::pls over any other package
#> [conflicted] Will prefer purrr::map over any other package
#> [conflicted] Will prefer recipes::step over any other package
#> [conflicted] Will prefer themis::step_downsample over any other package
#> [conflicted] Will prefer themis::step_upsample over any other package
#> [conflicted] Will prefer tune::tune over any other package
#> [conflicted] Will prefer yardstick::precision over any other package
#> [conflicted] Will prefer yardstick::recall over any other package
#> [conflicted] Will prefer yardstick::spec over any other package
#> ── Conflicts ────────────────────────────────── tidymodels_prefer() ──

Alzheimer’s disease data

Data from a clinical trial of individuals with well-characterized cognitive impairment, and age-matched control participants.

# install.packages("modeldata")
library(modeldata)
data("ad_data")
alz <- ad_data

glimpse(alz)
#> Rows: 333
#> Columns: 131
#> $ ACE_CD143_Angiotensin_Converti   <dbl> 2.0031003, 1.5618560, 1.5206598, 1.68…
#> $ ACTH_Adrenocorticotropic_Hormon  <dbl> -1.3862944, -1.3862944, -1.7147984, -…
#> $ AXL                              <dbl> 1.09838668, 0.68328157, -0.14527630, …
#> $ Adiponectin                      <dbl> -5.360193, -5.020686, -5.809143, -5.1…
#> $ Alpha_1_Antichymotrypsin         <dbl> 1.7404662, 1.4586150, 1.1939225, 1.28…
#> $ Alpha_1_Antitrypsin              <dbl> -12.631361, -11.909882, -13.642963, -…
#> $ Alpha_1_Microglobulin            <dbl> -2.577022, -3.244194, -2.882404, -3.1…
#> $ Alpha_2_Macroglobulin            <dbl> -72.65029, -154.61228, -136.52918, -9…
#> $ Angiopoietin_2_ANG_2             <dbl> 1.06471074, 0.74193734, 0.83290912, 0…
#> $ Angiotensinogen                  <dbl> 2.510547, 2.457283, 1.976365, 2.37608…
#> $ Apolipoprotein_A_IV              <dbl> -1.427116, -1.660731, -1.660731, -2.1…
#> $ Apolipoprotein_A1                <dbl> -7.402052, -7.047017, -7.684284, -8.0…
#> $ Apolipoprotein_A2                <dbl> -0.26136476, -0.86750057, -0.65392647…
#> $ Apolipoprotein_B                 <dbl> -4.624044, -6.747507, -3.976069, -6.5…
#> $ Apolipoprotein_CI                <dbl> -1.2729657, -1.2729657, -1.7147984, -…
#> $ Apolipoprotein_CIII              <dbl> -2.312635, -2.343407, -2.748872, -2.9…
#> $ Apolipoprotein_D                 <dbl> 2.0794415, 1.3350011, 1.3350011, 1.43…
#> $ Apolipoprotein_E                 <dbl> 3.7545215, 3.0971187, 2.7530556, 2.37…
#> $ Apolipoprotein_H                 <dbl> -0.15734908, -0.57539617, -0.34483937…
#> $ B_Lymphocyte_Chemoattractant_BL  <dbl> 2.2969819, 1.6731213, 1.6731213, 1.98…
#> $ BMP_6                            <dbl> -2.200744, -1.728053, -2.062421, -1.9…
#> $ Beta_2_Microglobulin             <dbl> 0.69314718, 0.47000363, 0.33647224, 0…
#> $ Betacellulin                     <int> 34, 53, 49, 52, 67, 51, 41, 42, 58, 5…
#> $ C_Reactive_Protein               <dbl> -4.074542, -6.645391, -8.047190, -6.2…
#> $ CD40                             <dbl> -0.7964147, -1.2733760, -1.2415199, -…
#> $ CD5L                             <dbl> 0.09531018, -0.67334455, 0.09531018, …
#> $ Calbindin                        <dbl> 33.21363, 25.27636, 22.16609, 23.4558…
#> $ Calcitonin                       <dbl> 1.3862944, 3.6109179, 2.1162555, -0.1…
#> $ CgA                              <dbl> 397.6536, 465.6759, 347.8639, 334.234…
#> $ Clusterin_Apo_J                  <dbl> 3.555348, 3.044522, 2.772589, 2.83321…
#> $ Complement_3                     <dbl> -10.36305, -16.10824, -16.10824, -13.…
#> $ Complement_Factor_H              <dbl> 3.5737252, 3.6000471, 4.4745686, 3.09…
#> $ Connective_Tissue_Growth_Factor  <dbl> 0.5306283, 0.5877867, 0.6418539, 0.53…
#> $ Cortisol                         <dbl> 10.0, 12.0, 10.0, 14.0, 11.0, 13.0, 4…
#> $ Creatine_Kinase_MB               <dbl> -1.710172, -1.751002, -1.383559, -1.6…
#> $ Cystatin_C                       <dbl> 9.041922, 9.067624, 8.954157, 9.58190…
#> $ EGF_R                            <dbl> -0.1354543, -0.3700474, -0.7329871, -…
#> $ EN_RAGE                          <dbl> -3.688879, -3.816713, -4.755993, -2.9…
#> $ ENA_78                           <dbl> -1.349543, -1.356595, -1.390672, -1.3…
#> $ Eotaxin_3                        <int> 53, 62, 62, 44, 64, 57, 64, 64, 64, 7…
#> $ FAS                              <dbl> -0.08338161, -0.52763274, -0.63487827…
#> $ FSH_Follicle_Stimulation_Hormon  <dbl> -0.6516715, -1.6272839, -1.5630004, -…
#> $ Fas_Ligand                       <dbl> 3.1014922, 2.9788133, 1.3600098, 2.53…
#> $ Fatty_Acid_Binding_Protein       <dbl> 2.5208712, 2.2477966, 0.9063009, 0.62…
#> $ Ferritin                         <dbl> 3.329165, 3.932959, 3.176872, 3.13809…
#> $ Fetuin_A                         <dbl> 1.2809338, 1.1939225, 1.4109870, 0.74…
#> $ Fibrinogen                       <dbl> -7.035589, -8.047190, -7.195437, -7.7…
#> $ GRO_alpha                        <dbl> 1.381830, 1.372438, 1.412679, 1.37243…
#> $ Gamma_Interferon_induced_Monokin <dbl> 2.949822, 2.721793, 2.762231, 2.88547…
#> $ Glutathione_S_Transferase_alpha  <dbl> 1.0641271, 0.8670202, 0.8890150, 0.70…
#> $ HB_EGF                           <dbl> 6.559746, 8.754531, 7.745463, 5.94943…
#> $ HCC_4                            <dbl> -3.036554, -4.074542, -3.649659, -3.8…
#> $ Hepatocyte_Growth_Factor_HGF     <dbl> 0.58778666, 0.53062825, 0.09531018, 0…
#> $ I_309                            <dbl> 3.433987, 3.135494, 2.397895, 3.36729…
#> $ ICAM_1                           <dbl> -0.1907787, -0.4620172, -0.4620172, -…
#> $ IGF_BP_2                         <dbl> 5.609472, 5.347108, 5.181784, 5.42495…
#> $ IL_11                            <dbl> 5.121987, 4.936704, 4.665910, 6.22393…
#> $ IL_13                            <dbl> 1.282549, 1.269463, 1.274133, 1.30754…
#> $ IL_16                            <dbl> 4.192081, 2.876338, 2.616102, 2.44105…
#> $ IL_17E                           <dbl> 5.731246, 6.705891, 4.149327, 4.69584…
#> $ IL_1alpha                        <dbl> -6.571283, -8.047190, -8.180721, -7.6…
#> $ IL_3                             <dbl> -3.244194, -3.912023, -4.645992, -4.2…
#> $ IL_4                             <dbl> 2.484907, 2.397895, 1.824549, 1.48160…
#> $ IL_5                             <dbl> 1.09861229, 0.69314718, -0.24846136, …
#> $ IL_6                             <dbl> 0.26936976, 0.09622438, 0.18568645, -…
#> $ IL_6_Receptor                    <dbl> 0.64279595, 0.43115645, 0.09668586, 0…
#> $ IL_7                             <dbl> 4.8050453, 3.7055056, 1.0056222, 2.33…
#> $ IL_8                             <dbl> 1.711325, 1.675557, 1.691393, 1.71994…
#> $ IP_10_Inducible_Protein_10       <dbl> 6.242223, 5.686975, 5.049856, 5.60211…
#> $ IgA                              <dbl> -6.812445, -6.377127, -6.319969, -7.6…
#> $ Insulin                          <dbl> -0.6258253, -0.9431406, -1.4466191, -…
#> $ Kidney_Injury_Molecule_1_KIM_1   <dbl> -1.204295, -1.197703, -1.191191, -1.2…
#> $ LOX_1                            <dbl> 1.7047481, 1.5260563, 1.1631508, 1.22…
#> $ Leptin                           <dbl> -1.5290628, -1.4660558, -1.6622675, -…
#> $ Lipoprotein_a                    <dbl> -4.268698, -4.933674, -5.843045, -4.9…
#> $ MCP_1                            <dbl> 6.740519, 6.849066, 6.767343, 6.78105…
#> $ MCP_2                            <dbl> 1.9805094, 1.8088944, 0.4005958, 1.98…
#> $ MIF                              <dbl> -1.237874, -1.897120, -2.302585, -1.6…
#> $ MIP_1alpha                       <dbl> 4.968453, 3.690160, 4.049508, 4.92856…
#> $ MIP_1beta                        <dbl> 3.258097, 3.135494, 2.397895, 3.21887…
#> $ MMP_2                            <dbl> 4.478566, 3.781473, 2.866631, 2.96851…
#> $ MMP_3                            <dbl> -2.207275, -2.465104, -2.302585, -1.7…
#> $ MMP10                            <dbl> -3.270169, -3.649659, -2.733368, -4.0…
#> $ MMP7                             <dbl> -3.7735027, -5.9681907, -4.0302269, -…
#> $ Myoglobin                        <dbl> -1.89711998, -0.75502258, -1.38629436…
#> $ NT_proBNP                        <dbl> 4.553877, 4.219508, 4.248495, 4.11087…
#> $ NrCAM                            <dbl> 5.003946, 5.209486, 4.744932, 4.96981…
#> $ Osteopontin                      <dbl> 5.356586, 6.003887, 5.017280, 5.76832…
#> $ PAI_1                            <dbl> 1.00350156, -0.03059880, 0.43837211, …
#> $ PAPP_A                           <dbl> -2.902226, -2.813276, -2.935541, -2.7…
#> $ PLGF                             <dbl> 4.442651, 4.025352, 4.510860, 3.43398…
#> $ PYY                              <dbl> 3.218876, 3.135494, 2.890372, 2.83321…
#> $ Pancreatic_polypeptide           <dbl> 0.57878085, 0.33647224, -0.89159812, …
#> $ Prolactin                        <dbl> 0.00000000, -0.51082562, -0.13926207,…
#> $ Prostatic_Acid_Phosphatase       <dbl> -1.620527, -1.739232, -1.636682, -1.7…
#> $ Protein_S                        <dbl> -1.784998, -2.463991, -2.259135, -2.7…
#> $ Pulmonary_and_Activation_Regulat <dbl> -0.8439701, -2.3025851, -1.6607312, -…
#> $ RANTES                           <dbl> -6.214608, -6.938214, -6.645391, -5.9…
#> $ Resistin                         <dbl> -16.475315, -16.025283, -16.475315, -…
#> $ S100b                            <dbl> 1.5618560, 1.7566212, 1.4357282, 1.25…
#> $ SGOT                             <dbl> -0.94160854, -0.65392647, 0.33647224,…
#> $ SHBG                             <dbl> -1.897120, -1.560648, -2.207275, -3.1…
#> $ SOD                              <dbl> 5.609472, 5.814131, 5.723585, 5.77144…
#> $ Serum_Amyloid_P                  <dbl> -5.599422, -6.119298, -5.381699, -6.6…
#> $ Sortilin                         <dbl> 4.908629, 5.478731, 3.810182, 3.40217…
#> $ Stem_Cell_Factor                 <dbl> 4.174387, 3.713572, 3.433987, 3.95124…
#> $ TGF_alpha                        <dbl> 8.649098, 11.331619, 10.858497, 9.454…
#> $ TIMP_1                           <dbl> 15.204651, 11.266499, 12.282857, 11.1…
#> $ TNF_RII                          <dbl> -0.06187540, -0.32850407, -0.41551544…
#> $ TRAIL_R3                         <dbl> -0.1829004, -0.5007471, -0.9240345, -…
#> $ TTR_prealbumin                   <dbl> 2.944439, 2.833213, 2.944439, 2.94443…
#> $ Tamm_Horsfall_Protein_THP        <dbl> -3.095810, -3.111190, -3.166721, -3.1…
#> $ Thrombomodulin                   <dbl> -1.340566, -1.675252, -1.534276, -1.9…
#> $ Thrombopoietin                   <dbl> -0.1026334, -0.6733501, -0.9229670, -…
#> $ Thymus_Expressed_Chemokine_TECK  <dbl> 4.149327, 3.810182, 2.791992, 4.03728…
#> $ Thyroid_Stimulating_Hormone      <dbl> -3.863233, -4.828314, -4.990833, -4.8…
#> $ Thyroxine_Binding_Globulin       <dbl> -1.4271164, -1.6094379, -1.8971200, -…
#> $ Tissue_Factor                    <dbl> 2.04122033, 2.02814825, 1.43508453, 2…
#> $ Transferrin                      <dbl> 3.332205, 2.890372, 2.890372, 2.89037…
#> $ Trefoil_Factor_3_TFF3            <dbl> -3.381395, -3.912023, -3.729701, -3.8…
#> $ VCAM_1                           <dbl> 3.258097, 2.708050, 2.639057, 2.77258…
#> $ VEGF                             <dbl> 22.03456, 18.60184, 17.47619, 17.5456…
#> $ Vitronectin                      <dbl> -0.04082199, -0.38566248, -0.22314355…
#> $ von_Willebrand_Factor            <dbl> -3.146555, -3.863233, -3.540459, -3.8…
#> $ age                              <dbl> 0.9876238, 0.9861496, 0.9866667, 0.98…
#> $ tau                              <dbl> 6.297754, 6.659294, 6.270988, 6.15273…
#> $ p_tau                            <dbl> 4.348108, 4.859967, 4.400247, 4.49488…
#> $ Ab_42                            <dbl> 12.019678, 11.015759, 12.302271, 12.3…
#> $ male                             <dbl> 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1…
#> $ Genotype                         <fct> E3E3, E3E4, E3E4, E3E4, E3E3, E4E4, E…
#> $ Class                            <fct> Control, Control, Control, Control, C…

Alzheimer’s disease data

  • 1 categorical outcome: Class
  • 130 predictors
  • 126 protein measurements
  • also: age, male, Genotype

Your turn

Explore the data.

library(tidymodels)
tidymodels_prefer()

data("ad_data", package = "modeldata")
alz <- ad_data
10:00

Schedule for today

  • A minimal model
  • A better workflow
  • A tuned model