@@ -122,7 +122,6 @@ kable(apply(counts,2,fun_summary), caption = "Table 3: Summary of the raw counts
...
@@ -122,7 +122,6 @@ kable(apply(counts,2,fun_summary), caption = "Table 3: Summary of the raw counts
Figure 1 shows the total number of mapped and counted reads for each sample. Total read counts are expected to be similar within conditions, they may be different across conditions. Total counts sometimes vary widely between replicates.
Figure 1 shows the total number of mapped and counted reads for each sample. Total read counts are expected to be similar within conditions, they may be different across conditions. Total counts sometimes vary widely between replicates.
```{r barplot, echo=FALSE,fig.align="center",fig.cap="Figure 1: Number of mapped reads per sample. Colors refer to the biological condition of the sample.", out.width="600px"}
```{r barplot, echo=FALSE,fig.align="center",fig.cap="Figure 1: Number of mapped reads per sample. Colors refer to the biological condition of the sample.", out.width="600px"}
@@ -132,8 +131,7 @@ A pairwise scatter plot is produced (figure 2) to show how replicates and sample
...
@@ -132,8 +131,7 @@ A pairwise scatter plot is produced (figure 2) to show how replicates and sample
- 1 for technical replicates (technical variability follows a Poisson distribution)
- 1 for technical replicates (technical variability follows a Poisson distribution)
- greater than 1 for biological replicates and samples from different biological conditions (biological variability is higher than technical one, data are over-dispersed with respect to Poisson). The higher the SERE value, the lower the similarity. It is expected to be lower between biological replicates than between samples of different biological conditions. Hence, the SERE statistic can be used to detect inversions between samples.
- greater than 1 for biological replicates and samples from different biological conditions (biological variability is higher than technical one, data are over-dispersed with respect to Poisson). The higher the SERE value, the lower the similarity. It is expected to be lower between biological replicates than between samples of different biological conditions. Hence, the SERE statistic can be used to detect inversions between samples.
```{r pairewiseScatter, echo=FALSE,fig.align="center",fig.cap="Figure 2: Pairwise comparison of samples (not produced when more than 12 samples).", out.width="1200px"}
```{r pairewiseScatter, echo=FALSE,fig.align="center",fig.cap="Figure 2: Pairwise comparison of samples (not produced when more than 12 samples).", out.width="1200px", fig.width=3*ncol(counts), fig.height=2*ncol(counts)}
@@ -144,14 +142,12 @@ The main variability within the experiment is expected to come from biological d
...
@@ -144,14 +142,12 @@ The main variability within the experiment is expected to come from biological d
Figure 3 sample clustering based on normalized data. An euclidean distance is computed between samples, and the dendrogram is built upon the Ward criterion. We expect this dendrogram to group replicates and separate biological conditions.
Figure 3 sample clustering based on normalized data. An euclidean distance is computed between samples, and the dendrogram is built upon the Ward criterion. We expect this dendrogram to group replicates and separate biological conditions.
```{r clusterplot, echo=FALSE,fig.align="center",fig.cap="Figure 3: Sample clustering based on normalized data.", out.width="600px", warning=FALSE, message=FALSE}
```{r clusterplot, echo=FALSE,fig.align="center",fig.cap="Figure 3: Sample clustering based on normalized data.", out.width="600px", warning=FALSE, message=FALSE}
Another way of visualizing the experiment variability is to look at the first principal components of the PCA, as shown on the figure 4. On this figure, the first principal component (PC1) is expected to separate samples from the different biological conditions, meaning that the biological variability is the main source of variance in the data.
Another way of visualizing the experiment variability is to look at the first principal components of the PCA, as shown on the figure 4. On this figure, the first principal component (PC1) is expected to separate samples from the different biological conditions, meaning that the biological variability is the main source of variance in the data.
```{r PCA, echo=FALSE,fig.align="center",fig.cap="Figure 4: First three components of a Principal Component Analysis, with percentages of variance associated with each axis.", out.width="1200px"}
```{r PCA, echo=FALSE,fig.align="center",fig.cap="Figure 4: First three components of a Principal Component Analysis, with percentages of variance associated with each axis.", fig.height=4, out.width="1200px"}
Boxplots are often used as a qualitative measure of the quality of the normalization process, as they show how distributions are globally affected during this process. We expect normalization to stabilize distributions across samples. Figure 5 shows boxplots of raw (left) and normalized (right) data respectively.
Boxplots are often used as a qualitative measure of the quality of the normalization process, as they show how distributions are globally affected during this process. We expect normalization to stabilize distributions across samples. Figure 5 shows boxplots of raw (left) and normalized (right) data respectively.
```{r boxplot, echo=FALSE,fig.align="center",fig.cap="Figure 5: Boxplots of raw (left) and normalized (right) read counts.", out.width="1200px", warning=FALSE}
```{r boxplot, echo=FALSE,fig.align="center",fig.cap="Figure 5: Boxplots of raw (left) and normalized (right) read counts.", out.width="1200px", fig.height=4, warning=FALSE}
cat("The DESeq2 model assumes that the count data follow a negative binomial distribution which is a robust alternative to the Poisson law when data are over-dispersed (the variance is higher than the mean). The first step of the statistical procedure is to estimate the dispersion of the data. Its purpose is to determine the shape of the mean-variance relationship. The default is to apply a GLM (Generalized Linear Model) based method (fitType='parametric'), which can handle complex designs but may not converge in some cases.\n")
cat("The DESeq2 model assumes that the count data follow a negative binomial distribution which is a robust alternative to the Poisson law when data are over-dispersed (the variance is higher than the mean). The first step of the statistical procedure is to estimate the dispersion of the data. Its purpose is to determine the shape of the mean-variance relationship. The default is to apply a GLM (Generalized Linear Model) based method (fitType='parametric'), which can handle complex designs but may not converge in some cases.\n")
...
@@ -208,7 +204,7 @@ cat("The figure 6 shows the result of the dispersion estimation step. The x- and
...
@@ -208,7 +204,7 @@ cat("The figure 6 shows the result of the dispersion estimation step. The x- and
cat("For the differential marking/binding analysis we use the limma approach to RNA-seq [@ritchie2015]. Read counts are converted to log2-counts-per-million (logCPM) and the mean-variance relationship is modelled either with precision weights or with an empirical Bayes prior trend. Here we use the the precision weights approach called “voom” [@law2014]. This transformation permits to apply the linear modelling in the limma package can be applied to sequencing data. The systematic variability of the data is modeled with a linear approach to differentiate it from the random variability. This linear modeling is very similar to classical ANOVA or multiple regression, except that a model is adapted to each peak.")
cat("For the differential marking/binding analysis we use the limma approach to RNA-seq [@ritchie2015]. Read counts are converted to log2-counts-per-million (logCPM) and the mean-variance relationship is modelled either with precision weights or with an empirical Bayes prior trend. Here we use the the precision weights approach called “voom” [@law2014]. This transformation permits to apply the linear modelling in the limma package can be applied to sequencing data. The systematic variability of the data is modeled with a linear approach to differentiate it from the random variability. This linear modeling is very similar to classical ANOVA or multiple regression, except that a model is adapted to each peak.")
...
@@ -225,7 +221,6 @@ cat("The figure 6 shows the result of the variance estimation step. The x- and y
...
@@ -225,7 +221,6 @@ cat("The figure 6 shows the result of the variance estimation step. The x- and y
Figure 7 shows the distributions of raw p-values computed by the statistical test for the comparison(s) done. This distribution is expected to be a mixture of a uniform distribution on $[0,1]$ and a peak around 0 corresponding to the differentially expressed features.
Figure 7 shows the distributions of raw p-values computed by the statistical test for the comparison(s) done. This distribution is expected to be a mixture of a uniform distribution on $[0,1]$ and a peak around 0 corresponding to the differentially expressed features.
```{r, echo=FALSE,fig.align="center",fig.cap="Figure 7: Distribution(s) of raw p-values", out.width="600px"}
```{r, echo=FALSE,fig.align="center",fig.cap="Figure 7: Distribution(s) of raw p-values", out.width="600px"}
Figure 8 represents the MA-plot of the data for the comparisons done, where differentially expressed features are highlighted in red. A MA-plot represents the log ratio of differential expression as a function of the mean intensity for each feature. Triangles correspond to features having a too low/high $\log_2(\text{FC})$ to be displayed on the plot.
Figure 8 represents the MA-plot of the data for the comparisons done, where differentially expressed features are highlighted in red. A MA-plot represents the log ratio of differential expression as a function of the mean intensity for each feature. Triangles correspond to features having a too low/high $\log_2(\text{FC})$ to be displayed on the plot.
```{r MAplot, echo=FALSE,fig.align="center",fig.cap="Figure 8: MA-plot(s) of each comparison. Red dots represent significantly differentially expressed features.", out.width="600px"}
```{r MAplot, echo=FALSE,fig.align="center",fig.cap="Figure 8: MA-plot(s) of each comparison. Red dots represent significantly differentially expressed features.", out.width="600px"}