From b7c294cc0cbbd64ace8b889e36793ddeae536d1e Mon Sep 17 00:00:00 2001
From: Natalia  PIETROSEMOLI <natalia.pietrosemoli@pasteur.fr>
Date: Thu, 19 May 2022 17:06:52 +0200
Subject: [PATCH] Update Monday/practicals_day1_PHINDAccess.Rmd Deleted
 Monday/Annotation_bascis.md

---
 Monday/Annotation_bascis.md            | 248 ----------
 Monday/practicals_day1_PHINDAccess.Rmd | 608 +++++++++++++++++++++++++
 2 files changed, 608 insertions(+), 248 deletions(-)
 delete mode 100644 Monday/Annotation_bascis.md
 create mode 100644 Monday/practicals_day1_PHINDAccess.Rmd

diff --git a/Monday/Annotation_bascis.md b/Monday/Annotation_bascis.md
deleted file mode 100644
index b98007f..0000000
--- a/Monday/Annotation_bascis.md
+++ /dev/null
@@ -1,248 +0,0 @@
-# Hands-On: Annotation basics 
-
-
-
-## Insstallations
-
-Download data from server :
-
-`wget https://dl.pasteur.fr/fop/HJfzm2Py/ChIP_data.tar`
-
-Untar data:
-
-`tar xvf ChIP_data.tar`
-
-Download reference genomes files from server:
-
-`wget https://dl.pasteur.fr/fop/lroDilwn/ReferenceGenome.tar`
-
-Untar data:
-
-`tar xvf ReferenceGenome.tar`
-
-## get ePeak on your home
-
-* Load modules (ON CLUSTER ONLY)
-
-```
-module load snakemake/6.5.0
-module load python/3.7
-module load singularity
-module load git-lfs/2.13.1
-module load pysam
-```
-
-* Clone workflow:
-
-`git clone https://gitlab.pasteur.fr/hub/ePeak.git`
-
-* Download singularity container:
-
-```
-cd ePeak
-singularity pull --arch amd64 --name epeak.img  library://rlegendre/epeak/epeak:1.0
-```
-
-## configure ePeak
-
-Open config/config.yaml and config/design.txt files
-
-* **Design file:** tabulated file of 4 columns.
-
-**Column 1** is the name of the IP file
-
-**Column 2** is the name of the corresponding INPUT file
-
-**Column 3** is the replicate number of IP file
-
-**Column 4** is the replicate number of the corresponding INPUT file
-
-
-```
-IP_NAME	INPUT_NAME	NB_IP	NB_INPUT
-H3K27ac_shCtrl	INPUT_shCtrl	1	1
-H3K27ac_shCtrl	INPUT_shCtrl	2	1
-H3K27ac_shUbc9	INPUT_shUbc9	1	1
-H3K27ac_shUbc9	INPUT_shUbc9	2	1
-Klf4_shCtrl	INPUT_shCtrl	1	1
-Klf4_shCtrl	INPUT_shCtrl	2	2
-Klf4_shUbc9	INPUT_shUbc9	1	1
-Klf4_shUbc9	INPUT_shUbc9	2	2
-```
-
-* **Config file:** yaml file containing all tools parameters
-
-This file is divided into _chunks_. Each chunk correspond to one step or one tool.
-
-
-This first chunk provides input information and assigns working directories. 
-`input_dir` path to FASTQ files directory. 
-`input_mate` mate pair format (i.e. `_R[12]` for *MATE* = R1 or R2) , must match the *MATE* parameter in FASTQ files.
-`input_extension` filename extension format (i.e. `fastq.gz` or `fq.gz`).
-`analysis_dir` path to analysis directory.
-`tmpdir` path to temporary directory (i.e. `/tmp/` or other)
-
-```
-input_dir: ../ChIP_data
-input_mate: '_R[12]'
-input_extension: '.fastq.gz'
-analysis_dir: $HOME #define for each user
-tmpdir: $TMPDIR
-```
-
-
-The design chunk aims to check that the FASTQ files name match the design file information. The `marks`, `conditions` and `replicates` parameters must respectively match the *MARK*, *COND* and *REP* parameters of the FASTQ files name and the design file. 
-For spike-in data, set `spike` on "True" and provide the spike-in genome FASTA file path through the `spike_genome_file` parameter.
-
-```
-design:
-    design_file: config/design.txt    
-    marks: H3K27ac, Klf4
-    condition: shCtrl, shUbc9
-    replicates: Rep
-    spike: false
-    spike_genome_file: genome/dmel9.fa
-```
-
-
-This genome chunk provides information about reference genome - directory, name of the index and path to fasta file.
-
-```
-genome:
-    index: yes
-    genome_directory: genome/
-    name: mm10
-    fasta_file: genome/mm10_chr1.fa
-```
-
-The fastqc chunk provides quality control checking of fastq files.
-
-```
-fastqc:
-    options: ''
-    threads: 4   
-```
-
-The adapters chunk is relative to quality trimming and adapter removal with cutadapt. A list of common adapters is provided under config directory and give to cutadapt (adapter_list). Then, different parameters are tuned to match precisely with the data.
-
-
-```
-adapters:
-    remove: yes
-    adapter_list: file:config/adapt.fa
-    m: 25
-    mode: a
-    options: -O 6 --trim-n --max-n 1 
-    quality: 30
-    threads: 4
-```
-
-
-The bowtie2_mapping chunk is relative to the reads mapping against genome file (provided by the genome chunk)
-
-```
-bowtie2_mapping:
-    options: "--very-sensitive --no-unal"
-    threads: 4
-```
-
-
-The mark duplicates chunk allows to mark PCR duplicate in BAM files. For ChIPseq data, IP and INPUT need to be deduplicated, so the dedup_IP parameter is set to True.
-
-
-```
-mark_duplicates:
-    do: yes
-    dedup_IP: 'True' 
-    threads: 4
-```
-
-The remove_biasedRegions chunk is relative to remove biased genomic regions (previously named blacklisted regions)
-
-```
-remove_biasedRegions:
-    do: yes
-    bed_file: genome/mm10.blacklist.bed
-    threads: 1
-```
-
-To produce metaregion profiles, coverages from each samples need to be producted.
-
-See https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html
-
-```
-bamCoverage:
-    do: yes
-    options: "--binSize 10 --effectiveGenomeSize 2652783500 --normalizeUsing RPGC" 
-    spike-in: no
-    threads: 4
-```
-
-Set yes to geneBody chunk to produce metaregion profiles. This step need a gene model file in bed format.
-
-```
-geneBody:
-    do: yes
-    regionsFileName: genome/mm10_chr1_RefSeq.bed
-    threads: 4
-```
-
-Set all following chunks 'do' to 'no' for now.
-
-
-## run ePeak
-
-
-Test your configuration by performing a dry-run via:
-
-`snakemake --use-singularity -n --cores 1`
-
-Execute the workflow locally using $N cores via:
-
-```
-export PICARD_TOOLS_JAVA_OPTS="-Xmx8G"
-N=8
-snakemake --use-singularity --singularity-args "-B '/home/'" --cores $N
-```
-
-
-Run it specifically on Slurm cluster:
-
-`sbatch snakemake --use-singularity --singularity-args "-B '$HOME'" --cluster-config config/cluster_config.json --cluster "sbatch --mem={cluster.ram} --cpus-per-task={threads} " -j 200 --nolock --cores $SLURM_JOB_CPUS_PER_NODE`
-
-
-## analyse QC reports
-
-### Look at MultiQC report
-
-- General statistics
-
-<img src="images/Multiqc_mainStats.png" width="1000" align="center" >
- 
-- Mapping with bowtie2
-
-<img src="images/bowtie2_se_plot.png" width="700" align="center" > 
-
-- Deduplication with MarkDuplicates
-
-<img src="images/picard_deduplication.png" width="700" align="center" >
-
-- Fingerplot
-
-<img src="images/deeptools_fingerprint_plot.png" width="700" align="center" >
-
-
-### Look at 05-QC directory
-
-- Cross correlation
- 
- <img src="images/H3K27ac_shCtrl_ppqt.png" width="700" align="center" >  <img src="images/Klf4_shCtrl_ppqt.png" width="700" align="center" >
-
-- GeneBody plot/heatmap
-  
-<img src="images/geneBodyplot.png" width="700" align="center" >
-
-
-
-
-Would you proceed to the analysis ? justify why
diff --git a/Monday/practicals_day1_PHINDAccess.Rmd b/Monday/practicals_day1_PHINDAccess.Rmd
new file mode 100644
index 0000000..a5d34a6
--- /dev/null
+++ b/Monday/practicals_day1_PHINDAccess.Rmd
@@ -0,0 +1,608 @@
+---
+title: "Practical session - PHINDAccess course: functional annotation basics"
+author: "Hélène Lopez-Maestre, Emeline Perthame, Natalia Pietrosemoli"
+date: "23 May 2022"
+output:
+  html_document:
+    code_folding: hide
+    theme: spacelab
+    toc: yes
+    toc_float: yes
+editor_options: 
+  chunk_output_type: console
+---
+
+```{r setup, message=FALSE, warning=FALSE}
+# option for html output
+knitr::opts_chunk$set(echo = TRUE)
+
+# Set the random generator seed so we can reproduce exactly results without any stochasticity
+set.seed(1234)
+
+# Load packages that will be useful for the analysis
+# library(FactoMineR)
+# library(limma)
+# library(factoextra)
+library(ggplot2)
+# library(AnnotationDbi)
+library(org.Hs.eg.db)
+# library(dplyr)
+# library(RColorBrewer)
+library(KEGGREST)
+library(biomaRt)
+# library(DESeq2)
+# library(pheatmap)
+library(progress)
+library(pathview)
+# library(gage)
+library(png)
+
+# Cosmetic choice: set light theme for ggplot2 plots
+theme_set(theme_light())
+
+# Set significance level for the statistical tests (e.g. False Discovery Rate)
+alpha <- 0.05
+```
+
+# I. Functional annotation
+
+**Goals**
+
+-  Identify and understand the steps of a functional analysis
+
+- Understand the output of provided R code lines. There are often several ways to write an R code obtaining the same result. This practical offers one possible solution! 
+
+- Even if you are not expected to be completely independent, ideally, by the end of the course, you should be able to reproduce a similar analysis. 
+
+## A. Annotation of Mycobacterium tuberculosis using the Gene Ontology
+
+Genes are annotated into Gene Ontology categories (terms) by some biological property. We need 
+to look for the specific annotation for our own organism.
+
+Then, we can annotate our list of genes that we have obtained,   for example, from the differential analysis of a comparison of two biological conditions.
+
+### Load data 
+
+For this exercise, the input dataset is from an experiment conducted in  Myc tb, with 3 biological conditions (FR, CR and HR) and 3 replicates per condition.
+
+```{r, message=FALSE, warning=FALSE}
+# Import  data
+
+## Define YOUR OWN working directory:
+mywd = "/Users/npietros/Documents/Course-PHINDAccess-2022/Hands on Day 1/"
+setwd(mywd)
+
+## Read the matrix of gene counts as obtained from a typical sequencing experiment.
+count.data <- read.csv(file="data/counts_myc.csv",
+                        sep="\t", 
+                        header = T, 
+                        row.names = 1)
+
+```
+
+```{r}
+## Play a bit with the parameters (e.g. define no separator) and check the results
+count.data <-read.csv(file="data/counts_myc.csv",
+                        header = T, 
+                        row.names = 1)
+```
+
+```{r}
+##  Explore the  data  
+# View(count.data)
+# head(count.data)
+
+# Q1: How many genes do you have? How many samples? 
+# dim(count.data)
+# nrow(count.data) # 4302 genes
+# ncol(count.data) # 9 samples
+
+# Q2: What is the range of the counts for the first 5 genes?
+# min(count.data[1,])
+# max(count.data[1,])
+# min(count.data[5,])
+# max(count.data[5,])
+# min(count.data[1:5,])
+# max(count.data[1:5,])
+
+# extract the first column 
+# count.data[,1]
+
+# extract the first row 
+# count.data[1,]
+
+# Get the summary statistics for each column. What kind of information does it provide ?
+# summary(count.data)
+
+# Now, get the summary statistics per row.  What kind of information does it provide ?
+
+sumPergene <- summary(t(count.data))
+# sumPergene[,1:5]
+
+# Q3: What genes have 0 counts across all the samples? And only one count?
+# head(rowSums(count.data))
+# this corresponds to the genes that are no expressed in the conditions
+# basic filtering of the data --> remove genes that are not expressed
+# which(rowSums(count.data) == 0)
+# rowSums(count.data)[1190]
+# which(rowSums(count.data) == 1)
+# count.data[ which(rowSums(count.data) == 1) , ]
+
+## Filter those genes for which you don't have enough read counts  across all samples.
+## In this case, we will filter out those genes having less than 10 counts.
+count.data <- count.data[which(rowSums(count.data) > 10),]
+
+# Q4: How many genes do you have now?
+# dim(count.data)
+
+```
+
+### Annotate genes according to their Gene Ontology 
+
+For this part, we will be using PATRIC, the Pathosystems Resource Integration Center knowledge
+data base.
+
+https://www.patricbrc.org/
+
+The aim of this exercise is to learn how to look for annotations for
+our organism and use them to annotate genes present in our dataset. This procedure is generic for other organisms, as long as they are present in PATRIC.
+
+These annotations will then allow us later to perform enrichment analysis of our genes according these GO categories.
+
+```{r Formatting, message=FALSE, warning=FALSE, fig.keep='none'}
+
+# ***** First step: select genes to annotate
+
+# Q5 Which genes are you going to annotate?
+# rownames(count.data)[1:5]
+
+# do you understand how these ids are structured? what type of annotation system do they correspond to?
+
+#  let's get the different identifiers for our genes
+
+#  let's get the gene IDs
+geneID <- gsub("\\|.*","",rownames(count.data)) # NP's solution ! 
+# geneID[1:5]
+
+# let's get the gene  symbols
+geneSymbol  <- gsub(".*\\|","",rownames(count.data))
+
+# EP's solution
+l <- strsplit(rownames(count.data),"\\|")
+geneID <- sapply(l,"[[",1)
+# geneID[1:5]
+geneSymbol <- sapply(l,"[[",2)
+# geneSymbol[1:5]
+
+# Q6:   How many genes have a geneID?. How many genes have a gene symbol? 
+# How many unique gene IDs/ gene symbols  do we have? what do you conclude on that?
+
+# length(geneID) # 4288
+# length(unique(geneID)) # 4288
+
+# length(geneSymbol) # 4288
+# length(which(geneSymbol!='NA')) # 1672
+# length(unique(geneSymbol)) # 1649
+
+# Which are the genes that have the same symbols?
+# head(duplicated(geneSymbol),10)
+# geneSymbol[duplicated(geneSymbol) & geneSymbol != "NA"]
+
+# Add these gene identifiers to the our dataset
+count.data.ids <- cbind(geneID, geneSymbol,count.data)
+
+# ***** Step 2: Find annotation
+
+########## STOP AND GO OUTSIDE of R to find the annotation file
+# https://www.patricbrc.org/view/Taxonomy/1763#view_tab=features
+# 1. once you have downloaded the file, you can load it or 
+# 2. use the file we already downloaded for you and available in the "data" folder
+
+# load annotation file  
+annotation <-read.table(file="annotation/PATRIC_genome_feature_myc.txt",
+                        sep="\t",
+                        header = T)
+
+# inspect the annotation : do you understand what each column corresponds to?
+# View(annotation)
+# Q7 Choose three columns that you think will give you interesting information
+
+# Keep only interesting columns: the gene ID identification, the Protein ID, the Symbol ID, and the GO annotation
+
+intCols <- c("RefSeq.Locus.Tag","Protein.ID", "Gene.Symbol","GO")
+annotation <- annotation[,intCols]
+# head(annotation)
+
+# ***** Step 3: annotate our count data
+
+# join your two tables (count matrix and annotation matrix) using the information they have in common: the gene ID
+ 
+annot.count.data <- merge(count.data.ids, annotation, 
+      by.x = "geneID",    
+      by.y = "RefSeq.Locus.Tag",
+      all.x = TRUE, 
+      sort = FALSE)
+
+# reorder columns to have all annotation at the begining
+Datacols <- grep("R[123]", colnames(annot.count.data ))
+
+# c(grep("R[123]",colnames(annot.count.data), invert = TRUE), Datacols)
+
+annot.count.data <- annot.count.data[, c(grep("R[123]",colnames(annot.count.data), invert = TRUE), Datacols)]
+
+# Q8 How many genes have GO annotation?
+# length(na.omit(annot.count.data$GO))
+
+# Q8bis How many unique GO annotation do we have?
+# length(unique(annot.count.data$GO)) # 589 !!
+# 
+# Save the table for further inspection
+write.table(annot.count.data, 
+            file = "output/annotatedFilteredCounts_myc.txt",
+            sep ="\t",
+            row.names = F)
+
+
+#########  Extra step: create a list of go terms 
+# This list structure will be used for the Functional Enrichment Hands On.
+
+# prepare a  list of all genes and their GO ids
+x <- list(a=1,b=2,c=3,d=2)
+tmp <- stack(x)
+# split(as.character(tmp$ind), tmp$values)
+
+GO_id = sapply(annotation$GO,FUN=function(x){stringr::str_extract_all(x,pattern="GO:[[:digit:]]+")})
+names(GO_id) = annotation$RefSeq.Locus.Tag
+# GO_id[["ERDMAN_1882"]]
+
+# prepare  a vector of  GO terms
+vect_GO = unlist(GO_id)
+# note that genes with no GO terms are removed because the element of the list is empty
+
+# now make a list of unique GO terms and the corresponding annotated genes
+# for a speficic GO term for example
+# x <- "GO:0003989"
+# names(vect_GO)[which(vect_GO %in% x)]
+
+unique_go = unique(vect_GO)
+list_go = lapply(unique_go,FUN=function(x){names(vect_GO)[which(vect_GO %in% x)]})
+names(list_go) = unique_go
+
+# Q9 How many GO terms does this organism have? 
+# leng th(list_go) # 633
+# or alternatively 
+# length(unique_go)
+
+# head(lengths(list_go)) # very sparse 
+# or alternatively 
+# head(sapply(list_go,length))
+# head = 6 FIRST values of an object
+# tail = 6 LAST values of an object
+# tail(lengths(list_go))
+
+# save this list for further use
+save(list_go, file = "output/list_go_myc.RData")
+
+# save the vector of GO in the dataset
+write.table(unique_go, 
+            file = "output/GOterms_myc.txt",
+            sep ="\t",
+            quote = F,
+            row.names = F, col.names = "GOterm")
+
+```
+
+## B. Functional annotation of Human - KEGG Pathways
+
+The aim of this exercise is to learn how to look for KEGG annotations for Human and use them to annotate genes present in our dataset before performing any enrichment analysis.
+
+
+
+
+This same procedure can be used to annotate any other organisms present in the Kegg database.
+
+### Get Kegg pathways for Human
+
+KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.
+
+This time, we can use KEGGREST R package to directly query the KEGG database and find the annotation for our organism without having to manually download an annotation file.
+
+The vignette of the package is very useful (http://bioconductor.org/packages/release/bioc/vignettes/KEGGREST/inst/doc/KEGGREST-vignette.html)
+
+```{r Annotate, warning=FALSE,  verbose = FALSE}
+
+# ***** First step : Find annotation
+
+#listDatabases() # list the KEGG databases which may be searched
+
+# Obtain a list of available organisms in KEGG. This command queries directly
+# Kegg's DB. It may take little while
+
+
+org <- keggList("organism")
+
+# Inspect this object:
+# head(org)
+
+# Q1: Do you know now how to look for your organism index, what is it?
+
+index <- "T01001"
+
+# Show entry:
+# org[1,]
+# get id in the database "T.number"
+# org[1,1]
+
+# Alternatively, you can go to KEGG website and look for the ID of Homo sapiens
+
+# get pathway/gene list for this organism
+#Keggdata <- keggLink("pathway",index)
+Keggdata <- keggLink("pathway","T01001")
+# Inspect this object:
+# head(Keggdata)
+# how this object is structured? 
+
+# length(Keggdata)
+# dim(Keggdata) # NULL 
+# Keggdata[1] # id of a pathway
+# names(Keggdata)[1] # ENTREZ ID of a gene
+
+# Q2 how many genes belong to at least one Kegg pathway? 
+keggGenes <- unique(names(Keggdata))
+# length(keggGenes)
+
+# Inspect this object:
+# head(keggGenes) # ENTREZ IDs of these genes
+
+# remove the organism prefix
+keggGenes <- sub("hsa:","",keggGenes)
+
+# Q3 What kind of geneIDs do we have now?? 
+# it is ENTREZ IDs --> only numbers 
+# head(keggGenes)
+
+# Q4 how many pathways are there for this organism?
+pathList <- unique(Keggdata)
+# head(pathList)
+# length(pathList)
+
+# define a single pathway query to inspect returned object:
+# 
+query <- keggGet(pathList[1])
+# query[[1]]
+# names(query[[1]])
+
+# Q5 Qhat is the name of the pathway? How many genes does it have?
+# hsa00010 
+# "Glycolysis / Gluconeogenesis"
+# query[[1]]$NAME
+# length(query[[1]]$GENE) / 2 # because 2 lines per genes (entrez and gene name)
+
+# Q6 What is the gene id used to describe the genes listed in a pathway?
+# head(query[[1]]$GENE)
+
+# keggGet() can also return images of you pathway of interest !
+#
+png <- keggGet("hsa05130", "image") 
+t <- tempfile()
+library(png)
+writePNG(png, t)
+if (interactive()) browseURL(t)
+
+png <- keggGet("hsa00010", "image") 
+t <- tempfile()
+
+writePNG(png, t)
+if (interactive()) browseURL(t)
+
+# Now that you have done the query for one pathway,
+# you need to do it for all
+
+query <- keggGet(pathList[1:2])
+
+# create a pathways list
+# fullPathList <- list()
+
+# Depending if you want to create the pathway list or to directly load it,
+# you will need to uncomment the following lines
+
+# ##### NOTE YOU ONLY NEED TO DO THIS ONCE ####
+## Create a list of pathways
+# since this will take some time, we add a progress line to assess the time
+# it will require (and do not dispair ;-)
+# pb <- progress_bar$new(total = length(pathList), width= 60)
+# for (i in 1:length(pathList)){
+#   pb$tick() # cosmetik
+#   query <- keggGet(pathList[i])
+#   idx <- grep(c("ENTRY|NAME|CLASS|PATHWAY_MAP|GENE"), names(query[[1]]))
+#   fullPathList[[i]] <- query[[1]][idx]
+# }
+# names(fullPathList) <- pathList
+# save(fullPathList,file = "annotation/fullPathList_KEGG_Human.RData")
+
+# ## as creating the pathway list  takes time, we saved the object into a RData file that you may load using:
+
+load("output/fullPathList_KEGG_Human.RData")
+
+# What does this list contain? 
+# length(fullPathList)
+# fullPathList[[1]]
+# names(fullPathList[[1]])
+#######
+
+# drop "empty pathways" (do you get what "empty pathways" are?)
+
+toDrop <- unlist(lapply(fullPathList, function(x) is.null(x$GENE)))
+filtPathList <- fullPathList[-which(toDrop)]
+
+# Q7 How many pathways are there for this organism after filtering? 
+# length(filtPathList)
+# 12 pathways were empty 
+
+# Decompose the full list of pathways into three smaller lists for better manipulation
+
+Kegg.GeneIDs <-list()
+Kegg.GeneSymbols <-list()
+Kegg.Genes <-list()
+
+KeggPathNames <-list()
+for (i in 1:length(filtPathList)){
+  # get the number of genes belonging to the pathway
+  numGenes = c(1:length(filtPathList[[i]]$GENE))  
+  # get the gene IDs
+  Kegg.GeneIDs[[i]] <- filtPathList[[i]]$GENE[numGenes%%2==1]
+  # get the gene name and its description
+  tmp <- filtPathList[[i]]$GENE[numGenes%%2==0]
+  tmp <- strsplit(tmp, split = "]")
+  Kegg.GeneSymbols[[i]] <- paste0(unlist(lapply(tmp,function(x) x[1])),"]")
+  Kegg.Genes[[i]] <- paste(Kegg.GeneIDs[[i]], Kegg.GeneSymbols[[i]], sep = "||")
+  Pathname = strsplit(fullPathList[[i]]$NAME, split = " -")
+  KeggPathNames[[i]] <-unlist(lapply(Pathname,function(x) x[1]))
+}
+
+names(Kegg.GeneIDs) <- names(Kegg.GeneSymbols) <- names(Kegg.Genes) <- KeggPathNames
+
+# head(Kegg.GeneIDs) ## Useful for GSEA analysis tomorrow ! 
+# head(Kegg.GeneSymbols)
+# head(Kegg.Genes)
+
+
+# Now, let's create a table that we can write into a file
+
+pathwayTable <- setNames(data.frame(matrix(ncol = 2, nrow =length(KeggPathNames))),c("Pathway","Genes"))
+
+for (i in 1:length(KeggPathNames)){
+ pathwayTable$Pathway[i] <- KeggPathNames[[i]] 
+ pathwayTable$Genes[i] <-  paste(Kegg.Genes[[i]],collapse = "///") 
+}
+
+# pathwayTable[1,]
+# dim(pathwayTable)
+
+write.table(pathwayTable, file = "output/KeggPathways_human.txt", 
+            sep ="\t", 
+            col.names = T, 
+            row.names = FALSE)
+
+```
+
+### Load data 
+
+For this exercise, input data consists in a human dataset, with 2 biological conditions (C and T) and 3 replicates per condition.
+
+
+### Annotate our own count table
+
+Now that we have explored the KEGGrest package functionalities, we are going to annotate our dataset. The aim of this section is to have a correspondence between the list 
+`Kegg.GeneIDs` which contains the correspondance between ENTREZ gene ids and KEGG pathways and the genes identifiers of the count table.
+
+```{r loading,message=FALSE, warning=FALSE}
+
+# Explore the format of this object
+# head(Kegg.GeneIDs)
+
+# Import  count table
+
+## Read the count data
+count.data <-read.table(file="data/counts_human.txt",
+                        header = T) 
+
+# Q1. What is the gene annotation system for this dataset?
+# rownames(count.data)[1:10]  
+# ENSEMBL 
+# gene ids are not ENTREZ ids in this dataset so we need to convert rownames 
+```
+
+#### Using org.Hs.eg.db package 
+
+org.Hs.eg.db is an R package that contains features annotation for human and allows to go from ENSEMBL ids to ENTREZids etc (check https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html for documentation)
+
+Note : the ENSEMBL version should match between the mapping bioinformatic step and the annotation step --> check the package release on Bioconductor and on ENSEMBL website
+
+```{r org.hs.ed.db,message=FALSE, warning=FALSE}
+entrezIDs <- mapIds(org.Hs.eg.db,keys=rownames(count.data),keytype="ENSEMBL",column="ENTREZID")
+# entrezIDs[1]
+entrezIDs <- data.frame(entrezIDs)
+# dim(entrezIDs)
+# head(entrezIDs,100)
+entrezIDs <- entrezIDs[! is.na(entrezIDs$entrezIDs),,drop=FALSE]
+# dim(entrezIDs)
+
+# in this part, we make sure that all ids are unique as functional enriments methods require unique gene identifiers
+# For repeated entries (doubles), we will keep the most variant gene
+
+# merge count data and annotation matching dataframe
+count.data.solution1 <- merge(count.data,entrezIDs, by.x =0, by.y = 0)
+# example of repetitions
+# count.data.solution1[count.data.solution1$entrezIDs == "100652739",]
+# compute the standard deviation of each gene, then order genes by variability, identify doublons that are less variable and remove them 
+count.data.solution1$std <- apply(count.data.solution1[,c("C1","C2","C3","T1","T2","T3")],1,sd)
+# count.data.solution1[count.data.solution1$entrezIDs == "100652739",]
+
+count.data.solution1 <- count.data.solution1[order(count.data.solution1$std, decreasing = TRUE),]
+# see, doublons are in decreasing order for variability
+# count.data.solution1[count.data.solution1$entrezIDs == "100652739",]
+# then identify and remove doubles
+count.data.solution1 <- count.data.solution1[ ! duplicated(count.data.solution1$entrezIDs),]
+# count.data.solution1[count.data.solution1$entrezIDs == "100652739",]
+# then set rownames and remove useless columns
+rownames(count.data.solution1) <- count.data.solution1$entrezIDs
+
+count.data.solution1$entrezIDs <- count.data.solution1$Row.names <- count.data.solution1$std <- NULL
+
+# Q1 how many genes did we loose after this manipulation? 
+# dim(count.data)
+# dim(count.data.solution1)
+
+# # Q2. Now, what is the gene annotation system for this dataset?
+# 
+# # We export the result into a new count table in order to use it in the GSEA section
+write.csv(count.data.solution1,file="output/annotatedCounts_Human_solution1.txt")
+```
+
+#### To go further : using biomaRt package
+
+Otherwise, one can query the ENSEMBL website using biomaRt package. For documentation, visit the vignette of the package, it is very user friendly (http://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html)
+
+Note : the ENSEMBL version should match between the mapping bioinformatic step and the annotation step --> it is an option of useEnsembl() function of biomaRt package
+
+```{r biomart,message=FALSE, warning=FALSE}
+# ENSEMBL version 92
+# retrieve ENSEMBL database
+ensembl <- useEnsembl(biomart="genes", dataset="hsapiens_gene_ensembl", version=103)
+
+# what is the structure of this object? what does it contain? 
+# head(listAttributes(ensembl), 20)
+# get annotation information for hsa
+# attention: it is an online query, may take time
+
+bm <- getBM(attributes=c("ensembl_gene_id","entrezgene_id"), 
+             filters="ensembl_gene_id",
+             values=rownames(count.data), 
+             mart=ensembl)
+
+# many other attributes exist, you may access to the list using 
+# searchAttributes(mart = ensembl, pattern = "gene_name")
+# or
+# listAttributes(ensembl)
+# and then we will have to explore the 2 objects to find the syntax you need...
+
+# head(bm)
+# dim(bm) # 58630
+# sum(duplicated(bm$ensembl_gene_id)) # 210
+
+# repeat the same procedure as before to get unique ENTREZ ids for count table
+# count.data.solution2 .... 
+
+# # We export the result into a new count table in order to use it in the GSEA section
+# write.csv(count.data.solution2,file="output/annotatedCounts_Human_solution2.txt")
+```
+
+
+# R session information
+
+In the reproducible research framework, an important step is to save all the versions of the packages used to perform the statistical analysis. They must be provided when submitting a paper.
+
+```{r sessionInfo, results='asis'}
+sessionInfo()
+```
-- 
GitLab