Update Annotation_bascis.md

a9d01cf0 · Claudia CHICA · 8dd73dc7 · a9d01cf0
Commit a9d01cf0 authored 3 years ago by Claudia CHICA
--- a/Annotation_bascis.md
+++ b/Annotation_bascis.md
+# Hands-On: Annotation basics 
+
+
+
+## Insstallations
+
+Download data from server :
+
+`wget https://dl.pasteur.fr/fop/HJfzm2Py/ChIP_data.tar`
+
+Untar data:
+
+`tar xvf ChIP_data.tar`
+
+Download reference genomes files from server:
+
+`wget https://dl.pasteur.fr/fop/lroDilwn/ReferenceGenome.tar`
+
+Untar data:
+
+`tar xvf ReferenceGenome.tar`
+
+## get ePeak on your home
+
+* Load modules (ON CLUSTER ONLY)
+
+```
+module load snakemake/6.5.0
+module load python/3.7
+module load singularity
+module load git-lfs/2.13.1
+module load pysam
+```
+
+* Clone workflow:
+
+`git clone https://gitlab.pasteur.fr/hub/ePeak.git`
+
+* Download singularity container:
+
+```
+cd ePeak
+singularity pull --arch amd64 --name epeak.img  library://rlegendre/epeak/epeak:1.0
+```
+
+## configure ePeak
+
+Open config/config.yaml and config/design.txt files
+
+* **Design file:** tabulated file of 4 columns.
+
+**Column 1** is the name of the IP file
+
+**Column 2** is the name of the corresponding INPUT file
+
+**Column 3** is the replicate number of IP file
+
+**Column 4** is the replicate number of the corresponding INPUT file
+
+
+```
+IP_NAME	INPUT_NAME	NB_IP	NB_INPUT
+H3K27ac_shCtrl	INPUT_shCtrl	1	1
+H3K27ac_shCtrl	INPUT_shCtrl	2	1
+H3K27ac_shUbc9	INPUT_shUbc9	1	1
+H3K27ac_shUbc9	INPUT_shUbc9	2	1
+Klf4_shCtrl	INPUT_shCtrl	1	1
+Klf4_shCtrl	INPUT_shCtrl	2	2
+Klf4_shUbc9	INPUT_shUbc9	1	1
+Klf4_shUbc9	INPUT_shUbc9	2	2
+```
+
+* **Config file:** yaml file containing all tools parameters
+
+This file is divided into _chunks_. Each chunk correspond to one step or one tool.
+
+
+This first chunk provides input information and assigns working directories. 
+`input_dir` path to FASTQ files directory. 
+`input_mate` mate pair format (i.e. `_R[12]` for *MATE* = R1 or R2) , must match the *MATE* parameter in FASTQ files.
+`input_extension` filename extension format (i.e. `fastq.gz` or `fq.gz`).
+`analysis_dir` path to analysis directory.
+`tmpdir` path to temporary directory (i.e. `/tmp/` or other)
+
+```
+input_dir: ../ChIP_data
+input_mate: '_R[12]'
+input_extension: '.fastq.gz'
+analysis_dir: $HOME #define for each user
+tmpdir: $TMPDIR
+```
+
+
+The design chunk aims to check that the FASTQ files name match the design file information. The `marks`, `conditions` and `replicates` parameters must respectively match the *MARK*, *COND* and *REP* parameters of the FASTQ files name and the design file. 
+For spike-in data, set `spike` on "True" and provide the spike-in genome FASTA file path through the `spike_genome_file` parameter.
+
+```
+design:
+    design_file: config/design.txt    
+    marks: H3K27ac, Klf4
+    condition: shCtrl, shUbc9
+    replicates: Rep
+    spike: false
+    spike_genome_file: genome/dmel9.fa
+```
+
+
+This genome chunk provides information about reference genome - directory, name of the index and path to fasta file.
+
+```
+genome:
+    index: yes
+    genome_directory: genome/
+    name: mm10
+    fasta_file: genome/mm10_chr1.fa
+```
+
+The fastqc chunk provides quality control checking of fastq files.
+
+```
+fastqc:
+    options: ''
+    threads: 4   
+```
+
+The adapters chunk is relative to quality trimming and adapter removal with cutadapt. A list of common adapters is provided under config directory and give to cutadapt (adapter_list). Then, different parameters are tuned to match precisely with the data.
+
+
+```
+adapters:
+    remove: yes
+    adapter_list: file:config/adapt.fa
+    m: 25
+    mode: a
+    options: -O 6 --trim-n --max-n 1 
+    quality: 30
+    threads: 4
+```
+
+
+The bowtie2_mapping chunk is relative to the reads mapping against genome file (provided by the genome chunk)
+
+```
+bowtie2_mapping:
+    options: "--very-sensitive --no-unal"
+    threads: 4
+```
+
+
+The mark duplicates chunk allows to mark PCR duplicate in BAM files. For ChIPseq data, IP and INPUT need to be deduplicated, so the dedup_IP parameter is set to True.
+
+
+```
+mark_duplicates:
+    do: yes
+    dedup_IP: 'True' 
+    threads: 4
+```
+
+The remove_biasedRegions chunk is relative to remove biased genomic regions (previously named blacklisted regions)
+
+```
+remove_biasedRegions:
+    do: yes
+    bed_file: genome/mm10.blacklist.bed
+    threads: 1
+```
+
+To produce metaregion profiles, coverages from each samples need to be producted.
+
+See https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html
+
+```
+bamCoverage:
+    do: yes
+    options: "--binSize 10 --effectiveGenomeSize 2652783500 --normalizeUsing RPGC" 
+    spike-in: no
+    threads: 4
+```
+
+Set yes to geneBody chunk to produce metaregion profiles. This step need a gene model file in bed format.
+
+```
+geneBody:
+    do: yes
+    regionsFileName: genome/mm10_chr1_RefSeq.bed
+    threads: 4
+```
+
+Set all following chunks 'do' to 'no' for now.
+
+
+## run ePeak
+
+
+Test your configuration by performing a dry-run via:
+
+`snakemake --use-singularity -n --cores 1`
+
+Execute the workflow locally using $N cores via:
+
+```
+export PICARD_TOOLS_JAVA_OPTS="-Xmx8G"
+N=8
+snakemake --use-singularity --singularity-args "-B '/home/'" --cores $N
+```
+
+
+Run it specifically on Slurm cluster:
+
+`sbatch snakemake --use-singularity --singularity-args "-B '$HOME'" --cluster-config config/cluster_config.json --cluster "sbatch --mem={cluster.ram} --cpus-per-task={threads} " -j 200 --nolock --cores $SLURM_JOB_CPUS_PER_NODE`
+
+
+## analyse QC reports
+
+### Look at MultiQC report
+
+- General statistics
+
+<img src="images/Multiqc_mainStats.png" width="1000" align="center" >
+ 
+- Mapping with bowtie2
+
+<img src="images/bowtie2_se_plot.png" width="700" align="center" > 
+
+- Deduplication with MarkDuplicates
+
+<img src="images/picard_deduplication.png" width="700" align="center" >
+
+- Fingerplot
+
+<img src="images/deeptools_fingerprint_plot.png" width="700" align="center" >
+
+
+### Look at 05-QC directory
+
+- Cross correlation
+ 
+ <img src="images/H3K27ac_shCtrl_ppqt.png" width="700" align="center" >  <img src="images/Klf4_shCtrl_ppqt.png" width="700" align="center" >
+
+- GeneBody plot/heatmap
+  
+<img src="images/geneBodyplot.png" width="700" align="center" >
+
+
+
+
+Would you proceed to the analysis ? justify why