Skip to content
Snippets Groups Projects
Commit a9d01cf0 authored by Claudia  CHICA's avatar Claudia CHICA
Browse files

Update Annotation_bascis.md

parent 8dd73dc7
Branches
No related tags found
No related merge requests found
# Hands-On: Annotation basics
## Insstallations
Download data from server :
`wget https://dl.pasteur.fr/fop/HJfzm2Py/ChIP_data.tar`
Untar data:
`tar xvf ChIP_data.tar`
Download reference genomes files from server:
`wget https://dl.pasteur.fr/fop/lroDilwn/ReferenceGenome.tar`
Untar data:
`tar xvf ReferenceGenome.tar`
## get ePeak on your home
* Load modules (ON CLUSTER ONLY)
```
module load snakemake/6.5.0
module load python/3.7
module load singularity
module load git-lfs/2.13.1
module load pysam
```
* Clone workflow:
`git clone https://gitlab.pasteur.fr/hub/ePeak.git`
* Download singularity container:
```
cd ePeak
singularity pull --arch amd64 --name epeak.img library://rlegendre/epeak/epeak:1.0
```
## configure ePeak
Open config/config.yaml and config/design.txt files
* **Design file:** tabulated file of 4 columns.
**Column 1** is the name of the IP file
**Column 2** is the name of the corresponding INPUT file
**Column 3** is the replicate number of IP file
**Column 4** is the replicate number of the corresponding INPUT file
```
IP_NAME INPUT_NAME NB_IP NB_INPUT
H3K27ac_shCtrl INPUT_shCtrl 1 1
H3K27ac_shCtrl INPUT_shCtrl 2 1
H3K27ac_shUbc9 INPUT_shUbc9 1 1
H3K27ac_shUbc9 INPUT_shUbc9 2 1
Klf4_shCtrl INPUT_shCtrl 1 1
Klf4_shCtrl INPUT_shCtrl 2 2
Klf4_shUbc9 INPUT_shUbc9 1 1
Klf4_shUbc9 INPUT_shUbc9 2 2
```
* **Config file:** yaml file containing all tools parameters
This file is divided into _chunks_. Each chunk correspond to one step or one tool.
This first chunk provides input information and assigns working directories.
`input_dir` path to FASTQ files directory.
`input_mate` mate pair format (i.e. `_R[12]` for *MATE* = R1 or R2) , must match the *MATE* parameter in FASTQ files.
`input_extension` filename extension format (i.e. `fastq.gz` or `fq.gz`).
`analysis_dir` path to analysis directory.
`tmpdir` path to temporary directory (i.e. `/tmp/` or other)
```
input_dir: ../ChIP_data
input_mate: '_R[12]'
input_extension: '.fastq.gz'
analysis_dir: $HOME #define for each user
tmpdir: $TMPDIR
```
The design chunk aims to check that the FASTQ files name match the design file information. The `marks`, `conditions` and `replicates` parameters must respectively match the *MARK*, *COND* and *REP* parameters of the FASTQ files name and the design file.
For spike-in data, set `spike` on "True" and provide the spike-in genome FASTA file path through the `spike_genome_file` parameter.
```
design:
design_file: config/design.txt
marks: H3K27ac, Klf4
condition: shCtrl, shUbc9
replicates: Rep
spike: false
spike_genome_file: genome/dmel9.fa
```
This genome chunk provides information about reference genome - directory, name of the index and path to fasta file.
```
genome:
index: yes
genome_directory: genome/
name: mm10
fasta_file: genome/mm10_chr1.fa
```
The fastqc chunk provides quality control checking of fastq files.
```
fastqc:
options: ''
threads: 4
```
The adapters chunk is relative to quality trimming and adapter removal with cutadapt. A list of common adapters is provided under config directory and give to cutadapt (adapter_list). Then, different parameters are tuned to match precisely with the data.
```
adapters:
remove: yes
adapter_list: file:config/adapt.fa
m: 25
mode: a
options: -O 6 --trim-n --max-n 1
quality: 30
threads: 4
```
The bowtie2_mapping chunk is relative to the reads mapping against genome file (provided by the genome chunk)
```
bowtie2_mapping:
options: "--very-sensitive --no-unal"
threads: 4
```
The mark duplicates chunk allows to mark PCR duplicate in BAM files. For ChIPseq data, IP and INPUT need to be deduplicated, so the dedup_IP parameter is set to True.
```
mark_duplicates:
do: yes
dedup_IP: 'True'
threads: 4
```
The remove_biasedRegions chunk is relative to remove biased genomic regions (previously named blacklisted regions)
```
remove_biasedRegions:
do: yes
bed_file: genome/mm10.blacklist.bed
threads: 1
```
To produce metaregion profiles, coverages from each samples need to be producted.
See https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html
```
bamCoverage:
do: yes
options: "--binSize 10 --effectiveGenomeSize 2652783500 --normalizeUsing RPGC"
spike-in: no
threads: 4
```
Set yes to geneBody chunk to produce metaregion profiles. This step need a gene model file in bed format.
```
geneBody:
do: yes
regionsFileName: genome/mm10_chr1_RefSeq.bed
threads: 4
```
Set all following chunks 'do' to 'no' for now.
## run ePeak
Test your configuration by performing a dry-run via:
`snakemake --use-singularity -n --cores 1`
Execute the workflow locally using $N cores via:
```
export PICARD_TOOLS_JAVA_OPTS="-Xmx8G"
N=8
snakemake --use-singularity --singularity-args "-B '/home/'" --cores $N
```
Run it specifically on Slurm cluster:
`sbatch snakemake --use-singularity --singularity-args "-B '$HOME'" --cluster-config config/cluster_config.json --cluster "sbatch --mem={cluster.ram} --cpus-per-task={threads} " -j 200 --nolock --cores $SLURM_JOB_CPUS_PER_NODE`
## analyse QC reports
### Look at MultiQC report
- General statistics
<img src="images/Multiqc_mainStats.png" width="1000" align="center" >
- Mapping with bowtie2
<img src="images/bowtie2_se_plot.png" width="700" align="center" >
- Deduplication with MarkDuplicates
<img src="images/picard_deduplication.png" width="700" align="center" >
- Fingerplot
<img src="images/deeptools_fingerprint_plot.png" width="700" align="center" >
### Look at 05-QC directory
- Cross correlation
<img src="images/H3K27ac_shCtrl_ppqt.png" width="700" align="center" > <img src="images/Klf4_shCtrl_ppqt.png" width="700" align="center" >
- GeneBody plot/heatmap
<img src="images/geneBodyplot.png" width="700" align="center" >
Would you proceed to the analysis ? justify why
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment