diff --git a/Annotation_bascis.md b/Annotation_bascis.md new file mode 100644 index 0000000000000000000000000000000000000000..b98007fea73f2f9a91b3d7e230cc051882e306a7 --- /dev/null +++ b/Annotation_bascis.md @@ -0,0 +1,248 @@ +# Hands-On: Annotation basics + + + +## Insstallations + +Download data from server : + +`wget https://dl.pasteur.fr/fop/HJfzm2Py/ChIP_data.tar` + +Untar data: + +`tar xvf ChIP_data.tar` + +Download reference genomes files from server: + +`wget https://dl.pasteur.fr/fop/lroDilwn/ReferenceGenome.tar` + +Untar data: + +`tar xvf ReferenceGenome.tar` + +## get ePeak on your home + +* Load modules (ON CLUSTER ONLY) + +``` +module load snakemake/6.5.0 +module load python/3.7 +module load singularity +module load git-lfs/2.13.1 +module load pysam +``` + +* Clone workflow: + +`git clone https://gitlab.pasteur.fr/hub/ePeak.git` + +* Download singularity container: + +``` +cd ePeak +singularity pull --arch amd64 --name epeak.img library://rlegendre/epeak/epeak:1.0 +``` + +## configure ePeak + +Open config/config.yaml and config/design.txt files + +* **Design file:** tabulated file of 4 columns. + +**Column 1** is the name of the IP file + +**Column 2** is the name of the corresponding INPUT file + +**Column 3** is the replicate number of IP file + +**Column 4** is the replicate number of the corresponding INPUT file + + +``` +IP_NAME INPUT_NAME NB_IP NB_INPUT +H3K27ac_shCtrl INPUT_shCtrl 1 1 +H3K27ac_shCtrl INPUT_shCtrl 2 1 +H3K27ac_shUbc9 INPUT_shUbc9 1 1 +H3K27ac_shUbc9 INPUT_shUbc9 2 1 +Klf4_shCtrl INPUT_shCtrl 1 1 +Klf4_shCtrl INPUT_shCtrl 2 2 +Klf4_shUbc9 INPUT_shUbc9 1 1 +Klf4_shUbc9 INPUT_shUbc9 2 2 +``` + +* **Config file:** yaml file containing all tools parameters + +This file is divided into _chunks_. Each chunk correspond to one step or one tool. + + +This first chunk provides input information and assigns working directories. +`input_dir` path to FASTQ files directory. +`input_mate` mate pair format (i.e. `_R[12]` for *MATE* = R1 or R2) , must match the *MATE* parameter in FASTQ files. +`input_extension` filename extension format (i.e. `fastq.gz` or `fq.gz`). +`analysis_dir` path to analysis directory. +`tmpdir` path to temporary directory (i.e. `/tmp/` or other) + +``` +input_dir: ../ChIP_data +input_mate: '_R[12]' +input_extension: '.fastq.gz' +analysis_dir: $HOME #define for each user +tmpdir: $TMPDIR +``` + + +The design chunk aims to check that the FASTQ files name match the design file information. The `marks`, `conditions` and `replicates` parameters must respectively match the *MARK*, *COND* and *REP* parameters of the FASTQ files name and the design file. +For spike-in data, set `spike` on "True" and provide the spike-in genome FASTA file path through the `spike_genome_file` parameter. + +``` +design: + design_file: config/design.txt + marks: H3K27ac, Klf4 + condition: shCtrl, shUbc9 + replicates: Rep + spike: false + spike_genome_file: genome/dmel9.fa +``` + + +This genome chunk provides information about reference genome - directory, name of the index and path to fasta file. + +``` +genome: + index: yes + genome_directory: genome/ + name: mm10 + fasta_file: genome/mm10_chr1.fa +``` + +The fastqc chunk provides quality control checking of fastq files. + +``` +fastqc: + options: '' + threads: 4 +``` + +The adapters chunk is relative to quality trimming and adapter removal with cutadapt. A list of common adapters is provided under config directory and give to cutadapt (adapter_list). Then, different parameters are tuned to match precisely with the data. + + +``` +adapters: + remove: yes + adapter_list: file:config/adapt.fa + m: 25 + mode: a + options: -O 6 --trim-n --max-n 1 + quality: 30 + threads: 4 +``` + + +The bowtie2_mapping chunk is relative to the reads mapping against genome file (provided by the genome chunk) + +``` +bowtie2_mapping: + options: "--very-sensitive --no-unal" + threads: 4 +``` + + +The mark duplicates chunk allows to mark PCR duplicate in BAM files. For ChIPseq data, IP and INPUT need to be deduplicated, so the dedup_IP parameter is set to True. + + +``` +mark_duplicates: + do: yes + dedup_IP: 'True' + threads: 4 +``` + +The remove_biasedRegions chunk is relative to remove biased genomic regions (previously named blacklisted regions) + +``` +remove_biasedRegions: + do: yes + bed_file: genome/mm10.blacklist.bed + threads: 1 +``` + +To produce metaregion profiles, coverages from each samples need to be producted. + +See https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html + +``` +bamCoverage: + do: yes + options: "--binSize 10 --effectiveGenomeSize 2652783500 --normalizeUsing RPGC" + spike-in: no + threads: 4 +``` + +Set yes to geneBody chunk to produce metaregion profiles. This step need a gene model file in bed format. + +``` +geneBody: + do: yes + regionsFileName: genome/mm10_chr1_RefSeq.bed + threads: 4 +``` + +Set all following chunks 'do' to 'no' for now. + + +## run ePeak + + +Test your configuration by performing a dry-run via: + +`snakemake --use-singularity -n --cores 1` + +Execute the workflow locally using $N cores via: + +``` +export PICARD_TOOLS_JAVA_OPTS="-Xmx8G" +N=8 +snakemake --use-singularity --singularity-args "-B '/home/'" --cores $N +``` + + +Run it specifically on Slurm cluster: + +`sbatch snakemake --use-singularity --singularity-args "-B '$HOME'" --cluster-config config/cluster_config.json --cluster "sbatch --mem={cluster.ram} --cpus-per-task={threads} " -j 200 --nolock --cores $SLURM_JOB_CPUS_PER_NODE` + + +## analyse QC reports + +### Look at MultiQC report + +- General statistics + +<img src="images/Multiqc_mainStats.png" width="1000" align="center" > + +- Mapping with bowtie2 + +<img src="images/bowtie2_se_plot.png" width="700" align="center" > + +- Deduplication with MarkDuplicates + +<img src="images/picard_deduplication.png" width="700" align="center" > + +- Fingerplot + +<img src="images/deeptools_fingerprint_plot.png" width="700" align="center" > + + +### Look at 05-QC directory + +- Cross correlation + + <img src="images/H3K27ac_shCtrl_ppqt.png" width="700" align="center" > <img src="images/Klf4_shCtrl_ppqt.png" width="700" align="center" > + +- GeneBody plot/heatmap + +<img src="images/geneBodyplot.png" width="700" align="center" > + + + + +Would you proceed to the analysis ? justify why