diff --git a/Tuesday/GREAT/GREAT_analysis_forEpigenomicData.md b/Tuesday/GREAT/GREAT_analysis_forEpigenomicData.md index 056ab6ceb921699df2d53d11b6fd8db9e2e997d5..0f78e02a7dafdaa9b2815db11813a4388936f198 100644 --- a/Tuesday/GREAT/GREAT_analysis_forEpigenomicData.md +++ b/Tuesday/GREAT/GREAT_analysis_forEpigenomicData.md @@ -1,236 +1,39 @@ # Hands-On: ChIPseq functional analysis -GREAT is a tool GREAT for the functional analysis of cis-regulatory regions. Assigns biological meaning to a set of non-coding genomic regions by analyzing the annotations of the nearby genes. It is thus useful for peaks identified by any epigenomic profiling technique. +GREAT is a tool GREAT for the functional analysis of cis-regulatory regions. Assigns biological meaning to a set of non-coding genomic regions by analyzing the annotations of the nearby genes. It uses an ORA (Over-Representation analyis) approach. It is useful for peaks identified by any epigenomic profiling technique. http://great.stanford.edu/public/html/ ## Data -Regions (peaks) showing significant changes in SUMO enrichment among cell cycle phases, classified in 4 profiles by profile clustering. +ChIP-seq for SUMO2 in synchronized WI-38 fibroblasts at four different cell cycle phases: G1, S, late S and G2/M. +<details> + <summary markdown="span">What's the actual input data for the functional analyis?</summary> -## get ePeak on your home +Input data for functional analysis are regions (peaks) showing significant changes in SUMO enrichment among cell cycle (CC) phases, classified in 4 profiles by profile clustering. +ChIPseq analyis done using hg19. -* Load modules (ON CLUSTER ONLY) +</details> -``` -module load snakemake/6.5.0 -module load python/3.7 -module load singularity -module load git-lfs/2.13.1 -module load pysam -``` -* Clone workflow: +## Perfom analysis -`git clone https://gitlab.pasteur.fr/hub/ePeak.git` +* Go to http://great.stanford.edu/public/html/ -* Download singularity container: -``` -cd ePeak -singularity pull --arch amd64 --name epeak.img library://rlegendre/epeak/epeak:1.0 -``` +* Choose test and backgroung regions -## configure ePeak +<details> + <summary markdown="span">How how you chose the background regions?</summary> -Open config/config.yaml and config/design.txt files +Depends on your question: Do you want to compare the functional enrichment in comparison to any random region of the region (Whole genome) or, else, in comparison to a given subset of regions that may have a specific genomic context (e.g. all the SUMO peaks that change along the CC). -* **Design file:** tabulated file of 4 columns. +</details> -**Column 1** is the name of the IP file +* Define association rules. What's the main effect of the the different rules? -**Column 2** is the name of the corresponding INPUT file +* Repeat the analysis for the 4 clusters. -**Column 3** is the replicate number of IP file +## Conclude -**Column 4** is the replicate number of the corresponding INPUT file - - -``` -IP_NAME INPUT_NAME NB_IP NB_INPUT -H3K27ac_shCtrl INPUT_shCtrl 1 1 -H3K27ac_shCtrl INPUT_shCtrl 2 1 -H3K27ac_shUbc9 INPUT_shUbc9 1 1 -H3K27ac_shUbc9 INPUT_shUbc9 2 1 -Klf4_shCtrl INPUT_shCtrl 1 1 -Klf4_shCtrl INPUT_shCtrl 2 2 -Klf4_shUbc9 INPUT_shUbc9 1 1 -Klf4_shUbc9 INPUT_shUbc9 2 2 -``` - -* **Config file:** yaml file containing all tools parameters - -This file is divided into _chunks_. Each chunk correspond to one step or one tool. - - -This first chunk provides input information and assigns working directories. -`input_dir` path to FASTQ files directory. -`input_mate` mate pair format (i.e. `_R[12]` for *MATE* = R1 or R2) , must match the *MATE* parameter in FASTQ files. -`input_extension` filename extension format (i.e. `fastq.gz` or `fq.gz`). -`analysis_dir` path to analysis directory. -`tmpdir` path to temporary directory (i.e. `/tmp/` or other) - -``` -input_dir: ../ChIP_data -input_mate: '_R[12]' -input_extension: '.fastq.gz' -analysis_dir: $HOME #define for each user -tmpdir: $TMPDIR -``` - - -The design chunk aims to check that the FASTQ files name match the design file information. The `marks`, `conditions` and `replicates` parameters must respectively match the *MARK*, *COND* and *REP* parameters of the FASTQ files name and the design file. -For spike-in data, set `spike` on "True" and provide the spike-in genome FASTA file path through the `spike_genome_file` parameter. - -``` -design: - design_file: config/design.txt - marks: H3K27ac, Klf4 - condition: shCtrl, shUbc9 - replicates: Rep - spike: false - spike_genome_file: genome/dmel9.fa -``` - - -This genome chunk provides information about reference genome - directory, name of the index and path to fasta file. - -``` -genome: - index: yes - genome_directory: genome/ - name: mm10 - fasta_file: genome/mm10_chr1.fa -``` - -The fastqc chunk provides quality control checking of fastq files. - -``` -fastqc: - options: '' - threads: 4 -``` - -The adapters chunk is relative to quality trimming and adapter removal with cutadapt. A list of common adapters is provided under config directory and give to cutadapt (adapter_list). Then, different parameters are tuned to match precisely with the data. - - -``` -adapters: - remove: yes - adapter_list: file:config/adapt.fa - m: 25 - mode: a - options: -O 6 --trim-n --max-n 1 - quality: 30 - threads: 4 -``` - - -The bowtie2_mapping chunk is relative to the reads mapping against genome file (provided by the genome chunk) - -``` -bowtie2_mapping: - options: "--very-sensitive --no-unal" - threads: 4 -``` - - -The mark duplicates chunk allows to mark PCR duplicate in BAM files. For ChIPseq data, IP and INPUT need to be deduplicated, so the dedup_IP parameter is set to True. - - -``` -mark_duplicates: - do: yes - dedup_IP: 'True' - threads: 4 -``` - -The remove_biasedRegions chunk is relative to remove biased genomic regions (previously named blacklisted regions) - -``` -remove_biasedRegions: - do: yes - bed_file: genome/mm10.blacklist.bed - threads: 1 -``` - -To produce metaregion profiles, coverages from each samples need to be producted. - -See https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html - -``` -bamCoverage: - do: yes - options: "--binSize 10 --effectiveGenomeSize 2652783500 --normalizeUsing RPGC" - spike-in: no - threads: 4 -``` - -Set yes to geneBody chunk to produce metaregion profiles. This step need a gene model file in bed format. - -``` -geneBody: - do: yes - regionsFileName: genome/mm10_chr1_RefSeq.bed - threads: 4 -``` - -Set all following chunks 'do' to 'no' for now. - - -## run ePeak - - -Test your configuration by performing a dry-run via: - -`snakemake --use-singularity -n --cores 1` - -Execute the workflow locally using $N cores via: - -``` -export PICARD_TOOLS_JAVA_OPTS="-Xmx8G" -N=8 -snakemake --use-singularity --singularity-args "-B '/home/'" --cores $N -``` - - -Run it specifically on Slurm cluster: - -`sbatch snakemake --use-singularity --singularity-args "-B '$HOME'" --cluster-config config/cluster_config.json --cluster "sbatch --mem={cluster.ram} --cpus-per-task={threads} " -j 200 --nolock --cores $SLURM_JOB_CPUS_PER_NODE` - - -## analyse QC reports - -### Look at MultiQC report - -- General statistics - -<img src="images/Multiqc_mainStats.png" width="1000" align="center" > - -- Mapping with bowtie2 - -<img src="images/bowtie2_se_plot.png" width="700" align="center" > - -- Deduplication with MarkDuplicates - -<img src="images/picard_deduplication.png" width="700" align="center" > - -- Fingerplot - -<img src="images/deeptools_fingerprint_plot.png" width="700" align="center" > - - -### Look at 05-QC directory - -- Cross correlation - - <img src="images/H3K27ac_shCtrl_ppqt.png" width="700" align="center" > <img src="images/Klf4_shCtrl_ppqt.png" width="700" align="center" > - -- GeneBody plot/heatmap - -<img src="images/geneBodyplot.png" width="700" align="center" > - - - - -Would you proceed to the analysis ? justify why +Use the output to identify if there is a functional difference between the SUMO regions associated to each profile. How would you proceed?