# ChIPflow: from replicated ChIP-seq raw data to epigenomic dynamics
# ePeak: from replicated chromatin profiling data to epigenomic dynamics
...
...
@@ -10,15 +10,15 @@
[[_TOC_]]
# What is ChIPflow ?
# What is ePeak ?
ChIPflow is a snakemake-based workflow for the analysis of ChIP-seq data from raw FASTQ files to differential analysis of transcription factor binding or histone modification marking. It streamlines critical steps like the quality assessment of the immunoprecipitation using the cross correlation and the replicate comparison for both narrow and broad peaks. For the differential analysis ChIPflow provides linear and non linear methods for normalisation between samples as well as conservative and stringent models for estimating the variance and testing the significance of the observed differences (see [chipflowr](https://gitlab.pasteur.fr/hub/chipflowr)).
ePeak is a snakemake-based workflow for the analysis of ChIP-seq data from raw FASTQ files to differential analysis of transcription factor binding or histone modification marking. It streamlines critical steps like the quality assessment of the immunoprecipitation using the cross correlation and the replicate comparison for both narrow and broad peaks. For the differential analysis ePeak provides linear and non linear methods for normalisation between samples as well as conservative and stringent models for estimating the variance and testing the significance of the observed differences (see [chipflowr](https://gitlab.pasteur.fr/hub/chipflowr)).
A tutorial to create a conda environment with all dependencies is available here : [env.sh](https://gitlab.pasteur.fr/hub/chipflow/-/blob/master/env.sh)
A tutorial to create a conda environment with all dependencies is available here : [env.sh](https://gitlab.pasteur.fr/hub/ePeak/-/blob/master/env.sh)
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, its DOI (see [above](https://gitlab.pasteur.fr/hub/chipflow#how-to-cite-chipflow-))
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, its DOI (see [above](https://gitlab.pasteur.fr/hub/ePeak#how-to-cite-ePeak-))
If you are using Singularity, you need to copy the Singularity image in the cloned ChIPflow directory.
If you are using Singularity, you need to copy the Singularity image in the cloned ePeak directory.
* Step 2: Configure workflow
To configure your analysis, you need to fill 2 configuration files, one to specify the design experimental of you project and one to parameter each step of the pipeline (stored in `config/`):
In addition, you can customize the MultiQC report by configuring this file: [multiqc_config.yaml](https://gitlab.pasteur.fr/hub/chipflow#how-to-fill-the-multiqc-config)
In addition, you can customize the MultiQC report by configuring this file: [multiqc_config.yaml](https://gitlab.pasteur.fr/hub/ePeak#how-to-fill-the-multiqc-config)
| NB_IP | Number of replicates of the histone mark or TFs (i.e. 1 or 2) |
| NB_INPUT | Number of replicates of INPUT files (i.e. 1 or 2) |
| NB_IP | Replicate number of the histone mark or TFs (i.e. 1 or 2) |
| NB_INPUT | Replicate number of INPUT file (i.e. 1 or 2) |
Link to an Example: [design.txt](https://gitlab.pasteur.fr/hub/chipflow/-/blob/master/test/design.txt)
Link to an Example: [design.txt](https://gitlab.pasteur.fr/hub/ePeak/-/blob/master/test/design.txt)
### How to fill the config file
All the parameters to run the pipeline are gathered in a YAML config file that the user has to fill before running the pipeline. Here is an filled example: [config.yaml](https://gitlab.pasteur.fr/hub/chipflow/-/blob/master/test/config.yaml)
All the parameters to run the pipeline are gathered in a YAML config file that the user has to fill before running the pipeline. Here is an filled example: [config.yaml](https://gitlab.pasteur.fr/hub/ePeak/-/blob/master/test/config.yaml)
This config file is divided in 2 sections:
...
...
@@ -230,7 +230,7 @@ At the beginning of `config/multiqc_config.yaml` file, you have the possibility
# Title to use for the report.
title: "ChIP-seq analysis"
subtitle: "ChIP-seq analysis of CTCF factor in breast tumor cells" # Set your own text
intro_text: "MultiQC reports summarise analysis results produced from ChIPflow"
intro_text: "MultiQC reports summarise analysis results produced from ePeak"
# Add generic information to the top of reports
report_header_info:
...
...
@@ -244,7 +244,7 @@ report_header_info:
## ChIPflow running modes
## ePeak running modes
After the read pre-processing steps of the pipeline, MACS2 peak calling software is used to identify enriched regions. Several settings of MACS2 are possible:
- To estimate the fragment size you can: use MACS2 model (default) or use PhantomPeakQualTool.
...
...
@@ -268,10 +268,15 @@ peak_calling:
compute_idr:
do: yes
rank: 'signal.value'
thresh: 0.05
intersectionApproach: no
intersectionApproach:
do: no
ia_overlap: 0.8
```
> If none or very few peaks pass the IDR, this means that there is to much variability between your replicates aka they are probably not good replicates. If you want to proceed anyway with the analysis, you can use the intersection approach (less stringent) instead of the IDR by setting `intersectionApproach` to 'yes'.
...
...
@@ -288,10 +293,42 @@ peak_calling:
cutoff: 0.01
genomeSize: mm
compute_idr:
do: no
rank: 'signal.value'
thresh: 0.01
intersectionApproach: yes
intersectionApproach:
do: yes
ia_overlap: 0.8
```
### Default mode for cut&run
With cut&run data, make deduplication only on INPUT/IgG data (dedup_IP to False). Then perform a stringent peak calling with SEACR and use Intersection Approach. Overlapping parameter of IA on peaks is set at 0.8.
```
mark_duplicates:
do: yes
dedup_IP: 'False'
threads: 4
seacr:
do: yes
threshold: 'stringent'
norm: 'norm'
compute_idr:
do: no
rank: 'signal.value'
thresh: 0.01
intersectionApproach:
do: yes
ia_overlap: 0.8
```
...
...
@@ -302,18 +339,21 @@ You need to have sra-toolkit installed before to download test data.
- Inside `IP_NAME` you can use "-" but do not "\_" because this is used to separate `MARK`, `COND`, `REP` and `MATE` from FASTQ filenames. For example: `TF4_Ctl-HeLa_rep1_R1.fastq.gz`
**Can I use relative path in config ?**
- yes, but you need to consider ChIPflow directory as origin.
- yes, but you need to consider ePeak directory as origin.
**What if I have 3 INPUT replicates?**
- You can use ChIPflow with more than 2 replicates, replicates will be merged in a Maxi Pool.
- You can use ePeak with more than 2 replicates, replicates will be merged in a Maxi Pool.
**What if I have 3 IP replicate?**
- The IDR for 3 IP replicates is not yet implemented.
**What INPUT is used when I have 2 INPUTs?**
-Only first INPUT is used for peak calling to reduce technical variability for now.
-First INPUT could be used for peak calling on all IP to reduce technical variability, but each IP could be associated to a specific INPUT in the design file.
**What if I have no INPUT ?**
- The pipeline failed to run if no INPUT are done. Following ENCODE guidelines, control experiment must be performed and are used during peak calling.
...
...
@@ -388,19 +429,19 @@ done
**Can I force the re-calculation of all the steps ?**
- Yes, you can add this snakemake option `--forceall` to force the execution of the first rule.
**Can I rename ChIPflow directory ?**
**Can I rename ePeak directory ?**
- yes, but must to be the same as in config.yaml (`analysis_dir`) or use relative path
**The pipeline fails because the IDR doesn't select enough reads?**
- If none or very few peaks pass the IDR, this means that there is to much variability between your replicates aka they are probably not good replicates. If you want to proceed anyway with the analysis, you can use the intersection approach (less stringent) instead of the IDR by setting `intersectionApproach` to 'yes'.
- If none or very few peaks pass the IDR, this means that there is to much variability between your replicates aka they are probably not good replicates. If you want to proceed anyway with the analysis, you can use the intersection approach (less stringent) instead of the IDR by setting `intersectionApproach:do` to 'yes'.
**When should I use PhantompeakCalling's fragment size estimation instead of MACS2's one?**
- If MACS2 cannot compute the fragment size estimation (or if you want), set `no_model` to yes, and the fragment length use for MACS2 will be the one computed by PhantompeakQualTools for each sample.
**What if I don't know if my chromatim factor in narrow or broad?**
- The output directories names of peak Calling, peak reproducibility and differential analysis steps includes the peak calling mode name, the peak reproducibility method name and the normalisation and variance estimation method name. This allows ChIPflow to test multiple combinations of peak calling, peak reproducibility and differential analysis parameters without erasing any output.
- The output directories names of peak Calling, peak reproducibility and differential analysis steps includes the peak calling mode name, the peak reproducibility method name and the normalisation and variance estimation method name. This allows ePeak to test multiple combinations of peak calling, peak reproducibility and differential analysis parameters without erasing any output.
- For example, if you have run the pipeline in narrow mode, and you want broad mode, you just need to modify the corresponding parameter in the configuration YAML file. The pipeline will then restart at the peak calling step and all the output will be stored in "06-PeakCalling/{}" directories.
- In case of unknown chromatin factor, we advice to run ChIPflow in narrow mode with IDR and IA, and afterward in broad mode. Results from narrow peak calling will be stored in "06-PeakCalling/Narrow" directory, and in "06-PeakCalling/Broad" for broad peak calling.
- In case of unknown chromatin factor, we advice to run ePeak in narrow mode with IDR and IA, and afterward in broad mode. Results from narrow peak calling will be stored in "06-PeakCalling/Narrow" directory, and in "06-PeakCalling/Broad" for broad peak calling.