
ePeak: from replicated chromatin profiling data to epigenomic dynamics
Table of content
- ePeak: from replicated chromatin profiling data to epigenomic dynamics
- What is ePeak ?
- How to install ePeak ?
- How to run ePeak ?
- Q&A
- Authors
- How to cite ePeak ?
What is ePeak ?
ePeak is a snakemake-based workflow for the analysis of ChIP-seq/CUT&RUN/CUT&Tag data from raw FASTQ files to differential analysis of transcription factor binding or histone modification marking. It streamlines critical steps like the quality assessment of the immunoprecipitation using the cross correlation and the replicate comparison for both narrow and broad peaks. For the differential analysis ePeak provides linear and non linear methods for normalisation between samples as well as conservative and stringent models for estimating the variance and testing the significance of the observed differences (see chipflowr).
How to install ePeak ?
Installation with singularity
Pre-required tools:
- python >= 3.8
- snakemake >=4.8.0
- pandas
- pysam
- singularity >= 3.8
A tutorial to create a conda environment with all dependencies is available here : env.sh
Download the singularity container:
singularity pull --arch amd64 --name epeak.img library://rlegendre/epeak/epeak:1.0
Manual installation
List of tools to install:
- python >= 3.8
- snakemake >=4.8.0
- pandas
- pysam
- singularity
- cutadapt >= 3.2
- fastqc == 0.11.9
- samtools >= 1.10
- bowtie2 >= 2.3.5
- macs2 >= 2.2.7
- picard >= 2.18.25
- featureCounts (from subread) >= 2.0.0
- bedtools >= 2.27.1
- idr == 2.0.3
- R >= 4.0.3
- spp == 1.15.2
- R cran packages: snow, snowfail, bitops, caTools, RCurl, Rcpp
- R bioconductor packages: GenomeInfoDB, genomicRanges, Rsamtools
- chipflowr
Run it on IFB Core Cluster without any installation
- Load modules:
module load snakemake/7.7.0
module load python/3.7
module load singularity
module load git-lfs/2.13.1
module load pysam
- Clone workflow:
git clone https://gitlab.pasteur.fr/hub/ePeak.git
- Download singularity container:
cd ePeak
singularity pull --arch amd64 --name epeak.img library://rlegendre/epeak/epeak:1.0
Then, you can configure the workflow.
How to run ePeak ?
Note: you can test the pipeline with an example datasets: here
Usage
- Step 1: Download workflow
git clone https://gitlab.pasteur.fr/hub/ePeak.git
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, its DOI (see above)
If you are using Singularity, you need to copy the Singularity image in the cloned ePeak directory.
- Step 2: Configure workflow
To configure your analysis, you need to fill 2 configuration files, one to specify the design experimental of you project and one to parameter each step of the pipeline (stored in config/
):
In addition, you can customize the MultiQC report by configuring this file: multiqc_config.yaml
- Step 3: Execute workflow
Test your configuration by performing a dry-run via:
snakemake --use-singularity -n --cores 1
Execute the workflow locally using $N cores via:
snakemake --use-singularity --singularity-args "-B '/home/login/'" --cores $N
Run it in a cluster environment via this standard command line:
snakemake --use-singularity --singularity-args "-B '$HOME'" --cluster-config config/cluster_config.json --cluster "sbatch --mem={cluster.ram} --cpus-per-task={threads} " -j 200 --nolock --cores 1
Or specifically on Slurm cluster:
sbatch snakemake --use-singularity --singularity-args "-B '$HOME'" --cluster-config config/cluster_config.json --cluster "sbatch --mem={cluster.ram} --cpus-per-task={threads} " -j 200 --nolock --cores $SLURM_JOB_CPUS_PER_NODE
or on SGE cluster:
snakemake --use-singularity --singularity-args "-B '$HOME'" --cluster-config config/cluster_config.json --cluster "qsub -cwd -V -b y -l h_vmem={cluster.ram} -pe [PE] {threads}" -j 200 --nolock --cores 1
Visualize how the rules are connected via:
snakemake -s Snakefile --rulegraph --nolock | dot -Tsvg > rulegraph.svg
or how the files are processed via:
snakemake -s Snakefile -j 10 --dag --nolock | dot -Tsvg > dag.svg
Rename FASTQ files
All FASTQ files have to observe the following name nomenclature: MARK_COND_REP_MATE.fastq.gz
. For FASTQ files with only one mate (single-end sequencing), or if no replicates are available, set MATE to R1 or REP to Rep1, respectively.
Wildcard | Description |
---|---|
MARK | Histone mark or Transcription Factor (TF) name (i.e. H3K4me1, Klf4) |
COND | Biological condition name (i.e. Normal, Cancer, Cells) |
REP | Replicate number (i.e. Rep1 or Rep2) |
MATE | Identification of mate pair sequencing (i.e. R1 or R2) |
All the FASTQ files must be stored in the same directory.
Example of FASTQ file names:
-
H3K27ac_shCtrl_Rep1_R1.fastq.gz
and its corresponding INPUT file:INPUT_shCtrl_Rep1_R1.fastq.gz
-
TF4_HeLa_rep1_R1.fastq.gz
and its corresponding INPUT file:Input_HeLa_rep1_R1.fastq.gz
-
CTCF_WT_REP1_R1.fastq.gz
and its corresponding INPUT file:INPUT_WT_REP1_R1.fastq.gz
How to fill the design
The experimental analysis design is summarised in a tabulated design file that the user have to fill before running the pipeline.
Design columns:
Column | Description |
---|---|
IP_NAME | IP FASTQ files prefix (i.e. MARK_COND1) |
INPUT_NAME | INPUT FASTQ files prefix (i.e. INPUT_COND1) |
NB_IP | Replicate number of the histone mark or TFs (i.e. 1 or 2) |
NB_INPUT | Replicate number of INPUT file (i.e. 1 or 2) |
Link to an Example: design.txt
Example of design file and associated fastq file:
For an experiment with one mark (H3K27ac) and 2 conditions (shCtrl and shUbc9) in duplicates, except for one INPUT which have failed. The list of fastq is the following :
H3K27ac_shCtrl_rep1_R1.fastq.gz
H3K27ac_shCtrl_rep2_R1.fastq.gz
INPUT_shCtrl_rep1_R1.fastq.gz
H3K27ac_shUbc9_rep1_R1.fastq.gz
H3K27ac_shUbc9_rep2_R1.fastq.gz
INPUT_shUbc9_rep1_R1.fastq.gz
INPUT_shUbc9_rep2_R1.fastq.gz
For this experiment, the design file should be :
IP_NAME INPUT_NAME NB_IP NB_INPUT
H3K27ac_shCtrl INPUT_shCtrl 1 1
H3K27ac_shCtrl INPUT_shCtrl 2 1
H3K27ac_shUbc9 INPUT_shUbc9 1 1
H3K27ac_shUbc9 INPUT_shUbc9 2 2
IP of the first condition are linked to the unique INPUT, and each IP file of second condition is linked to his INPUT.
For design with INPUT files, NB_IP and NB_INPUT correspond to the replicate number of files to associate IP with a specific INPUT. So, one line by IP (for each replicate) is expected.
How to fill the config file
All the parameters to run the pipeline are gathered in a YAML config file that the user has to fill before running the pipeline. Here is an filled example: config.yaml
This config file is divided in 2 sections:
- Section 1: Running pipeline chunks
This first chunk provides input information and assigns working directories.
input_dir
path to FASTQ files directory.
input_mate
mate pair format (i.e. _R[12]
for MATE = R1 or R2) , must match the MATE parameter in FASTQ files.
input_extension
filename extension format (i.e. fastq.gz
or fq.gz
).
analysis_dir
path to analysis directory.
tmpdir
path to temporary directory (i.e. /tmp/
or /local/scratch/
)
input_dir: /path/to/raw_data
input_mate: '_R[12]'
input_extension: '.fastq.gz'
analysis_dir: /path/to/directory/analysis
tmpdir: "/tmp/"
The design chunk aims to check that the FASTQ files name match the design file information. The marks
, conditions
and replicates
parameters must respectively match the MARK, COND and REP parameters of the FASTQ files name and the design file.
For spike-in data, set spike
on "True" and provide the spike-in genome FASTA file path through the spike_genome_file
parameter.
design:
design_file: /path/to/directory/analysis/config/design.txt
marks: K4Me3
condition: C, U
replicates: Rep
spike: false
spike_genome_file: /path/to/genome/directory/dmel9.fa
- Section 2: Rule parameters chunks
Each step has it proper setting in independent chunk.
The second section of the configuration file is divided in multiple chunks. Each chunk gather the parameters of one step of the pipeline: adapters
, bowtie2_mapping
, mark_duplicates
, remove_blacklist
, peak_calling
, compute_idr
and differential_analysis
.
The options
parameter present in adapters
, bowtie2_mapping
and peak_calling
allows you to provide any parameter recognised by cutadapt, bowtie2 and macs2 respectively. For example for the bowtie2_mapping
chunk, options
can be fill with "--very-sensitive" only or more options (adapted to paired-end data, see example 2).
example 1 :
bowtie2_mapping:
options: "--very-sensitive"
threads: 4
example 2 :
bowtie2_mapping:
options: "--very-sensitive -X 700 --no-mixed --no-discordant"
threads: 4
How to fill the multiqc config
At the beginning of config/multiqc_config.yaml
file, you have the possibility to customize header of MultiQC report according to your experiment as you can see below:
# Title to use for the report.
title: "ChIP-seq analysis"
subtitle: "ChIP-seq analysis of CTCF factor in breast tumor cells" # Set your own text
intro_text: "MultiQC reports summarise analysis results produced from ePeak"
# Add generic information to the top of reports
report_header_info:
- Contact E-mail: 'rlegendre@pasteur.fr' # Set your own text
- Application Type: 'ChIP-seq' # Set your own text
- Project Type: 'Differential peak expression' # Set your own text
- Sequencing Platform: 'NextSeq2000' # Set your own text
- Sequencing Setup: 'PE75' # Set your own text

ePeak running modes
After the read pre-processing steps of the pipeline, MACS2 peak calling software is used to identify enriched regions. Several settings of MACS2 are possible:
- To estimate the fragment size you can: use MACS2 model (default) or use PhantomPeakQualTool.
- To call the peaks you can either choose narrow or broad.
For the selection of reproducible peaks between replicates we implemented 2 methods:
- The IDR that is the recommended method by ENCODE to measure consistency between replicates for narrow peaks.
- The intersection approach that select the peaks that overlap (default:80%) between the replicates. We optimised this approach for broad peaks. Indeed, the IDR is not optimised for broad peaks. This because the broader size of the peaks induces a bigger position and signal variability making that very few peaks are passing the IDR.
In summary, the user as the possibility to choose multiple fragment size estimation method, multiple peak calling modes and multiple methods of selection of reproducible peaks between replicates. The following subsection shows example of running mode that we advise:
Default narrow mode with IDR
peak_calling:
mode_choice: 'narrow'
no_model: no
options: "--keep-dup all "
cutoff: 0.1
genomeSize: mm
compute_idr:
do: yes
rank: 'signal.value'
thresh: 0.05
intersectionApproach:
do: no
ia_overlap: 0.8
If none or very few peaks pass the IDR, this means that there is to much variability between your replicates aka they are probably not good replicates. If you want to proceed anyway with the analysis, you can use the intersection approach (less stringent) instead of the IDR by setting
intersectionApproach
to 'yes'.
Default broad mode with intersection approach
In broad mode, perform a stringent peak calling (cutoff of 0.01) and use Intersection Approach. Overlapping parameter of IA on broad peaks is set at 0.8.
peak_calling:
mode_choice: 'broad'
no_model: no
options: "--keep-dup all --broad-cutoff 0.01"
cutoff: 0.01
genomeSize: mm
compute_idr:
do: no
rank: 'signal.value'
thresh: 0.01
intersectionApproach:
do: yes
ia_overlap: 0.8
Default mode for CUT&RUN (with INPUT)
With CUT&RUN data, make deduplication only on INPUT/IgG data (dedup_IP to False). Then perform a stringent peak calling with SEACR and use Intersection Approach. Overlapping parameter of IA on peaks is set at 0.8. Set SEACR normalization to non if experiment have control genome (a scaling factor will be calulated from spike-in) .
mark_duplicates:
do: yes
dedup_IP: 'False'
threads: 4
seacr:
do: yes
threshold: 'stringent'
norm: 'norm'
compute_idr:
do: no
rank: 'signal.value'
thresh: 0.01
intersectionApproach:
do: yes
ia_overlap: 0.8
Default mode for CUT&Tag (or with no INPUT)
With CUT&Tag data, that does not require an input control or IgG control sample, the "noCTL" version of ePeak is recommended.
To running ePeak without any control, the design file should be adapted:
IP_NAME NB_IP
H3K27ac_shCtrl 2
H3K27ac_shUbc9 2
For design without any INPUT file, NB_IP correspond to the number of replicates available for each sample. So, one line by sample (one for all replicates) is expected.
The config file stay identical.
bowtie2_mapping:
options: "--very-sensitive -X 700 --no-mixed --no-discordant"
threads: 4
mark_duplicates:
do: yes
dedup_IP: 'False'
threads: 4
peak_calling:
mode_choice: 'narrow'
no_model: no
options: "--keep-dup all "
cutoff: 0.1
genomeSize: mm
compute_idr:
do: yes
rank: 'signal.value'
thresh: 0.05
intersectionApproach:
do: no
ia_overlap: 0.8
To run the pipeline, the correct Snakefile (Snakefile_noCTL) should be used:
snakemake -s Snakefile_noCTL --use-singularity --singularity-args "-B '$HOME'" --cluster-config config/cluster_config.json --cluster "sbatch --mem={cluster.ram} --cpus-per-task={threads} " -j 200 --nolock --cores 1
Run the pipeline on test data
You need to have sra-toolkit installed before to download test data.
# Download genome references
wget http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz
gunzip mm10.fa.gz
# get blacklisted regions
wget http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz
gunzip mm10.blacklist.bed.gz
# download bed file for GeneBody plot
wget http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/genes/mm10.refGene.gtf.gz
gunzip mm10.refGene.gtf.gz
# create genome directory
mkdir genome
mv mm10.* genome/
# copy config file
cp test/config.yaml config/config.yaml
cp test/design.txt config/design.txt
# Download FastQ files from GEO (GSE99009) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99009
# Only include the H3K27ac histone mark and Klf4 transcription factor with their associated inputs for the shUbc9 and shCtrl conditions
SRR=("SRR5572646" "SRR5572647" "SRR5572658" "SRR5572659" "SRR5572668" "SRR5572669" "SRR5572676" "SRR5572677" "SRR5572652" "SRR5572653" "SRR5572664" "SRR5572665")
sample=("H3K27ac_shCtrl_Rep1_R1" "H3K27ac_shCtrl_Rep2_R1" "H3K27ac_shUbc9_Rep1_R1" "H3K27ac_shUbc9_Rep2_R1" "Klf4_shCtrl_Rep1_R1" "Klf4_shCtrl_Rep2_R1" "Klf4_shUbc9_Rep1_R1" "Klf4_shUbc9_Rep2_R1" "INPUT_shCtrl_Rep1_R1" "INPUT_shCtrl_Rep2_R1" "INPUT_shUbc9_Rep1_R1" "INPUT_shUbc9_Rep2_R1")
mkdir data
cd data
for i in ${!SRR[*]} ; do
echo ${SRR[$i]}, ${sample[$i]}
prefetch ${SRR[$i]} -o ${sample[$i]}.sra
fastq-dump ${sample[$i]}.sra
done
rm *.sra
for file in *.fastq ; do
pigz $file ;
done
Q&A
What's a IP?
- IP stand for ImmunoPrecipitation. The IP samples contains the DNA that bind the protein of interest (i.e. H3K4me1, CTCF, TFIIA). These samples has been cross-linked, sonicated and immuno-precipitated.
What's an INPUT?
- INPUT stand for control DNA. The control DNA has been cross-linked and sonicated but not immuno-precipitated.
Replicate format in FASTQ file name?
- Rep1, rep1 and REP1 accepted, all samples must have the same format:
- Rep1 and Rep2 works but Rep1 and REP2 doesn't
What is the IP_NAME
format columns? (i.e. What if I want to specify something more (species, incubation time...))
- Inside
IP_NAME
you can use "-" but do not "_" because this is used to separateMARK
,COND
,REP
andMATE
from FASTQ filenames. For example:TF4_Ctl-HeLa_rep1_R1.fastq.gz
Can I use relative path in config ?
- yes, but you need to consider ePeak directory as origin.
What if I have 3 INPUT replicates?
- You can use ePeak with more than 2 replicates, replicates will be merged in a Maxi Pool.
What if I have 3 IP replicate?
- The IDR for 3 IP replicates is not yet implemented.
What INPUT is used when I have 2 INPUTs?
- First INPUT could be used for peak calling on all IP to reduce technical variability, but each IP could be associated to a specific INPUT in the design file.
What if I have no INPUT ?
- Following ENCODE guidelines, control experiment must be performed and are used during peak calling for ChIp-seq experiments. Nevertheless, the CUT&Tag method reduces the background noise and allows for more precise and sensitive detection of protein-DNA interactions, reducing the need for an input control or IgG control. Follow the CUT&Tag section for more explanation about running ePeak without INPUT.
What if I have one INPUT for one mark, and two for another ?
- Experimental design with different numbers of replicates for IP and/or INPUT are included to ePeak.
What if I have one replicate for one mark ?
- The pipeline will stop after peak calling, without selection of peaks as reproducibility cannot be ensure with only one replicate.
What are the default ADAPTERS for cutadapt?
- By default, indexes from TruSeq and NEBNext Kits for Illumina are done in the config/adapt.fa.
Can I give my own adapters list for cutadapt?
- In-house adapters could be added to the config/adapt.fa or in an independent file. Don't forget to change the value of
adapter_list
in cutadapt rule inconfig.yaml
file.
I want to test other options for Bowtie2, how can I re-run the mapping step ?
- As any Snakemake pipeline, you can force the re-calculation of one or more rule. After changing the options of
bowtie2_mapping
chunk, you can restart the pipeline adding this snakemake option :--forcerun bowtie2_mapping
. All rules depending ofbowtie2_mapping
will be re-calculated.
Can I force the re-calculation of all the steps ?
- Yes, you can add this snakemake option
--forceall
to force the execution of the first rule.
Can I rename ePeak directory ?
- yes, but must to be the same as in config.yaml (
analysis_dir
) or use relative path
The pipeline fails because the IDR doesn't select enough reads?
- If none or very few peaks pass the IDR, this means that there is to much variability between your replicates aka they are probably not good replicates. If you want to proceed anyway with the analysis, you can use the intersection approach (less stringent) instead of the IDR by setting
intersectionApproach:do
to 'yes'.
When should I use PhantompeakCalling's fragment size estimation instead of MACS2's one?
- If MACS2 cannot compute the fragment size estimation (or if you want), set
no_model
to yes, and the fragment length use for MACS2 will be the one computed by PhantompeakQualTools for each sample.
What if I don't know if my chromatim factor in narrow or broad?
- The output directories names of peak Calling, peak reproducibility and differential analysis steps includes the peak calling mode name, the peak reproducibility method name and the normalisation and variance estimation method name. This allows ePeak to test multiple combinations of peak calling, peak reproducibility and differential analysis parameters without erasing any output.
- For example, if you have run the pipeline in narrow mode, and you want broad mode, you just need to modify the corresponding parameter in the configuration YAML file. The pipeline will then restart at the peak calling step and all the output will be stored in "06-PeakCalling/{}" directories.
- In case of unknown chromatin factor, we advice to run ePeak in narrow mode with IDR and IA, and afterward in broad mode. Results from narrow peak calling will be stored in "06-PeakCalling/Narrow" directory, and in "06-PeakCalling/Broad" for broad peak calling.
Authors
- Rachel Legendre (@rlegendr)
- Maëlle Daunesse
- Adrien Pain
- Hugo Varet
- Claudia Chica
How to cite ePeak ?
Daunesse Maëlle, Legendre Rachel, Varet Hugo, Pain Adrien, Chica Claudia. ePeak: from replicated chromatin profiling data to epigenomic dynamics. NAR Genom Bioinform. 2022 May 27;4(2):lqac041. doi: 10.1093/nargab/lqac041. PMID: 35664802; PMCID: PMC9154330.