README.md 18.2 KB
Newer Older
1
# ChIPflow: from replicated ChIP-seq raw data to epigenomic dynamics
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
2
3
4
5
6
7

[![Snakemake](https://img.shields.io/badge/snakemake-≥5.2.1-brightgreen.svg)](https://snakemake.readthedocs.io/en/stable/) [![Python 3.6](https://img.shields.io/badge/python-3.6-blue.svg)](https://www.python.org/downloads/release/python-360/)


## What is ChIPflow ?

8
9
10
ChIPflow is a snakemake-based workflow for the analysis of ChIP-seq data from raw FASTQ files to differential analysis of transcription factor binding or histone modification marking. It streamlines critical steps like the quality assessment of the immunoprecipitation using the cross correlation and the replicate comparison for both narrow and broad peaks. For the differential analysis ChIPflow provides linear and non linear methods for normalisation between samples as well as conservative and stringent models for estimating the variance and testing the significance of the observed differences (see [chipflowr](https://gitlab.pasteur.fr/hub/chipflowr)).

<img src="images/chipflow_pipeline.svg" width="700">
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
11
12
13
14
15

## How to install ChIPflow ?

### Installation with singularity

16
Pre-required tools:
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
17
18
19
20
21
22
- python >= 3.8
- snakemake >=4.8.0
- pandas
- pysam
- singularity

23
A tutorial to create a conda environment with all dependencies is available here : [env.sh](https://gitlab.pasteur.fr/hub/chipflow/-/blob/master/env.sh)
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
24
25
26
    
Download the singularity container:
 
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
27
` singularity pull --arch amd64 library://rlegendre/default/chipflow:latest ` 
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
28
29
30

### Manual installation 

31
List of tools to install:
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
32

33
34
35
36
37
- python >= 3.8
- snakemake >=4.8.0
- pandas
- pysam
- singularity
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
38
39
40
41
42
43
44
45
46
47
48
- cutadapt >= 3.2
- fastqc == 0.11.9
- samtools >= 1.10
- bowtie2 >= 2.3.5
- macs2 >= 2.2.7
- picard >= 2.18.25
- featureCounts (from subread) >= 2.0.0
- bedtools >= 2.27.1
- idr == 2.0.3
- R >= 4.0.3
- spp == 1.15.2
49
50
    - R cran packages: snow, snowfail, bitops, caTools, RCurl, Rcpp
    - R bioconductor packages: GenomeInfoDB, genomicRanges, Rsamtools
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
51
- [chipflowr](https://gitlab.pasteur.fr/hub/chipflowr)
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
52

Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
53
54
### Run it on IFB Core Cluster without any installation

55
* Load modules:
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
56
57
58
59
60
61
62
63
64

```
module load snakemake/6.5.0
module load python/3.7
module load singularity
module load git-lfs/2.13.1
module load pysam
```

65
* Clone workflow:
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
66
67
68
69
70
71
72
73
74
75

`git clone https://gitlab.pasteur.fr/hub/chipflow.git`

* Download singularity container:

```
singularity pull --arch amd64 library://rlegendre/default/chipflow:latest
mv chipflow_latest.sif chipflow/chipflow.img
```

76
Then, you can configure the workflow.
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
77
78
79

# How to run ChIPflow ?

80
> Note: you can test the pipeline with an example datasets: [here](https://gitlab.pasteur.fr/hub/chipflow#run-the-pipeline-on-test-data)
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
81

82
## Usage
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
83

84
* Step 1: Download workflow
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
85
86
87

`git clone https://gitlab.pasteur.fr/hub/chipflow.git`

88
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, its DOI (see [above](https://gitlab.pasteur.fr/hub/chipflow#how-to-cite-chipflow-))
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
89

90
If you are using Singularity, you need to copy the Singularity image in the cloned ChIPflow directory and rename it as "chipflow.img":
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
91

92
`mv chipflow_latest.sif chipflow/chipflow.img`
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
93
94


95
*  Step 2: Configure workflow
96

97
98
99
To configure your analysis, you need to fill 2 configuration files, one to specify the design experimental of you project and one to parameter each step of the pipeline (stored in `config/`):
- [config.yaml](https://gitlab.pasteur.fr/hub/chipflow#how-to-fill-the-design)
- [design.txt](https://gitlab.pasteur.fr/hub/chipflow#how-to-fill-the-design)
100

101
In addition, you can customize the MultiQC report by configuring this file: [multiqc_config.yaml](https://gitlab.pasteur.fr/hub/chipflow#how-to-fill-the-multiqc-config)
102

Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
103
104
*  Step 3: Execute workflow

105
Test your configuration by performing a dry-run via:
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
106

Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
107
`snakemake --use-singularity -n --cores 1`
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
108

109
Execute the workflow locally using $N cores via:
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
110

Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
111
`snakemake --use-singularity --singularity-args "-B '/home/login/'" --cores $N`
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
112

113
Run it in a cluster environment via this standard command line:
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
114

115
`snakemake --use-singularity --singularity-args "-B '$HOME'" --cluster-config config/cluster_config.json --cluster "sbatch --mem={cluster.ram} --cpus-per-task={threads} " -j 200 --nolock --cores 1`
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
116

Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
117
118
119
120
121
122
123
124
Or specifically on Slurm cluster:

`sbatch snakemake --use-singularity --singularity-args "-B '$HOME'" --cluster-config config/cluster_config.json --cluster "sbatch --mem={cluster.ram} --cpus-per-task={threads} " -j 200 --nolock --cores $SLURM_JOB_CPUS_PER_NODE`

or on SGE cluster:

`snakemake --use-singularity --singularity-args "-B '$HOME'" --cluster-config config/cluster_config.json --cluster "qsub -cwd -V -b y -l h_vmem={cluster.ram} -pe [PE] {threads}" -j 200 --nolock --cores 1`

Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
125

126
Visualize how the rules are connected via:
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
127
128
129

`snakemake -s Snakefile --rulegraph --nolock | dot -Tsvg > rulegraph.svg`

130
or how the files are processed via:
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
131
132
133
134
135
136
137
138
139
140
141

`snakemake -s Snakefile -j 10 --dag --nolock | dot -Tsvg > dag.svg`


### Rename FASTQ files

All FASTQ files have to observe the following name nomenclature: `MARK_COND_REP_MATE.fastq.gz`. For FASTQ files with only one mate (single-end sequencing), or if no replicates are available, set MATE to R1 or REP to Rep1, respectively.

| Wildcard |                          Description                                |
|----------|---------------------------------------------------------------------|
|   MARK   | Histone mark or Transcription Factor (TF) name (i.e. H3K4me1, Klf4) |
142
|   COND   | Biological condition name (i.e. Normal, Cancer, Cells)              |
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
143
144
145
146
147
|   REP    | Replicate number (i.e. Rep1 or Rep2)                                |
|   MATE   | Identification of mate pair sequencing (i.e. R1 or R2)              |

All the FASTQ files must be stored in the same directory.

148
149
150
151
152
153
Example of FASTQ file names:

- `H3K27ac_shCtrl_Rep1_R1.fastq.gz` and its corresponding INPUT file: `INPUT_shCtrl_Rep1_R1.fastq.gz`
- `TF4_HeLa_rep1_R1.fastq.gz` and its corresponding INPUT file: `Input_HeLa_rep1_R1.fastq.gz`
- `CTCF_WT_REP1_R1.fastq.gz` and its corresponding INPUT file: `INPUT_WT_REP1_R1.fastq.gz`

Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
154
155
### How to fill the design

156
The experimental analysis design is summarised in a tabulated design file that the user have to fill before running the pipeline.
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218

Design columns:

|    Column    |                          Description                          |
|--------------|---------------------------------------------------------------|
|   IP_NAME    | IP FASTQ files prefix  (i.e. MARK_COND1)                      |
|   INPUT_NAME | INPUT FASTQ files prefix (i.e. INPUT_COND1)                   |
|   NB_IP      | Number of replicates of the histone mark or TFs (i.e. 1 or 2) |
|   NB_INPUT   | Number of replicates of INPUT files (i.e. 1 or 2)             |

Link to an Example: [design.txt](https://gitlab.pasteur.fr/hub/chipflow/-/blob/master/test/design.txt)


### How to fill the config file

All the parameters to run the pipeline are gathered in a YAML config file that the user has to fill before running the pipeline. Here is an filled example: [config.yaml](https://gitlab.pasteur.fr/hub/chipflow/-/blob/master/test/config.yaml)

This config file is divided in 2 sections:

1. Section 1: Running pipeline chunks

This first chunk provides input information and assigns working directories. 
`input_dir` path to FASTQ files directory. 
`input_mate` mate pair format (i.e. `_R[12]` for *MATE* = R1 or R2) , must match the *MATE* parameter in FASTQ files.
`input_extension` filename extension format (i.e. `fastq.gz` or `fq.gz`).
`analysis_dir` path to analysis directory.
`tmpdir` path to temporary directory (i.e. `/tmp/` or `/local/scratch/`)

```
input_dir: /path/to/raw_data
input_mate: '_R[12]'
input_extension: '.fastq.gz'
analysis_dir: /path/to/directory/analysis
tmpdir: "/tmp/"
```

The design chunk aims to check that the FASTQ files name match the design file information. The `marks`, `conditions` and `replicates` parameters must respectively match the *MARK*, *COND* and *REP* parameters of the FASTQ files name and the design file. 
For spike-in data, set `spike` on "True" and provide the spike-in genome FASTA file path through the `spike_genome_file` parameter.

```
design:
    design_file: /path/to/directory/analysis/config/design.txt    
    marks: K4Me3
    condition: C, U
    replicates: Rep
    spike: false
    spike_genome_file: /path/to/genome/directory/dmel9.fa
```

2. Section 2: Rule parameters chunks

Each step has it proper setting in independent chunk.
The second section of the configuration file is divided in multiple chunks. Each chunk gather the parameters of one step of the pipeline: `adapters`, `bowtie2_mapping`, `mark_duplicates`, `remove_blacklist`, `peak_calling`, `compute_idr` and `differential_analysis`. 

The `options` parameter present in `adapters`, `bowtie2_mapping` and `peak_calling` allows you to provide any parameter recognised by cutadapt, bowtie2 and macs2 respectively. For example for the `bowtie2_mapping` chunk, `options` can be fill with "--very-sensitive".

```
bowtie2_mapping:
    options: "--very-sensitive"
    threads: 4
```

219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
### ChIPflow running modes

After the read pre-processing steps of the pipeline, MACS2 peak calling software is used to identify enriched regions. Several settings of MACS2 are possible: 
- To estimate the fragment size you can: use MACS2 model (default) or use PhantomPeakQualTool.
- To call the peaks you can either choose narrow or broad.

For the selection of reproducible peaks between replicates we implemented 2 methods:
- The IDR that is the recommended method by ENCODE to measure consistency between replicates for narrow peaks.
- The intersection approach that select the peaks that overlap (default:80%) between the replicates. We optimised this approach for broad peaks. Indeed, the IDR is not optimised for broad peaks. This because the broader size of the peaks induces a bigger position and signal variability making that very few peaks are passing the IDR.

In summary, the user as the possibility to choose multiple fragment size estimation method, multiple peak calling modes and multiple methods of selection of reproducible peaks between replicates. The following subsection shows example of running mode that we advise:

#### Default narrow mode with IDR

```
peak_calling:
    mode_choice: 'narrow'    
    no_model: no             
    options: "--keep-dup all "
    cutoff: 0.1
    genomeSize: mm


compute_idr:
    rank: 'signal.value'
    thresh: 0.05
    intersectionApproach: no
    ia_overlap: 0.8
```

> If none or very few peaks pass the IDR, this means that there is to much variability between your replicates aka they are probably not good replicates. If you want to proceed anyway with the analysis, you can use the intersection approach (less stringent) instead of the IDR by setting `intersectionApproach` to 'yes'.

#### Default broad mode with intersection approach

In broad mode, perform a stringent peak calling (cutoff of 0.01) and use Intersection Approach. Overlapping parameter of IA on broad peaks is set at 0.8.

```
peak_calling:
    mode_choice: 'broad'    
    no_model: no             
    options: "--keep-dup all --broad-cutoff 0.01"
    cutoff: 0.01
    genomeSize: mm
    
compute_idr:
    rank: 'signal.value'
    thresh: 0.01
    intersectionApproach: yes
    ia_overlap: 0.8
```

270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
### How to fill the multiqc config

At the beginning of `config/multiqc_config.yaml` file, you have the possibility to customize header of MultiQC report according to your experiment as you can see below: 

```
# Title to use for the report.
title: "ChIP-seq analysis"
subtitle: "ChIP-seq analysis of CTCF factor in breast tumor cells"                  # Set your own text
intro_text: "MultiQC reports summarise analysis results produced from ChIPflow"     

# Add generic information to the top of reports
report_header_info:
    - Contact E-mail: 'rlegendre@pasteur.fr'                                        # Set your own text
    - Application Type: 'ChIP-seq'                                                  # Set your own text
    - Project Type: 'Differential peak expression'                                  # Set your own text
    - Sequencing Platform: 'NextSeq2000'                                            # Set your own text
    - Sequencing Setup: 'PE75'                                                      # Set your own text
```
<img src="images/multiqc_header.png" width="600">

290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334


### Run the pipeline on test data 

You need to have sra-toolkit installed before to download test data.

```
# Download genome references
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.26_GRCm38.p6/GCF_000001635.26_GRCm38.p6_genomic.fna.gz
gunzip GCF_000001635.26_GRCm38.p6_genomic.fna.gz


# get blacklisted regions
wget http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz
gunzip mm10.blacklist.bed.gz

# create genome directory
mkdir genome
mv GCF_000001635.26_GRCm38.p6_genomic.fna genome/mm10.fa
mv mm10.blacklist.bed genome/mm10.blacklist.bed

# copy config file
cp test/config.yaml config/config.yaml
cp test/design.txt config/design.txt

# Download FastQ files from GEO (GSE99009) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE99009
# Only include the H3K27ac histone mark and Klf4 transcription factor with their associated inputs for the shUbc9 and shCtrl conditions
SRR=("SRR5572646" "SRR5572647" "SRR5572658" "SRR5572659" "SRR5572668" "SRR5572669" "SRR5572676" "SRR5572677" "SRR5572652" "SRR5572653" "SRR5572664" "SRR5572665")
sample=("H3K27ac_shCtrl_Rep1_R1" "H3K27ac_shCtrl_Rep2_R1" "H3K27ac_shUbc9_Rep1_R1" "H3K27ac_shUbc9_Rep2_R1" "Klf4_shCtrl_Rep1_R1" "Klf4_shCtrl_Rep2_R1" "Klf4_shUbc9_Rep1_R1" "Klf4_shUbc9_Rep2_R1" "INPUT_shCtrl_Rep1_R1" "INPUT_shCtrl_Rep2_R1" "INPUT_shUbc9_Rep1_R1" "INPUT_shUbc9_Rep2_R1")

mkdir data
cd data
for i in  ${!SRR[*]} ; do
    echo ${SRR[$i]}, ${sample[$i]}
    prefetch ${SRR[$i]} -o ${sample[$i]}.sra
    fastq-dump ${sample[$i]}.sra
done

rm *.sra 
for file in *.fastq ; do 
    pigz $file ; 
done
```


Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
335
## Q&A
336

337
338
What's a IP?
- IP stand for ImmunoPrecipitation. The IP samples contains the DNA that bind the protein of interest (i.e. H3K4me1, CTCF, TFIIA). These samples has been cross-linked, sonicated and immuno-precipitated.
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
339

340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
What's an INPUT?
- INPUT stand for control DNA. The control DNA has been cross-linked and sonicated but not immuno-precipitated.

Replicate format in FASTQ file name?
 - Rep1, rep1 and REP1 accepted, all samples must have the same format:
 - Rep1 and Rep2 works but Rep1 and REP2 doesn't

IP_NAME format columns? (i.e. What if I want to specify something more (species, incubation time...))
 - use "-" and not "_". "_" is used to separate MARK, COND, REP and MATE from FASTQ names.

Can I use relative path in config ?
- yes, but you need to consider ChIPflow directory as origin.
 
What if I have 3 INPUT replicates?
 - You can use ChIPflow with more than 2 replicates, replicates will be merged in a Maxi Pool.

What if I have 3 IP replicate?
 - The IDR for 3 IP replicates is not yet implemented.

What INPUT is used when I have 2 INPUTs?
 - INPUT1 only (variabilité technique, on prend la même base de bruit)

What if I have no INPUT ?
- The pipeline failed to run if no INPUT are done. Following ENCODE guidelines, control experiment must be performed and are used during peak calling.

What if I have one INPUT for one mark, and two for another ?
- We advice to run 2 different analysis. Experimental design with different numbers of replicates for IP or INPUT is not included for now.

What if I have one replicate for one mark ?
- The pipeline will stop after peak calling, without selection of peaks as reproducibility cannot be ensure with only one replicate.

What are the default ADAPTERS for cutadapt?
- by default, TruSeq and NEBNext Kits for Illumina are done in the config/adapt.fa.

Can I give my own adapters list for cutadapt?
 - In-house adapters could be added to the config/adapt.fa or in an independent file. Don't forget to change the value of `adapter_list` in cutadapt rule in `config.yaml` file.
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
376

377
378
I want to test other options for Bowtie2, how can I re-run the mapping step ?
- As any Snakemake pipeline, you can force the re-calculation of one or more rule. After changing the options of `bowtie2_mapping` chunk, you can restart the pipeline adding this snakemake option : `--forcerun bowtie2_mapping`. All rules depending of `bowtie2_mapping` will be re-calculated.
Rachel  LEGENDRE's avatar
Rachel LEGENDRE committed
379

380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
Can I force the re-calculation of all the steps ?
- Yes, you can add this snakemake option `--forceall` to force the execution of the first rule.

Can I rename ChIPflow directory ?
- yes, but must to be the same as in config.yaml (`analysis_dir`) or use relative path

The pipeline fails because the IDR doesn't select enough reads? 
- If none or very few peaks pass the IDR, this means that there is to much variability between your replicates aka they are probably not good replicates. If you want to proceed anyway with the analysis, you can use the intersection approach (less stringent) instead of the IDR by setting `intersectionApproach` to 'yes'.

When should I use PhantompeakCalling's fragment size estimation instead of MACS2's one?
- If MACS2 cannot compute the fragment size estimation (or if you want), set `no_model` to yes, and the fragment length use for MACS2 will be the one computed by PhantompeakQualTools for each sample.

What if I don't know if my chromatim factor in narrow or broad?
- The output directories names of peak Calling, peak reproducibility and differential analysis steps includes the peak calling mode name, the peak reproducibility method name and the normalisation and variance estimation method name. This allows ChIPflow to test multiple combinations of peak calling, peak reproducibility and differential analysis parameters without erasing any output.
- For example, if you have run the pipeline in narrow mode, and you want broad mode, you just need to modify the corresponding parameter in the configuration YAML file. The pipeline will then restart at the peak calling step and all the output will be stored in "06-PeakCalling/{}" directories.
- In case of unknown chromatin factor, we advice to run ChIPflow in narrow mode with IDR and IA, and afterward in broad mode. Results from narrow peak calling will be stored in "06-PeakCalling/Narrow" directory, and in "06-PeakCalling/Broad" for broad peak calling. 


---
## Authors

* Rachel Legendre (@rlegendr)
* Maëlle Daunesse
* Adrien Pain
* Hugo Varet
* Claudia Chica


## How to cite ChIPflow ?

https://doi.org/10.1101/2021.02.02.429342