Open config/config.yaml and config/design.txt files
***Design file:** tabulated file of 4 columns.
**Column 1** is the name of the IP file
**Column 2** is the name of the corresponding INPUT file
**Column 3** is the replicate number of IP file
**Column 4** is the replicate number of the corresponding INPUT file
```
IP_NAME INPUT_NAME NB_IP NB_INPUT
H3K27ac_shCtrl INPUT_shCtrl 1 1
H3K27ac_shCtrl INPUT_shCtrl 2 1
H3K27ac_shUbc9 INPUT_shUbc9 1 1
H3K27ac_shUbc9 INPUT_shUbc9 2 1
Klf4_shCtrl INPUT_shCtrl 1 1
Klf4_shCtrl INPUT_shCtrl 2 2
Klf4_shUbc9 INPUT_shUbc9 1 1
Klf4_shUbc9 INPUT_shUbc9 2 2
```
***Config file:** yaml file containing all tools parameters
This file is divided into _chunks_. Each chunk correspond to one step or one tool.
This first chunk provides input information and assigns working directories.
`input_dir` path to FASTQ files directory.
`input_mate` mate pair format (i.e. `_R[12]` for *MATE* = R1 or R2) , must match the *MATE* parameter in FASTQ files.
`input_extension` filename extension format (i.e. `fastq.gz` or `fq.gz`).
`analysis_dir` path to analysis directory.
`tmpdir` path to temporary directory (i.e. `/tmp/` or other)
```
input_dir: ../ChIP_data
input_mate: '_R[12]'
input_extension: '.fastq.gz'
analysis_dir: $HOME #define for each user
tmpdir: $TMPDIR
```
The design chunk aims to check that the FASTQ files name match the design file information. The `marks`, `conditions` and `replicates` parameters must respectively match the *MARK*, *COND* and *REP* parameters of the FASTQ files name and the design file.
For spike-in data, set `spike` on "True" and provide the spike-in genome FASTA file path through the `spike_genome_file` parameter.
```
design:
design_file: config/design.txt
marks: H3K27ac, Klf4
condition: shCtrl, shUbc9
replicates: Rep
spike: false
spike_genome_file: genome/dmel9.fa
```
This genome chunk provides information about reference genome - directory, name of the index and path to fasta file.
```
genome:
index: yes
genome_directory: genome/
name: mm10
fasta_file: genome/mm10_chr1.fa
```
The fastqc chunk provides quality control checking of fastq files.
```
fastqc:
options: ''
threads: 4
```
The adapters chunk is relative to quality trimming and adapter removal with cutadapt. A list of common adapters is provided under config directory and give to cutadapt (adapter_list). Then, different parameters are tuned to match precisely with the data.
```
adapters:
remove: yes
adapter_list: file:config/adapt.fa
m: 25
mode: a
options: -O 6 --trim-n --max-n 1
quality: 30
threads: 4
```
The bowtie2_mapping chunk is relative to the reads mapping against genome file (provided by the genome chunk)
```
bowtie2_mapping:
options: "--very-sensitive --no-unal"
threads: 4
```
The mark duplicates chunk allows to mark PCR duplicate in BAM files. For ChIPseq data, IP and INPUT need to be deduplicated, so the dedup_IP parameter is set to True.
```
mark_duplicates:
do: yes
dedup_IP: 'True'
threads: 4
```
The remove_biasedRegions chunk is relative to remove biased genomic regions (previously named blacklisted regions)
```
remove_biasedRegions:
do: yes
bed_file: genome/mm10.blacklist.bed
threads: 1
```
To produce metaregion profiles, coverages from each samples need to be producted.
See https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html