Commit 04d21226 authored by Rachel  LEGENDRE's avatar Rachel LEGENDRE
Browse files

Merge branch 'dev' into 'master'

Merge new features to master branch

See merge request hub/chipflow!5
parents bad7aae7 3ee37bc7
ChIPflow - a snakemake-based workflow for the analysis of epigenomic data (ChIP-seq)
ePeak - a snakemake-based workflow for the analysis of epigenomic data (ChIP-seq, CUT&RUN, CUT&Tag)
from the raw fastq files to the differential analysis of transcription factor binding
or histone modification marking.
Copyright © 2020 Institut Pasteur (Paris) and CNRS.
------------------------------------------------------------------
Citation:
Daunesse M, Legendre R, Varet H, Pain A, Chica C. ChIPflow: from raw data to epigenomic dynamics
Daunesse M, Legendre R, Varet H, Pain A, Chica C. ePeak: from replicated chromatin profiling data to epigenomic dynamics
------------------------------------------------------------------
......@@ -18,18 +18,9 @@ The code includes contributions and input from:
- Claudia Chica, Institut Pasteur
Some modules of ChIPflow were initially implemented within the framework of Sequana.
- Sequana
Url: https://github.com/sequana/sequana
Author: Thomas Cokelaer, Dimitri Desvillechabrol, Rachel Legendre, Melissa Cardon
Citation: Cokelaer et al, (2017), 'Sequana': a Set of Snakemake NGS pipelines, Journal of Open Source Software, 2(16), 352, doi:10.21105/joss.00352
License: BSD (see website for details)
-----------------------------------------------------------------
ChIPflow is free software: you can redistribute it and/or modify
ePeak is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation version 3 of the License.
......
<img align="left" width="100" height="100" src="images/EpiFlow_logo.png" alt="ChIPflow">
<img align="left" width="100" height="100" src="images/EpiFlow_logo.png" alt="ePeak">
# ChIPflow: from replicated ChIP-seq raw data to epigenomic dynamics
# ePeak: from replicated chromatin profiling data to epigenomic dynamics
......@@ -10,15 +10,15 @@
[[_TOC_]]
# What is ChIPflow ?
# What is ePeak ?
ChIPflow is a snakemake-based workflow for the analysis of ChIP-seq data from raw FASTQ files to differential analysis of transcription factor binding or histone modification marking. It streamlines critical steps like the quality assessment of the immunoprecipitation using the cross correlation and the replicate comparison for both narrow and broad peaks. For the differential analysis ChIPflow provides linear and non linear methods for normalisation between samples as well as conservative and stringent models for estimating the variance and testing the significance of the observed differences (see [chipflowr](https://gitlab.pasteur.fr/hub/chipflowr)).
ePeak is a snakemake-based workflow for the analysis of ChIP-seq data from raw FASTQ files to differential analysis of transcription factor binding or histone modification marking. It streamlines critical steps like the quality assessment of the immunoprecipitation using the cross correlation and the replicate comparison for both narrow and broad peaks. For the differential analysis ePeak provides linear and non linear methods for normalisation between samples as well as conservative and stringent models for estimating the variance and testing the significance of the observed differences (see [chipflowr](https://gitlab.pasteur.fr/hub/chipflowr)).
<img src="images/chipflow_pipeline.svg" width="700">
<img src="images/epeak_workflow.svg" width="700">
# How to install ChIPflow ?
# How to install ePeak ?
## Installation with singularity
......@@ -29,11 +29,11 @@ Pre-required tools:
- pysam
- singularity
A tutorial to create a conda environment with all dependencies is available here : [env.sh](https://gitlab.pasteur.fr/hub/chipflow/-/blob/master/env.sh)
A tutorial to create a conda environment with all dependencies is available here : [env.sh](https://gitlab.pasteur.fr/hub/ePeak/-/blob/master/env.sh)
Download the singularity container:
` singularity pull --arch amd64 --name chipflow.img library://rlegendre/default/chipflow:latest `
` singularity pull --arch amd64 --name epeak.img library://rlegendre/epeak/epeak:1.0`
## Manual installation
......@@ -73,38 +73,38 @@ module load pysam
* Clone workflow:
`git clone https://gitlab.pasteur.fr/hub/chipflow.git`
`git clone https://gitlab.pasteur.fr/hub/ePeak.git`
* Download singularity container:
```
cd chipflow
singularity pull --arch amd64 --name chipflow.img library://rlegendre/default/chipflow:latest
cd ePeak
singularity pull --arch amd64 --name epeak.img library://rlegendre/epeak/epeak:1.0
```
Then, you can configure the workflow.
# How to run ChIPflow ?
# How to run ePeak ?
> Note: you can test the pipeline with an example datasets: [here](https://gitlab.pasteur.fr/hub/chipflow#run-the-pipeline-on-test-data)
> Note: you can test the pipeline with an example datasets: [here](https://gitlab.pasteur.fr/hub/ePeak#run-the-pipeline-on-test-data)
## Usage
* Step 1: Download workflow
`git clone https://gitlab.pasteur.fr/hub/chipflow.git`
`git clone https://gitlab.pasteur.fr/hub/ePeak.git`
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, its DOI (see [above](https://gitlab.pasteur.fr/hub/chipflow#how-to-cite-chipflow-))
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, its DOI (see [above](https://gitlab.pasteur.fr/hub/ePeak#how-to-cite-ePeak-))
If you are using Singularity, you need to copy the Singularity image in the cloned ChIPflow directory.
If you are using Singularity, you need to copy the Singularity image in the cloned ePeak directory.
* Step 2: Configure workflow
To configure your analysis, you need to fill 2 configuration files, one to specify the design experimental of you project and one to parameter each step of the pipeline (stored in `config/`):
- [config.yaml](https://gitlab.pasteur.fr/hub/chipflow#how-to-fill-the-design)
- [design.txt](https://gitlab.pasteur.fr/hub/chipflow#how-to-fill-the-design)
- [config.yaml](https://gitlab.pasteur.fr/hub/ePeak#how-to-fill-the-design)
- [design.txt](https://gitlab.pasteur.fr/hub/ePeak#how-to-fill-the-design)
In addition, you can customize the MultiQC report by configuring this file: [multiqc_config.yaml](https://gitlab.pasteur.fr/hub/chipflow#how-to-fill-the-multiqc-config)
In addition, you can customize the MultiQC report by configuring this file: [multiqc_config.yaml](https://gitlab.pasteur.fr/hub/ePeak#how-to-fill-the-multiqc-config)
* Step 3: Execute workflow
......@@ -167,15 +167,15 @@ Design columns:
|--------------|---------------------------------------------------------------|
| IP_NAME | IP FASTQ files prefix (i.e. MARK_COND1) |
| INPUT_NAME | INPUT FASTQ files prefix (i.e. INPUT_COND1) |
| NB_IP | Number of replicates of the histone mark or TFs (i.e. 1 or 2) |
| NB_INPUT | Number of replicates of INPUT files (i.e. 1 or 2) |
| NB_IP | Replicate number of the histone mark or TFs (i.e. 1 or 2) |
| NB_INPUT | Replicate number of INPUT file (i.e. 1 or 2) |
Link to an Example: [design.txt](https://gitlab.pasteur.fr/hub/chipflow/-/blob/master/test/design.txt)
Link to an Example: [design.txt](https://gitlab.pasteur.fr/hub/ePeak/-/blob/master/test/design.txt)
### How to fill the config file
All the parameters to run the pipeline are gathered in a YAML config file that the user has to fill before running the pipeline. Here is an filled example: [config.yaml](https://gitlab.pasteur.fr/hub/chipflow/-/blob/master/test/config.yaml)
All the parameters to run the pipeline are gathered in a YAML config file that the user has to fill before running the pipeline. Here is an filled example: [config.yaml](https://gitlab.pasteur.fr/hub/ePeak/-/blob/master/test/config.yaml)
This config file is divided in 2 sections:
......@@ -230,7 +230,7 @@ At the beginning of `config/multiqc_config.yaml` file, you have the possibility
# Title to use for the report.
title: "ChIP-seq analysis"
subtitle: "ChIP-seq analysis of CTCF factor in breast tumor cells" # Set your own text
intro_text: "MultiQC reports summarise analysis results produced from ChIPflow"
intro_text: "MultiQC reports summarise analysis results produced from ePeak"
# Add generic information to the top of reports
report_header_info:
......@@ -244,7 +244,7 @@ report_header_info:
## ChIPflow running modes
## ePeak running modes
After the read pre-processing steps of the pipeline, MACS2 peak calling software is used to identify enriched regions. Several settings of MACS2 are possible:
- To estimate the fragment size you can: use MACS2 model (default) or use PhantomPeakQualTool.
......@@ -268,10 +268,15 @@ peak_calling:
compute_idr:
do: yes
rank: 'signal.value'
thresh: 0.05
intersectionApproach: no
intersectionApproach:
do: no
ia_overlap: 0.8
```
> If none or very few peaks pass the IDR, this means that there is to much variability between your replicates aka they are probably not good replicates. If you want to proceed anyway with the analysis, you can use the intersection approach (less stringent) instead of the IDR by setting `intersectionApproach` to 'yes'.
......@@ -288,10 +293,42 @@ peak_calling:
cutoff: 0.01
genomeSize: mm
compute_idr:
do: no
rank: 'signal.value'
thresh: 0.01
intersectionApproach: yes
intersectionApproach:
do: yes
ia_overlap: 0.8
```
### Default mode for cut&run
With cut&run data, make deduplication only on INPUT/IgG data (dedup_IP to False). Then perform a stringent peak calling with SEACR and use Intersection Approach. Overlapping parameter of IA on peaks is set at 0.8.
```
mark_duplicates:
do: yes
dedup_IP: 'False'
threads: 4
seacr:
do: yes
threshold: 'stringent'
norm: 'norm'
compute_idr:
do: no
rank: 'signal.value'
thresh: 0.01
intersectionApproach:
do: yes
ia_overlap: 0.8
```
......@@ -302,18 +339,21 @@ You need to have sra-toolkit installed before to download test data.
```
# Download genome references
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.26_GRCm38.p6/GCF_000001635.26_GRCm38.p6_genomic.fna.gz
gunzip GCF_000001635.26_GRCm38.p6_genomic.fna.gz
wget http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz
gunzip mm10.fa.gz
# get blacklisted regions
wget http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz
gunzip mm10.blacklist.bed.gz
# download bed file for GeneBody plot
wget http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/genes/mm10.refGene.gtf.gz
gunzip mm10.refGene.gtf.gz
# create genome directory
mkdir genome
mv GCF_000001635.26_GRCm38.p6_genomic.fna genome/mm10.fa
mv mm10.blacklist.bed genome/mm10.blacklist.bed
mv mm10.* genome/
# copy config file
cp test/config.yaml config/config.yaml
......@@ -336,6 +376,7 @@ rm *.sra
for file in *.fastq ; do
pigz $file ;
done
```
......@@ -356,16 +397,16 @@ done
- Inside `IP_NAME` you can use "-" but do not "\_" because this is used to separate `MARK`, `COND`, `REP` and `MATE` from FASTQ filenames. For example: `TF4_Ctl-HeLa_rep1_R1.fastq.gz`
**Can I use relative path in config ?**
- yes, but you need to consider ChIPflow directory as origin.
- yes, but you need to consider ePeak directory as origin.
**What if I have 3 INPUT replicates?**
- You can use ChIPflow with more than 2 replicates, replicates will be merged in a Maxi Pool.
- You can use ePeak with more than 2 replicates, replicates will be merged in a Maxi Pool.
**What if I have 3 IP replicate?**
- The IDR for 3 IP replicates is not yet implemented.
**What INPUT is used when I have 2 INPUTs?**
- Only first INPUT is used for peak calling to reduce technical variability for now.
- First INPUT could be used for peak calling on all IP to reduce technical variability, but each IP could be associated to a specific INPUT in the design file.
**What if I have no INPUT ?**
- The pipeline failed to run if no INPUT are done. Following ENCODE guidelines, control experiment must be performed and are used during peak calling.
......@@ -388,19 +429,19 @@ done
**Can I force the re-calculation of all the steps ?**
- Yes, you can add this snakemake option `--forceall` to force the execution of the first rule.
**Can I rename ChIPflow directory ?**
**Can I rename ePeak directory ?**
- yes, but must to be the same as in config.yaml (`analysis_dir`) or use relative path
**The pipeline fails because the IDR doesn't select enough reads?**
- If none or very few peaks pass the IDR, this means that there is to much variability between your replicates aka they are probably not good replicates. If you want to proceed anyway with the analysis, you can use the intersection approach (less stringent) instead of the IDR by setting `intersectionApproach` to 'yes'.
- If none or very few peaks pass the IDR, this means that there is to much variability between your replicates aka they are probably not good replicates. If you want to proceed anyway with the analysis, you can use the intersection approach (less stringent) instead of the IDR by setting `intersectionApproach:do` to 'yes'.
**When should I use PhantompeakCalling's fragment size estimation instead of MACS2's one?**
- If MACS2 cannot compute the fragment size estimation (or if you want), set `no_model` to yes, and the fragment length use for MACS2 will be the one computed by PhantompeakQualTools for each sample.
**What if I don't know if my chromatim factor in narrow or broad?**
- The output directories names of peak Calling, peak reproducibility and differential analysis steps includes the peak calling mode name, the peak reproducibility method name and the normalisation and variance estimation method name. This allows ChIPflow to test multiple combinations of peak calling, peak reproducibility and differential analysis parameters without erasing any output.
- The output directories names of peak Calling, peak reproducibility and differential analysis steps includes the peak calling mode name, the peak reproducibility method name and the normalisation and variance estimation method name. This allows ePeak to test multiple combinations of peak calling, peak reproducibility and differential analysis parameters without erasing any output.
- For example, if you have run the pipeline in narrow mode, and you want broad mode, you just need to modify the corresponding parameter in the configuration YAML file. The pipeline will then restart at the peak calling step and all the output will be stored in "06-PeakCalling/{}" directories.
- In case of unknown chromatin factor, we advice to run ChIPflow in narrow mode with IDR and IA, and afterward in broad mode. Results from narrow peak calling will be stored in "06-PeakCalling/Narrow" directory, and in "06-PeakCalling/Broad" for broad peak calling.
- In case of unknown chromatin factor, we advice to run ePeak in narrow mode with IDR and IA, and afterward in broad mode. Results from narrow peak calling will be stored in "06-PeakCalling/Narrow" directory, and in "06-PeakCalling/Broad" for broad peak calling.
---
......@@ -413,6 +454,6 @@ done
* Claudia Chica
# How to cite ChIPflow ?
# How to cite ePeak ?
https://doi.org/10.1101/2021.02.02.429342
#########################################################################
# ChIPflow: Standardize and reproducible ChIP-seq analysis from raw #
# ePeak: Standardize and reproducible ChIP-seq analysis from raw #
# data to differential analysis #
# Authors: Rachel Legendre, Maelle Daunesse #
# Copyright (c) 2019-2020 Institut Pasteur (Paris) and CNRS. #
# #
# This file is part of ChIPflow workflow. #
# This file is part of ePeak workflow. #
# #
# ChIPflow is free software: you can redistribute it and/or modify #
# ePeak is free software: you can redistribute it and/or modify #
# it under the terms of the GNU General Public License as published by #
# the Free Software Foundation, either version 3 of the License, or #
# (at your option) any later version. #
# #
# ChIPflow is distributed in the hope that it will be useful, #
# ePeak is distributed in the hope that it will be useful, #
# but WITHOUT ANY WARRANTY; without even the implied warranty of #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the #
# GNU General Public License for more details . #
# #
# You should have received a copy of the GNU General Public License #
# along with ChIPflow (LICENSE). #
# along with ePeak (LICENSE). #
# If not, see <https://www.gnu.org/licenses/>. #
#########################################################################
import pandas as pd
from fnmatch import fnmatch
from re import sub, match
from itertools import repeat, chain
import os
import glob
#-------------------------------------------------------
# read config files
......@@ -37,7 +38,7 @@ configfile: "config/config.yaml"
#put a relative path
RULES = os.path.join("workflow", "rules")
design = pd.read_csv(config["design"]["design_file"], header=0, sep='\t')
design_user = pd.read_csv(config["design"]["design_file"], header=0, sep='\t')
"""
REQUIREMENTS IN DESIGN:
......@@ -82,6 +83,27 @@ marks = [ x.strip() for x in (config["design"]["marks"]).split(",")]
conds = [ x.strip() for x in (config["design"]["condition"]).split(",")]
rep_flag = config["design"]["replicates"]
# -----------------------------------------------
# Compute number of replicates by IP/INPUT
#get uniq IP and INPUT
unique_ip = pd.unique(design_user["IP_NAME"])
unique_input = pd.unique(design_user["INPUT_NAME"])
#Fill new design with informations about IP
design = pd.DataFrame(unique_ip, columns = ['IP_NAME'])
design['NB_IP'] = design_user["IP_NAME"].value_counts().values
#initialize INPUT related columns
inputs = []
nb_input = []
#fill new design with informations about INPUT
for row in design.itertuples(index=True, name='Pandas'):
inputs.append(design_user.loc[design_user['IP_NAME'] == getattr(row, "IP_NAME"), 'INPUT_NAME'].iloc[0])
nb_input.append(int(design_user.loc[design_user['IP_NAME'] == getattr(row, "IP_NAME"), 'INPUT_NAME'].value_counts().values))
design['INPUT_NAME'] = inputs
design['NB_INPUT'] = nb_input
# -----------------------------------------------
# get list of INPUT files
......@@ -123,7 +145,7 @@ for row in design.itertuples(index=True, name='Pandas'):
if mark == getattr(row, "IP_NAME").split("_")[0]:
nb_rep = getattr(row, "NB_IP")
if (getattr(row, "IP_NAME")).endswith(tuple(conds)) and getattr(row, "NB_IP") > 1:
if nb >= 1:
if nb >= 1 and mark not in MARK_OK:
MARK_OK.append(mark)
CORR_INPUT_OK.append(getattr(row, "INPUT_NAME").split("_")[0])
break
......@@ -131,6 +153,7 @@ for row in design.itertuples(index=True, name='Pandas'):
nb += 1
nb = 0
if config["differential_analysis"]["input_counting"]:
CORR_INPUT = [] # corresponding INPUT files for input counting
for mark in CORR_INPUT_OK:
......@@ -164,6 +187,17 @@ if len(conds) > 1 :
"file and fastq filenames: %s not found %s times in %s" % (getattr(row, "INPUT_NAME"), getattr(row, "NB_INPUT"),samples))
# -----------------------------------------------
# get correspondance between IP and INPUT files
d = {}
d["IP"] = []
d["INPUT"] = []
for row in design_user.itertuples(index=True, name='Pandas'):
d["IP"].append(getattr(row, "IP_NAME") +"_"+ rep_flag + str(getattr(row, "NB_IP")))
d["INPUT"].append(getattr(row, "INPUT_NAME") +"_"+ rep_flag + str(getattr(row, "NB_INPUT")))
IP_INPUT = pd.DataFrame.from_dict(d)
# -----------------------------------------------
# get REP names
......@@ -215,30 +249,36 @@ for row in design.itertuples(index=True, name='Pandas'):
IP_IDR = []
IP_SPR = []
IP_PPR = []
SPR_POOL = []
PPR_POOL = []
INPUT_SPR = []
INPUT_PPR = []
#IDR_REP = []
# for only IP with more than one replicate
for cond in IP_REP:
tmp = []
tmp2 = []
for ip in IP_ALL:
name = ip.split("_" + rep_flag)
if ip.startswith(cond):
spr_file = sub(rep_flag, 'SPR', ip)
ppr_file = sub(rep_flag, 'PPR', ip)
pool_file = sub(rep_flag, 'PPRPool', ip)
pool_file = sub(r''+rep_flag+'[0-9]*', 'PPRPool', ip)
tmp.append(ip)
#IDR_REP.append("Rep"+name[1])
IP_SPR.append(spr_file+".1")
IP_SPR.append(spr_file+".2")
IP_PPR.append(ppr_file)
SPR_POOL.append(pool_file)
for row in design.itertuples(index=True, name='Pandas'):
if (getattr(row, "IP_NAME")) in name[0]:
INPUT_PPR.append(getattr(row, "INPUT_NAME")+"_"+rep[0])
INPUT_SPR.append(getattr(row, "INPUT_NAME")+"_"+rep[0])
INPUT_SPR.append(getattr(row, "INPUT_NAME")+"_"+rep[0])
PPR_POOL.append(pool_file)
# add corresponding INPUT
#get INPUT for this IP and this rep
ctl = design_user.loc[(design_user['IP_NAME'] == name[0]) & (design_user['NB_IP'] == int(name[1])), 'INPUT_NAME'].iloc[0]+"_"+rep_flag+str(design_user.loc[(design_user['IP_NAME'] == name[0]) & (design_user['NB_IP'] == int(name[1])), 'NB_INPUT'].iloc[0])
#check is INPUT is a POOL or not
if design.loc[design['INPUT_NAME'] == (design_user.loc[design_user['IP_NAME'] == name[0], 'INPUT_NAME'].iloc[0]), 'NB_INPUT'].iloc[0] > 1 :
ctl_pool = (design_user.loc[design_user['IP_NAME'] == name[0], 'INPUT_NAME'].iloc[0])+'_Pool'
else: ctl_pool = ctl
#tmp_df from all preIDR + input
tmp_df = pd.DataFrame({"IP":[spr_file+".1",spr_file+".2", ppr_file, pool_file],"INPUT":[ctl, ctl, ctl, ctl_pool]})
#add to IP_INPUT
IP_INPUT = IP_INPUT.append(tmp_df, ignore_index = True)
IP_IDR.append(tmp)
......@@ -248,16 +288,18 @@ for cond in IP_REP:
PPR_POOL = []
INPUT_POOL = []
for row in design.itertuples(index=True, name='Pandas'):
#print(row)
if getattr(row, "NB_IP") > 1 and getattr(row, "NB_INPUT") > 1 :
PPR_POOL.append(getattr(row, "IP_NAME") + "_PPRPool")
INPUT_POOL.append(getattr(row, "INPUT_NAME") + "_Pool")
pass_pool = 0
#elif getattr(row, "NB_IP") > 1 and getattr(row, "NB_INPUT") < 2 :
elif getattr(row, "NB_IP") > 1 and getattr(row, "NB_INPUT") == 1 :
PPR_POOL.append(getattr(row, "IP_NAME") + "_PPRPool")
INPUT_POOL.append(getattr(row, "INPUT_NAME")+ "_" + rep_flag + "1")
pass_pool = 1
# all files passing PhantomPeakQualTool if no-model is chosen
ALL = IP_ALL + IP_SPR + IP_PPR + PPR_POOL
......@@ -266,14 +308,11 @@ ALL = IP_ALL + IP_SPR + IP_PPR + PPR_POOL
ALL_IP_PC = IP_ALL + IP_SPR + IP_PPR + PPR_POOL
ALL_INPUT_PC = INPUT + INPUT_SPR + INPUT_PPR + INPUT_POOL
# get files for IDR
CASE = [rep_flag, "PPR", "SPR1.", "SPR2."]*len(IP_REP)
REP_IDR = list(chain(*zip(*repeat(IP_REP,4))))
IN_IDR = list(chain(*zip(*repeat(IP_REP_DUP,4))))
# -----------------------------------------------
# Add wildcard constraints
......@@ -341,7 +380,7 @@ if config["adapters"]["remove"] :
cutadapt_min = config["adapters"]["m"]
cutadapt_qual = config["adapters"]["quality"]
cutadapt_log = os.path.join(analysis_dir, "01-Trimming/logs/{SAMPLE}_trim.txt")
final_output.extend(expand(cutadapt_output, SAMPLE=samples))
#final_output.extend(expand(cutadapt_output, SAMPLE=samples))
include: os.path.join(RULES, "cutadapt.rules")
else:
......@@ -352,7 +391,7 @@ else:
# genome gestion
#----------------------------------
ref = [ config["genome"]["name"] ]
ref = config["genome"]["name"]
#if config["design"]["spike"]:
# ref += [ "spikes" ]
......@@ -376,7 +415,7 @@ if config["genome"]["index"]:
bowtie2_index_output_done = os.path.join(config["genome"]["genome_directory"],"{REF}.1.bt2")
bowtie2_index_output_prefix = os.path.join(config["genome"]["genome_directory"],"{REF}")
bowtie2_index_log = os.path.join(analysis_dir, "02-Mapping/logs/bowtie2_{REF}_indexing.log")
final_output.extend(expand(bowtie2_index_output_done, REF=ref))
#final_output.extend(expand(bowtie2_index_output_done, REF=ref))
include: os.path.join(RULES, "bowtie2_index.rules")
else:
......@@ -396,22 +435,37 @@ bowtie2_mapping_logs_err = os.path.join(analysis_dir, "02-Mapping/logs/{SAMPLE}_
bowtie2_mapping_logs_out = os.path.join(analysis_dir, "02-Mapping/logs/{SAMPLE}_{REF}_mapping.out")
bowtie2_mapping_prefix_index = bowtie2_index_output_prefix
bowtie2_mapping_options = config["bowtie2_mapping"]["options"]
final_output.extend(expand(bowtie2_mapping_sort, SAMPLE=samples, REF=ref))
#final_output.extend(expand(bowtie2_mapping_sort, SAMPLE=samples, REF=ref))
include: os.path.join(RULES, "bowtie2_mapping.rules")
biasedRegions_dir = os.path.join(analysis_dir, "02-Mapping")
biasedRegions = ""
#----------------------------------
# Mark duplicated reads
#----------------------------------
dedup_IP = config['mark_duplicates']['dedup_IP']
def dedup(wildcards):
if dedup_IP == "true":
return "true"
elif (wildcards.SAMPLE in IP_ALL) and dedup_IP == "false":
return "false"
else :
return "true"
mark_duplicates_input = bowtie2_mapping_sort
mark_duplicates_output = os.path.join(analysis_dir, "03-Deduplication/{SAMPLE}_{REF}_sort_dedup.bam")
mark_duplicates_metrics = os.path.join(analysis_dir, "03-Deduplication/{SAMPLE}_{REF}_sort_dedup.txt")
mark_duplicates_log_std = os.path.join(analysis_dir, "03-Deduplication/logs/{SAMPLE}_{REF}_sort_dedup.out")
mark_duplicates_log_err = os.path.join(analysis_dir, "03-Deduplication/logs/{SAMPLE}_{REF}_sort_dedup.err")
mark_duplicates_tmpdir = config['tmpdir']
final_output.extend(expand(mark_duplicates_output, SAMPLE=samples, REF=ref))
include: os.path.join(RULES, "mark_duplicates.rules")
if config['mark_duplicates']['do']:
biasedRegions = "_dedup"
biasedRegions_dir = os.path.join(analysis_dir, "03-Deduplication")
mark_duplicates_input = bowtie2_mapping_sort
mark_duplicates_output = os.path.join(analysis_dir, "03-Deduplication/{{SAMPLE}}_{{REF}}_sort{}.bam".format(biasedRegions))
mark_duplicates_metrics = os.path.join(analysis_dir, "03-Deduplication/{SAMPLE}_{REF}_sort_dedup.txt")
mark_duplicates_remove = dedup
mark_duplicates_log_std = os.path.join(analysis_dir, "03-Deduplication/logs/{SAMPLE}_{REF}}_sort_dedup.out")
mark_duplicates_log_err = os.path.join(analysis_dir, "03-Deduplication/logs/{SAMPLE}_{REF}_sort_dedup.err")
mark_duplicates_tmpdir = config['tmpdir']
#final_output.extend(expand(mark_duplicates_output, SAMPLE=samples, REF=ref))
include: os.path.join(RULES, "mark_duplicates.rules")
......@@ -421,46 +475,109 @@ include: os.path.join(RULES, "mark_duplicates.rules")
if config["design"]["spike"]:
# counting on spikes
spikes_counting_input = expand(os.path.join(analysis_dir, "03-Deduplication/{SAMPLE}_{REF}_sort_dedup.bam"), SAMPLE=samples, REF="spikes")
spikes_counting_input = expand(os.path.join(analysis_dir, "{}/{{SAMPLE}}_{{REF}}_sort{}.bam".format(biasedRegions_dir, biasedRegions)), SAMPLE=samples, REF="spikes")
spikes_counting_output_json = os.path.join(analysis_dir, "09-CountMatrix/Spikes_count.json")
spikes_counting_output = os.path.join(analysis_dir, "Spikes_metrics_mqc.out")
spikes_counting_output = os.path.join(analysis_dir, "Spikes_metrics.out")
spikes_counting_log = os.path.join(analysis_dir, "03-Deduplication/logs/Spikes_metrics.out")
final_output.extend([spikes_counting_output_json])
include: os.path.join(RULES, "spikes_counting.rules")
#compute scale factor if spike-in
compute_scaling_factor_input = "{}/{{SAMPLE}}_spikes_sort{}.bam".format(biasedRegions_dir, biasedRegions)
compute_scaling_factor_output = "{}/{{SAMPLE}}_scaleFactor.txt".format(biasedRegions_dir, biasedRegions)
#final_output.extend(expand(compute_scaling_factor_output, SAMPLE=samples))
include: os.path.join(RULES, "compute_scaling_factor.rules")
#----------------------------------
# Remove biased regions
#----------------------------------
if config["remove_biasedRegions"]["do"]:
biasedRegions = "_biasedRegions"
remove_biasedRegions_input = os.path.join(analysis_dir, "{}/{{SAMPLE}}_{{REF}}_sort{}.bam".format(biasedRegions_dir, biasedRegions))
biasedRegions += "_biasedRegions"
biasedRegions_dir = os.path.join(analysis_dir, "04-NobiasedRegions")
remove_biasedRegions_input = os.path.join(analysis_dir, "03-Deduplication/{SAMPLE}_{REF}_sort_dedup.bam")
remove_biasedRegions_output = os.path.join(analysis_dir, "04-NobiasedRegions/{{SAMPLE}}_{{REF}}_sort_dedup{}.bam".format(biasedRegions))
remove_biasedRegions_log_std = os.path.join(analysis_dir, "04-NobiasedRegions/logs/{{SAMPLE}}_{{REF}}_sort_dedup{}.out".format(biasedRegions))
remove_biasedRegions_log_err = os.path.join(analysis_dir, "04-NobiasedRegions/logs/{{SAMPLE}}_{{REF}}_sort_dedup{}.err".format(biasedRegions))
final_output.extend(expand(remove_biasedRegions_output, SAMPLE=samples, REF=ref))
remove_biasedRegions_output = os.path.join(analysis_dir, "04-NobiasedRegions/{{SAMPLE}}_{{REF}}_sort{}.bam".format(biasedRegions))
remove_biasedRegions_log_std = os.path.join(analysis_dir, "04-NobiasedRegions/logs/{{SAMPLE}}_{{REF}}_sort{}.out".format(biasedRegions))
remove_biasedRegions_log_err = os.path.join(analysis_dir, "04-NobiasedRegions/logs/{{SAMPLE}}_{{REF}}_sort{}.err".format(biasedRegions))
#final_output.extend(expand(remove_biasedRegions_output, SAMPLE=samples, REF=ref))
include: os.path.join(RULES, "remove_biasedRegions.rules")
else:
biasedRegions = ""
biasedRegions_dir = os.path.join(analysis_dir, "03-Deduplication")
#----------------------------------
# Coverage step
#----------------------------------
if config["bamCoverage"]["do"]:
bamCoverage_input = "{}/{{SAMPLE}}_{{REF}}_sort_dedup{}.bam".format(biasedRegions_dir, biasedRegions)
bamCoverage_logs = os.path.join(analysis_dir, "12-Coverage/logs/{SAMPLE}_{REF}.out")
bamCoverage_output = os.path.join(analysis_dir, "12-Coverage/{SAMPLE}_{REF}_coverage.bw")
bamCoverage_input = "{}/{{SAMPLE}}_{{REF}}_sort{}.bam".format(biasedRegions_dir, biasedRegions)
bamCoverage_logs = os.path.join(analysis_dir, "12-IGV/logs/{SAMPLE}_{REF}.out")
bamCoverage_output = os.path.join(analysis_dir, "12-IGV/{SAMPLE}_{REF}_coverage.bw")
bamCoverage_options = config['bamCoverage'</