Commit 04d21226 authored by Rachel  LEGENDRE's avatar Rachel LEGENDRE
Browse files

Merge branch 'dev' into 'master'

Merge new features to master branch

See merge request hub/chipflow!5
parents bad7aae7 3ee37bc7
ChIPflow - a snakemake-based workflow for the analysis of epigenomic data (ChIP-seq) ePeak - a snakemake-based workflow for the analysis of epigenomic data (ChIP-seq, CUT&RUN, CUT&Tag)
from the raw fastq files to the differential analysis of transcription factor binding from the raw fastq files to the differential analysis of transcription factor binding
or histone modification marking. or histone modification marking.
Copyright © 2020 Institut Pasteur (Paris) and CNRS. Copyright © 2020 Institut Pasteur (Paris) and CNRS.
------------------------------------------------------------------ ------------------------------------------------------------------
Citation: Citation:
Daunesse M, Legendre R, Varet H, Pain A, Chica C. ChIPflow: from raw data to epigenomic dynamics Daunesse M, Legendre R, Varet H, Pain A, Chica C. ePeak: from replicated chromatin profiling data to epigenomic dynamics
------------------------------------------------------------------ ------------------------------------------------------------------
...@@ -18,18 +18,9 @@ The code includes contributions and input from: ...@@ -18,18 +18,9 @@ The code includes contributions and input from:
- Claudia Chica, Institut Pasteur - Claudia Chica, Institut Pasteur
Some modules of ChIPflow were initially implemented within the framework of Sequana.
- Sequana
Url: https://github.com/sequana/sequana
Author: Thomas Cokelaer, Dimitri Desvillechabrol, Rachel Legendre, Melissa Cardon
Citation: Cokelaer et al, (2017), 'Sequana': a Set of Snakemake NGS pipelines, Journal of Open Source Software, 2(16), 352, doi:10.21105/joss.00352
License: BSD (see website for details)
----------------------------------------------------------------- -----------------------------------------------------------------
ChIPflow is free software: you can redistribute it and/or modify ePeak is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by it under the terms of the GNU General Public License as published by
the Free Software Foundation version 3 of the License. the Free Software Foundation version 3 of the License.
......
<img align="left" width="100" height="100" src="images/EpiFlow_logo.png" alt="ChIPflow"> <img align="left" width="100" height="100" src="images/EpiFlow_logo.png" alt="ePeak">
# ChIPflow: from replicated ChIP-seq raw data to epigenomic dynamics # ePeak: from replicated chromatin profiling data to epigenomic dynamics
...@@ -10,15 +10,15 @@ ...@@ -10,15 +10,15 @@
[[_TOC_]] [[_TOC_]]
# What is ChIPflow ? # What is ePeak ?
ChIPflow is a snakemake-based workflow for the analysis of ChIP-seq data from raw FASTQ files to differential analysis of transcription factor binding or histone modification marking. It streamlines critical steps like the quality assessment of the immunoprecipitation using the cross correlation and the replicate comparison for both narrow and broad peaks. For the differential analysis ChIPflow provides linear and non linear methods for normalisation between samples as well as conservative and stringent models for estimating the variance and testing the significance of the observed differences (see [chipflowr](https://gitlab.pasteur.fr/hub/chipflowr)). ePeak is a snakemake-based workflow for the analysis of ChIP-seq data from raw FASTQ files to differential analysis of transcription factor binding or histone modification marking. It streamlines critical steps like the quality assessment of the immunoprecipitation using the cross correlation and the replicate comparison for both narrow and broad peaks. For the differential analysis ePeak provides linear and non linear methods for normalisation between samples as well as conservative and stringent models for estimating the variance and testing the significance of the observed differences (see [chipflowr](https://gitlab.pasteur.fr/hub/chipflowr)).
<img src="images/chipflow_pipeline.svg" width="700"> <img src="images/epeak_workflow.svg" width="700">
# How to install ChIPflow ? # How to install ePeak ?
## Installation with singularity ## Installation with singularity
...@@ -29,11 +29,11 @@ Pre-required tools: ...@@ -29,11 +29,11 @@ Pre-required tools:
- pysam - pysam
- singularity - singularity
A tutorial to create a conda environment with all dependencies is available here : [env.sh](https://gitlab.pasteur.fr/hub/chipflow/-/blob/master/env.sh) A tutorial to create a conda environment with all dependencies is available here : [env.sh](https://gitlab.pasteur.fr/hub/ePeak/-/blob/master/env.sh)
Download the singularity container: Download the singularity container:
` singularity pull --arch amd64 --name chipflow.img library://rlegendre/default/chipflow:latest ` ` singularity pull --arch amd64 --name epeak.img library://rlegendre/epeak/epeak:1.0`
## Manual installation ## Manual installation
...@@ -73,38 +73,38 @@ module load pysam ...@@ -73,38 +73,38 @@ module load pysam
* Clone workflow: * Clone workflow:
`git clone https://gitlab.pasteur.fr/hub/chipflow.git` `git clone https://gitlab.pasteur.fr/hub/ePeak.git`
* Download singularity container: * Download singularity container:
``` ```
cd chipflow cd ePeak
singularity pull --arch amd64 --name chipflow.img library://rlegendre/default/chipflow:latest singularity pull --arch amd64 --name epeak.img library://rlegendre/epeak/epeak:1.0
``` ```
Then, you can configure the workflow. Then, you can configure the workflow.
# How to run ChIPflow ? # How to run ePeak ?
> Note: you can test the pipeline with an example datasets: [here](https://gitlab.pasteur.fr/hub/chipflow#run-the-pipeline-on-test-data) > Note: you can test the pipeline with an example datasets: [here](https://gitlab.pasteur.fr/hub/ePeak#run-the-pipeline-on-test-data)
## Usage ## Usage
* Step 1: Download workflow * Step 1: Download workflow
`git clone https://gitlab.pasteur.fr/hub/chipflow.git` `git clone https://gitlab.pasteur.fr/hub/ePeak.git`
In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, its DOI (see [above](https://gitlab.pasteur.fr/hub/chipflow#how-to-cite-chipflow-)) In any case, if you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this repository and, its DOI (see [above](https://gitlab.pasteur.fr/hub/ePeak#how-to-cite-ePeak-))
If you are using Singularity, you need to copy the Singularity image in the cloned ChIPflow directory. If you are using Singularity, you need to copy the Singularity image in the cloned ePeak directory.
* Step 2: Configure workflow * Step 2: Configure workflow
To configure your analysis, you need to fill 2 configuration files, one to specify the design experimental of you project and one to parameter each step of the pipeline (stored in `config/`): To configure your analysis, you need to fill 2 configuration files, one to specify the design experimental of you project and one to parameter each step of the pipeline (stored in `config/`):
- [config.yaml](https://gitlab.pasteur.fr/hub/chipflow#how-to-fill-the-design) - [config.yaml](https://gitlab.pasteur.fr/hub/ePeak#how-to-fill-the-design)
- [design.txt](https://gitlab.pasteur.fr/hub/chipflow#how-to-fill-the-design) - [design.txt](https://gitlab.pasteur.fr/hub/ePeak#how-to-fill-the-design)
In addition, you can customize the MultiQC report by configuring this file: [multiqc_config.yaml](https://gitlab.pasteur.fr/hub/chipflow#how-to-fill-the-multiqc-config) In addition, you can customize the MultiQC report by configuring this file: [multiqc_config.yaml](https://gitlab.pasteur.fr/hub/ePeak#how-to-fill-the-multiqc-config)
* Step 3: Execute workflow * Step 3: Execute workflow
...@@ -167,15 +167,15 @@ Design columns: ...@@ -167,15 +167,15 @@ Design columns:
|--------------|---------------------------------------------------------------| |--------------|---------------------------------------------------------------|
| IP_NAME | IP FASTQ files prefix (i.e. MARK_COND1) | | IP_NAME | IP FASTQ files prefix (i.e. MARK_COND1) |
| INPUT_NAME | INPUT FASTQ files prefix (i.e. INPUT_COND1) | | INPUT_NAME | INPUT FASTQ files prefix (i.e. INPUT_COND1) |
| NB_IP | Number of replicates of the histone mark or TFs (i.e. 1 or 2) | | NB_IP | Replicate number of the histone mark or TFs (i.e. 1 or 2) |
| NB_INPUT | Number of replicates of INPUT files (i.e. 1 or 2) | | NB_INPUT | Replicate number of INPUT file (i.e. 1 or 2) |
Link to an Example: [design.txt](https://gitlab.pasteur.fr/hub/chipflow/-/blob/master/test/design.txt) Link to an Example: [design.txt](https://gitlab.pasteur.fr/hub/ePeak/-/blob/master/test/design.txt)
### How to fill the config file ### How to fill the config file
All the parameters to run the pipeline are gathered in a YAML config file that the user has to fill before running the pipeline. Here is an filled example: [config.yaml](https://gitlab.pasteur.fr/hub/chipflow/-/blob/master/test/config.yaml) All the parameters to run the pipeline are gathered in a YAML config file that the user has to fill before running the pipeline. Here is an filled example: [config.yaml](https://gitlab.pasteur.fr/hub/ePeak/-/blob/master/test/config.yaml)
This config file is divided in 2 sections: This config file is divided in 2 sections:
...@@ -230,7 +230,7 @@ At the beginning of `config/multiqc_config.yaml` file, you have the possibility ...@@ -230,7 +230,7 @@ At the beginning of `config/multiqc_config.yaml` file, you have the possibility
# Title to use for the report. # Title to use for the report.
title: "ChIP-seq analysis" title: "ChIP-seq analysis"
subtitle: "ChIP-seq analysis of CTCF factor in breast tumor cells" # Set your own text subtitle: "ChIP-seq analysis of CTCF factor in breast tumor cells" # Set your own text
intro_text: "MultiQC reports summarise analysis results produced from ChIPflow" intro_text: "MultiQC reports summarise analysis results produced from ePeak"
# Add generic information to the top of reports # Add generic information to the top of reports
report_header_info: report_header_info:
...@@ -244,7 +244,7 @@ report_header_info: ...@@ -244,7 +244,7 @@ report_header_info:
## ChIPflow running modes ## ePeak running modes
After the read pre-processing steps of the pipeline, MACS2 peak calling software is used to identify enriched regions. Several settings of MACS2 are possible: After the read pre-processing steps of the pipeline, MACS2 peak calling software is used to identify enriched regions. Several settings of MACS2 are possible:
- To estimate the fragment size you can: use MACS2 model (default) or use PhantomPeakQualTool. - To estimate the fragment size you can: use MACS2 model (default) or use PhantomPeakQualTool.
...@@ -268,10 +268,15 @@ peak_calling: ...@@ -268,10 +268,15 @@ peak_calling:
compute_idr: compute_idr:
do: yes
rank: 'signal.value' rank: 'signal.value'
thresh: 0.05 thresh: 0.05
intersectionApproach: no
intersectionApproach:
do: no
ia_overlap: 0.8 ia_overlap: 0.8
``` ```
> If none or very few peaks pass the IDR, this means that there is to much variability between your replicates aka they are probably not good replicates. If you want to proceed anyway with the analysis, you can use the intersection approach (less stringent) instead of the IDR by setting `intersectionApproach` to 'yes'. > If none or very few peaks pass the IDR, this means that there is to much variability between your replicates aka they are probably not good replicates. If you want to proceed anyway with the analysis, you can use the intersection approach (less stringent) instead of the IDR by setting `intersectionApproach` to 'yes'.
...@@ -288,10 +293,42 @@ peak_calling: ...@@ -288,10 +293,42 @@ peak_calling:
cutoff: 0.01 cutoff: 0.01
genomeSize: mm genomeSize: mm
compute_idr: compute_idr:
do: no
rank: 'signal.value' rank: 'signal.value'
thresh: 0.01 thresh: 0.01
intersectionApproach: yes
intersectionApproach:
do: yes
ia_overlap: 0.8
```
### Default mode for cut&run
With cut&run data, make deduplication only on INPUT/IgG data (dedup_IP to False). Then perform a stringent peak calling with SEACR and use Intersection Approach. Overlapping parameter of IA on peaks is set at 0.8.
```
mark_duplicates:
do: yes
dedup_IP: 'False'
threads: 4
seacr:
do: yes
threshold: 'stringent'
norm: 'norm'
compute_idr:
do: no
rank: 'signal.value'
thresh: 0.01
intersectionApproach:
do: yes
ia_overlap: 0.8 ia_overlap: 0.8
``` ```
...@@ -302,18 +339,21 @@ You need to have sra-toolkit installed before to download test data. ...@@ -302,18 +339,21 @@ You need to have sra-toolkit installed before to download test data.
``` ```
# Download genome references # Download genome references
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.26_GRCm38.p6/GCF_000001635.26_GRCm38.p6_genomic.fna.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz
gunzip GCF_000001635.26_GRCm38.p6_genomic.fna.gz gunzip mm10.fa.gz
# get blacklisted regions # get blacklisted regions
wget http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz wget http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/mm10-mouse/mm10.blacklist.bed.gz
gunzip mm10.blacklist.bed.gz gunzip mm10.blacklist.bed.gz
# download bed file for GeneBody plot
wget http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/genes/mm10.refGene.gtf.gz
gunzip mm10.refGene.gtf.gz
# create genome directory # create genome directory
mkdir genome mkdir genome
mv GCF_000001635.26_GRCm38.p6_genomic.fna genome/mm10.fa mv mm10.* genome/
mv mm10.blacklist.bed genome/mm10.blacklist.bed
# copy config file # copy config file
cp test/config.yaml config/config.yaml cp test/config.yaml config/config.yaml
...@@ -336,6 +376,7 @@ rm *.sra ...@@ -336,6 +376,7 @@ rm *.sra
for file in *.fastq ; do for file in *.fastq ; do
pigz $file ; pigz $file ;
done done
``` ```
...@@ -356,16 +397,16 @@ done ...@@ -356,16 +397,16 @@ done
- Inside `IP_NAME` you can use "-" but do not "\_" because this is used to separate `MARK`, `COND`, `REP` and `MATE` from FASTQ filenames. For example: `TF4_Ctl-HeLa_rep1_R1.fastq.gz` - Inside `IP_NAME` you can use "-" but do not "\_" because this is used to separate `MARK`, `COND`, `REP` and `MATE` from FASTQ filenames. For example: `TF4_Ctl-HeLa_rep1_R1.fastq.gz`
**Can I use relative path in config ?** **Can I use relative path in config ?**
- yes, but you need to consider ChIPflow directory as origin. - yes, but you need to consider ePeak directory as origin.
**What if I have 3 INPUT replicates?** **What if I have 3 INPUT replicates?**
- You can use ChIPflow with more than 2 replicates, replicates will be merged in a Maxi Pool. - You can use ePeak with more than 2 replicates, replicates will be merged in a Maxi Pool.
**What if I have 3 IP replicate?** **What if I have 3 IP replicate?**
- The IDR for 3 IP replicates is not yet implemented. - The IDR for 3 IP replicates is not yet implemented.
**What INPUT is used when I have 2 INPUTs?** **What INPUT is used when I have 2 INPUTs?**
- Only first INPUT is used for peak calling to reduce technical variability for now. - First INPUT could be used for peak calling on all IP to reduce technical variability, but each IP could be associated to a specific INPUT in the design file.
**What if I have no INPUT ?** **What if I have no INPUT ?**
- The pipeline failed to run if no INPUT are done. Following ENCODE guidelines, control experiment must be performed and are used during peak calling. - The pipeline failed to run if no INPUT are done. Following ENCODE guidelines, control experiment must be performed and are used during peak calling.
...@@ -388,19 +429,19 @@ done ...@@ -388,19 +429,19 @@ done
**Can I force the re-calculation of all the steps ?** **Can I force the re-calculation of all the steps ?**
- Yes, you can add this snakemake option `--forceall` to force the execution of the first rule. - Yes, you can add this snakemake option `--forceall` to force the execution of the first rule.
**Can I rename ChIPflow directory ?** **Can I rename ePeak directory ?**
- yes, but must to be the same as in config.yaml (`analysis_dir`) or use relative path - yes, but must to be the same as in config.yaml (`analysis_dir`) or use relative path
**The pipeline fails because the IDR doesn't select enough reads?** **The pipeline fails because the IDR doesn't select enough reads?**
- If none or very few peaks pass the IDR, this means that there is to much variability between your replicates aka they are probably not good replicates. If you want to proceed anyway with the analysis, you can use the intersection approach (less stringent) instead of the IDR by setting `intersectionApproach` to 'yes'. - If none or very few peaks pass the IDR, this means that there is to much variability between your replicates aka they are probably not good replicates. If you want to proceed anyway with the analysis, you can use the intersection approach (less stringent) instead of the IDR by setting `intersectionApproach:do` to 'yes'.
**When should I use PhantompeakCalling's fragment size estimation instead of MACS2's one?** **When should I use PhantompeakCalling's fragment size estimation instead of MACS2's one?**
- If MACS2 cannot compute the fragment size estimation (or if you want), set `no_model` to yes, and the fragment length use for MACS2 will be the one computed by PhantompeakQualTools for each sample. - If MACS2 cannot compute the fragment size estimation (or if you want), set `no_model` to yes, and the fragment length use for MACS2 will be the one computed by PhantompeakQualTools for each sample.
**What if I don't know if my chromatim factor in narrow or broad?** **What if I don't know if my chromatim factor in narrow or broad?**
- The output directories names of peak Calling, peak reproducibility and differential analysis steps includes the peak calling mode name, the peak reproducibility method name and the normalisation and variance estimation method name. This allows ChIPflow to test multiple combinations of peak calling, peak reproducibility and differential analysis parameters without erasing any output. - The output directories names of peak Calling, peak reproducibility and differential analysis steps includes the peak calling mode name, the peak reproducibility method name and the normalisation and variance estimation method name. This allows ePeak to test multiple combinations of peak calling, peak reproducibility and differential analysis parameters without erasing any output.
- For example, if you have run the pipeline in narrow mode, and you want broad mode, you just need to modify the corresponding parameter in the configuration YAML file. The pipeline will then restart at the peak calling step and all the output will be stored in "06-PeakCalling/{}" directories. - For example, if you have run the pipeline in narrow mode, and you want broad mode, you just need to modify the corresponding parameter in the configuration YAML file. The pipeline will then restart at the peak calling step and all the output will be stored in "06-PeakCalling/{}" directories.
- In case of unknown chromatin factor, we advice to run ChIPflow in narrow mode with IDR and IA, and afterward in broad mode. Results from narrow peak calling will be stored in "06-PeakCalling/Narrow" directory, and in "06-PeakCalling/Broad" for broad peak calling. - In case of unknown chromatin factor, we advice to run ePeak in narrow mode with IDR and IA, and afterward in broad mode. Results from narrow peak calling will be stored in "06-PeakCalling/Narrow" directory, and in "06-PeakCalling/Broad" for broad peak calling.
--- ---
...@@ -413,6 +454,6 @@ done ...@@ -413,6 +454,6 @@ done
* Claudia Chica * Claudia Chica
# How to cite ChIPflow ? # How to cite ePeak ?
https://doi.org/10.1101/2021.02.02.429342 https://doi.org/10.1101/2021.02.02.429342
######################################################################### #########################################################################
# ChIPflow: Standardize and reproducible ChIP-seq analysis from raw # # ePeak: Standardize and reproducible ChIP-seq analysis from raw #
# data to differential analysis # # data to differential analysis #
# Authors: Rachel Legendre, Maelle Daunesse # # Authors: Rachel Legendre, Maelle Daunesse #
# Copyright (c) 2019-2020 Institut Pasteur (Paris) and CNRS. # # Copyright (c) 2019-2020 Institut Pasteur (Paris) and CNRS. #
# # # #
# This file is part of ChIPflow workflow. # # This file is part of ePeak workflow. #
# # # #
# ChIPflow is free software: you can redistribute it and/or modify # # ePeak is free software: you can redistribute it and/or modify #
# it under the terms of the GNU General Public License as published by # # it under the terms of the GNU General Public License as published by #
# the Free Software Foundation, either version 3 of the License, or # # the Free Software Foundation, either version 3 of the License, or #
# (at your option) any later version. # # (at your option) any later version. #
# # # #
# ChIPflow is distributed in the hope that it will be useful, # # ePeak is distributed in the hope that it will be useful, #
# but WITHOUT ANY WARRANTY; without even the implied warranty of # # but WITHOUT ANY WARRANTY; without even the implied warranty of #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the #
# GNU General Public License for more details . # # GNU General Public License for more details . #
# # # #
# You should have received a copy of the GNU General Public License # # You should have received a copy of the GNU General Public License #
# along with ChIPflow (LICENSE). # # along with ePeak (LICENSE). #
# If not, see <https://www.gnu.org/licenses/>. # # If not, see <https://www.gnu.org/licenses/>. #
######################################################################### #########################################################################
import pandas as pd import pandas as pd
from fnmatch import fnmatch from fnmatch import fnmatch
from re import sub, match from re import sub, match
from itertools import repeat, chain from itertools import repeat, chain
import os import os
import glob
#------------------------------------------------------- #-------------------------------------------------------
# read config files # read config files
...@@ -37,7 +38,7 @@ configfile: "config/config.yaml" ...@@ -37,7 +38,7 @@ configfile: "config/config.yaml"
#put a relative path #put a relative path
RULES = os.path.join("workflow", "rules") RULES = os.path.join("workflow", "rules")
design = pd.read_csv(config["design"]["design_file"], header=0, sep='\t') design_user = pd.read_csv(config["design"]["design_file"], header=0, sep='\t')
""" """
REQUIREMENTS IN DESIGN: REQUIREMENTS IN DESIGN:
...@@ -82,6 +83,27 @@ marks = [ x.strip() for x in (config["design"]["marks"]).split(",")] ...@@ -82,6 +83,27 @@ marks = [ x.strip() for x in (config["design"]["marks"]).split(",")]
conds = [ x.strip() for x in (config["design"]["condition"]).split(",")] conds = [ x.strip() for x in (config["design"]["condition"]).split(",")]
rep_flag = config["design"]["replicates"] rep_flag = config["design"]["replicates"]
# -----------------------------------------------
# Compute number of replicates by IP/INPUT
#get uniq IP and INPUT
unique_ip = pd.unique(design_user["IP_NAME"])
unique_input = pd.unique(design_user["INPUT_NAME"])
#Fill new design with informations about IP
design = pd.DataFrame(unique_ip, columns = ['IP_NAME'])
design['NB_IP'] = design_user["IP_NAME"].value_counts().values
#initialize INPUT related columns
inputs = []
nb_input = []
#fill new design with informations about INPUT
for row in design.itertuples(index=True, name='Pandas'):
inputs.append(design_user.loc[design_user['IP_NAME'] == getattr(row, "IP_NAME"), 'INPUT_NAME'].iloc[0])
nb_input.append(int(design_user.loc[design_user['IP_NAME'] == getattr(row, "IP_NAME"), 'INPUT_NAME'].value_counts().values))
design['INPUT_NAME'] = inputs
design['NB_INPUT'] = nb_input
# ----------------------------------------------- # -----------------------------------------------
# get list of INPUT files # get list of INPUT files
...@@ -123,7 +145,7 @@ for row in design.itertuples(index=True, name='Pandas'): ...@@ -123,7 +145,7 @@ for row in design.itertuples(index=True, name='Pandas'):
if mark == getattr(row, "IP_NAME").split("_")[0]: if mark == getattr(row, "IP_NAME").split("_")[0]:
nb_rep = getattr(row, "NB_IP") nb_rep = getattr(row, "NB_IP")
if (getattr(row, "IP_NAME")).endswith(tuple(conds)) and getattr(row, "NB_IP") > 1: if (getattr(row, "IP_NAME")).endswith(tuple(conds)) and getattr(row, "NB_IP") > 1:
if nb >= 1: if nb >= 1 and mark not in MARK_OK:
MARK_OK.append(mark) MARK_OK.append(mark)
CORR_INPUT_OK.append(getattr(row, "INPUT_NAME").split("_")[0]) CORR_INPUT_OK.append(getattr(row, "INPUT_NAME").split("_")[0])
break break
...@@ -131,6 +153,7 @@ for row in design.itertuples(index=True, name='Pandas'): ...@@ -131,6 +153,7 @@ for row in design.itertuples(index=True, name='Pandas'):
nb += 1 nb += 1
nb = 0 nb = 0
if config["differential_analysis"]["input_counting"]: if config["differential_analysis"]["input_counting"]:
CORR_INPUT = [] # corresponding INPUT files for input counting CORR_INPUT = [] # corresponding INPUT files for input counting
for mark in CORR_INPUT_OK: for mark in CORR_INPUT_OK:
...@@ -164,6 +187,17 @@ if len(conds) > 1 : ...@@ -164,6 +187,17 @@ if len(conds) > 1 :
"file and fastq filenames: %s not found %s times in %s" % (getattr(row, "INPUT_NAME"), getattr(row, "NB_INPUT"),samples)) "file and fastq filenames: %s not found %s times in %s" % (getattr(row, "INPUT_NAME"), getattr(row, "NB_INPUT"),samples))
# -----------------------------------------------
# get correspondance between IP and INPUT files
d = {}
d["IP"] = []
d["INPUT"] = []
for row in design_user.itertuples(index=True, name='Pandas'):
d["IP"].append(getattr(row, "IP_NAME") +"_"+ rep_flag + str(getattr(row, "NB_IP")))
d["INPUT"].append(getattr(row, "INPUT_NAME") +"_"+ rep_flag + str(getattr(row, "NB_INPUT")))
IP_INPUT = pd.DataFrame.from_dict(d)
# ----------------------------------------------- # -----------------------------------------------
# get REP names # get REP names
...@@ -215,30 +249,36 @@ for row in design.itertuples(index=True, name='Pandas'): ...@@ -215,30 +249,36 @@ for row in design.itertuples(index=True, name='Pandas'):
IP_IDR = [] IP_IDR = []
IP_SPR = [] IP_SPR = []
IP_PPR = [] IP_PPR = []
SPR_POOL = [] PPR_POOL = []
INPUT_SPR = [] INPUT_SPR = []
INPUT_PPR = [] INPUT_PPR = []
#IDR_REP = [] #IDR_REP = []
# for only IP with more than one replicate
for cond in IP_REP: for cond in IP_REP:
tmp = [] tmp = []
tmp2 = []
for ip in IP_ALL: for ip in IP_ALL:
name = ip.split("_" + rep_flag) name = ip.split("_" + rep_flag)
if ip.startswith(cond): if ip.startswith(cond):
spr_file = sub(rep_flag, 'SPR', ip) spr_file = sub(rep_flag, 'SPR', ip)
ppr_file = sub(rep_flag, 'PPR', ip) ppr_file = sub(rep_flag, 'PPR', ip)
pool_file = sub(rep_flag, 'PPRPool', ip) pool_file = sub(r''+rep_flag+'[0-9]*', 'PPRPool', ip)
tmp.append(ip) tmp.append(ip)
#IDR_REP.append("Rep"+name[1]) #IDR_REP.append("Rep"+name[1])
IP_SPR.append(spr_file+".1") IP_SPR.append(spr_file+".1")
IP_SPR.append(spr_file+".2") IP_SPR.append(spr_file+".2")
IP_PPR.append(ppr_file) IP_PPR.append(ppr_file)
SPR_POOL.append(pool_file) PPR_POOL.append(pool_file)
for row in design.itertuples(index=True, name='Pandas'): # add corresponding INPUT
if (getattr(row, "IP_NAME")) in name[0]: #get INPUT for this IP and this rep
INPUT_PPR.append(getattr(row, "INPUT_NAME")+"_"+rep[0]) ctl = design_user.loc[(design_user['IP_NAME'] == name[0]) & (design_user['NB_IP'] == int(name[1])), 'INPUT_NAME'].iloc[0]+"_"+rep_flag+str(design_user.loc[(design_user['IP_NAME'] == name[0]) & (design_user['NB_IP'] == int(name[1])), 'NB_INPUT'].iloc[0])
INPUT_SPR.append(getattr(row, "INPUT_NAME")+"_"+rep[0]) #check is INPUT is a POOL or not
INPUT_SPR.append(getattr(row, "INPUT_NAME")+"_"+rep[0]) if design.loc[design['INPUT_NAME'] == (design_user.loc[design_user['IP_NAME'] == name[0], 'INPUT_NAME'].iloc[0]), 'NB_INPUT'].iloc[0] > 1 :
ctl_pool = (design_user.loc[design_user['IP_NAME'] == name[0], 'INPUT_NAME'].iloc[0])+'_Pool'