small_RNA-seq.snakefile 209 KB
Newer Older
1
2
"""
Snakefile to analyse small_RNA-seq data.
Blaise Li's avatar
Blaise Li committed
3
4

TODO: Some figures and summaries may be overridden when changing the mapper. The mapper name should be added to their path.
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

There are 2 types of counts. One resulting from small RNA annotations using a custom script, and one resulting from remapping and counting using featureCount.

The first one can contain "chimeric" annotations (possibly resulting from overlapping annotations?):
```
bli@naples:/extra2/Documents/Mhe/bli/small_RNA-seq_analyses$ head results_HRDE1_IP/bowtie2/mapped_C_elegans/annotation/hrde1_flag_HRDE1_IP_1_18-26_on_C_elegans/all_siu_counts.txt
gene	count
DNA?|DNA?|NDNAX1_CE:81	1
DNA?|DNA?|NDNAX2_CE:110_and_WBGene00005903	1
DNA?|DNA?|NDNAX2_CE:45	2
DNA?|DNA?|NDNAX3_CE:31	1
DNA|CMC-Chapaev|Chapaev-1_CE:45	1
DNA|CMC-Chapaev|Chapaev-1_CE:58_and_WBGene00005792	1
DNA|CMC-Chapaev|Chapaev-2_CE:40_and_WBGene00021886	1
DNA|CMC-Chapaev|Chapaev-2_CE:41_and_WBGene00021886	3
DNA|CMC-Chapaev|Chapaev-2_CE:42_and_WBGene00021886	5

Maybe it would be better to not use these counts.
```

25
"""
Blaise Li's avatar
Blaise Li committed
26
27
28
29
import sys
major, minor = sys.version_info[:2]
if major < 3 or (major == 3 and minor < 6):
    sys.exit("Need at least python 3.6\n")
30

31
32
33
34
# TODO: Try to find what the proportion of small reads map at unique position within repetitive elements.
# What is the distribution of unique reads among repetitive elements of a given family: is it evenly spread, or biased?
# Is it different if we use all small RNAs instead of just unique ones?
# For bowtie2, allow MAPQ >= 23 to get "not too ambiguously mapped reads"
35

36
37
# TODO: implement RPM folds from feature_counts results (and also for Ribo-seq pipeline)

38
39
40
41
42
43
44
45
46
47
48
# TODO: scatterplots log(IP_RPM) vs log(input_RPM) (or mutant vs WT), colouring with gene list
# TODO: heatmaps with rows (genes) ranked by expression in WT (mean reference level (wild types or inputs)), columns representing fold in mut vs WT
# Possibly remove the low expressed ones
# TODO: MA plot: fold vs "mean" count (to be defined)?

# TODO: extract most abundant piRNA reads (WT_48hph)
#bioawk -c fastx '{print $seq}' results_prg1_HS/bowtie2/mapped_C_elegans/reads/WT_RT_1_18-26_on_C_elegans/piRNA.fastq.gz | sort | uniq -c | sort -n | mawk '$1 >= 25 {print}' | wc -l
# -> 6857
# map them on the genome (without the piRNA loci), or on pseudo-genomes with specific features (histone genes...), tolerating a certain amount of mismatches (using bowtie1)
# Or: scan the genome (or "pseudo-genome") to find where there are matches with 0 to 3 mismatches at specific positions (13-21) (and perfect matches in the 5' zone)

Blaise Li's avatar
Blaise Li committed
49
50
51
52
53
54
# TODO: meta-profiles with IP and input on the same meta profile for Ortiz_oogenic and Ortiz_spermatogenic
# Make external script to generate meta-profiles on a custom list of libraries, and a given list of genes.

# TODO: boxplots with contrasts as series and a selection of gene lists
# Ex: 3 time points and oogenic and spermatogenic -> 6 boxes (custom program needed)
# Or simpler: one figure per gene list -> done for a full group of contrasts in make_gene_list_lfc_boxplots
55
# TODO: boxplots for all genes, for different small RNA categories (same as above but with all genes)
Blaise Li's avatar
Blaise Li committed
56

Blaise Li's avatar
Blaise Li committed
57
# Total number of "non-structural" (mapped - (fwd-(t,sn,sno,r-RNA))) to compute RPKM
58
59
# Quick-and-dirty: use gene span (TES - TSS) -> done for repeats
# Better: use length of union of exons for each gene -> done for genes
60
# Ideal: use transcriptome-based per-isoform computation
Blaise Li's avatar
Blaise Li committed
61
# Actually, we don't want a normalization by length. We deal with small RNAs, not transcripts
62
# TODO: make a bigwig of IP/input: is it possible?
Blaise Li's avatar
Blaise Li committed
63
64
# For RNA-seq?: use (oocyte, spermatogenic) intersected with "CSR-1-loaded" (=targets for different time point), csr1ADH_vs_WT_all_alltypes_up_genes_ids.txt (and intersected with non-spermatogenic) for meta-profiles

65
66
67

# Possibly filter out on RPKM
# Then, compute folds of RP(K)M IP/input (for a given experiment, i.-e. REP)
68
# and give list of genes sorted by such folds -> see /Gene_lists/csr1_prot_si_supertargets*
69
70
71
72
73
74
75
76

# Exploratory:
# Heatmap (fold)
# genes | experiment

# Then either:
# - take the genes (above log-fold threshold) common across replicates
# - look at irreproducible discovery rate (http://dx.doi.org/doi:10.1214%2F11-AOAS466)
77
# -> define CSR-1-loaded
78
79
80

# For metagene: see --metagene option of computeMatrix
# Retrieve gtf info after filtering out interfering genes based on merged bed
81

82
83
# TODO
# Locate potential targets of piRNA (using perfect complementarity), in a given annot_biotypes category for a given set of piRNA (the whole list, or for instance, up or down-regulated ones...).
84
85
# To do this: map all perfect matches, make a bed file, intersect with the annotation file for the given biotype, with antisense (note that this will automatically filter out self matches) -> done but only for all piRNAs
# Plot the count of endo-siRNA around those sites (from 100 upstream up to 100 downstream, using meta-TSS-like analysis) -> done
86
87
88
89


import os
OPJ = os.path.join
90
from glob import glob
Blaise Li's avatar
Blaise Li committed
91
from re import sub
92
93
from pickle import load
from fileinput import input as finput
94
from sys import stderr
95
from subprocess import Popen, PIPE, CalledProcessError
96
97

# Useful for functional style
98
from itertools import chain, combinations, product, repeat, starmap
99
from functools import partial, reduce
100
from operator import or_ as union
101
from cytoolz import concat, merge_with, take_nth, valmap
Blaise Li's avatar
Blaise Li committed
102
103
104

def concat_lists(lists):
    return list(concat(lists))
105
106
107
108
109

# Useful data structures
from collections import OrderedDict as od
from collections import defaultdict, Counter

110

111
112
113
114
115
116
117
118
119
120
121
import warnings


def formatwarning(message, category, filename, lineno, line):
    """Used to format warning messages."""
    return "%s:%s: %s: %s\n" % (filename, lineno, category.__name__, message)


warnings.formatwarning = formatwarning


122
123
# from gatb import Bank
from mappy import fastx_read
124
# To parse SAM format
Blaise Li's avatar
Blaise Li committed
125
import pyBigWig
126

Blaise Li's avatar
Blaise Li committed
127
128
# To compute correlation coefficient
from scipy.stats.stats import pearsonr
129
130
# To catch errors when plotting KDE
from scipy.linalg import LinAlgError
131
# For data processing and displaying
132
from sklearn import preprocessing
133
from sklearn.decomposition import PCA
134
135
import matplotlib as mpl
# To be able to run the script without a defined $DISPLAY
136
137
# https://github.com/mwaskom/seaborn/issues/1262
#mpl.use("agg")
138
139
140
141
142
mpl.use("PDF")
#mpl.rcParams["figure.figsize"] = 2, 4
mpl.rcParams["font.sans-serif"] = [
    "Arial", "Liberation Sans", "Bitstream Vera Sans"]
mpl.rcParams["font.family"] = "sans-serif"
Blaise Li's avatar
Blaise Li committed
143
#mpl.rcParams["figure.figsize"] = [16, 30]
144

145
146
147
from matplotlib import cm
from matplotlib.colors import Normalize

148
from matplotlib import numpy as np
149
from math import ceil, floor
Blaise Li's avatar
Blaise Li committed
150
from matplotlib.backends.backend_pdf import PdfPages
151
152
import pandas as pd
import matplotlib.pyplot as plt
153
154
import seaborn as sns
# import seaborn.apionly as sns
155
# import husl
156
# predefined seaborn graphical parameters
Blaise Li's avatar
Blaise Li committed
157
158
sns.set_context("talk")

159
from libsmallrna import PI_MIN, PI_MAX, SI_MIN, SI_MAX
Blaise Li's avatar
Blaise Li committed
160
161
# Do this outside the workflow
#from libhts import gtf_2_genes_exon_lengths, repeat_bed_2_lengths
162
from libdeseq import do_deseq2
163
from libhts import aligner2min_mapq
164
from libhts import status_setter, make_empty_bigwig
165
from libhts import median_ratio_to_pseudo_ref_size_factors, size_factor_correlations
166
from libhts import plot_paired_scatters, plot_norm_correlations, plot_counts_distribution, plot_boxplots, plot_histo
167
from libworkflows import texscape, wc_applied, ensure_relative, cleanup_and_backup
168
169
from libworkflows import get_chrom_sizes, column_converter, make_id_list_getter
from libworkflows import read_int_from_file, strip_split, file_len, last_lines, save_plot, SHELL_FUNCTIONS
170
from libworkflows import filter_combinator, read_feature_counts, sum_feature_counts, sum_htseq_counts, warn_context
171
from smincludes import rules as irules
172
173
strip = str.strip

174
alignment_settings = {"bowtie2": "-L 6 -i S,1,0.8 -N 0"}
175

176
# Positions in small RNA sequences for which to analyse nucleotide distribution
177
178
#POSITIONS = ["first", "last"]
POSITIONS = config["positions"]
179
180
# Orientations of small RNAs with respect to an annotated feature orientation.
# "fwd" and "rev" restrict feature quantification to sense or antisense reads.
181
ORIENTATIONS = config["orientations"]
182
183
184
185
186
# Codes for types of small RNAs
# Codes ending with "u" are for uridinylated endo siRNAs
# (recognized as not having mapped without first trimming a poly-T tail.
# "prot", "te" and "pseu" prefixes indicate the type of mRNA template that is inferred to have been used for endo siRNA synthesis by RdRP
# See small_RNA_seq_annotate.py and libsmallrna.pyx for more details
187
188
#SMALL_TYPES = ["prot_si", "te_si", "pseu_si", "pi", "mi", "prot_siu", "te_siu", "pseu_siu"]
SMALL_TYPES = config["small_types"]
189
190
191
192
193
#SI_TYPES = ["prot_si", "te_si", "pseu_si", "satel_si", "simrep_si"]
#SIU_TYPES = ["prot_siu", "te_siu", "pseu_siu", "satel_siu", "simrep_siu"]
SI_TYPE_PREFIXES = ["prot", "te", "pseu", "satel", "simrep"]
SI_TYPES = [prefix + "_si" for prefix in SI_TYPE_PREFIXES]
SIU_TYPES = [prefix + "_siu" for prefix in SI_TYPE_PREFIXES]
194
# small RNA types on which to run DESeq2
195
DE_TYPES = config["de_types"]
Blaise Li's avatar
Blaise Li committed
196
# small RNA types on which to compute IP/input RPM folds
197
198
199
200
201
# prot_si does not include polyU siRNA, so the counts here may be lower than in pisimi
# siu includes SIU_TYPES, but not SI_TYPES, so the counts here may be lower than in pisimi
# pisimi includes all (SI_TYPES, SIU_TYPES, pi and mi (pi and mi will not be on the same genes as the siRNAs))
IP_TYPES = ["pisimi", "siu", "prot_si"]
#IP_TYPES = config["ip_types"]
202
203
204
205
# Cutoffs in log fold change
LFC_CUTOFFS = [0.5, 1, 2]
UP_STATUSES = [f"up{cutoff}" for cutoff in LFC_CUTOFFS]
DOWN_STATUSES = [f"down{cutoff}" for cutoff in LFC_CUTOFFS]
206
#PISIMI_TYPES = ["si", "pi", "mi", "siu"]
Blaise Li's avatar
Blaise Li committed
207
#STANDARDS = ["zscore", "robust", "minmax", "unit"]
208
#STANDARDS = ["zscore", "robust", "minmax"]
Blaise Li's avatar
Blaise Li committed
209
210
# hexbin jointplot for principal components crashes on MemoryError for PCA without standardization
#STANDARDS = ["robust", "identity"]
211
STANDARDS = ["robust"]
212
213
214

COMPL = {"A" : "T", "C" : "G", "G" : "C", "T" : "A", "N" : "N"}

215
# Possible feature ID conversions
216
ID_TYPES = ["name", "cosmid"]
217
218
219
220
221
222
223
224
225
226
227
228
229

#########################
# Reading configuration #
#########################
# key: library name
# value: 3' adapter sequence
lib2adapt = config["lib2adapt"]
trim5 = config["trim5"]
trim3 = config["trim3"]
# key: library name
# value: path to raw data
lib2raw = config["lib2raw"]
LIBS = list(lib2raw.keys())
230
231
232
233
234
#REF=config["WT"]
#MUT=config["mutant"]
# Used to associate colours to libraries
# key: series name
# value: list of libraries
235
236
237
238
239
240
241
242
colour_series_dict = config["colour_series"]
genotype_series = colour_series_dict["genotype_series"]
# Groups of libraries to put together on a same metagene profile
lib_groups = config["lib_groups"]
GROUP_TYPES = list(lib_groups.keys())
merged_groups = merge_with(concat_lists, *lib_groups.values())
ALL_LIB_GROUPS = list(merged_groups.keys())
#all_libs_in_group = concat_lists(merge_with(concat_lists, *lib_groups.values()).values())
243
REPS = config["replicates"]
244
#TREATS = config["treatments"]
245
# Conditions to compare using DESeq2
246
247
248
249
250
# Note: we don't want contrasts of the following type:
# - ({geno}, {geno}) where various treatments are taken into account
# - ({treat}, {treat}) where various genotypes are taken into account
# We also don't want contrasts where more than one factor vary (ex: prg1_RT vs WT_HS30)
#ALL_COND_PAIRS = [(MUT, REF)] + list(combinations(reversed(TREATS), 2)) + [("%s_%s" % (lib2, treat2), "%s_%s" % (lib1, treat1)) for ((lib1, treat1), (lib2, treat2)) in combinations(product(LIBS, TREATS), 2)]
251
252
253
254
255
256
#COND_PAIRS = [(lib2, lib1) for (lib1, lib2) in [
#    ["%s_%s" % (geno, treat) for (geno, treat) in product(genos, treats)] for (genos, treats) in chain(
#        product(combinations(LIBS, 1), combinations(TREATS, 2)),
#        product(combinations(LIBS, 2), combinations(TREATS, 1)))]]
#CONTRASTS = ["%s_vs_%s" % cond_pair for cond_pair in COND_PAIRS]
#CONTRAST2PAIR = dict(zip(CONTRASTS, COND_PAIRS))
257
DE_COND_PAIRS = config["de_cond_pairs"]
258
259
260
261
262
msg = "\n".join([
    "Some contrats do not use known library names.",
    "Contrasts:"
    ", ".join([f"({cond}, {ref})" for (cond, ref) in DE_COND_PAIRS])])
assert all([cond in LIBS and ref in LIBS for (cond, ref) in DE_COND_PAIRS]), msg
263
IP_COND_PAIRS = config["ip_cond_pairs"]
264
265
266
267
268
msg = "\n".join([
    "Some contrats do not use known library names.",
    "Contrasts:"
    ", ".join([f"({cond}, {ref})" for (cond, ref) in IP_COND_PAIRS])])
assert all([cond in LIBS and ref in LIBS for (cond, ref) in IP_COND_PAIRS]), ""
269
COND_PAIRS = DE_COND_PAIRS + IP_COND_PAIRS
270
DE_CONTRASTS = [f"{cond1}_vs_{cond2}" for [cond1, cond2] in DE_COND_PAIRS]
271
IP_CONTRASTS = [f"{cond1}_vs_{cond2}" for [cond1, cond2] in IP_COND_PAIRS]
Blaise Li's avatar
Blaise Li committed
272
contrasts_dict = {"de" : DE_CONTRASTS, "ip" : IP_CONTRASTS}
273
CONTRASTS = DE_CONTRASTS + IP_CONTRASTS
274
CONTRAST2PAIR = dict(zip(CONTRASTS, COND_PAIRS))
275
276
MIN_LEN = config["min_len"]
MAX_LEN = config["max_len"]
Blaise Li's avatar
Blaise Li committed
277
size_selected = "%s-%s" % (MIN_LEN, MAX_LEN)
278
279
280
281
282
283
284
285
286
287
288
# bli@naples:/pasteur/entites/Mhe/Genomes$ cat C_elegans/Caenorhabditis_elegans/Ensembl/WBcel235/Annotation/Genes/miRNA.bed | mawk '{hist[$3-$2]++} END {for (l in hist) print l"\t"hist[l]}' | sort -n
# 17	1
# 18	2
# 19	4
# 20	18
# 21	57
# 22	172
# 23	138
# 24	45
# 25	11
# 26	3
289
# This should ideally come from genome configuration:
290
291
292
293
294
295
MI_MAX = 26
read_type_max_len = {
    size_selected: int(MAX_LEN),
    "pi": min(PI_MAX, int(MAX_LEN)),
    "si": min(SI_MAX, int(MAX_LEN)),
    "mi": min(MI_MAX, int(MAX_LEN))}
296
297
298
299
300
301
READ_TYPES_FOR_COMPOSITION = [
    size_selected, "nomap_siRNA", "all_siRNA"] + list(map(lambda x: "%s%s" % (x, "RNA"), SMALL_TYPES))
READ_TYPES_FOR_MAPPING = [
    size_selected, "nomap_siRNA", "prot_sisiuRNA",
    "all_siRNA", "all_siuRNA", "all_sisiuRNA",
    "siRNA", "siuRNA", "sisiuRNA"] + list(map(lambda x: "%s%s" % (x, "RNA"), SMALL_TYPES))
302
303
# Types of annotation features, as defined in the "gene_biotype"
# GTF attribute sections of the annotation files.
304
COUNT_BIOTYPES = config["count_biotypes"]
305
BOXPLOT_BIOTYPES = config.get("boxplot_biotypes", [])
306
ANNOT_BIOTYPES = config["annot_biotypes"]
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326

RMSK_BIOTYPES = [
    "DNA_transposons_rmsk",
    "RNA_transposons_rmsk",
    "satellites_rmsk",
    "simple_repeats_rmsk"]
RMSK_FAMILIES_BIOTYPES = [
    "DNA_transposons_rmsk_families",
    "RNA_transposons_rmsk_families",
    "satellites_rmsk_families",
    "simple_repeats_rmsk_families"]

BIOTYPES_TO_JOIN = {
    "all_rmsk": [biotype for biotype in COUNT_BIOTYPES + BOXPLOT_BIOTYPES if biotype in RMSK_BIOTYPES],
    "all_rmsk_families": [biotype for biotype in COUNT_BIOTYPES + BOXPLOT_BIOTYPES if biotype in RMSK_FAMILIES_BIOTYPES],
    # We only count "protein_coding", not "protein_codin_{5UTR,CDS,3UTR}"
    "alltypes": [biotype for biotype in COUNT_BIOTYPES + BOXPLOT_BIOTYPES if not biotype.startswith("protein_coding_")]}
# Filter out those with empty values
JOINED_BIOTYPES = [joined_biotype for (joined_biotype, biotypes) in BIOTYPES_TO_JOIN.items() if biotypes]

327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
# cf Gu et al. (2009), supplement:
# -----
# To compare 22G-RNAs derived from a gene, transposon, or pseudogene between two
# samples, each sample was normalized using the total number of reads less structural RNAs, i.e.
# sense small RNA reads likely derived from degraded ncRNAs, tRNAs, snoRNAs, rRNAs,
# snRNAs, and scRNAs. Degradation products of structural RNAs map to the sense strand, with a
# poorly defined size profile and 1 st nucleotide distribution. At least 25 22G-RNA reads per million,
# nonstructural reads in one of the two samples was arbitrarily chosen as a cutoff for comparison
# analyses. A change of 2-fold or more between samples was chosen as an enrichment threshold.
# Because some 21U-RNAs or miRNAs overlap with protein coding genes, reads derived from
# miRNA loci within a window of ± 4 nt and all the known 21U-RNAs were filtered out prior to
# comparison analysis.
# -----
# And Germano Cecere, about scRNA:
# -----
# Its an old nomenclature and in anycase there is only one of this annotated scRNAs
# (small cytoplasmic RNA genes).
# https://www.ncbi.nlm.nih.gov/books/NBK19701/table/genestructure_table2/?report=objectonly
# Don't even pay attention to this
# -----
STRUCTURAL_BIOTYPES = ["tRNA", "snRNA", "snoRNA", "rRNA", "ncRNA"]
Blaise Li's avatar
Blaise Li committed
348
GENE_LISTS = config["gene_lists"]
349
350
351
352
353
354
355
BOXPLOT_GENE_LISTS = config["boxplot_gene_lists"]
#BOXPLOT_GENE_LISTS = [
#    "all_genes",
#    "replication_dependent_octamer_histone",
#    "piRNA_dependent_prot_si_down4",
#    "csr1_prot_si_supertargets_48hph",
#    "spermatogenic_Ortiz_2014", "oogenic_Ortiz_2014"]
356
357
358
359
360
361
362
aligner = config["aligner"]
########################
# Genome configuration #
########################
genome_dict = config["genome_dict"]
genome = genome_dict["name"]
chrom_sizes = get_chrom_sizes(genome_dict["size"])
363
chrom_sizes.update(valmap(int, genome_dict.get("extra_chromosomes", {})))
364
365
366
367
368
genomelen = sum(chrom_sizes.values())
genome_db = genome_dict["db"][aligner]
# bed file binning the genome in 10nt bins
genome_binned = genome_dict["binned"]
annot_dir = genome_dict["annot_dir"]
369
exon_lengths_file = OPJ(annot_dir, "union_exon_lengths.txt"),
370
371
372
# TODO: figure out the difference between OPJ(convert_dir, "wormid2name.pickle") and genome_dict["converter"]
convert_dir = genome_dict["convert_dir"]
gene_lists_dir = genome_dict["gene_lists_dir"]
373
avail_id_lists = set(glob(OPJ(gene_lists_dir, "*_ids.txt")))
374
index = genome_db
375

376
output_dir = config["output_dir"]
377
378
379
local_annot_dir = config.get("local_annot_dir", OPJ(output_dir, "annotations"))
log_dir = config.get("log_dir", OPJ(output_dir, "logs"))
data_dir = config.get("data_dir", OPJ(output_dir, "data"))
380
# To put the results of small_RNA_seq_annotate
381
382
383
384
mapping_dir = OPJ(output_dir, aligner, f"mapped_{genome}")
reads_dir = OPJ(mapping_dir, "reads")
annot_counts_dir = OPJ(mapping_dir, "annotation")
feature_counts_dir = OPJ(mapping_dir, "feature_count")
385
READ_TYPES_FOR_COUNTING = ["all_sisiuRNA", "sisiuRNA", f"{size_selected}_and_nomap_siRNA", "prot_siRNA_and_prot_siuRNA"] + list(map(lambda x: "%s%s" % (x, "RNA"), ["prot_sisiu"]))
386
387
# Reads remapped and counted using featureCounts
REMAPPED_COUNTED = [
388
389
390
391
392
393
394
    f"{small_type}_{mapped_type}_{biotype}_{orientation}_transcript" for (
        small_type, mapped_type, biotype, orientation) in product(
            READ_TYPES_FOR_COUNTING,
            [f"on_{genome}", f"unique_on_{genome}"],
            #[f"on_{genome}"],
            COUNT_BIOTYPES + JOINED_BIOTYPES,
            ORIENTATIONS)]
395
396
# Used to skip some genotype x treatment x replicate number combinations
# when some of them were not sequenced
397
forbidden = {frozenset(wc_comb.items()) for wc_comb in config["missing"]}
Blaise Li's avatar
Blaise Li committed
398
399
CONDITIONS = [{
    "lib" : lib,
400
    "rep" : rep} for rep in REPS for lib in LIBS]
Blaise Li's avatar
Blaise Li committed
401
402
403
404
405
406
# We use this for various things in order to have always the same library order:
COND_NAMES = ["_".join((
    cond["lib"],
    cond["rep"])) for cond in CONDITIONS]
COND_COLUMNS = pd.DataFrame(CONDITIONS).assign(
    cond_name=pd.Series(COND_NAMES).values).set_index("cond_name")
Blaise Li's avatar
Blaise Li committed
407
408
#SIZE_FACTORS = ["raw", "deduped", size_selected, "mapped", "siRNA", "miRNA"]
#SIZE_FACTORS = [size_selected, "mapped", "miRNA"]
409
TESTED_SIZE_FACTORS = ["mapped", "non_structural", "siRNA", "miRNA", "median_ratio_to_pseudo_ref"]
Blaise Li's avatar
Blaise Li committed
410
#SIZE_FACTORS = ["mapped", "miRNA", "median_ratio_to_pseudo_ref"]
411
412
413
414
# "median_ratio_to_pseudo_ref" is a size factor adapted from
# the method described in the DESeq paper, but with addition
# and then substraction of a pseudocount, in order to deal with zero counts.
# This seems to perform well (see "test_size_factor" results).
Blaise Li's avatar
Blaise Li committed
415
DE_SIZE_FACTORS = ["non_structural", "median_ratio_to_pseudo_ref"]
416
417
SIZE_FACTORS = ["non_structural"]
#NORMALIZER = "median_ratio_to_pseudo_ref"
418

Blaise Li's avatar
Blaise Li committed
419
# For metagene analyses
420
421
#META_MARGIN = 300
META_MARGIN = 0
422
META_SCALE = 500
423
424
#UNSCALED_INSIDE = 500
UNSCALED_INSIDE = 0
Blaise Li's avatar
Blaise Li committed
425
426
427
#META_MIN_LEN = 1000
META_MIN_LEN = 2 * UNSCALED_INSIDE
MIN_DIST = 2 * META_MARGIN
428

429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
# split = str.split
#
#
# def strip_split(text):
#     return split(strip(text), "\t")
#
#
# # http://stackoverflow.com/a/845069/1878788
# def file_len(fname):
#     p = Popen(
#         ['wc', '-l', fname],
#         stdout=PIPE,
#         stderr=PIPE)
#     result, err = p.communicate()
#     if p.returncode != 0:
#         raise IOError(err)
#     return int(result.strip().split()[0])
446
447


448
449
450
451
452
453
454
455
def add_dataframes(df1, df2):
    return df1.add(df2, fill_value=0)


def sum_dataframes(dfs):
    return reduce(add_dataframes, dfs)


456
def sum_counts(fname):
457
    p = Popen(
458
        ['awk', '$1 ~ /^piRNA$|^miRNA$|^pseudogene$|^satellites_rmsk$|^simple_repeats_rmsk$|^protein_coding_|^.NA_transposons_rmsk$/ {sum += $2} END {print sum}', fname],
459
460
461
462
463
        stdout=PIPE,
        stderr=PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
464
465
466
467
468
    try:
        return int(result.strip().split()[0])
    except IndexError:
        warnings.warn(f"No counts in {fname}\n")
        return 0
469
470

def sum_te_counts(fname):
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
    p = Popen(
        ['awk', '$1 !~ /WBGene/ {sum += $2} END {print sum}', fname],
        stdout=PIPE,
        stderr=PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])


import matplotlib.patheffects


class Scale(matplotlib.patheffects.RendererBase):
    def __init__(self, sx, sy=None):
        self._sx = sx
        self._sy = sy

    def draw_path(self, renderer, gc, tpath, affine, rgbFace):
        affine = affine.identity().scale(self._sx, self._sy) + affine
        renderer.draw_path(gc, tpath, affine, rgbFace)


494
495
496
497
498
499
500
501
502
503
504
505
506
507
def print_wc(wildcards):
    print(wildcards)
    return []


def check_wc(wildcards):
    #if hasattr(wildcards, "read_type"):
    #    if wildcards.read_type.endswith("_on_C"):
    #        print(wildcards)
    if any(attr.endswith("_on_C") for attr in vars(wildcards).values() if type(attr) == "str"):
        print(wildcards)
    return []


508
509
510
filtered_product = filter_combinator(product, forbidden)

# Limit risks of ambiguity by imposing replicates to be numbers
511
# and restricting possible forms of some other wildcards
512
wildcard_constraints:
513
    lib="|".join(LIBS),
514
    #treat="|".join(TREATS),
515
    rep="\d+",
Blaise Li's avatar
Blaise Li committed
516
517
    min_dist="\d+",
    min_len="\d+",
518
    #max_len="\d+",
519
    biotype="|".join(set(COUNT_BIOTYPES + ANNOT_BIOTYPES + JOINED_BIOTYPES)),
520
    id_list="|".join(GENE_LISTS),
521
    type_set="|".join(["all", "protein_coding", "protein_coding_TE"]),
522
    mapped_type="|".join([f"on_{genome}", f"unique_on_{genome}"]),
523
    small_type="[spm]i|pisimi|siu|prot_si|te_si|pseu_si|satel_si|simrep_si|prot_siu|te_siu|pseu_siu|satel_siu|simrep_siu|sisiu|all_si|all_siu|all_sisiu",
524
    read_type = "|".join(REMAPPED_COUNTED + READ_TYPES_FOR_MAPPING + READ_TYPES_FOR_COUNTING + ["trimmed", "nomap"]),
525
526
527
528
529
530
531
532
    # read_type="|".join([
    #     "raw", "trimmed", "deduped", f"{size_selected}",
    #     "nomap", "nomap_siRNA", "mapped",
    #     "siRNA", "piRNA", "miRNA", "siuRNA",
    #     "prot_siRNA", "te_siRNA", "pseu_siRNA", "satel_siRNA", "simrep_siRNA",
    #     "prot_siuRNA", "te_siuRNA", "pseu_siuRNA", "satel_siuRNA", "simrep_siuRNA",
    #     "prot_sisiuRNA", "te_sisiuRNA", "pseu_sisiuRNA", "satel_sisiuRNA", "simrep_sisiuRNA",
    #     "all_siRNA", "all_siuRNA", "all_sisiuRNA"] + REMAPPED_COUNTED),
Blaise Li's avatar
Blaise Li committed
533
    standard="zscore|robust|minmax|unit|identity",
Blaise Li's avatar
Blaise Li committed
534
    orientation="all|fwd|rev",
535
    contrast="|".join(CONTRASTS),
536
    norm="|".join(TESTED_SIZE_FACTORS),
537
538
    lib_group="|".join(ALL_LIB_GROUPS),
    group_type="|".join(GROUP_TYPES),
Blaise Li's avatar
Blaise Li committed
539
    fold_type="|".join(["mean_log2_RPM_fold", "log2FoldChange", "lfcMLE"]),
540

Blaise Li's avatar
Blaise Li committed
541
#ruleorder: map_on_genome > sam2indexedbam > compute_coverage > remap_on_genome > resam2indexedbam > recompute_coverage
542
543

# Define functions to be used in shell portions
544
545
546
shell.prefix(SHELL_FUNCTIONS)


547
548
549
550
###################################
# Preparing the input of rule all #
###################################

551
bigwig_files = [
Blaise Li's avatar
Blaise Li committed
552
    # individual libraries
553
554
    expand(
        expand(
555
            OPJ(mapping_dir, "{lib}_{rep}",
556
                "{lib}_{rep}_{{read_type}}_on_%s_by_{{norm}}_{{orientation}}.bw" % genome),
557
            filtered_product, lib=LIBS, rep=REPS),
558
        read_type=READ_TYPES_FOR_MAPPING, norm=SIZE_FACTORS, orientation=["all"]),
559
560
    #expand(
    #    expand(
561
    #        OPJ(mapping_dir, "{lib}_{rep}", "{lib}_{rep}_{{read_type}}_on_%s_{{orientation}}.bw" % genome),
562
563
    #            filtered_product, lib=LIBS, rep=REPS),
    #    read_type=READ_TYPES_FOR_MAPPING, orientation=ORIENTATIONS),
Blaise Li's avatar
Blaise Li committed
564
    # means of replicates
565
566
    expand(
        expand(
567
            OPJ(mapping_dir, "{lib}_mean",
568
                "{lib}_mean_{{read_type}}_on_%s_by_{{norm}}_{{orientation}}.bw" % genome),
569
            filtered_product, lib=LIBS),
570
        read_type=READ_TYPES_FOR_MAPPING, norm=SIZE_FACTORS, orientation=["all"]),
571
572
573
574
575
    ]

meta_profiles = [
        #expand(OPJ(local_annot_dir, "transcripts_{type_set}", "merged_isolated_{min_dist}.bed"), type_set=["all", "protein_coding", "protein_coding_TE"], min_dist="0 5 10 25 50 100 250 500 1000 2500 5000 10000".split()),
        #expand(OPJ(local_annot_dir, "transcripts_{type_set}", "merged_isolated_{min_dist}_{biotype}_min_{min_len}.bed"), type_set=["all", "protein_coding", "protein_coding_TE"], min_dist="0 5 10 25 50 100 250 500 1000 2500 5000 10000".split(), biotype=["protein_coding"], min_len=[str(META_MIN_LEN)]),
Blaise Li's avatar
Blaise Li committed
576
        [expand(
577
            OPJ(output_dir, "figures", "mean_meta_profiles_meta_scale_{meta_scale}",
578
                "{read_type}_by_{norm}_{orientation}_on_{type_set}_merged_isolated_{min_dist}_{biotype}_min_{min_len}_{group_type}_{lib_group}_meta_profile.pdf"),
Blaise Li's avatar
Blaise Li committed
579
            meta_scale=[str(META_SCALE)], read_type=[size_selected, "siRNA", "siuRNA", "miRNA", "piRNA", "all_siRNA"],
580
581
            norm=SIZE_FACTORS, orientation=["all"], type_set=["protein_coding_TE"], min_dist=[str(MIN_DIST)],
            biotype=["protein_coding", "DNA_transposons_rmsk", "RNA_transposons_rmsk"], min_len=[str(META_MIN_LEN)],
582
            group_type=[group_type], lib_group=list(lib_group.keys())) for (group_type, lib_group) in lib_groups.items()],
583
584
585
        # Same as above ?
        # [expand(
        #     OPJ(output_dir, "figures", "mean_meta_profiles_meta_scale_{meta_scale}",
586
        #         "{read_type}_by_{norm}_{orientation}_on_{type_set}_merged_isolated_{min_dist}_{biotype}_min_{min_len}_{group_type}_{lib_group}_meta_profile.pdf"),
587
588
589
        #     meta_scale=[str(META_SCALE)], read_type=[size_selected, "siRNA", "siuRNA", "miRNA", "piRNA", "all_siRNA"],
        #     norm=SIZE_FACTORS, orientation=["all"], type_set=["protein_coding_TE"], min_dist=[str(MIN_DIST)],
        #     biotype=["protein_coding", "DNA_transposons_rmsk", "RNA_transposons_rmsk"], min_len=[str(META_MIN_LEN)],
590
        #     group_type=[group_type], lib_group=list(lib_group.keys())) for (group_type, lib_group) in lib_groups.items()],
Blaise Li's avatar
Blaise Li committed
591
        [expand(
592
            OPJ(output_dir, "figures", "mean_meta_profiles_meta_scale_{meta_scale}",
593
                "{read_type}_by_{norm}_{orientation}_on_{type_set}_merged_isolated_{min_dist}_{biotype}_min_{min_len}_{group_type}_{lib_group}_meta_profile.pdf"),
Blaise Li's avatar
Blaise Li committed
594
            meta_scale=[str(META_SCALE)], read_type=[size_selected, "siRNA", "siuRNA", "miRNA", "piRNA", "all_siRNA"],
595
596
            norm=SIZE_FACTORS, orientation=["all"], type_set=["protein_coding"], min_dist=[str(MIN_DIST)],
            biotype=["protein_coding"], min_len=[str(META_MIN_LEN)],
597
            group_type=[group_type], lib_group=list(lib_group.keys())) for (group_type, lib_group) in lib_groups.items()],
Blaise Li's avatar
Blaise Li committed
598
        [expand(
599
            OPJ(output_dir, "figures", "mean_meta_profiles_meta_scale_{meta_scale}",
600
                "{read_type}_by_{norm}_{orientation}_on_{type_set}_merged_isolated_{min_dist}_{id_list}_{group_type}_{lib_group}_meta_profile.pdf"),
601
602
            meta_scale=[str(META_SCALE)], read_type=[size_selected, "siRNA", "siuRNA", "miRNA", "piRNA", "all_siRNA"],
            norm=SIZE_FACTORS, orientation=["all"], type_set=["protein_coding_TE"], min_dist=["0"], id_list=GENE_LISTS,
603
            group_type=[group_type], lib_group=list(lib_group.keys())) for (group_type, lib_group) in lib_groups.items()],
604
605
606
        ## TODO: Resolve issue with bedtools
        # expand(
        #     OPJ(output_dir, "figures", "{lib}_{rep}",
607
        #         "{read_type}_by_{norm}_{orientation}_pi_targets_in_{biotype}_profile.pdf"),
608
        #     lib=LIBS, rep=REPS, read_type=[size_selected, "siRNA", "siuRNA", "miRNA", "piRNA", "all_siRNA"],
609
        #     norm=SIZE_FACTORS, orientation=["all"], biotype=["protein_coding"]),
610
611
612
    ]

read_graphs = [
613
614
    expand(
        expand(
615
616
            OPJ(output_dir, "figures", "{{lib}}_{{rep}}", "{read_type}_base_composition_from_{position}.pdf"),
            read_type=READ_TYPES_FOR_COMPOSITION, position=["start", "end"]),
617
618
619
        filtered_product, lib=LIBS, rep=REPS),
    expand(
        expand(
620
621
            OPJ(output_dir, "figures", "{{lib}}_{{rep}}", "{read_type}_base_logo_from_{position}.pdf"),
            read_type=READ_TYPES_FOR_COMPOSITION, position=["start", "end"]),
622
        filtered_product, lib=LIBS, rep=REPS),
623
624
    expand(
        expand(
625
626
            OPJ(output_dir, "figures", "{{lib}}_{{rep}}", "{read_type}_{position}_base_composition.pdf"),
            read_type=READ_TYPES_FOR_COMPOSITION, position=POSITIONS),
627
628
629
        filtered_product, lib=LIBS, rep=REPS),
    expand(
        expand(
630
631
            OPJ(output_dir, "figures", "{{lib}}_{{rep}}", "{read_type}_{position}_base_logo.pdf"),
            read_type=READ_TYPES_FOR_COMPOSITION, position=POSITIONS),
632
633
634
        filtered_product, lib=LIBS, rep=REPS),
    expand(
        expand(
635
636
            OPJ(output_dir, "figures", "{{lib}}_{{rep}}", "{read_type}_size_distribution.pdf"),
            read_type=["trimmed", "nomap"]),
637
638
        filtered_product, lib=LIBS, rep=REPS),
    expand(
639
640
        OPJ(output_dir, "figures", "{lib}_{rep}", f"{size_selected}_smallRNA_barchart.pdf"),
        filtered_product, lib=LIBS, rep=REPS),
641
    expand(
642
643
        OPJ(output_dir, "figures", "{lib}_{rep}", "nb_reads.pdf"),
        filtered_product, lib=LIBS, rep=REPS),
644
645
    ]

646
647
648
ip_fold_heatmaps = []
de_fold_heatmaps = []
ip_fold_boxplots = []
649
650
# Temporary, until used for boxplots:
remapped_folds = []
651
remapped_fold_boxplots = []
652
if contrasts_dict["ip"]:
653
654
655
656
657
658
659
660
661
662
663
664
665
    if BOXPLOT_BIOTYPES:
        remapped_fold_boxplots = expand(
            OPJ(output_dir, "figures", "{contrast}", "by_biotypes",
                "{contrast}_{read_type}_{mapped_type}_{biotypes}_{orientation}_transcript_{fold_type}_{gene_list}_boxplots.pdf"),
            contrast=IP_CONTRASTS,
            read_type=READ_TYPES_FOR_COUNTING,
            mapped_type=[f"on_{genome}", f"unique_on_{genome}"],
            #mapped_type=[f"on_{genome}"],
            # TODO: Read this from config file
            biotypes=["_and_".join(BOXPLOT_BIOTYPES)],
            orientation=ORIENTATIONS,
            fold_type=["mean_log2_RPM_fold"],
            gene_list=BOXPLOT_GENE_LISTS)
666
    remapped_folds = expand(
667
        OPJ(mapping_dir, f"RPM_folds_{size_selected}", "all", "remapped", "{counted_type}_mean_log2_RPM_fold.txt"),
668
669
670
671
        counted_type=REMAPPED_COUNTED),
    # snakemake -n OK
    # expand(
    #     OPJ(mapping_dir, f"RPM_folds_{size_selected}", "all",
672
673
    #         "{read_type}_{mapped_type}_{biotype}_{orientation}_transcript_mean_log2_RPM_fold.txt"),
    #     read_type=READ_TYPES_FOR_COUNTING, mapped_type=[f"on_{genome}"], biotype=COUNT_BIOTYPES, orientation=ORIENTATIONS),
674
675
676
    # snakemake -n OK
    # expand(
    #     OPJ(mapping_dir, f"RPM_folds_{size_selected}", "{contrast}",
677
678
    #         "{contrast}_{small_type}_{mapped_type}_{biotype}_{orientation}_transcript_RPM_folds.txt"),
    #     contrast=IP_CONTRASTS, small_type=READ_TYPES_FOR_COUNTING, mapped_type=[f"on_{genome}"], biotype=COUNT_BIOTYPES, orientation=ORIENTATIONS),
679
    ip_fold_heatmaps = expand(
680
681
        OPJ(output_dir, "figures", "fold_heatmaps", "{small_type}_{fold_type}_heatmap.pdf"),
        small_type=["pisimi", "prot_si"], fold_type=["mean_log2_RPM_fold"])
682
683
    ip_fold_boxplots = expand(
        OPJ(output_dir, "figures", "all_{contrast_type}",
684
            "{contrast_type}_{small_type}_{fold_type}_{gene_list}_boxplots.pdf"),
685
        contrast_type=["ip"], small_type=IP_TYPES, fold_type=["mean_log2_RPM_fold"],
686
        gene_list=BOXPLOT_GENE_LISTS)
687
688
689
    # Subdivided by biotype (to enable distinction between 5'UTR, CDS and 3'UTR)
    # ip_fold_boxplots = expand(
    #     OPJ(output_dir, "figures", "all_{contrast_type}",
690
    #         "{contrast_type}_{small_type}_{fold_type}_{gene_list}_{biotype}_boxplots.pdf"),
691
    #     contrast_type=["ip"], small_type=IP_TYPES, fold_type=["mean_log2_RPM_fold"],
692
    #     gene_list=BOXPLOT_GENE_LISTS, biotype=["protein_coding_5UTR", "protein_coding_CDS", "protein_coding_3UTR"])
693
694
if contrasts_dict["de"]:
    de_fold_heatmaps = expand(
695
696
        OPJ(output_dir, "figures", "fold_heatmaps", "{small_type}_{fold_type}_heatmap.pdf"),
        small_type=["pisimi", "prot_si"], fold_type=["log2FoldChange"])
697
698

exploratory_graphs = [
699
700
    ip_fold_heatmaps,
    de_fold_heatmaps,
701
    # Large figures, not very readable
702
    # expand(
703
704
    #     OPJ(mapping_dir, f"deseq2_{size_selected}", "{contrast}", "{contrast}_{small_type}_by_{norm}_pairplots.pdf"),
    #     contrast=DE_CONTRASTS, small_type=DE_TYPES, norm=DE_SIZE_FACTORS),
705
706
    ## TODO: debug PCA
    #expand(
707
708
    #    OPJ(output_dir, "figures", "{small_type}_{standard}_PCA.pdf"),
    #    small_type=["pisimi"], standard=STANDARDS),
709
    #expand(
710
711
    #    OPJ(output_dir, "figures", "{small_type}_{standard}_PC1_PC2_distrib.pdf"),
    #    small_type=["pisimi"], standard=STANDARDS),
712
713
    ##
    #expand(
714
715
    #    OPJ(output_dir, "figures", "{small_type}_clustermap.pdf"),
    #    small_type=SMALL_TYPES),
716
    #expand(
717
718
    #    OPJ(output_dir, "figures", "{small_type}_zscore_clustermap.pdf"),
    #    small_type=SMALL_TYPES),
719
    #expand(
720
721
    #    OPJ(output_dir, "figures", "{contrast}", "{small_type}_zscore_clustermap.pdf"),
    #    contrast=CONTRASTS, small_type=DE_TYPES),
722
723
724
725
    ]

de_fold_boxplots = expand(
    OPJ(output_dir, "figures", "{contrast}",
726
        "{contrast}_{small_type}_{fold_type}_{gene_list}_boxplots.pdf"),
Blaise Li's avatar
Blaise Li committed
727
    contrast=DE_CONTRASTS, small_type=DE_TYPES, fold_type=["log2FoldChange", "mean_log2_RPM_fold"],
728
    gene_list=["all_gene_lists"])
729
730
ip_fold_boxplots_by_contrast = expand(
    OPJ(output_dir, "figures", "{contrast}",
731
        "{contrast}_{small_type}_{fold_type}_{gene_list}_boxplots.pdf"),
Blaise Li's avatar
Blaise Li committed
732
    contrast=IP_CONTRASTS, small_type=IP_TYPES, fold_type=["mean_log2_RPM_fold"],
733
    gene_list=["all_gene_lists"])
734
735
736

fold_boxplots = [de_fold_boxplots, ip_fold_boxplots_by_contrast, ip_fold_boxplots]

737
738
rule all:
    input:
739
        expand(
740
            OPJ(mapping_dir, "{lib}_{rep}", "{read_type}_on_%s_coverage.txt" % genome),
741
742
743
            filtered_product, lib=LIBS, rep=REPS, read_type=[size_selected]),
        bigwig_files,
        meta_profiles,
744
745
746
        # snakemake -n OK
        # expand(
        #     expand(
747
        #         OPJ(feature_counts_dir, "summaries",
748
749
750
751
752
753
754
        #             "{{lib}}_{{rep}}_{read_type}_on_%s_{orientation}_transcript_counts.txt" % (genome)),
        #         read_type=READ_TYPES_FOR_MAPPING, orientation=ORIENTATIONS),
        #     filtered_product, lib=LIBS, rep=REPS),
        # TODO
        # snakemake -n OK
        # expand(
        #     expand(
755
        #         OPJ(feature_counts_dir,
756
757
758
759
760
761
        #             "{{lib}}_{{rep}}_{small_type}_counts.txt"),
        #         small_type=REMAPPED_COUNTED),
        #     filtered_product, lib=LIBS, rep=REPS),
        # snakemake -n not OK: summaries contains all biotypes
        # expand(
        #     expand(
762
        #         OPJ(feature_counts_dir, "summaries",
763
764
765
766
        #             "{{lib}}_{{rep}}_{small_type}_counts.txt"),
        #         small_type=REMAPPED_COUNTED),
        #     filtered_product, lib=LIBS, rep=REPS),
        # REMAPPED_COUNTED = [f"{small_type}_on_{genome}_{biotype}_{orientation}_transcript" for (small_type, biotype, orientation) in product(READ_TYPES_FOR_MAPPING, set(COUNT_BIOTYPES + ANNOT_BIOTYPES), ORIENTATIONS)]
767
768
        expand(
            expand(
769
                OPJ(feature_counts_dir, "summaries",
770
                    "{{lib}}_{{rep}}_%s_and_nomap_siRNA_on_%s_{orientation}_transcript_counts.txt" % (size_selected, genome)),
771
772
773
774
                orientation=ORIENTATIONS),
            filtered_product, lib=LIBS, rep=REPS),
        expand(
            expand(
775
                OPJ(feature_counts_dir, "summaries",
776
                    "{{lib}}_{{rep}}_prot_siRNA_and_prot_siuRNA_on_%s_{orientation}_transcript_counts.txt" % genome),
777
778
779
                orientation=ORIENTATIONS),
            filtered_product, lib=LIBS, rep=REPS),
        read_graphs,
780
781
        #expand(OPJ(feature_counts_dir, "summaries", "{lib}_{rep}_nb_non_structural.txt"), filtered_product, lib=LIBS, rep=REPS),
        OPJ(mapping_dir, f"RPM_folds_{size_selected}", "all", "pisimi_mean_log2_RPM_fold.txt"),
782
783
        # Not looking ad deseq2 results any more
        #expand(OPJ(mapping_dir, f"deseq2_{size_selected}", "all", "pisimi_{fold_type}.txt"), fold_type=["log2FoldChange"]),
784
        #expand(OPJ(mapping_dir, "RPM_folds_%s" % size_selected, "{contrast}", "{contrast}_{small_type}_RPM_folds.txt"), contrast=IP_CONTRASTS, small_type=DE_TYPES),
785
        exploratory_graphs,
786
        remapped_folds,
787
        fold_boxplots,
788
        remapped_fold_boxplots,
789
        #expand(OPJ(output_dir, "figures", "{small_type}_unit_clustermap.pdf"), small_type=SMALL_TYPES),
790
        #expand(OPJ(
791
        #    feature_counts_dir,
792
        #    "all_{read_type}_{mapped_type}_{biotype}_{orientation}_transcript_counts.txt"), read_type=READ_TYPES_FOR_COUNTING, mapped_type=[f"on_{genome}"], biotype=ANNOT_BIOTYPES, orientation=ORIENTATIONS),
793
        # expand(OPJ(
794
        #     feature_counts_dir,
795
796
        #     "all_{small_type}_{mapped_type}_{biotype}_{orientation}_transcript_counts.txt"),
        #     small_type=READ_TYPES_FOR_MAPPING, mapped_type=[f"on_{genome}"], biotype=set(COUNT_BIOTYPES + ANNOT_BIOTYPES), orientation=ORIENTATIONS),
797
        expand(
798
            OPJ(annot_counts_dir, f"all_{size_selected}_on_{genome}", "{small_type}_RPM.txt"),
799
            small_type=["mi", "prot_si", "te_si", "pseu_si", "satel_si", "simrep_si", "prot_siu", "te_siu", "pseu_siu",  "pisimi"]),
800
801
        # piRNA and satel_siu raise ValueError: `dataset` input should have multiple elements when plotting
        # simrep_siu raise TypeError: Empty 'DataFrame': no numeric data to plot
802
        expand(
803
804
            OPJ(output_dir, "figures", "{small_type}_norm_correlations.pdf"),
            small_type=["mi", "prot_si", "te_si", "pseu_si", "satel_si", "simrep_si", "prot_siu", "te_siu", "pseu_siu",  "pisimi"]),
805
        expand(
806
807
            OPJ(output_dir, "figures", "{small_type}_norm_counts_distrib.pdf"),
            small_type=["mi", "prot_si", "te_si", "pseu_si", "satel_si", "simrep_si", "prot_siu", "te_siu", "pseu_siu", "pisimi"]),
808

809
810
811
812
813
814
#absolute = "/pasteur/homes/bli/src/bioinfo_utils/snakemake_wrappers/includes/link_raw_data.rules"
#relative_include_path = "../snakemake_wrappers/includes/link_raw_data.snakefile"
#absolute_include_path = os.path.join(workflow.basedir, relative_include_path)
#assert os.path.exists(absolute_include_path)
#include: relative_include_path
include: ensure_relative(irules["link_raw_data"], workflow.basedir)
815

816

817
818
819
820
821
rule trim_and_dedup:
    input:
        rules.link_raw_data.output,
        #OPJ(data_dir, "{lib}_{rep}.fastq.gz"),
    params:
822
        adapter = lambda wildcards: lib2adapt[wildcards.lib],
823
824
825
        trim5 = trim5,
        trim3 = trim3,
    output:
826
827
828
829
        trimmed = OPJ(data_dir, "trimmed", "{lib}_{rep}_trimmed.fastq.gz"),
        nb_raw =  OPJ(data_dir, "trimmed", "{lib}_{rep}_nb_raw.txt"),
        nb_trimmed =  OPJ(data_dir, "trimmed", "{lib}_{rep}_nb_trimmed.txt"),
        nb_deduped =  OPJ(data_dir, "trimmed", "{lib}_{rep}_nb_deduped.txt"),
830
831
    #threads: 2
    message:
832
        "Trimming adaptor from raw data, deduplicating reads, removing random 5' {trim5}-mers and 3' {trim3}-mers for {wildcards.lib}_{wildcards.rep}."
833
834
    benchmark:
        OPJ(log_dir, "trim_and_dedup", "{lib}_{rep}_benchmark.txt")
835
    log:
836
837
        cutadapt = OPJ(log_dir, "cutadapt", "{lib}_{rep}.log"),
        trim_and_dedup = OPJ(log_dir, "trim_and_dedup", "{lib}_{rep}.log"),
838
839
840
841
842
843
    shell:
        """
        zcat {input} \\
            | tee >(count_fastq_reads {output.nb_raw}) \\
            | cutadapt -a {params.adapter} --discard-untrimmed - 2> {log.cutadapt} \\
            | tee >(count_fastq_reads {output.nb_trimmed}) \\
844
            | dedup \\
845
846
847
848
849
850
851
852
            | tee >(count_fastq_reads {output.nb_deduped}) \\
            | trim_random_nt {params.trim5} {params.trim3}  2>> {log.cutadapt} \\
            | gzip > {output.trimmed} \\
            2> {log.trim_and_dedup}
        """


def awk_size_filter(wildcards):
Blaise Li's avatar
Blaise Li committed
853
    """Returns the bioawk filter to select reads of size from MIN_LEN to MAX_LEN."""
854
    return "%s <= length($seq) && length($seq) <= %s" % (MIN_LEN, MAX_LEN)
855

Blaise Li's avatar
Blaise Li committed
856

857
rule select_size_range:
Blaise Li's avatar
Blaise Li committed
858
    """Select (and count) reads in the correct size range."""
859
860
861
    input:
        rules.trim_and_dedup.output.trimmed
    output:
862
863
        selected = OPJ(data_dir, "trimmed", "{lib}_{rep}_%s.fastq.gz" % size_selected),
        nb_selected = OPJ(data_dir, "trimmed", "{lib}_{rep}_nb_%s.txt" % size_selected),
864
865
866
    params:
        awk_filter = awk_size_filter,
    message:
867
        "Selecting reads size %s for {wildcards.lib}_{wildcards.rep}." % size_selected
868
869
870
    shell:
        """
        bioawk -c fastx '{params.awk_filter} {{print "@"$name" "$4"\\n"$seq"\\n+\\n"$qual}}' {input} \\
871
872
            | tee >(count_fastq_reads {output.nb_selected}) \\
            | gzip > {output.selected}
873
874
875
        """


876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# if int(snakemake.__version__.split(".")[0]) == 5:
#     def source_fastq(wildcards):
#         """Determine the fastq file corresponding to a given read type."""
#         read_type = wildcards.read_type
#         if read_type == "raw":
#             return OPJ(data_dir, "{wildcards.lib}_{wildcards.rep}.fastq.gz")
#             # return rules.link_raw_data.output
#         elif read_type == "trimmed":
#             return rules.trim_and_dedup.output.trimmed
#         elif read_type == size_selected:
#             return rules.select_size_range.output.selected
#         elif read_type == "nomap":
#             return rules.map_on_genome.output.nomap_fastq
#         elif read_type == "nomap_siRNA":
#             return rules.extract_nomap_siRNAs.output.nomap_si
#         elif read_type == "prot_siRNA":
#             return rules.small_RNA_seq_annotate.output.prot_si
#         elif read_type == "te_siRNA":
#             return rules.small_RNA_seq_annotate.output.te_si
#         elif read_type == "pseu_siRNA":
#             return rules.small_RNA_seq_annotate.output.pseu_si
#         elif read_type == "satel_siRNA":
#             return rules.small_RNA_seq_annotate.output.satel_si
#         elif read_type == "simrep_siRNA":
#             return rules.small_RNA_seq_annotate.output.simrep_si
#         elif read_type == "siRNA":
#             return rules.join_si_reads.output.si
#         elif read_type == "piRNA":
#             return rules.small_RNA_seq_annotate.output.pi
#         elif read_type == "miRNA":
#             return rules.small_RNA_seq_annotate.output.mi
#         elif read_type == "all_siRNA":
#             return rules.small_RNA_seq_annotate.output.all_si
#         elif read_type == "prot_siuRNA":
#             return rules.small_RNA_seq_annotate.output.prot_siu
#         elif read_type == "te_siuRNA":
#             return rules.small_RNA_seq_annotate.output.te_siu
#         elif read_type == "pseu_siuRNA":
#             return rules.small_RNA_seq_annotate.output.pseu_siu
#         elif read_type == "satel_siuRNA":
#             return rules.small_RNA_seq_annotate.output.satel_siu
#         elif read_type == "simrep_siuRNA":
#             return rules.small_RNA_seq_annotate.output.simrep_siu
#         elif read_type == "all_siuRNA":
#             return rules.small_RNA_seq_annotate.output.all_siu
#         elif read_type == "siuRNA":
#             return rules.join_si_reads.output.siu
#         elif read_type == "sisiuRNA":
#             return rules.join_si_reads.output.sisiu
#         elif read_type == "prot_sisiuRNA":
#             return rules.join_si_reads.output.prot_sisiu
#         elif read_type == "te_sisiuRNA":
#             return rules.join_si_reads.output.te_sisiu
#         elif read_type == "pseu_sisiuRNA":
#             return rules.join_si_reads.output.pseu_sisiu
#         elif read_type == "satel_sisiuRNA":
#             return rules.join_si_reads.output.satel_sisiu
#         elif read_type == "simrep_sisiuRNA":
#             return rules.join_si_reads.output.simrep_sisiu
#         elif read_type == "all_sisiuRNA":
#             return rules.join_si_reads.output.all_sisiu
#         else:
#             raise NotImplementedError("Unknown read type: %s" % read_type)
# else:
#     def source_fastq(wildcards):
#         """Determine the fastq file corresponding to a given read type."""
#         read_type = wildcards.read_type
#         if read_type == "raw":
#             return rules.link_raw_data.output
#         elif read_type == "trimmed":
#             return rules.trim_and_dedup.output.trimmed
#         elif read_type == size_selected:
#             return rules.select_size_range.output.selected
#         elif read_type == "nomap":
#             return rules.map_on_genome.output.nomap_fastq
#         elif read_type == "nomap_siRNA":
#             return rules.extract_nomap_siRNAs.output.nomap_si
#         elif read_type == "prot_siRNA":
#             return rules.small_RNA_seq_annotate.output.prot_si
#         elif read_type == "te_siRNA":
#             return rules.small_RNA_seq_annotate.output.te_si
#         elif read_type == "pseu_siRNA":
#             return rules.small_RNA_seq_annotate.output.pseu_si
#         elif read_type == "satel_siRNA":
#             return rules.small_RNA_seq_annotate.output.satel_si
#         elif read_type == "simrep_siRNA":
#             return rules.small_RNA_seq_annotate.output.simrep_si
#         elif read_type == "siRNA":
#             return rules.join_si_reads.output.si
#         elif read_type == "piRNA":
#             return rules.small_RNA_seq_annotate.output.pi
#         elif read_type == "miRNA":
#             return rules.small_RNA_seq_annotate.output.mi
#         elif read_type == "all_siRNA":
#             return rules.small_RNA_seq_annotate.output.all_si
#         elif read_type == "prot_siuRNA":
#             return rules.small_RNA_seq_annotate.output.prot_siu
#         elif read_type == "te_siuRNA":
#             return rules.small_RNA_seq_annotate.output.te_siu
#         elif read_type == "pseu_siuRNA":
#             return rules.small_RNA_seq_annotate.output.pseu_siu
#         elif read_type == "satel_siuRNA":
#             return rules.small_RNA_seq_annotate.output.satel_siu
#         elif read_type == "simrep_siuRNA":
#             return rules.small_RNA_seq_annotate.output.simrep_siu
#         elif read_type == "all_siuRNA":
#             return rules.small_RNA_seq_annotate.output.all_siu
#         elif read_type == "siuRNA":
#             return rules.join_si_reads.output.siu
#         elif read_type == "sisiuRNA":
#             return rules.join_si_reads.output.sisiu
#         elif read_type == "prot_sisiuRNA":
#             return rules.join_si_reads.output.prot_sisiu
#         elif read_type == "te_sisiuRNA":
#             return rules.join_si_reads.output.te_sisiu
#         elif read_type == "pseu_sisiuRNA":
#             return rules.join_si_reads.output.pseu_sisiu
#         elif read_type == "satel_sisiuRNA":
#             return rules.join_si_reads.output.satel_sisiu
#         elif read_type == "simrep_sisiuRNA":
#             return rules.join_si_reads.output.simrep_sisiu
#         elif read_type == "all_sisiuRNA":
#             return rules.join_si_reads.output.all_sisiu
#         else:
#             raise NotImplementedError("Unknown read type: %s" % read_type)
For faster browsing, not all history is shown. View entire blame