A compilation of scripts and pipelines to count and extract scaffolds of barcodes from linked reads datasets.
**WARNING**: This code is a proof of concept, not a usable software for production. If the code is too slow for your tests or you are encontering some bugs (maybe it's a feature ? :p) don't hesitate to contact us via the issues or with a direct mail to me (yoann [dot] dufresne [at] pasteur [dot] fr).
## Nomenclature warnings
During the process of writing a scientific article, some of the datastructure names have been modified.
In this repository the majority of the names are old names.
For the majority of the scripts, argparse is used.
To know how to use it please use the -h command line option.
### Test the complete pipeline on simulated data
For a complete test, we made a bunch of snakemake files.
If you are looking for a complete pipeline from synthetic data generation, you should look into the "Snakefile_d2_eval" file.
You can play with the N (number of molecules in the interval graph), M (average number of merge to perform in a barcode), DEV (standard deviation on merge) variables to see impact on performances.
These values are arrays. You can enter multiple values and all the combinations will be done.
A summary is output in the tsv file "{WORKDIR}/eval_compare_maxclique.tsv.
Warning: the pipeline can be very slow for huge number of parameters.
Command to run the pipeline:
snakemake -s Snakefile_d2_eval
### Data simulation
* generate_fake_molecule_graph.py: Create a linear molecule graph, where the molecules are linked to the d molecules on their left and d molecules on their right.
rule d2_simplification:
"python3 deconvolution/main/d2_reduction.py -o {output.simplified_d2} {input.barcode_graph} {input.d2_raw}"
"python3 deconvolution/main/d2_reduction.py -o {output.simplified_d2} {input.d2_raw}"
rule d2_generation:
shell(f"python3 deconvolution/main/to_d2_graph.py {{input.barcode_graph}} --{{wildcards.method}} -t {{threads}} -o {WORKDIR}/{{wildcards.file}}_d2_raw_{{wildcards.method}}")
shell(f"python3 deconvolution/main/to_d2_graph.py {{input.barcode_graph}} --{{wildcards.method}} -t {{threads}} -o {WORKDIR}/{{wildcards.file}}_d2_raw_{{wildcards.method}}.gexf")
rule setup_workdir:
include: "Snakefile_d2_path"
WORKDIR = "snake_experiments" if "workdir" not in config else config["workdir"]
N = [5000, 10000]
N = [5000]
D = [10]
M = [2, 3]
DEV = [0, 1]
M = [2]
DEV = [0]
rule generate_compare:
rule d2_path_generation:
best = 0
for _ in range(number_try):
shell("python3 deconvolution/main/d2_to_path.py {input.barcode} {input.d2} > {output}_tmp.out")
shell("python3 deconvolution/main/d2_to_path.py {input.d2} > {output}_tmp.out")
score = 0
with open(f"{output}_tmp.out") as out:
score_line = out.readlines()[-2].strip()
def parse_arguments():
parser = argparse.ArgumentParser(description='Transform a barcode graph into a lcp graph. The program dig for a set of lcps and then merge them into a lcp graph.')
parser.add_argument('barcode_graph', help='The barcode graph file. Must be a gefx formated file.')
parser.add_argument('--output_prefix', '-o', default="", help="Output file prefix.")
parser.add_argument('--outfile', '-o', default="", help="Output file name for lcp graph.")
parser.add_argument('--threads', '-t', default=8, type=int, help='Number of thread to use for dgraph computation')
parser.add_argument('--debug', '-d', action='store_true', help="Debug")
parser.add_argument('--verbose', '-v', action='store_true', help="Verbose")
args = parser.parse_args()
if args.output_prefix == "":
args.output_prefix = ".".join(args.barcode_graph.split(".")[:-1]) + "_lcpg"
if args.outfile == "":
args.outfile = ".".join(args.barcode_graph.split(".")[:-1]) + "_lcpg.gexf"
return args
debug_path = "/dev/null"
if args.debug:
debug = True
debug_path = f"{args.output_prefix}_debug"
debug_path = ".".join(args.barcode_graph.split(".")[:-1]) + "_debug"
import os, shutil
if os.path.isdir(debug_path):
nx.write_gexf(d2g, f"{args.output_prefix}.gexf")
nx.write_gexf(d2g, args.outfile)
if __name__ == "__main__":