...
 
Commits (2)
......@@ -2,6 +2,8 @@
A compilation of scripts and pipelines to count and extract scaffolds of barcodes from linked reads datasets.
**WARNING**: This code is a proof of concept, not a usable software for production. If the code is too slow for your tests or you are encontering some bugs (maybe it's a feature ? :p) don't hesitate to contact us via the issues or with a direct mail to me (yoann [dot] dufresne [at] pasteur [dot] fr).
## Nomenclature warnings
During the process of writing a scientific article, some of the datastructure names have been modified.
In this repository the majority of the names are old names.
......@@ -27,6 +29,20 @@ Install the package from the root directory.
For the majority of the scripts, argparse is used.
To know how to use it please use the -h command line option.
### Test the complete pipeline on simulated data
For a complete test, we made a bunch of snakemake files.
If you are looking for a complete pipeline from synthetic data generation, you should look into the "Snakefile_d2_eval" file.
You can play with the N (number of molecules in the interval graph), M (average number of merge to perform in a barcode), DEV (standard deviation on merge) variables to see impact on performances.
These values are arrays. You can enter multiple values and all the combinations will be done.
A summary is output in the tsv file "{WORKDIR}/eval_compare_maxclique.tsv.
Warning: the pipeline can be very slow for huge number of parameters.
Command to run the pipeline:
```
snakemake -s Snakefile_d2_eval
```
### Data simulation
* generate_fake_molecule_graph.py: Create a linear molecule graph, where the molecules are linked to the d molecules on their left and d molecules on their right.
......
......@@ -35,14 +35,13 @@ rule compress_data:
rule d2_simplification:
input:
barcode_graph="{barcode_path}.gexf",
d2_raw="{barcode_path}_d2_raw_{method}.gexf"
output:
simplified_d2="{barcode_path}_d2_simplified_{method}.gexf"
wildcard_constraints:
method="[A-Za-z0-9]+"
shell:
"python3 deconvolution/main/d2_reduction.py -o {output.simplified_d2} {input.barcode_graph} {input.d2_raw}"
"python3 deconvolution/main/d2_reduction.py -o {output.simplified_d2} {input.d2_raw}"
rule d2_generation:
......@@ -54,7 +53,7 @@ rule d2_generation:
wildcard_constraints:
method="[A-Za-z0-9]+"
run:
shell(f"python3 deconvolution/main/to_d2_graph.py {{input.barcode_graph}} --{{wildcards.method}} -t {{threads}} -o {WORKDIR}/{{wildcards.file}}_d2_raw_{{wildcards.method}}")
shell(f"python3 deconvolution/main/to_d2_graph.py {{input.barcode_graph}} --{{wildcards.method}} -t {{threads}} -o {WORKDIR}/{{wildcards.file}}_d2_raw_{{wildcards.method}}.gexf")
rule setup_workdir:
......
......@@ -3,10 +3,10 @@ include: "Snakefile_d2"
include: "Snakefile_d2_path"
WORKDIR = "snake_experiments" if "workdir" not in config else config["workdir"]
N = [5000, 10000]
N = [5000]
D = [10]
M = [2, 3]
DEV = [0, 1]
M = [2]
DEV = [0]
rule generate_compare:
input:
......
......@@ -5,14 +5,13 @@ threshold = 0.9
rule d2_path_generation:
input:
barcode="{path}.gexf",
d2="{path}_d2_{type}_{method}.gexf"
output:
"{path}_d2_{type}_{method}_path.gexf"
run:
best = 0
for _ in range(number_try):
shell("python3 deconvolution/main/d2_to_path.py {input.barcode} {input.d2} > {output}_tmp.out")
shell("python3 deconvolution/main/d2_to_path.py {input.d2} > {output}_tmp.out")
score = 0
with open(f"{output}_tmp.out") as out:
score_line = out.readlines()[-2].strip()
......
......@@ -10,7 +10,7 @@ from deconvolution.d2graph import d2_graph as d2
def parse_arguments():
parser = argparse.ArgumentParser(description='Transform a barcode graph into a lcp graph. The program dig for a set of lcps and then merge them into a lcp graph.')
parser.add_argument('barcode_graph', help='The barcode graph file. Must be a gefx formated file.')
parser.add_argument('--output_prefix', '-o', default="", help="Output file prefix.")
parser.add_argument('--outfile', '-o', default="", help="Output file name for lcp graph.")
parser.add_argument('--threads', '-t', default=8, type=int, help='Number of thread to use for dgraph computation')
parser.add_argument('--debug', '-d', action='store_true', help="Debug")
parser.add_argument('--verbose', '-v', action='store_true', help="Verbose")
......@@ -20,8 +20,8 @@ def parse_arguments():
args = parser.parse_args()
if args.output_prefix == "":
args.output_prefix = ".".join(args.barcode_graph.split(".")[:-1]) + "_lcpg"
if args.outfile == "":
args.outfile = ".".join(args.barcode_graph.split(".")[:-1]) + "_lcpg.gexf"
return args
......@@ -43,7 +43,7 @@ def main():
debug_path = "/dev/null"
if args.debug:
debug = True
debug_path = f"{args.output_prefix}_debug"
debug_path = ".".join(args.barcode_graph.split(".")[:-1]) + "_debug"
import os, shutil
if os.path.isdir(debug_path):
shutil.rmtree(debug_path)
......@@ -58,7 +58,7 @@ def main():
verbose=args.verbose
)
nx.write_gexf(d2g, f"{args.output_prefix}.gexf")
nx.write_gexf(d2g, args.outfile)
if __name__ == "__main__":
......