# Linked Reads molecule separation A compilation of scripts and pipelines to count and extract scaffolds of barcodes from linked reads datasets. **WARNING**: This code is a proof of concept, not a usable software for production. If the code is too slow for your tests or you are encontering some bugs (maybe it's a feature ? :p) don't hesitate to contact us via the issues or with a direct mail to me (yoann [dot] dufresne [at] pasteur [dot] fr). ## Nomenclature warnings During the process of writing a scientific article, some of the datastructure names have been modified. In this repository the majority of the names are old names. So, here is a short list of equivalences: - unit d-graph -> local clique pair - udg -> lcp - d²-graph (or d2-graph) -> lcp graph - udg divergence = lcp weight - udg edge distance = lcp edge weight ## Installation Install the package from the root directory. ```bash # For users pip install . --user # For developers pip install -e . --user ``` ## Scripts For the majority of the scripts, argparse is used. To know how to use it please use the -h command line option. ### Test the complete pipeline on simulated data For a complete test, we made a bunch of snakemake files. If you are looking for a complete pipeline from synthetic data generation, you should look into the "Snakefile_d2_eval" file. You can play with the N (number of molecules in the interval graph), M (average number of merge to perform in a barcode), DEV (standard deviation on merge) variables to see impact on performances. These values are arrays. You can enter multiple values and all the combinations will be done. A summary is output in the tsv file "{WORKDIR}/eval_compare_maxclique.tsv. Warning: the pipeline can be very slow for huge number of parameters. Command to run the pipeline: ``` snakemake -s Snakefile_d2_eval ``` ### Data simulation * generate_fake_molecule_graph.py: Create a linear molecule graph, where the molecules are linked to the d molecules on their left and d molecules on their right. * generate_fake_barcode_graph.py: Take a barcode graph as input (gexf formatted) and outputs a barcode graph. The barcode graph is create by fusion of nodes from the molecule graph. * use the snakefile "Snakemake_data_simu". All the parameters can be an integer or a list of integer. Each combination of parameter will generate a barcode graph. Config parameters: * n: the number of initial molecules * m: average number of node merged in each barcode * d: average coverage of a molecule in the initial graph * workdir: the directory to create and use as output ### Data structures and algorithms * Create a d2 graph from barcode graph: use the snakemake "Snakefile_d2" The result will be generate as a compressed file in the workdir. Config parameters: * input: the input barcode graph (gexf format preferred). * workdir: The working and output directory. * to_d2_graph.py: Mount a barcode graph into memory and create a d2 graph from it. * d2_to_path.py: take a d2 graph as input and explore the nodes to extract a udg path. * evaluate.py: take a d2 graph gexf file and and analyse it. Look for an approximation of the longest correct path to reconstruct a molecule graph. Take as input a d2 graph where the truth is known in the node names (the format used to create fake data).