-
Yoann Dufresne authoredYoann Dufresne authored
Linked Reads molecule separation
A compilation of scripts and pipelines to count and extract scaffolds of barcodes from linked reads datasets.
Nomenclature warnings
During the process of writing a scientific article, some of the datastructure names have been modified. In this repository the majority of the names are old names. So, here is a short list of equivalences:
- unit d-graph -> local clique pair
- udg -> lcp
- d²-graph (or d2-graph) -> lcp graph
- udg divergence = lcp weight
- udg edge distance = lcp edge weight
Installation
Install the package from the root directory.
# For users
pip install . --user
# For developers
pip install -e . --user
Scripts
For the majority of the scripts, argparse is used. To know how to use it please use the -h command line option.
Data simulation
-
generate_fake_molecule_graph.py: Create a linear molecule graph, where the molecules are linked to the d molecules on their left and d molecules on their right.
-
generate_fake_barcode_graph.py: Take a barcode graph as input (gexf formatted) and outputs a barcode graph. The barcode graph is create by fusion of nodes from the molecule graph.
-
use the snakefile "Snakemake_data_simu". All the parameters can be an integer or a list of integer. Each combination of parameter will generate a barcode graph.
Config parameters:- n: the number of initial molecules
- m: average number of node merged in each barcode
- d: average coverage of a molecule in the initial graph
- workdir: the directory to create and use as output
Data structures and algorithms
-
Create a d2 graph from barcode graph: use the snakemake "Snakefile_d2"
The result will be generate as a compressed file in the workdir.
Config parameters:- input: the input barcode graph (gexf format preferred).
- workdir: The working and output directory.
-
to_d2_graph.py: Mount a barcode graph into memory and create a d2 graph from it.
-
d2_to_path.py: take a d2 graph as input and explore the nodes to extract a udg path.
-
evaluate.py: take a d2 graph gexf file and and analyse it. Look for an approximation of the longest correct path to reconstruct a molecule graph. Take as input a d2 graph where the truth is known in the node names (the format used to create fake data).