Skip to content
Snippets Groups Projects

Linked Reads molecule separation

A compilation of scripts and pipelines to count and extract scaffolds of barcodes from linked reads datasets.

Nomenclature warnings

During the process of writing a scientific article, some of the datastructure names have been modified. In this repository the majority of the names are old names. So, here is a short list of equivalences:

  • unit d-graph -> local clique pair
  • udg -> lcp
  • d²-graph (or d2-graph) -> lcp graph
  • udg divergence = lcp weight
  • udg edge distance = lcp edge weight

Installation

Install the package from the root directory.

    # For users
    pip install . --user
    # For developers
    pip install -e . --user

Scripts

For the majority of the scripts, argparse is used. To know how to use it please use the -h command line option.

Data simulation

  • generate_fake_molecule_graph.py: Create a linear molecule graph, where the molecules are linked to the d molecules on their left and d molecules on their right.

  • generate_fake_barcode_graph.py: Take a barcode graph as input (gexf formatted) and outputs a barcode graph. The barcode graph is create by fusion of nodes from the molecule graph.

  • use the snakefile "Snakemake_data_simu". All the parameters can be an integer or a list of integer. Each combination of parameter will generate a barcode graph.
    Config parameters:

    • n: the number of initial molecules
    • m: average number of node merged in each barcode
    • d: average coverage of a molecule in the initial graph
    • workdir: the directory to create and use as output

Data structures and algorithms

  • Create a d2 graph from barcode graph: use the snakemake "Snakefile_d2"
    The result will be generate as a compressed file in the workdir.
    Config parameters:

    • input: the input barcode graph (gexf format preferred).
    • workdir: The working and output directory.
  • to_d2_graph.py: Mount a barcode graph into memory and create a d2 graph from it.

  • d2_to_path.py: take a d2 graph as input and explore the nodes to extract a udg path.

  • evaluate.py: take a d2 graph gexf file and and analyse it. Look for an approximation of the longest correct path to reconstruct a molecule graph. Take as input a d2 graph where the truth is known in the node names (the format used to create fake data).