README.md 2.26 KB
Newer Older
1
# Linked Reads molecule separation
Yoann  DUFRESNE's avatar
Yoann DUFRESNE committed
2

3
4
5
6
7
8
9
10
11
12
13
A compilation of scripts and pipelines to count and extract scaffolds of barcodes from linked reads datasets.

## Nomenclature warnings
During the process of writing a scientific article, some of the datastructure names have been modified.
In this repository the majority of the names are old names.
So, here is a short list of equivalences:
- unit d-graph -> local clique pair
- udg -> lcp
- d²-graph (or d2-graph) -> lcp graph
- udg divergence = lcp weight
- udg edge distance = lcp edge weight
Yoann Dufresne's avatar
Yoann Dufresne committed
14

15
16
17
18
19
20
21
22
23
24
## Installation

Install the package from the root directory.
```bash
    # For users
    pip install . --user
    # For developers
    pip install -e . --user
```

25
## Scripts
Yoann Dufresne's avatar
Yoann Dufresne committed
26

27
For the majority of the scripts, argparse is used.
28
To know how to use it please use the -h command line option.
29
30
31
32
33

### Data simulation

* generate_fake_molecule_graph.py: Create a linear molecule graph, where the molecules are linked to the d molecules on their left and d molecules on their right.

34
* generate_fake_barcode_graph.py: Take a barcode graph as input (gexf formatted) and outputs a barcode graph. The barcode graph is create by fusion of nodes from the molecule graph.
35

36
37
* use the snakefile "Snakemake_data_simu".
All the parameters can be an integer or a list of integer.
38
Each combination of parameter will generate a barcode graph.  
39
40
41
42
43
44
Config parameters:
  * n: the number of initial molecules
  * m: average number of node merged in each barcode
  * d: average coverage of a molecule in the initial graph
  * workdir: the directory to create and use as output

45
46
### Data structures and algorithms

47
48
49
50
51
52
53
* Create a d2 graph from barcode graph: use the snakemake "Snakefile_d2"  
The result will be generate as a compressed file in the workdir.  
Config parameters:

  * input: the input barcode graph (gexf format preferred).
  * workdir: The working and output directory.

54
55
* to_d2_graph.py: Mount a barcode graph into memory and create a d2 graph from it.

56
* d2_to_path.py: take a d2 graph as input and explore the nodes to extract a udg path.
57

58
* evaluate.py: take a d2 graph gexf file and and analyse it. Look for an approximation of the longest correct path to reconstruct a molecule graph. Take as input a d2 graph where the truth is known in the node names (the format used to create fake data).