README.md 3.28 KB
Newer Older
1
# Linked Reads molecule separation
Yoann  DUFRESNE's avatar
Yoann DUFRESNE committed
2

3
4
A compilation of scripts and pipelines to count and extract scaffolds of barcodes from linked reads datasets.

Yoann Dufresne's avatar
Yoann Dufresne committed
5
6
**WARNING**: This code is a proof of concept, not a usable software for production. If the code is too slow for your tests or you are encontering some bugs (maybe it's a feature ? :p) don't hesitate to contact us via the issues or with a direct mail to me (yoann [dot] dufresne [at] pasteur [dot] fr).

7
8
9
10
11
12
13
14
15
## Nomenclature warnings
During the process of writing a scientific article, some of the datastructure names have been modified.
In this repository the majority of the names are old names.
So, here is a short list of equivalences:
- unit d-graph -> local clique pair
- udg -> lcp
- d²-graph (or d2-graph) -> lcp graph
- udg divergence = lcp weight
- udg edge distance = lcp edge weight
Yoann Dufresne's avatar
Yoann Dufresne committed
16

17
18
19
20
21
22
23
24
25
26
## Installation

Install the package from the root directory.
```bash
    # For users
    pip install . --user
    # For developers
    pip install -e . --user
```

27
## Scripts
Yoann Dufresne's avatar
Yoann Dufresne committed
28

29
For the majority of the scripts, argparse is used.
30
To know how to use it please use the -h command line option.
31

Yoann Dufresne's avatar
Yoann Dufresne committed
32
33
34
35
36
37
38
39
40
41
42
43
44
45
### Test the complete pipeline on simulated data

For a complete test, we made a bunch of snakemake files.
If you are looking for a complete pipeline from synthetic data generation, you should look into the "Snakefile_d2_eval" file.
You can play with the N (number of molecules in the interval graph), M (average number of merge to perform in a barcode), DEV (standard deviation on merge) variables to see impact on performances.
These values are arrays. You can enter multiple values and all the combinations will be done.
A summary is output in the tsv file "{WORKDIR}/eval_compare_maxclique.tsv.
Warning: the pipeline can be very slow for huge number of parameters.

Command to run the pipeline:
```
    snakemake -s Snakefile_d2_eval
```

46
47
48
49
### Data simulation

* generate_fake_molecule_graph.py: Create a linear molecule graph, where the molecules are linked to the d molecules on their left and d molecules on their right.

50
* generate_fake_barcode_graph.py: Take a barcode graph as input (gexf formatted) and outputs a barcode graph. The barcode graph is create by fusion of nodes from the molecule graph.
51

52
53
* use the snakefile "Snakemake_data_simu".
All the parameters can be an integer or a list of integer.
54
Each combination of parameter will generate a barcode graph.  
55
56
57
58
59
60
Config parameters:
  * n: the number of initial molecules
  * m: average number of node merged in each barcode
  * d: average coverage of a molecule in the initial graph
  * workdir: the directory to create and use as output

61
62
### Data structures and algorithms

63
64
65
66
67
68
69
* Create a d2 graph from barcode graph: use the snakemake "Snakefile_d2"  
The result will be generate as a compressed file in the workdir.  
Config parameters:

  * input: the input barcode graph (gexf format preferred).
  * workdir: The working and output directory.

70
71
* to_d2_graph.py: Mount a barcode graph into memory and create a d2 graph from it.

72
* d2_to_path.py: take a d2 graph as input and explore the nodes to extract a udg path.
73

74
* evaluate.py: take a d2 graph gexf file and and analyse it. Look for an approximation of the longest correct path to reconstruct a molecule graph. Take as input a d2 graph where the truth is known in the node names (the format used to create fake data).