# "Lets call a graph a d-graph if the vertices can be ordered on a line and any vertices within d/2 of each other are connected by an edge. Observe that a d-graph has all vertices of degree d, except at the very end and beginning. For example, for d=6, vertex at position 10 will be connected to 7,8,9,11,12,13. A d-graph (sort of?) captures what the neighborhood of a molecule would look like."
#
# alternatively, observe that a d-graph can be seen as a line in a special overlap graph, where the nodes are the set of neighbors of a certain barcode graph node node, and two nodes in the special overlap graph are linked by an edge if the neighbors intersect over d-1 (or d-2?) elements
# arguments: graphml file
# attempts to deconvolve a barcode graph using d-graph detection
# actually a very loose definition of d-graph, where consecutive neighbors of node on the line don't need to be exactly of cardinality +1/-1
# (for that, use strict_d_line_compatible_neighbors)
# but can be +3/-3
defis_d_graph(graph,all_neighbors_graph):
Gn=nx.Graph()# create a graph of 'neighbor-compatibility': whether two nodes in the original graphs have 'almost' the same neighbors-set, apart from one to the left and one to the right (typical in d-graphs)
#print("tentative d graph",list(nx.connected_components(Gn)))
iflen(list(nx.connected_components(Gn)))!=1:
returnFalse
# this is really a heuristic:
#
# for all nodes in the putative d-graph, make sure their neighbors are all in majority in the component rathe than the whole graph
# this is a critical filter to remove putative d-graphs that are made of multiple molecules
# see e.g. 100_5_2-neighbors_of_molecule_21_83 without that filter..
# 3 d-graphs found (2 ok), dont 1 pas ok:
# d graph found ['10:24_65', '12:66_19', '22:25_81', '37:22_42', '46:63_86', '9:85_45'] -> chimere des molecules voisines de 21 et 83 via le barcode 22:25_81
# number of molecule per barcode: 10 (~/10x/drosophila/chen_data_longranger_run_on_ref/outs$ samtools view phased_possorted_bam.bam | python ~/10x-barcode-graph/scripts/sam_stats.py)
# so in total, 500k molecules
# i.e. the molecule coverage is around 140Mbp/500kbp = 280x
# conservatively, it seems that we can get overlaps for at least 20 neighbor molecules
# so in that setting, considering each molecule as '1bp', i.e. scaling the genome down to 140Mbp/70kbp=2Mbp