Commit de7c603a authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO

0.01

parent 06478cb8
......@@ -18,7 +18,7 @@ The source code of _SAM2MSA_ is inside the _src_ directory and can be compiled a
#### Building executable jar files
On computers with [Oracle JDK](http://www.oracle.com/technetwork/java/javase/downloads/index.html) (6 or higher) installed, Java executable jar files can be created.
On computers with [Oracle JDK](http://www.oracle.com/technetwork/java/javase/downloads/index.html) (8 or higher) installed, Java executable jar files can be created.
In a command-line window, go to the _src_ directory and type:
```bash
javac SAM2MAP.java
......@@ -163,7 +163,31 @@ Run _SAM2MAP_ without option to read the following documentation:
* For each position of the specified reference, _SAM2MAP_ summarizes the corresponding aligned bases in a tab-delimited MAP file. _SAM2MAP_ also estimates a Poisson+Negative Binomial (NB) theoretical distribution from the observed read coverage distribution, and writes the results into an output file (.cov.txt); the Poisson+NB distribution is used to determine the min/max coverage depths to assess reference regions where the consensus sequence can be accurately built.
* For most positions (e.g. associated with the map code 'M' in the assessed min/max read coverage regions), the inferred consensus character state is the majority one(s) observed in at least 80% of the aligned bases (option -f). Of note, when more than one character state is required to determine at least 80% of the aligned ones, the corresponding degenerate nucleotide is inferred in the consensus sequence. This can be useful to build consensus sequences corresponding to quasi-species.
* For most positions (e.g. associated with the map code 'M' in the assessed min/max read coverage regions), the inferred consensus character state is the majority one(s) observed in at least 80% of the aligned bases (option -f). Of note, when more than one character state is required to determine at least 80% of the aligned ones, the corresponding degenerate nucleotide is inferred in the consensus sequence. This can be useful to build consensus sequences corresponding to quasi-species. The table below lists the inferred base in the consensus sequence for each majority character state set:
<div align="center">
| majority set | inferred base | | majority set | inferred base |
|:------------:|:-------------:|:-:|:------------:|:-------------:|
| `{A}` | `A` | | `{A-}` | `a` |
| `{C}` | `C` | | `{C-}` | `c` |
| `{G}` | `G` | | `{G-}` | `g` |
| `{T}` | `T` | | `{T-}` | `t` |
| `{AC}` | `M` | | `{AC-}` | `m` |
| `{AG}` | `R` | | `{AG-}` | `r` |
| `{AT}` | `W` | | `{AT-}` | `w` |
| `{CG}` | `S` | | `{CG-}` | `s` |
| `{CT}` | `Y` | | `{CT-}` | `y` |
| `{GT}` | `K` | | `{GT-}` | `k` |
| `{ACG}` | `V` | | `{ACG-}` | `v` |
| `{ACT}` | `H` | | `{ACT-}` | `h` |
| `{AGT}` | `D` | | `{AGT-}` | `d` |
| `{CGT}` | `B` | | `{CGT-}` | `b` |
| `{ACGT}` | `N` | | `{ACGT-}` | `n` |
</div>
* As the main aim of _SAM2MSA_ is to quickly build MSA from read alignments, it should be stressed that insertions are never considered, therefore guaranteeing that the consensus sequence will have the same length as the the reference, which is expected to quickly build an MSA by pooling all consensus sequences.
......@@ -350,18 +374,18 @@ Faster running times can be obtained by using [minimap2](https://github.com/lh3/
The directory _example/_ contains the three output files written by _SAM2MAP_ for the sample ERR1806508:
* _ERR1806508.cov.txt_: the observed and theoretical mapping coverage distributions,
* _ERR1806508.map_: the MAP file summarizing the overall mapping for each reference position,
* _ERR1806508.map.gz_: the gziped MAP file summarizing the overall mapping for each reference position,
* _ERR1806508.fasta_: the consensus sequences in FASTA format.
#### _FASTA2MSA_ on consensus sequences
The following command line allows creating a MSA using _FASTA2MSA_ on the 38 generated consensus sequences (the input file _infile.txt_ is available in the directory _example/_):
The following command line create a MSA using _FASTA2MSA_ on the 38 generated consensus sequences (the input file _infile.txt_ is available _example/_):
```bash
FASTA2MSA -i infile.txt -o msa -g CH16.gff -v
```
The directory _example/_ contains the three output files written by _FASTA2MSA_:
* _msa.fasta_: a multiple sequence alignment of 8,371,282 aligned nucleotide characters in FASTA format,
The directory _example/_ contains the tow output files written by _FASTA2MSA_:
* _msa.fasta.gz_: a multiple sequence alignment of 8,371,282 aligned nucleotide characters in gziped FASTA format,
* _msa.var.tsv_: a tab-delimited file summarizing the 489 variable characters inside _msa.fasta_.
......@@ -373,20 +397,19 @@ iqtree2 -s msa.fasta -T 12 -m HKY+F
```
Branch lengths of the ML tree (_msa.fasta.treefile_ in _example/_) were rescaled using the total number of aligned characters (i.e. 8,371,282) to estimate the number of SNPs on each branch with the following _awk_ one-liner:
```bash
awk 'BEGIN{s=8371282;p=0} {t=t$0}
END{while((c=substr(t,++x,1))!=";")
if(c==":"){
y=++x; while((c=substr(t,++y,1))!=","&&c!=")"){}
t=substr(t,1,x-1)sprintf("%."p"f",s*substr(t,x,y-x))substr(t,y)}
print t}' msa.fasta.treefile > msa.nwk
awk -v s=8371282 '{t=t$0}
END{while((c=substr(t,++x,1))!=";")
if(c==":"){y=++x; while((c=substr(t,++y,1))!=","&&c!=")"){}
t=substr(t,1,x-1)sprintf("%.0f",s*substr(t,x,y-x))substr(t,y)} print t}' msa.fasta.treefile > msa.nwk
```
The final phylogenetic tree (_msa.nwk_ in _example/_) is represented below (to be compared with Figure 2 in [Seth-Smith et al. 2019](https://wwwnc.cdc.gov/eid/article/25/6/17-2119_article)).
The final phylogenetic tree (_msa.nwk_ in _example/_) is represented below (to be compared with [Figure 2](https://wwwnc.cdc.gov/eid/article/25/6/17-2119-f2) in [Seth-Smith et al. 2019](https://wwwnc.cdc.gov/eid/article/25/6/17-2119_article)).
![CH16 SNP tree](example/msa.nwk.png "CH16 SNP tree")
<!--- <iframe width="800" height="900" src="https://phylogeny.io/share/0ad13a8140437565f9593fba1b04b36028dc882b"></iframe> --->
<!--- https://phylogeny.io/edit/0ad13a8140437565f9593fba1b04b36028dc882b/dcc6740bddd18a1f0f93c031ad38beaf5c17b4e6 --->
## References
Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. [doi:10.1093/bioinformatics/bty191](https://doi.org/10.1093/bioinformatics/bty19).
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment