@@ -34,6 +34,7 @@ The source code of _SAM2MSA_ is inside the _src_ directory and can be compiled a
On computers with [Oracle JDK](http://www.oracle.com/technetwork/java/javase/downloads/index.html)(8 or higher) installed, Java executable jar files can be created.
In a command-line window, go to the _src_ directory and type:
```bash
for p in SAM2MAP MAP2FASTA FASTA2MSA
do
...
...
@@ -43,7 +44,9 @@ do
rm MANIFEST.MF $p.class ;
done
```
This will create the three executable jar files `SAM2MAP.jar`, `MAP2FASTA.jar` and `FASTA2MSA.jar` that can be run with the following command line models:
On computers with [GraalVM](hhttps://www.graalvm.org/downloads/) installed, native executables can also be built.
In a command-line window, go to the _src_ directory, and type:
```bash
for p in SAM2MAP MAP2FASTA FASTA2MSA
do
...
...
@@ -62,7 +66,9 @@ do
rm$p.class ;
done
```
This will create the three native executable `SAM2MAP`, `MAP2FASTA` and `FASTA2MSA` that can be launched with the following command line models:
```bash
./SAM2MAP [options]
./MAP2FASTA [options]
...
...
@@ -173,7 +179,12 @@ Run _SAM2MAP_ without option to read the following documentation:
* By default, all sequenced bases with Phread score < 20 are not considered (option `-q`), therefore minimizing the impact of sequencing errors when computing the consensus sequence. By default, all read alignment with Phred score < 20 (as assessed by the read mapping program) are also not considered (option `-Q`), therefore discarding from the consensus sequence every region with low mappability (e.g. low complexity or repeated regions).
* _SAM2MAP_ estimates a Poisson+Negative Binomial (NB) theoretical distribution from the observed read coverage distribution, and writes the results into an output file (cov.txt file extension). The Poisson distribution is dedicated to observed (near-)zero read coverage distribution. The NB distribution is used to determine the min/max coverage depths to assess reference regions where the consensus sequence can be trustingly built.
* _SAM2MAP_ estimates a Poisson+Negative Binomial (NB) theoretical distribution from the observed read coverage distribution, and writes the results into an output file (cov.txt file extension). <br>
The Poisson distribution is dedicated to observed (near-)zero read coverage distribution (called the coverage tail distribution into output files *.cov.txt). It is determined by the probability mass function (PMF) <b>P</b><sub><em>λ</em></sub>(<em>x</em>) = <em>λ</em><sup><em>x</em></sup><em>e</em><sup>-<em>λ</em></sup>Γ(<em>x</em>+1)<sup>-1</sup>, where Γ is the [gamma function](https://en.wikipedia.org/wiki/Gamma_function). <br>
The (main) NB distribution is used to determine the min/max coverage depths (as ruled by option `-p`) to assess reference regions where the consensus sequence can be trustingly built. The NB(<em>p</em>,<em>r</em>) distribution is determined by the PMF <b>P</b><sub><em>p</em>,<em>r</em></sub>(<em>x</em>) = Γ(<em>r</em>+<em>x</em>) Γ(<em>x</em>+1)<sup>-1</sup>Γ(<em>r</em>)<sup>-1</sup><em>p</em><sup><em>x</em></sup> (1-<em>p</em>)<sup><em>r</em></sup>. However, when the observed read coverage distribution is not overdispersed (i.e. the NB parameter <em>r</em> tends to infinity), the theoretical NB distribution is replaced by the Generalized Poisson (GP) one. The GP(<em>λ'</em>,<em>ρ</em>) distribution is here determined by the PMF <b>P</b><sub><em>λ'</em>,<em>ρ</em></sub>(<em>x</em>) = <em>λ'</em> (<em>λ'</em>+<em>ρx</em>)<sup><em>x</em>-1</sup><em>e</em><sup>-<em>λ'</em>-<em>ρx</em></sup>Γ(<em>x</em>+1)<sup>-1</sup>, where <em>ρ</em><0;when<em>ρ</em> = 0, GP(<em>λ'</em>,0) reduces to a Poisson distribution of parameter <em>λ'</em> (for more details, see e.g. Consul and Shoukri 1985). <br>
From the above formalizations, the Poisson+NB theoretical distribution is therefore determined by the PMF <em>w</em><b>P</b><sub><em>λ</em></sub>(<em>x</em>) + (1-<em>w</em>) <b>P</b><sub><em>p</em>,<em>r</em></sub>(<em>x</em>). The values of the different parameters <em>w</em>, <em>λ</em>, <em>p</em> and <em>r</em> are written into output files *.cov.txt. Of note, such statistical results can also be used jointly with a genome coverage profile analysis (e.g. Lindner et al. 2013).
* For each position of the specified reference, _SAM2MAP_ summarizes the corresponding aligned bases in a tab-delimited MAP file. A MAP is defined by the following fields: reference position and base, no. A, C, G, T and gaps, no. reverse read, map code, variant.
...
...
@@ -355,21 +366,24 @@ The example below shows how the different _SAM2MSA_ programs can be used to buil
#### Sample FASTQ files
The following command lines allow downloading from the [ENA](https://www.ebi.ac.uk/ena) ftp repository the 38 pairs of FASTQ files (Illumina MiSeq) associated to the 38 sequenced isolate genomes:
```bash
for err in ERR180650{8..9} ERR18065{10..34} ERR1810578 ERR18105{89..98}
Of note, faster downloading times can be observed using [axel](https://github.com/axel-download-accelerator/axel) instead of [wget](https://www.gnu.org/software/wget/): on computers with [axel](https://github.com/axel-download-accelerator/axel) installed, replace the two occurrences of `wget` by e.g. `axel -a -n 10`.
#### Reference FASTA and GFF3 files
The following command lines allow downloading from the [NCBI](https://www.ncbi.nlm.nih.gov/) ftp repository the FASTA and GFF3 files associated to the reference genome of [_B. stabilis_ CH16](https://www.ncbi.nlm.nih.gov/genome/45559?genome_assembly_id=413612)(3 chromosomes and 1 plasmid; see the corresponding [GenBank repository](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/240/005/GCA_900240005.1)):
The following command lines allow read mapping to be carried out against _CH16.fasta_ using [minimap2](https://github.com/lh3/minimap2)([Li 2018](https://doi.org/10.1093/bioinformatics/bty19)) on 6 threads; the SAM-formatted read alignments are directly read via a pipe (`|`) by _SAM2MAP_ (default options) to obtain the MAP files and the FASTA-formatted consensus sequences:
```bash
for err in ERR180650{8..9} ERR18065{10..34} ERR1810578 ERR18105{89..98}
Faster running times can be obtained by using [minimap2](https://github.com/lh3/minimap2) on more threads (option `-t`).
The directory _example/_ contains the three output files written by _SAM2MAP_ for the sample ERR1806508:
...
...
@@ -396,9 +412,11 @@ The directory _example/_ contains the three output files written by _SAM2MAP_ fo
#### _FASTA2MSA_ on consensus sequences
The following command line builds a MSA using _FASTA2MSA_ on the 38 generated consensus sequences (the input file _infile.txt_ is available in _example/_):
```bash
FASTA2MSA -i infile.txt -o msa -g CH16.gff -v
```
The directory _example/_ contains the two output files written by _FASTA2MSA_:
* _msa.fasta.mfc_: a multiple sequence alignment of 8,371,282 aligned nucleotide characters in FASTA format, compressed using [MFCompress](http://bioinformatics.ua.pt/software/mfcompress/)([Pinho and Pratas 2014](http://dx.doi.org/10.1093/bioinformatics/btt594)),
* _msa.var.tsv_: a tab-delimited file summarizing the 489 variable characters inside _msa.fasta_; coding characters are identified from the annotation file _CH16.gff_ specified using option `-g`.
...
...
@@ -407,35 +425,40 @@ The directory _example/_ contains the two output files written by _FASTA2MSA_:
#### Phylogenetic inference
A phylogenetic tree was inferred using [IQ-TREE 2](http://www.iqtree.org/)([Minh et al. 2020](https://academic.oup.com/mbe/article/37/5/1530/5721363)) on 12 threads with evolutionary model HKY+F:
```bash
iqtree2 -s msa.fasta -T 12 -m HKY+F
```
Branch lengths of the ML tree (_msa.fasta.treefile_ in _example/_) were rescaled using the total number of aligned characters (i.e. _s_ = 8,371,282) to estimate the number of SNPs on each branch with the following _awk_ one-liner:
This simple approach leads to accurate estimates, as the _p_-distance _p_ (i.e. number of observed mismatches per character) is always similar to the evolutionary distance _d_ (i.e. number of substitution events per character) when _d_ < 0.1; therefore _sp_ (i.e. the number of SNP) can be accurately approximated by _sd_ as performed by the above _awk_ one-liner.
The final phylogenetic tree (_msa.nwk_ in _example/_) is represented below (to be compared with [Figure 2](https://wwwnc.cdc.gov/eid/article/25/6/17-2119-f2) in [Seth-Smith et al. 2019](https://wwwnc.cdc.gov/eid/article/25/6/17-2119_article)).

Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. [doi:10.1093/bioinformatics/bty191](https://doi.org/10.1093/bioinformatics/bty19).
Consul PC, Shoukri MM (1985) _The generalized poisson distribution when the sample mean is larger than the sample variance_. **Communications in Statistics - Simulation and Computation**, 14(3):667-681. [doi:10.1080/03610918508812463](https://doi.org/10.1080/03610918508812463).
Li H (2018) _Minimap2: pairwise alignment for nucleotide sequences_. **Bioinformatics**, 34:3094-3100. [doi:10.1093/bioinformatics/bty191](https://doi.org/10.1093/bioinformatics/bty19).
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, and 1000 Genome Project Data Processing Subgroup (2009) _The Sequence alignment/map (SAM) format and SAMtools_. **Bioinformatics**, 25(16):2078-2079. [doi:10.1093/bioinformatics/btp352](https://doi.org/10.1093/bioinformatics/btp352).
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25(16):2078-2079. [doi:10.1093/bioinformatics/btp352](https://doi.org/10.1093/bioinformatics/btp352).
Lindner MS, Kollock M, Zickmann F, Renard BY (2013) _Analyzing genome coverage profiles with applications to quality control in metagenomics_. **Bioinformatics**, 29(10):1260-1267. [doi:10.1093/bioinformatics/btt147](https://doi.org/10.1093/bioinformatics/btt147).
Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R (2020) IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Molecular Biology and Evolution, 37(5):1530-1534. [doi:10.1093/molbev/msaa015](https://doi.org/10.1093/molbev/msaa015).
Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R (2020) _IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era_. **Molecular Biology and Evolution**, 37(5):1530-1534. [doi:10.1093/molbev/msaa015](https://doi.org/10.1093/molbev/msaa015).
Pinho AJ and Pratas D (2014) MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics, 30(1):117-118. [doi:10.1093/bioinformatics/btt594](http://dx.doi.org/10.1093/bioinformatics/btt594)
Pinho AJ and Pratas D (2014) _MFCompress: a compression tool for FASTA and multi-FASTA data_. **Bioinformatics**, 30(1):117-118. [doi:10.1093/bioinformatics/btt594](http://dx.doi.org/10.1093/bioinformatics/btt594)
Seth-Smith HMB, Casanova C, Sommerstein R, Meinel DM, Abdelbary MMH, Blanc DS, Droz S, Führer U, Lienhard R, Lang C, Dubuis O, Schlegel M, Widmer A, Keller PM, Marschall J, Egli A (2019) Phenotypic and Genomic Analyses of Burkholderia stabilis Clinical Contamination, Switzerland. Emerging Infectious Diseases, 25(6):1084-1092. [doi:10.3201/eid2506.172119](https://doi.org/10.3201/eid2506.172119).
Seth-Smith HMB, Casanova C, Sommerstein R, Meinel DM, Abdelbary MMH, Blanc DS, Droz S, Führer U, Lienhard R, Lang C, Dubuis O, Schlegel M, Widmer A, Keller PM, Marschall J, Egli A (2019) _Phenotypic and Genomic Analyses of Burkholderia stabilis Clinical Contamination, Switzerland_. **Emerging Infectious Diseases**, 25(6):1084-1092. [doi:10.3201/eid2506.172119](https://doi.org/10.3201/eid2506.172119).
//##### estimates the inverse NB(p,r) CDF, i.e. the largest x st. CDF(x) < pvalue
...
...
@@ -809,6 +816,21 @@ public class MAP2FASTA {
returngammq(x+1,lambda);
}
//##### estimates the GP(lambda, theta) PMF, i.e. P(X=x) with X~GP(lambda, theta)
//##### GP = Generalized Poisson, e.g. Consul (1989) Generalized Poisson Distributions: Properties and Applications. Marcel Dekker Inc., New York/Basel
//##### NOTE: here, theta < 0 to obtain a underdispersed counting distribution
//##### GP = Generalized Poisson, e.g. Consul (1989) Generalized Poisson Distributions: Properties and Applications. Marcel Dekker Inc., New York/Basel
//##### NOTE: here, theta < 0 to obtain a underdispersed counting distribution
//##### estimates the inverse NB(p,r) CDF, i.e. the largest x st. CDF(x) < pvalue
...
...
@@ -897,6 +904,21 @@ public class SAM2MAP {
returngammq(x+1,lambda);
}
//##### estimates the GP(lambda, theta) PMF, i.e. P(X=x) with X~GP(lambda, theta)
//##### GP = Generalized Poisson, e.g. Consul (1989) Generalized Poisson Distributions: Properties and Applications. Marcel Dekker Inc., New York/Basel
//##### NOTE: here, theta < 0 to obtain a underdispersed counting distribution
//##### GP = Generalized Poisson, e.g. Consul (1989) Generalized Poisson Distributions: Properties and Applications. Marcel Dekker Inc., New York/Basel
//##### NOTE: here, theta < 0 to obtain a underdispersed counting distribution