Commit 6ec228ac authored by Alexis  CRISCUOLO's avatar Alexis CRISCUOLO
Browse files

0.3.3.1

parent e9258760
......@@ -34,6 +34,7 @@ The source code of _SAM2MSA_ is inside the _src_ directory and can be compiled a
On computers with [Oracle JDK](http://www.oracle.com/technetwork/java/javase/downloads/index.html) (8 or higher) installed, Java executable jar files can be created.
In a command-line window, go to the _src_ directory and type:
```bash
for p in SAM2MAP MAP2FASTA FASTA2MSA
do
......@@ -43,7 +44,9 @@ do
rm MANIFEST.MF $p.class ;
done
```
This will create the three executable jar files `SAM2MAP.jar`, `MAP2FASTA.jar` and `FASTA2MSA.jar` that can be run with the following command line models:
```bash
java -jar SAM2MAP.jar [options]
java -jar MAP2FASTA.jar [options]
......@@ -54,6 +57,7 @@ java -jar FASTA2MSA.jar [options]
On computers with [GraalVM](hhttps://www.graalvm.org/downloads/) installed, native executables can also be built.
In a command-line window, go to the _src_ directory, and type:
```bash
for p in SAM2MAP MAP2FASTA FASTA2MSA
do
......@@ -62,7 +66,9 @@ do
rm $p.class ;
done
```
This will create the three native executable `SAM2MAP`, `MAP2FASTA` and `FASTA2MSA` that can be launched with the following command line models:
```bash
./SAM2MAP [options]
./MAP2FASTA [options]
......@@ -173,7 +179,12 @@ Run _SAM2MAP_ without option to read the following documentation:
* By default, all sequenced bases with Phread score < 20 are not considered (option `-q`), therefore minimizing the impact of sequencing errors when computing the consensus sequence. By default, all read alignment with Phred score < 20 (as assessed by the read mapping program) are also not considered (option `-Q`), therefore discarding from the consensus sequence every region with low mappability (e.g. low complexity or repeated regions).
* _SAM2MAP_ estimates a Poisson+Negative Binomial (NB) theoretical distribution from the observed read coverage distribution, and writes the results into an output file (cov.txt file extension). The Poisson distribution is dedicated to observed (near-)zero read coverage distribution. The NB distribution is used to determine the min/max coverage depths to assess reference regions where the consensus sequence can be trustingly built.
* _SAM2MAP_ estimates a Poisson+Negative Binomial (NB) theoretical distribution from the observed read coverage distribution, and writes the results into an output file (cov.txt file extension). <br>
The Poisson distribution is dedicated to observed (near-)zero read coverage distribution (called the coverage tail distribution into output files *.cov.txt). It is determined by the probability mass function (PMF) <b>P</b><sub><em>&lambda;</em></sub>(<em>x</em>) = <em>&lambda;</em><sup><em>x</em></sup> <em>e</em><sup>-<em>&lambda;</em></sup> &Gamma;(<em>x</em>+1)<sup>-1</sup>, where &Gamma; is the [gamma function](https://en.wikipedia.org/wiki/Gamma_function). <br>
The (main) NB distribution is used to determine the min/max coverage depths (as ruled by option `-p`) to assess reference regions where the consensus sequence can be trustingly built. The NB(<em>p</em>,<em>r</em>) distribution is determined by the PMF <b>P</b><sub><em>p</em>,<em>r</em></sub>(<em>x</em>) = &Gamma;(<em>r</em>+<em>x</em>) &Gamma;(<em>x</em>+1)<sup>-1</sup> &Gamma;(<em>r</em>)<sup>-1</sup> <em>p</em><sup><em>x</em></sup> (1-<em>p</em>)<sup><em>r</em></sup>. However, when the observed read coverage distribution is not overdispersed (i.e. the NB parameter <em>r</em> tends to infinity), the theoretical NB distribution is replaced by the Generalized Poisson (GP) one. The GP(<em>&lambda;'</em>,<em>&rho;</em>) distribution is here determined by the PMF <b>P</b><sub><em>&lambda;'</em>,<em>&rho;</em></sub>(<em>x</em>) = <em>&lambda;'</em> (<em>&lambda;'</em>+<em>&rho;x</em>)<sup><em>x</em>-1</sup> <em>e</em><sup>-<em>&lambda;'</em>-<em>&rho;x</em></sup> &Gamma;(<em>x</em>+1)<sup>-1</sup>, where <em>&rho;</em> < 0; when <em>&rho;</em> = 0, GP(<em>&lambda;'</em>,0) reduces to a Poisson distribution of parameter <em>&lambda;'</em> (for more details, see e.g. Consul and Shoukri 1985). <br>
From the above formalizations, the Poisson+NB theoretical distribution is therefore determined by the PMF <em>w</em> <b>P</b><sub><em>&lambda;</em></sub>(<em>x</em>) + (1-<em>w</em>) <b>P</b><sub><em>p</em>,<em>r</em></sub>(<em>x</em>). The values of the different parameters <em>w</em>, <em>&lambda;</em>, <em>p</em> and <em>r</em> are written into output files *.cov.txt. Of note, such statistical results can also be used jointly with a genome coverage profile analysis (e.g. Lindner et al. 2013).
* For each position of the specified reference, _SAM2MAP_ summarizes the corresponding aligned bases in a tab-delimited MAP file. A MAP is defined by the following fields: reference position and base, no. A, C, G, T and gaps, no. reverse read, map code, variant.
......@@ -355,21 +366,24 @@ The example below shows how the different _SAM2MSA_ programs can be used to buil
#### Sample FASTQ files
The following command lines allow downloading from the [ENA](https://www.ebi.ac.uk/ena) ftp repository the 38 pairs of FASTQ files (Illumina MiSeq) associated to the 38 sequenced isolate genomes:
```bash
for err in ERR180650{8..9} ERR18065{10..34} ERR1810578 ERR18105{89..98}
do
echo -n "$err ...";
wget -q ftp.sra.ebi.ac.uk/vol1/fastq/${err:0:6}/00${err:9:1}/$err/$err""_1.fastq.gz &
wget -q ftp.sra.ebi.ac.uk/vol1/fastq/${err:0:6}/00${err:9:1}/$err/$err""_2.fastq.gz ;
wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/${err:0:6}/00${err:9:1}/$err/$err""_1.fastq.gz &
wget -q ftp://ftp.sra.ebi.ac.uk/vol1/fastq/${err:0:6}/00${err:9:1}/$err/$err""_2.fastq.gz ;
wait ;
echo " [ok]";
done
```
Of note, faster downloading times can be observed using [axel](https://github.com/axel-download-accelerator/axel) instead of [wget](https://www.gnu.org/software/wget/): on computers with [axel](https://github.com/axel-download-accelerator/axel) installed, replace the two occurrences of `wget` by e.g. `axel -a -n 10`.
#### Reference FASTA and GFF3 files
The following command lines allow downloading from the [NCBI](https://www.ncbi.nlm.nih.gov/) ftp repository the FASTA and GFF3 files associated to the reference genome of [_B. stabilis_ CH16](https://www.ncbi.nlm.nih.gov/genome/45559?genome_assembly_id=413612) (3 chromosomes and 1 plasmid; see the corresponding [GenBank repository](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/240/005/GCA_900240005.1)):
```bash
BASEURL="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/240/005/GCA_900240005.1_BStabCH16";
wget -q -O- $BASEURL/GCA_900240005.1_BStabCH16_genomic.fna.gz | zcat > CH16.fasta ;
......@@ -379,12 +393,14 @@ wget -q -O- $BASEURL/GCA_900240005.1_BStabCH16_genomic.gff.gz | zcat > CH16.gff
#### _SAM2MAP_ on read alignments
The following command lines allow read mapping to be carried out against _CH16.fasta_ using [minimap2](https://github.com/lh3/minimap2) ([Li 2018](https://doi.org/10.1093/bioinformatics/bty19)) on 6 threads; the SAM-formatted read alignments are directly read via a pipe (`|`) by _SAM2MAP_ (default options) to obtain the MAP files and the FASTA-formatted consensus sequences:
```bash
for err in ERR180650{8..9} ERR18065{10..34} ERR1810578 ERR18105{89..98}
do
minimap2 -ax sr -t 6 CH6.fasta $err""_1.fastq.gz $err""_2.fastq.gz 2>/dev/null | SAM2MAP -i - -r CH6.fasta -o $err -v ;
minimap2 -ax sr -t 6 CH16.fasta $err""_1.fastq.gz $err""_2.fastq.gz 2>/dev/null | SAM2MAP -i - -r CH16.fasta -o $err -v ;
done
```
Faster running times can be obtained by using [minimap2](https://github.com/lh3/minimap2) on more threads (option `-t`).
The directory _example/_ contains the three output files written by _SAM2MAP_ for the sample ERR1806508:
......@@ -396,9 +412,11 @@ The directory _example/_ contains the three output files written by _SAM2MAP_ fo
#### _FASTA2MSA_ on consensus sequences
The following command line builds a MSA using _FASTA2MSA_ on the 38 generated consensus sequences (the input file _infile.txt_ is available in _example/_):
```bash
FASTA2MSA -i infile.txt -o msa -g CH16.gff -v
```
The directory _example/_ contains the two output files written by _FASTA2MSA_:
* _msa.fasta.mfc_: a multiple sequence alignment of 8,371,282 aligned nucleotide characters in FASTA format, compressed using [MFCompress](http://bioinformatics.ua.pt/software/mfcompress/) ([Pinho and Pratas 2014](http://dx.doi.org/10.1093/bioinformatics/btt594)),
* _msa.var.tsv_: a tab-delimited file summarizing the 489 variable characters inside _msa.fasta_; coding characters are identified from the annotation file _CH16.gff_ specified using option `-g`.
......@@ -407,35 +425,40 @@ The directory _example/_ contains the two output files written by _FASTA2MSA_:
#### Phylogenetic inference
A phylogenetic tree was inferred using [IQ-TREE 2](http://www.iqtree.org/) ([Minh et al. 2020](https://academic.oup.com/mbe/article/37/5/1530/5721363)) on 12 threads with evolutionary model HKY+F:
```bash
iqtree2 -s msa.fasta -T 12 -m HKY+F
```
Branch lengths of the ML tree (_msa.fasta.treefile_ in _example/_) were rescaled using the total number of aligned characters (i.e. _s_ = 8,371,282) to estimate the number of SNPs on each branch with the following _awk_ one-liner:
```bash
awk -v s=8371282 '{t=t$0}
END{while((c=substr(t,++x,1))!=";")
if(c==":"){y=++x; while((c=substr(t,++y,1))!=","&&c!=")"){}
t=substr(t,1,x-1)sprintf("%.0f",s*substr(t,x,y-x))substr(t,y)} print t}' msa.fasta.treefile > msa.nwk
```
This simple approach leads to accurate estimates, as the _p_-distance _p_ (i.e. number of observed mismatches per character) is always similar to the evolutionary distance _d_ (i.e. number of substitution events per character) when _d_ < 0.1; therefore _sp_ (i.e. the number of SNP) can be accurately approximated by _sd_ as performed by the above _awk_ one-liner.
The final phylogenetic tree (_msa.nwk_ in _example/_) is represented below (to be compared with [Figure 2](https://wwwnc.cdc.gov/eid/article/25/6/17-2119-f2) in [Seth-Smith et al. 2019](https://wwwnc.cdc.gov/eid/article/25/6/17-2119_article)).
![CH16 SNP tree](example/msa.nwk.png "CH16 SNP tree")
<!--- <iframe width="800" height="900" src="https://phylogeny.io/share/0ad13a8140437565f9593fba1b04b36028dc882b"></iframe> --->
<!--- https://phylogeny.io/edit/0ad13a8140437565f9593fba1b04b36028dc882b/dcc6740bddd18a1f0f93c031ad38beaf5c17b4e6 --->
## References
Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. [doi:10.1093/bioinformatics/bty191](https://doi.org/10.1093/bioinformatics/bty19).
Consul PC, Shoukri MM (1985) _The generalized poisson distribution when the sample mean is larger than the sample variance_. **Communications in Statistics - Simulation and Computation**, 14(3):667-681. [doi:10.1080/03610918508812463](https://doi.org/10.1080/03610918508812463).
Li H (2018) _Minimap2: pairwise alignment for nucleotide sequences_. **Bioinformatics**, 34:3094-3100. [doi:10.1093/bioinformatics/bty191](https://doi.org/10.1093/bioinformatics/bty19).
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, and 1000 Genome Project Data Processing Subgroup (2009) _The Sequence alignment/map (SAM) format and SAMtools_. **Bioinformatics**, 25(16):2078-2079. [doi:10.1093/bioinformatics/btp352](https://doi.org/10.1093/bioinformatics/btp352).
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25(16):2078-2079. [doi:10.1093/bioinformatics/btp352](https://doi.org/10.1093/bioinformatics/btp352).
Lindner MS, Kollock M, Zickmann F, Renard BY (2013) _Analyzing genome coverage profiles with applications to quality control in metagenomics_. **Bioinformatics**, 29(10):1260-1267. [doi:10.1093/bioinformatics/btt147](https://doi.org/10.1093/bioinformatics/btt147).
Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R (2020) IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Molecular Biology and Evolution, 37(5):1530-1534. [doi:10.1093/molbev/msaa015](https://doi.org/10.1093/molbev/msaa015).
Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R (2020) _IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era_. **Molecular Biology and Evolution**, 37(5):1530-1534. [doi:10.1093/molbev/msaa015](https://doi.org/10.1093/molbev/msaa015).
Pinho AJ and Pratas D (2014) MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics, 30(1):117-118. [doi:10.1093/bioinformatics/btt594](http://dx.doi.org/10.1093/bioinformatics/btt594)
Pinho AJ and Pratas D (2014) _MFCompress: a compression tool for FASTA and multi-FASTA data_. **Bioinformatics**, 30(1):117-118. [doi:10.1093/bioinformatics/btt594](http://dx.doi.org/10.1093/bioinformatics/btt594)
Seth-Smith HMB, Casanova C, Sommerstein R, Meinel DM, Abdelbary MMH, Blanc DS, Droz S, Führer U, Lienhard R, Lang C, Dubuis O, Schlegel M, Widmer A, Keller PM, Marschall J, Egli A (2019) Phenotypic and Genomic Analyses of Burkholderia stabilis Clinical Contamination, Switzerland. Emerging Infectious Diseases, 25(6):1084-1092. [doi:10.3201/eid2506.172119](https://doi.org/10.3201/eid2506.172119).
Seth-Smith HMB, Casanova C, Sommerstein R, Meinel DM, Abdelbary MMH, Blanc DS, Droz S, Führer U, Lienhard R, Lang C, Dubuis O, Schlegel M, Widmer A, Keller PM, Marschall J, Egli A (2019) _Phenotypic and Genomic Analyses of Burkholderia stabilis Clinical Contamination, Switzerland_. **Emerging Infectious Diseases**, 25(6):1084-1092. [doi:10.3201/eid2506.172119](https://doi.org/10.3201/eid2506.172119).
......@@ -3,7 +3,7 @@
FASTA2MSA: combining consensus sequences into a multiple sequence alignment
Copyright (C) 2020 Institut Pasteur
Copyright (C) 2020-2021 Institut Pasteur
This program is part of the package SAM2MSA.
......@@ -22,8 +22,7 @@
Alexis Criscuolo alexis.criscuolo@pasteur.fr
Genome Informatics & Phylogenetics (GIPhy) giphy.pasteur.fr
Bioinformatics and Biostatistics Hub research.pasteur.fr/team/hub-giphy
USR 3756 IP CNRS research.pasteur.fr/team/bioinformatics-and-biostatistics-hub
Dpt. Biologie Computationnelle research.pasteur.fr/department/computational-biology
Dpt. Biologie Computationnelle research.pasteur.fr/team/bioinformatics-and-biostatistics-hub
Institut Pasteur, Paris, FRANCE research.pasteur.fr
########################################################################################################
......
......@@ -3,7 +3,7 @@
MAP2FASTA: inferring a consensus sequence from a MAP file
Copyright (C) 2020 Institut Pasteur
Copyright (C) 2020-2021 Institut Pasteur
This program is part of the package SAM2MSA.
......@@ -22,8 +22,7 @@
Alexis Criscuolo alexis.criscuolo@pasteur.fr
Genome Informatics & Phylogenetics (GIPhy) giphy.pasteur.fr
Bioinformatics and Biostatistics Hub research.pasteur.fr/team/hub-giphy
USR 3756 IP CNRS research.pasteur.fr/team/bioinformatics-and-biostatistics-hub
Dpt. Biologie Computationnelle research.pasteur.fr/department/computational-biology
Dpt. Biologie Computationnelle research.pasteur.fr/team/bioinformatics-and-biostatistics-hub
Institut Pasteur, Paris, FRANCE research.pasteur.fr
########################################################################################################
......@@ -35,7 +34,7 @@ import java.util.*;
public class MAP2FASTA {
//### constants ################################################################
final static String VERSION = "0.2.200712ac";
final static String VERSION = "0.3.210528c";
final static String NOTHING = "N.o./.T.h.I.n.G";
final static String STDIN = "-";
final static int MIN_COV = 10; // default minimum coverage
......@@ -356,8 +355,8 @@ public class MAP2FASTA {
if ( minC < 0 ) minC = (x < MIN_COV) ? MIN_COV : x;
y = cdfNBinv(1-pvC, cNBzP[2], cNBzP[3]);
if ( maxC < 0 ) maxC = y;
if ( verb ) {
System.err.println("mean coverage " + ((int)(cNBzP[2]*cNBzP[3]/(1-cNBzP[2]))) + "x");
if ( verb ) { // ? Poisson : ? GP : NB
System.err.println("mean coverage " + ((int)(Double.isInfinite(cNBzP[3]) ? cNBzP[2] : (cNBzP[3] < 0) ? cNBzP[2]/(1-cNBzP[3]) : cNBzP[2]*cNBzP[3]/(1-cNBzP[2]))) + "x");
System.err.println("coverage " + String.format(Locale.US, "%.2f", 100.0*(1.0-2.0*pvC)) + "% CI [" + x + " , " + y + "]");
System.err.println("coverage lower bound (U) " + minC + "x");
System.err.println("coverage upper bound (O) " + maxC + "x");
......@@ -762,9 +761,14 @@ public class MAP2FASTA {
x = l; while ( --x >= 0 ) { fx = dist[(k=x)]; tmp = r-1; while ( --k >= 0 ) { up += fx/(++tmp); dn -= fx/(tmp*tmp); } }
dg = up/dn;
}
//r = ( Math.abs(r-dg) > rmax ) ? (1 - P_MIN) * sumxy / (P_MIN * sumy) : r; //## NOTE: when r too large, r = avg*(1-P_MIN)/P_MIN
if ( Math.abs(r-dg) > rmax ) { //## NOTE: switching to Poisson
pr[0] = sumxy / sumy; pr[1] = Double.POSITIVE_INFINITY; //System.out.println(" NB lambda " + pr[0]);
if ( Math.abs(r-dg) > rmax ) { //## NOTE: switching to (Generalized) Poisson
avg = sumxy / sumy;
var = 0; x = l; while ( --x >= 0 ) var += dist[x]*square(avg-x);
var /= sumy;
pr[0] = Math.sqrt(avg*avg*avg/var); pr[1] = 1 - Math.sqrt(avg/var); //System.out.println(" GP lambda theta " + pr[0] + " " + pr[1]);
if ( pr[1] >= 0 ) { //## NOTE: not underdispersed, switching to Poisson
pr[0] = avg; pr[1] = Double.POSITIVE_INFINITY; //System.out.println(" NB lambda " + pr[0]);
}
return pr;
}
p = sumxy / (sumxy + sumy*r); //System.out.println(" NB p " + p);
......@@ -781,16 +785,19 @@ public class MAP2FASTA {
}
//##### estimates the NB(p,r) PMF, i.e. P(X=x) with X~NB(p,r)
//##### estimates the Poisson(p) PMF when r is infinite
//##### r >= 0 : estimates the NB(p,r) PMF, i.e. P(X=x) with X~NB(p,r)
//##### r = +inf : estimates the Poisson(p) PMF
//##### r < 0 : estimates the GP(p,r) PMF
final static double pmfNB(final int x, final double p, final double r) {
return ( Double.isInfinite(r) ) ? pmfPoisson(x, p) : ( x == 0 ) ? Math.pow(1-p, r) : Math.exp(gammln(r+x) - gammln(x+1) - gammln(r) + x*Math.log(p) + r*Math.log(1-p));
return ( Double.isInfinite(r) ) ? pmfPoisson(x, p) : ( r < 0 ) ? pmfGP(x, p, r)
: ( x == 0 ) ? Math.pow(1-p, r) : Math.exp(gammln(r+x) - gammln(x+1) - gammln(r) + x*Math.log(p) + r*Math.log(1-p));
}
//##### estimates the NB(p,r) CDF
//##### estimates the Poisson(p) CDF when r is infinite
//##### r >= 0 : estimates the NB(p,r) CDF
//##### r = +inf : estimates the Poisson(p) CDF
//##### r < 0 : estimates the GP(p,r) CDF
final static double cdfNB(final int x, final double p, final double r) {
return ( Double.isInfinite(r) ) ? cdfPoisson(x, p) : 1 - betai(x+1, r, p);
return ( Double.isInfinite(r) ) ? cdfPoisson(x, p) : ( r < 0 ) ? cdfGP(x, p, r) : 1 - betai(x+1, r, p);
}
//##### estimates the inverse NB(p,r) CDF, i.e. the largest x st. CDF(x) < pvalue
......@@ -809,6 +816,21 @@ public class MAP2FASTA {
return gammq(x+1, lambda);
}
//##### estimates the GP(lambda, theta) PMF, i.e. P(X=x) with X~GP(lambda, theta)
//##### GP = Generalized Poisson, e.g. Consul (1989) Generalized Poisson Distributions: Properties and Applications. Marcel Dekker Inc., New York/Basel
//##### NOTE: here, theta < 0 to obtain a underdispersed counting distribution
final static double pmfGP(final int x, final double lambda, final double theta) {
return ( x == 0 ) ? Math.exp(-lambda) : Math.exp(Math.log(lambda) + (x-1)*Math.log(lambda + theta*x) - gammln(x+1) - lambda - theta*x);
}
//##### estimates the GP(lambda, theta) CDF
//##### GP = Generalized Poisson, e.g. Consul (1989) Generalized Poisson Distributions: Properties and Applications. Marcel Dekker Inc., New York/Basel
//##### NOTE: here, theta < 0 to obtain a underdispersed counting distribution
final static double cdfGP(final int x, final double lambda, final double theta) {
double cdf = 0; int i = -1; while ( ++i <= x ) cdf += pmfGP(i, lambda, theta);
return cdf;
}
//##### squares the specified double value
final static double square(final double x) {
return x*x;
......@@ -951,6 +973,7 @@ public class MAP2FASTA {
double l = cnbzp[0], wp = cnbzp[1], p = cnbzp[2], r = cnbzp[3], wnb = cnbzp[4];
out.write(String.format(Locale.US, "# Poisson(l) coverage tail distribution: l=%.8f w=%.8f", l, wp)); out.newLine();
if ( Double.isInfinite(r) ) out.write(String.format(Locale.US, "* Poisson(l') coverage distribution: l'=%.8f 1-w=%.8f", p, wnb));
else if ( r < 0 ) out.write(String.format(Locale.US, "* GP(l',r) coverage distribution: l'=%.8f r=%.8f 1-w=%.8f", p, r, wnb));
else out.write(String.format(Locale.US, "* NB(p,r) coverage distribution: p=%.8f r=%.8f 1-w=%.8f", p, r, wnb));
out.newLine();
int sumy = 0;
......
......@@ -3,7 +3,7 @@
SAM2MAP: inferring a consensus sequence from SAM-formatted read alignments
Copyright (C) 2020 Institut Pasteur
Copyright (C) 2020-2021 Institut Pasteur
This program is part of the package SAM2MSA.
......@@ -22,8 +22,7 @@
Alexis Criscuolo alexis.criscuolo@pasteur.fr
Genome Informatics & Phylogenetics (GIPhy) giphy.pasteur.fr
Bioinformatics and Biostatistics Hub research.pasteur.fr/team/hub-giphy
USR 3756 IP CNRS research.pasteur.fr/team/bioinformatics-and-biostatistics-hub
Dpt. Biologie Computationnelle research.pasteur.fr/department/computational-biology
Dpt. Biologie Computationnelle research.pasteur.fr/team/bioinformatics-and-biostatistics-hub
Institut Pasteur, Paris, FRANCE research.pasteur.fr
########################################################################################################
......@@ -35,7 +34,7 @@ import java.util.*;
public class SAM2MAP {
//### constants ################################################################
final static String VERSION = "0.2.200712c";
final static String VERSION = "0.3.210528c";
final static String NOTHING = "N.o./.T.h.I.n.G";
final static String STDIN = "-";
final static int MIN_COV = 10; // default minimum coverage
......@@ -442,8 +441,8 @@ public class SAM2MAP {
if ( minC < 0 ) minC = (x < MIN_COV) ? MIN_COV : x;
y = cdfNBinv(1-pvC, cNBzP[2], cNBzP[3]);
if ( maxC < 0 ) maxC = y;
if ( verb ) {
System.err.println("mean coverage " + ((int)(Double.isInfinite(cNBzP[3]) ? cNBzP[2] : cNBzP[2]*cNBzP[3]/(1-cNBzP[2]))) + "x");
if ( verb ) { // ? Poisson : ? GP : NB
System.err.println("mean coverage " + ((int)(Double.isInfinite(cNBzP[3]) ? cNBzP[2] : (cNBzP[3] < 0) ? cNBzP[2]/(1-cNBzP[3]) : cNBzP[2]*cNBzP[3]/(1-cNBzP[2]))) + "x");
System.err.println("coverage " + String.format(Locale.US, "%.2f", 100.0*(1.0-2.0*pvC)) + "% CI [" + x + " , " + y + "]");
System.err.println("coverage lower bound (U) " + minC + "x");
System.err.println("coverage upper bound (O) " + maxC + "x");
......@@ -851,9 +850,14 @@ public class SAM2MAP {
x = l; while ( --x >= 0 ) { fx = dist[(k=x)]; tmp = r-1; while ( --k >= 0 ) { up += fx/(++tmp); dn -= fx/(tmp*tmp); } }
dg = up/dn;
}
//r = ( Math.abs(r-dg) > rmax ) ? (1 - P_MIN) * sumxy / (P_MIN * sumy) : r; //## NOTE: when r too large, r = avg*(1-P_MIN)/P_MIN
if ( Math.abs(r-dg) > rmax ) { //## NOTE: switching to Poisson
pr[0] = sumxy / sumy; pr[1] = Double.POSITIVE_INFINITY; //System.out.println(" NB lambda " + pr[0]);
if ( Math.abs(r-dg) > rmax ) { //## NOTE: switching to (Generalized) Poisson
avg = sumxy / sumy;
var = 0; x = l; while ( --x >= 0 ) var += dist[x]*square(avg-x);
var /= sumy;
pr[0] = Math.sqrt(avg*avg*avg/var); pr[1] = 1 - Math.sqrt(avg/var); //System.out.println(" GP lambda theta " + pr[0] + " " + pr[1]);
if ( pr[1] >= 0 ) { //## NOTE: not underdispersed, switching to Poisson
pr[0] = avg; pr[1] = Double.POSITIVE_INFINITY; //System.out.println(" NB lambda " + pr[0]);
}
return pr;
}
p = sumxy / (sumxy + sumy*r); //System.out.println(" NB p " + p);
......@@ -869,16 +873,19 @@ public class SAM2MAP {
return pr;
}
//##### estimates the NB(p,r) PMF, i.e. P(X=x) with X~NB(p,r)
//##### estimates the Poisson(p) PMF when r is infinite
//##### r >= 0 : estimates the NB(p,r) PMF, i.e. P(X=x) with X~NB(p,r)
//##### r = +inf : estimates the Poisson(p) PMF
//##### r < 0 : estimates the GP(p,r) PMF
final static double pmfNB(final int x, final double p, final double r) {
return ( Double.isInfinite(r) ) ? pmfPoisson(x, p) : ( x == 0 ) ? Math.pow(1-p, r) : Math.exp(gammln(r+x) - gammln(x+1) - gammln(r) + x*Math.log(p) + r*Math.log(1-p));
return ( Double.isInfinite(r) ) ? pmfPoisson(x, p) : ( r < 0 ) ? pmfGP(x, p, r)
: ( x == 0 ) ? Math.pow(1-p, r) : Math.exp(gammln(r+x) - gammln(x+1) - gammln(r) + x*Math.log(p) + r*Math.log(1-p));
}
//##### estimates the NB(p,r) CDF
//##### estimates the Poisson(p) CDF when r is infinite
//##### r >= 0 : estimates the NB(p,r) CDF
//##### r = +inf : estimates the Poisson(p) CDF
//##### r < 0 : estimates the GP(p,r) CDF
final static double cdfNB(final int x, final double p, final double r) {
return ( Double.isInfinite(r) ) ? cdfPoisson(x, p) : 1 - betai(x+1, r, p);
return ( Double.isInfinite(r) ) ? cdfPoisson(x, p) : ( r < 0 ) ? cdfGP(x, p, r) : 1 - betai(x+1, r, p);
}
//##### estimates the inverse NB(p,r) CDF, i.e. the largest x st. CDF(x) < pvalue
......@@ -897,6 +904,21 @@ public class SAM2MAP {
return gammq(x+1, lambda);
}
//##### estimates the GP(lambda, theta) PMF, i.e. P(X=x) with X~GP(lambda, theta)
//##### GP = Generalized Poisson, e.g. Consul (1989) Generalized Poisson Distributions: Properties and Applications. Marcel Dekker Inc., New York/Basel
//##### NOTE: here, theta < 0 to obtain a underdispersed counting distribution
final static double pmfGP(final int x, final double lambda, final double theta) {
return ( x == 0 ) ? Math.exp(-lambda) : Math.exp(Math.log(lambda) + (x-1)*Math.log(lambda + theta*x) - gammln(x+1) - lambda - theta*x);
}
//##### estimates the GP(lambda, theta) CDF
//##### GP = Generalized Poisson, e.g. Consul (1989) Generalized Poisson Distributions: Properties and Applications. Marcel Dekker Inc., New York/Basel
//##### NOTE: here, theta < 0 to obtain a underdispersed counting distribution
final static double cdfGP(final int x, final double lambda, final double theta) {
double cdf = 0; int i = -1; while ( ++i <= x ) cdf += pmfGP(i, lambda, theta);
return cdf;
}
//##### squares the specified double value
final static double square(final double x) {
return x*x;
......@@ -1040,6 +1062,7 @@ public class SAM2MAP {
double l = cnbzp[0], wp = cnbzp[1], p = cnbzp[2], r = cnbzp[3], wnb = cnbzp[4];
out.write(String.format(Locale.US, "# Poisson(l) coverage tail distribution: l=%.8f w=%.8f", l, wp)); out.newLine();
if ( Double.isInfinite(r) ) out.write(String.format(Locale.US, "* Poisson(l') coverage distribution: l'=%.8f 1-w=%.8f", p, wnb));
else if ( r < 0 ) out.write(String.format(Locale.US, "* GP(l',r) coverage distribution: l'=%.8f r=%.8f 1-w=%.8f", p, r, wnb));
else out.write(String.format(Locale.US, "* NB(p,r) coverage distribution: p=%.8f r=%.8f 1-w=%.8f", p, r, wnb));
out.newLine();
int sumy = 0;
......
SAM2MSA package v0.3.3.1
> updated documentation
+ SAM2MAP v0.3
> adding GP(l',r), to be used instead of NB(p,r) when dealing with
underdispersed observed coverage distribution
+ MAP2FASTA v0.3
> adding GP(l',r), to be used instead of NB(p,r) when dealing with
underdispersed observed coverage distribution
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment