README.md 27 KB
Newer Older
Alexis  CRISCUOLO's avatar
Alexis CRISCUOLO committed
1 2
# SAM2MSA

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
3 4 5 6 7 8 9 10 11 12 13 14
> [**Compilation and execution**](#compilation-and-execution)
>
> [**_SAM2MAP_ documentation**](#sam2map)
>
> [**_MAP2FASTA_ documentation**](#map2fasta)
>
> [**_FASTA2MSA_ documentation**](#fasta2msa)
>
> [**Usage example**](#example)



Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
15
The package _SAM2MSA_ is a set of command line programs written in [Java](https://docs.oracle.com/javase/8/docs/technotes/guides/language/index.html) to build multiple sequence alignments (MSA) from read alignments in [SAM format](https://samtools.github.io/hts-specs/SAMv1.pdf).
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
16
_SAM2MSA_ tools are specifically designed to ease the quick building of MSA that are well-suited for phylogenetic inference.
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
17 18 19 20 21 22

_SAM2MSA_ is composed of three tools:
* _SAM2MAP_ to build a MAP file (i.e. a tab-delimited read alignment summary) from SAM-formatted read alignments,
* _MAP2FASTA_ to convert a MAP file into a FASTA-formatted consensus sequence file,
* _FASTA2MSA_ to pool different FASTA-formatted consensus sequence files into a unique MSA.

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
23 24 25
No need to sort or compress read alignments: _SAM2MSA_ only needs standard SAM-formatted data to quickly infer consensus sequences, and next MSA that can be directly used for phylogenetic analyses.


Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
26 27 28 29 30 31 32 33 34



## Compilation and execution

The source code of _SAM2MSA_ is inside the _src_ directory and can be compiled and executed in two different ways. 

#### Building executable jar files

Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
35
On computers with [Oracle JDK](http://www.oracle.com/technetwork/java/javase/downloads/index.html) (8 or higher) installed, Java executable jar files can be created. 
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
36 37
In a command-line window, go to the _src_ directory and type:
```bash
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
38 39 40 41 42 43 44
for p in SAM2MAP MAP2FASTA FASTA2MSA
do
  javac $p.java ;
  echo Main-Class: $p > MANIFEST.MF ;
  jar -cmvf MANIFEST.MF $p.jar $p.class ;
  rm MANIFEST.MF $p.class ;
done
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
45 46 47 48 49 50 51 52 53 54 55 56 57
```
This will create the three executable jar files `SAM2MAP.jar`, `MAP2FASTA.jar` and `FASTA2MSA.jar` that can be run with the following command line models:
```bash
java -jar SAM2MAP.jar   [options]
java -jar MAP2FASTA.jar [options]
java -jar FASTA2MSA.jar [options]
```

#### Building a native executable

On computers with [GraalVM](hhttps://www.graalvm.org/downloads/) installed, native executables can also be built. 
In a command-line window, go to the _src_ directory, and type:
```bash
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
58 59 60 61 62 63
for p in SAM2MAP MAP2FASTA FASTA2MSA
do
  javac $p.java ;
  native-image $p $p ;
  rm $p.class ;
done
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
```
This will create the three native executable `SAM2MAP`, `MAP2FASTA` and `FASTA2MSA` that can be launched with the following command line models:
```bash
./SAM2MAP   [options]
./MAP2FASTA [options]
./FASTA2MSA [options]
```

## Usage

### _SAM2MAP_

Run _SAM2MAP_ without option to read the following documentation:

```
 SAM2MAP

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
81 82 83 84 85
 SAM2MAP infers a consensus sequence from SAM-formatted read alignments against a
 reference.  The inferred  consensus sequence  has always  the same  size as  the
 reference one.  At each position,  the inferred character state is the majority-
 rule one within the aligned bases (option -f).  For each position,  the inferred
 character state is associated with one of the following map code:
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
86 87 88 89 90 91 92
   U  under-covered position (options -p or -c)
   u  position neighboring map code 'U'
   O  over-covered position (options -p or -C)
   o  position neighboring map code 'O'
   S  strand-biased position (option -s)
   X  position within SNP-rich or SNP-poor regions (options -x and -w)
   M  unbiased position
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
93 94 95 96 97
 Inferred  character states  associated  with the  map code  'U' are  always '?'.
 Inferred character  states that  differ from  the reference ones  and associated
 with the map codes  'u', 'O', 'o' or 'S'  can be replaced by 'X'  or not (option
 -m). Inferred character states associated  with the map code 'X' are replaced by
 'x' (options -x and -w).
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
 The main output file is  a map file that summarizes  the read alignments against
 the reference sequence. The inferred sequence is also written in FASTA format.

 USAGE: SAM2MAP [-i SAMFILE] [-r REFFILE] [-o BASENAME] ...

 GENERAL OPTIONS:

   -i FILE      read alignment in SAM format; set "-" to read from standard input
                (mandatory)
   -r FILE      reference sequence(s)  in FASTA format;  should contain  at least
                one sequence used for the read alignment (mandatory)
   -o BASENAME  basename for output files (mandatory)
   -n STRING    name of the inferred sequence;  when set,  a unique sequence will
                be written in a FASTA file with the specified name in the header
   -v           verbose mode

 READ ALIGNMENT:

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
116
   -q INTEGER   minimum allowed  Phred score;  sequenced bases  associated with a
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
117 118
                Phred  score  smaller  than  the   specified  threshold  are  not
                considered (default: 20)
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
119 120
   -Q INTEGER   minimum allowed  mapping Phred  score;  aligned reads  associated
                with a Phred score  smaller than the specified  threshold are not
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
121 122 123 124 125 126 127 128 129 130 131
                considered (default: 20)

 READ COVERAGE:

   -p NUMBER    p-value  to  determine the  coverage  depth  confidence interval;
                after  estimating  a  theoretical   coverage  depth  distribution
                (Poisson  + NB)  from the  read alignments,  the lower  and upper
                coverage bounds  are determined  by the  CDF NB values p and 1-p,
                respectively (default: 0.005)
   -c INTEGER   coverage depth  lower bound;  if the  number of  aligned reads is
                smaller  than  this  threshold,  the  corresponding  position  is
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
132 133
                considered as  under-covered and associated with the map code 'U'
                (default: estimated from the data via option -p, but at least 10)
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
134 135
   -C INTEGER   coverage depth  upper bound;  if the  number of  aligned reads is
                larger  than  this  threshold,   the  corresponding  position  is
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
136
                considered as over-covered  and associated with the  map code 'O'
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
137 138
                (default: estimated from the data via option -p)
   -s INTEGER   minimum number of  reads for each strand  to trust a position; if
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
139 140
                a position  does not verify  this condition,  it is considered as
                strand-biased and associated with the map code 'S' (default: 5)
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170

 CHARACTER STATE INFERENCE:

   -f NUMBER    minimum proportion  to infer  the majority-rule  character state;
                at each position,  the set of  most occuring aligned  nucleotides
                is built  up to  this threshold  and the  corresponding character
                state is inferred from the one(s) in this set (default: 0.8)
   -m STRING    allowing mismatch in biased positions; when an inferred character
                state differs  from the  reference one  and is  associated to map
                codes  'u',  'O',  'o' or  'S', it is replaced by 'X' by default;
                this can  be relaxed  by specifying  the  map  code(s) for  which
                mismatches are allowed (default: "M")

 SNP DENSITY:

   -x NUMBER    threshold to determine  whether a position  does not belong  to a
                clonal SNP  density region;  after estimating  a theoretical  SNP
                density distribution  (NB) using  a sliding  window  (option -w),
                each position is associated  to an index that  is close to 0 when
                it belongs to  a region containing  significantly too few  or too
                many SNP;  every  position  with index  lower than  the specified
                threshold (e.g. 0.01) is associated to map codes 'X' and replaced
                by 'x' (default: 0)
   -w INTEGER   window size to assess clonal SNP density (default: 1000)
```

#### Notes on _SAM2MAP_

* As most read mapping programs (e.g. [BWA](http://bio-bwa.sourceforge.net/), [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml), [minimap2](https://github.com/lh3/minimap2)) output the read alignments in SAM format, _SAM2MAP_ can be directly used together with any of them without conversion, compression or sorting step.

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
171 172
* Option `-i` can be used to specify an input file in SAM format. To read SAM-formatted alignments from `stdin`, use option `-i -`. Therefore, _SAM2MAP_ can be used together with [_samtools_](http://www.htslib.org/) ([Li et al. 2009](https://doi.org/10.1093/bioinformatics/btp352)) to read BAM-formatted files, e.g  
`samtools view -h infile.bam | SAM2MAP -i - -r ref.fa -o outfile`
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
173

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
174
* By default, all sequenced bases with Phread score < 20 are not considered (option `-q`), therefore minimizing the impact of sequencing errors when computing the consensus sequence. By default, all read alignment with Phred score < 20 (as assessed by the read mapping program) are also not considered (option `-Q`), therefore discarding from the consensus sequence every region with low mappability (e.g. low complexity or repeated regions).
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
175

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
176
* _SAM2MAP_ estimates a Poisson+Negative Binomial (NB) theoretical distribution from the observed read coverage distribution, and writes the results into an output file (cov.txt file extension). The Poisson distribution is dedicated to observed (near-)zero read coverage distribution. The NB distribution is used to determine the min/max coverage depths to assess reference regions where the consensus sequence can be trustingly built.
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
177

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
178
* For each position of the specified reference, _SAM2MAP_ summarizes the corresponding aligned bases in a tab-delimited MAP file. A MAP is defined by the following fields: reference position and base, no. A, C, G, T and gaps, no. reverse read, map code, variant. 
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
179

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
180 181 182
* For most positions (e.g. associated with the map code 'M' in the assessed min/max read coverage regions), the inferred consensus character state is the majority one(s) observed in at least 80% of the aligned bases (option `-f`). Of note, when more than one character state is required to determine at least 80% of the aligned ones, the corresponding degenerated nucleotide is inferred in the consensus sequence. This can be useful to build consensus sequences corresponding to quasi-species. The table below lists the inferred base in the consensus sequence for each majority character state set:

<div align="center">
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203

| majority set | inferred base |   | majority set | inferred base |
|:------------:|:-------------:|:-:|:------------:|:-------------:|
| `{A}`        | `A`           |   | `{A-}`       | `a`           |
| `{C}`        | `C`           |   | `{C-}`       | `c`           |
| `{G}`        | `G`           |   | `{G-}`       | `g`           |
| `{T}`        | `T`           |   | `{T-}`       | `t`           |
| `{AC}`       | `M`           |   | `{AC-}`      | `m`           |
| `{AG}`       | `R`           |   | `{AG-}`      | `r`           |
| `{AT}`       | `W`           |   | `{AT-}`      | `w`           |
| `{CG}`       | `S`           |   | `{CG-}`      | `s`           |
| `{CT}`       | `Y`           |   | `{CT-}`      | `y`           |
| `{GT}`       | `K`           |   | `{GT-}`      | `k`           |
| `{ACG}`      | `V`           |   | `{ACG-}`     | `v`           |
| `{ACT}`      | `H`           |   | `{ACT-}`     | `h`           |
| `{AGT}`      | `D`           |   | `{AGT-}`     | `d`           |
| `{CGT}`      | `B`           |   | `{CGT-}`     | `b`           |
| `{ACGT}`     | `N`           |   | `{ACGT-}`    | `n`           |

</div>

Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
204

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
205
* As the main aim of _SAM2MSA_ is to quickly build MSA from read alignments, it should be stressed that _SAM2MAP_ never consider insertions relative to the reference, therefore guaranteeing that the consensus sequence will have the same length as the reference, which is expected to quickly build an MSA by pooling all consensus sequences.
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
206 207 208

* Basically, running _SAM2MAP_ with default options is sufficient to build a MAP file and its associated FASTA-formatted consensus sequence in many cases. However, if the consensus sequence is found to be incorrect, alternative ones can be quickly built by using _MAP2FASTA_ directly on the MAP file.

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
209 210 211
* The reference file used by _SAM2MAP_ (option `-r`) is not expected to be the same as the one used to align reads, but it should contain at least one common sequence (identified by its FASTA header). This can be useful to infer a specific consensus sequence (e.g. a given chromosome) from the read alignments against a complete genome (e.g. multiple chromosomes/replicons).

* For phylogenetic purposes, _SAM2MAP_ is also able to search for regions containing significantly too few or too many mismatches (SNP; option `-x`). First, a NB distribution of SNP density is estimated using a sliding window (option `-w`). Next, a _p_-value is estimated for each position, assessing that a given position belongs to a region that is SNP-poor or SNP-rich (i.e. _p_-value close to 0) in comparison with the number of SNP expected in a standard genome evolution context. This option can be useful to filter out regions involved in non-horizontal evolutionary events (e.g. gene transfer, homologous recombination).
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227



### _MAP2FASTA_

Run _MAP2FASTA_ without option to read the following documentation:

```
 MAP2FASTA

 MAP2FASTA infers a consensus sequence from a MAP file.

 USAGE: MAP2FASTA [-i MAPFILE] ...

 GENERAL OPTIONS:

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
228
   -i FILE      tab-delimited  MAP file;  set "-"  to read  from  standard  input
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
229 230
                (mandatory)
   -n STRING    name of the inferred sequence;  when set,  a unique sequence will
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
231
                be written in a FASTA file with  the specified name in the header
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
232 233 234 235 236 237 238 239 240 241 242
   -v           verbose mode

 READ COVERAGE:

   -p NUMBER    p-value  to  determine the  coverage  depth  confidence interval;
                after  estimating  a  theoretical   coverage  depth  distribution
                (Poisson  + NB)  from the  read alignments,  the lower  and upper
                coverage bounds  are determined  by the  CDF NB values p and 1-p,
                respectively (default: 0.005)
   -c INTEGER   coverage depth  lower bound;  if the  number of  aligned reads is
                smaller  than  this  threshold,  the  corresponding  position  is
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
243 244
                considered as  under-covered and associated with the map code 'U'
                (default: estimated from the data via option -p, but at least 10)
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
245 246
   -C INTEGER   coverage depth  upper bound;  if the  number of  aligned reads is
                larger  than  this  threshold,   the  corresponding  position  is
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
247
                considered as over-covered  and associated with the  map code 'O'
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
248 249
                (default: estimated from the data via option -p)
   -s INTEGER   minimum number of  reads for each strand  to trust a position; if
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
250 251
                a position  does not verify  this condition,  it is considered as
                strand-biased and associated with the map code 'S' (default: 5)
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279

 CHARACTER STATE INFERENCE:

   -f NUMBER    minimum proportion  to infer  the majority-rule  character state;
                at each position,  the set of  most occuring aligned  nucleotides
                is built  up to  this threshold  and the  corresponding character
                state is inferred from the one(s) in this set (default: 0.8)
   -m STRING    allowing mismatch in biased positions; when an inferred character
                state differs  from the  reference one  and is  associated to map
                codes  'u',  'O',  'o' or  'S', it is replaced by 'X' by default;
                this can  be relaxed  by specifying  the  map  code(s) for  which
                mismatches are allowed (default: "M")

 SNP DENSITY:

   -x NUMBER    threshold to determine  whether a position  does not belong  to a
                clonal SNP  density region;  after estimating  a theoretical  SNP
                density distribution  (NB) using  a sliding  window  (option -w),
                each position is associated  to an index that  is close to 0 when
                it belongs to  a region containing  significantly too few  or too
                many SNP;  every  position  with index  lower than  the specified
                threshold (e.g. 0.01) is associated to map codes 'X' and replaced
                by 'x' (default: 0)
   -w INTEGER   window size to assess clonal SNP density (default: 1000)
```

#### Notes on _MAP2FASTA_

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
280
* The program _MAP2FASTA_ has broadly the same options as _SAM2MAP_. When the consensus sequence built using _SAM2MAP_ is found to be incorrect (e.g. not enough mismatches in comparison with the reference, unsatisfactory read coverage confidence interval), alternative ones can be built using _MAP2FASTA_ directly on the MAP file generated by _SAM2MAP_. As a MAP file (e.g. one line per reference position) is smaller than a SAM file (e.g. one line per sequenced read), _MAP2FASTA_ runs faster than _SAM2MAP_ for the same aim.
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314

* Of note, output files written by _MAP2FASTA_ are derived from the input file name.


### _FASTA2MSA_

Run _FASTA2MSA_ without option to read the following documentation:

```
 FASTA2MSA

 FASTA2MSA combines  the different  consensus sequences  estimated by  SAM2MAP or
 MAP2FASTA into a multiple sequence alignment (MSA) in FASTA format.

 USAGE: FASTA2MSA [-i INFILE] [-o BASENAME] ...

 GENERAL OPTIONS:

   -i FILE      text file containing one line per taxon, each line containing the
                taxon  name  followed  by  the  corresponding  FASTA  file  name,
                separated by blank space(s) or tabulation(s) (mandatory)
   -o BASENAME  basename for output files (mandatory)
   -v           verbose mode

 CHARACTER FILTERING:

   -u STRING    the character state(s) to be considered as unknown ones (default:
                "xXnN-?")
   -p NUMBER    the maximum  allowed frequency  of unknown  character states  per
                aligned character (default: 0.5)
   -f STRING    the  character  state(s)  to be  considered  as  forbidden;  each
                character containing at least one  forbidden character state will
                be discarded (default: "")
   -V           when set, only variable characters will be selected (default: not
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
315
                set)
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337
   -s NUMBER    discarding all characters before the specified position (default:
                not set)
   -e NUMBER    discarding all characters  after the specified position (default:
                not set)

 CODING REGIONS:

   -g FILE      GFF3 file containing  annotations of the  reference sequence used
                for creating  the input  FASTA files  (same contig  order as  the
                FASTA reference sequence file)
   -1           only codon positions 1 will be selected  (requires option -g; can
                be combined with options -2 and -3)
   -2           only codon positions 2 will be selected  (requires option -g; can
                be combined with options -1 and -3)
   -3           only codon positions 3 will be selected  (requires option -g; can
                be combined with options -1 and -2)
   -G NUMBER    genetic code  identifier according  to the translation  tables at
                www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi (default: 1)
```

#### Notes on _FASTA2MSA_

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
338
* _FASTA2MAP_ pools the consensus sequences generated by _SAM2MAP_ and/or _MAP2FASTA_ into a multiple sequence alignment (MSA). It is therefore expected that the input FASTA files (option `-i`) contain sequences of identical lengths and with comparable FASTA headers.
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
339

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
340
* Regions containing too many unknown character states (e.g. '?', 'N', 'X', gaps) can be filtered out (option `-p`; default 50%). 
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
341

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
342
* Variable characters in the MSA are summarized in a tab-delimited file (var.tsv file extension). A GFF3-formatted annotation file associated with the reference can be used (option `-g`) to determine variable characters that are synonymous/non-synonymous within the MSA (different genetic codes available using option `-G`). To obtain the list of all variable characters, use option `-p 1`.
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
343

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
344
* A MSA corresponding to a specific region of the reference can be obtained using option(s) `-s` and/or `-e`.
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
345

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
346
* The FASTA-formatted MSA file generated by _FASTA2MSA_ can be directly used to perform a phylogenetic analysis of the sequenced samples. Of note, recombination-purged MSA can be obtained by filtering out from the different consensus sequences the regions that contain significantly too many or not enough mismatches when compared to the reference sequence. This can be obtained when using _SAM2MAP_ and/or _MAP2FASTA_ (options `-x` and `-w`).
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
347 348 349 350 351



## Example

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
352 353
The directory _example/_ contains different files associated with the study of [Seth-Smith et al. (2019)](https://wwwnc.cdc.gov/eid/article/25/6/17-2119_article) describing an outbreak of _Burkholderia stabilis_ identified among hospitalized patients across several cantons in Switzerland during 2015-2016.
The example below shows how the different _SAM2MSA_ programs can be used to build a MSA from 38 isolates sequenced in the context of the [Seth-Smith et al. (2019)](https://wwwnc.cdc.gov/eid/article/25/6/17-2119_article) study.
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
354 355 356 357 358 359 360 361 362 363 364 365 366 367

#### Sample FASTQ files

The following command lines allow downloading from the [ENA](https://www.ebi.ac.uk/ena) ftp repository the 38 pairs of FASTQ files (Illumina MiSeq) associated to the 38 sequenced isolate genomes:
```bash
for err in ERR180650{8..9} ERR18065{10..34} ERR1810578 ERR18105{89..98}
do
  echo -n "$err ..."; 
  wget -q ftp.sra.ebi.ac.uk/vol1/fastq/${err:0:6}/00${err:9:1}/$err/$err""_1.fastq.gz & 
  wget -q ftp.sra.ebi.ac.uk/vol1/fastq/${err:0:6}/00${err:9:1}/$err/$err""_2.fastq.gz ; 
  wait ; 
  echo " [ok]";
done
```
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
368
Of note, faster downloading times can be observed using [axel](https://github.com/axel-download-accelerator/axel) instead of [wget](https://www.gnu.org/software/wget/): on computers with [axel](https://github.com/axel-download-accelerator/axel) installed, replace the two occurrences of `wget` by e.g. `axel -a -n 10`.
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
369 370 371 372 373 374 375 376 377 378 379 380

#### Reference FASTA and GFF3 files

The following command lines allow downloading from the [NCBI](https://www.ncbi.nlm.nih.gov/) ftp repository the FASTA and GFF3 files associated to the reference genome of [_B. stabilis_ CH16](https://www.ncbi.nlm.nih.gov/genome/45559?genome_assembly_id=413612) (3 chromosomes and 1 plasmid; see the corresponding [GenBank repository](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/240/005/GCA_900240005.1)):
```bash
BASEURL="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/240/005/GCA_900240005.1_BStabCH16";
wget -q -O- $BASEURL/GCA_900240005.1_BStabCH16_genomic.fna.gz | zcat > CH16.fasta ;
wget -q -O- $BASEURL/GCA_900240005.1_BStabCH16_genomic.gff.gz | zcat > CH16.gff ;
```

#### _SAM2MAP_ on read alignments

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
381
The following command lines allow read mapping to be carried out against _CH16.fasta_ using [minimap2](https://github.com/lh3/minimap2) ([Li 2018](https://doi.org/10.1093/bioinformatics/bty19)) on 6 threads; the SAM-formatted read alignments are directly read via a pipe (`|`) by _SAM2MAP_ (default options) to obtain the MAP files and the FASTA-formatted consensus sequences:
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
382 383 384 385 386 387
```bash
for err in ERR180650{8..9} ERR18065{10..34} ERR1810578 ERR18105{89..98}
do
  minimap2 -ax sr -t 6 CH6.fasta $err""_1.fastq.gz $err""_2.fastq.gz 2>/dev/null | SAM2MAP -i - -r CH6.fasta -o $err -v ;
done
```
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
388
Faster running times can be obtained by using [minimap2](https://github.com/lh3/minimap2) on more threads (option `-t`).
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
389 390 391

The directory _example/_ contains the three output files written by _SAM2MAP_ for the sample ERR1806508:
* _ERR1806508.cov.txt_: the observed and theoretical mapping coverage distributions,
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
392
* _ERR1806508.map.xz_: the xzipped MAP file summarizing the overall mapping for each reference position,
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
393 394 395 396 397
* _ERR1806508.fasta_: the consensus sequences in FASTA format.


#### _FASTA2MSA_ on consensus sequences

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
398
The following command line builds a MSA using _FASTA2MSA_ on the 38 generated consensus sequences (the input file _infile.txt_ is available in _example/_):
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
399 400 401
```bash
FASTA2MSA -i infile.txt -o msa -g CH16.gff -v
```
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
402
The directory _example/_ contains the two output files written by _FASTA2MSA_:
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
403 404
* _msa.fasta.mfc_: a multiple sequence alignment of 8,371,282 aligned nucleotide characters in FASTA format, compressed using [MFCompress](http://bioinformatics.ua.pt/software/mfcompress/) ([Pinho and Pratas 2014](http://dx.doi.org/10.1093/bioinformatics/btt594)),
* _msa.var.tsv_: a tab-delimited file summarizing the 489 variable characters inside _msa.fasta_; coding characters are identified from the annotation file _CH16.gff_ specified using option `-g`.
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
405 406 407 408


#### Phylogenetic inference

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
409
A phylogenetic tree was inferred using [IQ-TREE 2](http://www.iqtree.org/) ([Minh et al. 2020](https://academic.oup.com/mbe/article/37/5/1530/5721363)) on 12 threads with evolutionary model HKY+F:
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
410 411 412
```bash
iqtree2 -s msa.fasta -T 12 -m HKY+F
```
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
413
Branch lengths of the ML tree (_msa.fasta.treefile_ in _example/_) were rescaled using the total number of aligned characters (i.e. _s_ = 8,371,282) to estimate the number of SNPs on each branch with the following _awk_ one-liner:
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
414
```bash
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
415 416 417 418
awk -v s=8371282 '{t=t$0} 
                  END{while((c=substr(t,++x,1))!=";")
                       if(c==":"){y=++x; while((c=substr(t,++y,1))!=","&&c!=")"){}
                                  t=substr(t,1,x-1)sprintf("%.0f",s*substr(t,x,y-x))substr(t,y)} print t}' msa.fasta.treefile > msa.nwk
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
419
```
Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
420
This simple approach leads to accurate estimates, as the _p_-distance _p_ (i.e. number of observed mismatches per character) is always similar to the evolutionary distance _d_ (i.e. number of substitution events per character) when _d_ < 0.1; therefore _sp_ (i.e. the number of SNP) can be accurately approximated by _sd_ as performed by the above _awk_ one-liner.
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
421

Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
422
The final phylogenetic tree (_msa.nwk_ in _example/_) is represented below (to be compared with [Figure 2](https://wwwnc.cdc.gov/eid/article/25/6/17-2119-f2) in [Seth-Smith et al. 2019](https://wwwnc.cdc.gov/eid/article/25/6/17-2119_article)).
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
423

Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
424 425
![CH16 SNP tree](example/msa.nwk.png "CH16 SNP tree")
<!--- <iframe width="800" height="900" src="https://phylogeny.io/share/0ad13a8140437565f9593fba1b04b36028dc882b"></iframe> --->
Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
426 427
<!--- https://phylogeny.io/edit/0ad13a8140437565f9593fba1b04b36028dc882b/dcc6740bddd18a1f0f93c031ad38beaf5c17b4e6  --->

Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
428

Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
429 430 431 432
## References

Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. [doi:10.1093/bioinformatics/bty191](https://doi.org/10.1093/bioinformatics/bty19).

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
433 434
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25(16):2078-2079. [doi:10.1093/bioinformatics/btp352](https://doi.org/10.1093/bioinformatics/btp352).

Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
435 436
Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R (2020) IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Molecular Biology and Evolution, 37(5):1530-1534. [doi:10.1093/molbev/msaa015](https://doi.org/10.1093/molbev/msaa015).

Alexis  CRISCUOLO's avatar
0.1.1.1  
Alexis CRISCUOLO committed
437 438
Pinho AJ and Pratas D (2014) MFCompress: a compression tool for FASTA and multi-FASTA data. Bioinformatics, 30(1):117-118. [doi:10.1093/bioinformatics/btt594](http://dx.doi.org/10.1093/bioinformatics/btt594)

Alexis  CRISCUOLO's avatar
0.01  
Alexis CRISCUOLO committed
439 440 441
Seth-Smith HMB, Casanova C, Sommerstein R, Meinel DM, Abdelbary MMH, Blanc DS, Droz S, Führer U, Lienhard R, Lang C, Dubuis O, Schlegel M, Widmer A, Keller PM, Marschall J, Egli A (2019) Phenotypic and Genomic Analyses of Burkholderia stabilis Clinical Contamination, Switzerland. Emerging Infectious Diseases, 25(6):1084-1092. [doi:10.3201/eid2506.172119](https://doi.org/10.3201/eid2506.172119).