ASSU
ASSU (ASSembling SSU) is a command line tool written in Bash to carry out the reference-guided assembly of small subunit (SSU) 16S ribosomal ribonucleic acid (rRNA) using short high-throughput sequencing (HTS) reads derived from whole genome) sequencing of bacteria or archaea strains.
This tool was developed to compensate the failure of several de novo assembly programs to assemble (at least one) non-fragmented SSU segment when the sequenced genome contains different 16S rRNA copies with sequence variation, especially when using short HTS reads.
Dependencies
You will need to install the required programs and tools listed in the following tables, or to verify that they are already installed with the required version.
Mandatory programs
program | package | version | sources |
---|---|---|---|
bwa-mem2 | - | ≥ 2.2.1 | gitlab.pasteur.fr/GIPhy/contig_info |
samtools | - | ≥ 1.18 |
github.com/samtools/samtools sourceforge.net/projects/samtools |
Optional programs
program | package | version | sources |
---|---|---|---|
bzip2 | - | > 1.0.0 | sourceware.org/bzip2/downloads.html |
DSRC | - | ≥ 2.0 | github.com/refresh-bio/DSRC |
pigz | - | ≥ 2.4 | github.com/madler/pigz |
Standard GNU packages and utilities
program | package | version | sources |
---|---|---|---|
echo head fold paste tail tr |
coreutils | > 8.0 | ftp.gnu.org/gnu/coreutils |
gunzip zgrep |
gzip | > 1.0 | ftp.gnu.org/gnu/gzip |
bc | - | > 1.0 | ftp.gnu.org/gnu/bc |
gawk | - | > 4.0.0 | ftp.gnu.org/gnu/gawk |
grep | - | > 2.0 | ftp.gnu.org/gnu/bc |
sed | - | > 4.2 | ftp.gnu.org/gnu/bc |
Installation and execution
A. Clone this repository with the following command line:
git clone https://gitlab.pasteur.fr/GIPhy/ASSU.git
B. Go to the created directory and give the execute permission to the file ASSU.sh
:
cd ASSU/
chmod +x ASSU.sh
C. Check the dependencies (and their version) using the following command line:
./ASSU.sh -c
D. If at least one of the required program (see Dependencies) is not available on your $PATH
variable (or if one compiled binary has a different default name), it should be manually specified.
To specify the location of a specific binary, edit the file ASSU.sh
and indicate the local path to the corresponding binary(ies) within the code block REQUIREMENTS
(approximately lines 60-110).
For each required program, the table below reports the corresponding variable assignment instruction to edit (if needed) within the code block REQUIREMENTS
program | variable assignment | program | variable assignment | |
---|---|---|---|---|
bwa-mem2 | BWAMEM2_BIN=bwa-mem2; |
gunzip | GUNZIP_BIN=gunzip; |
|
bzip2 | BZIP2_BIN=bzip2; |
pigz | PIGZ_BIN=pigz; |
|
DSRC | DSRC_BIN=dsrc; |
samtools | SAMTOOLS_BIN=samtools; |
|
gawk | GAWK_BIN=gawk; |
zgrep | ZGREP_BIN=zgrep; |
E. Execute ASSU with the following command line model:
./ASSU.sh [options] <infile> [<infile> ...]
F. ASSU also requires a databank of reference SSU sequences. By default, a version of this databank is provided inside the directory db/ as a file named SSUdb.gz
(see details in SSUdb.version.txt
). However, a more recent version can be quickly built using the provided script makeSSUdb.sh
with the following command line:
./makeSSUdb.sh
After a few seconds, a new SSU databank file named SSUdb.gz
will be automatically created from the NCBI RefSeq Targeted Loci Project. Note that the previous command line will overwrite the provided version of SSUdb.gz
when run in the same directory.
Usage
Run ASSU without option to read the following documentation:
USAGE: ASSU [options] <infile> [<infile> ...]
OPTIONS:
-d <file> SSU databank file (default: db/SSUdb.gz in the same directory as ASSU)
-p <string> restricts the SSU databank to the specified (extended regex) pattern
(default: none)
-o <string> output FASTA-formatted SSU sequence file name (default: ssu.fasta)
-O <string> writes the selected reads into the specified FASTQ-formatted file name
(default: none)
-l <int> minimum sequence length (default: 1000)
-L <int> minimum read length (default: AUTO)
-Q <int> minimum base Phred quality value (default: 20)
-M <int> minimum mapping Phred quality value (default: 20)
-D <int> minimum coverage depth (default: 50)
-F <float> minimum proportion of the majority base to infer that base (default: 0.8)
-A <float> minimum ratio of the alternative base(s) to the majority one to add that
base(s) to the consensus (default: 0.2)
-N set N when multiple bases at a consensus position (default: not set)
-w <dir> path to the tmp directory (default: $TMPDIR, otherwise /tmp)
-t <int> thread numbers (default: 2)
-v verbose mode
-s prints the content of the SSU databank and exit
-c checks dependencies and exit
-h prints this help and exit
EXAMPLES:
ASSU -t 24 -o 16s.fasta fwd.fastq.gz rev.fastq.gz sgl.fastq.gz
ASSU -d SSUdb.gz -O 16s.fastq -p "Devosia limi" -L 75 -v *.fastq
ASSU -p "Citrobacter|Escherichia|Shigella" -N -v hts.fastq.bz2
Notes
-
In brief, ASSU first quickly aligns the specified HTS reads against all the reference sequences available in the SSU databank using bwa-mem2. This first step enables to determine the most suited reference sequence (called model), as well as the subset of HTS reads that arise from SSU genome regions. Next, every HTS read from the subset is accurately aligned against the model sequence, and the resulting alignments are processed by samtools to build a final (consensus) sequence.
-
ASSU requires at least one HTS read file. Input file(s) should be in FASTQ format and can be compressed using gzip, bzip2 or DSRC (Roguski and Deorowicz 2014). Note that input files compressed using bzip2 or DSRC require the associated decompression tool to be read (see Dependencies).
-
ASSU is not working with long HTS reads, as bwa-mem2 is not developed to align HTS reads on significantly shorter reference sequences. The source code of ASSU can be easily modified (on request) to deal with such a case, but long HTS reads generally lead to complete SSU segments via de novo assembly.
-
By default, ASSU expects that the SSU databank file
SSUdb.gz
is located in the directorydb/
. However, an alternative SSU databank file (e.g. different version, different file name) can be specified using option-d
. The content of the specified SSU databank can be summarized using option-s
. -
The running time of ASSU is very dependent on the size of the input files, but faster running times can be obtained using multiple threads (option
-t
) and/or a temporary directory located on a hard drive with high speed (option-w
). -
The assembled sequence is written in FASTA format into an output file (option
-o
; default name:ssu.fasta
). Optionally, the selected HTS reads can be written in FASTQ format into a specified output file (option-O
). -
The selection of the model sequence can be oriented/forced by using the option
-p
to set a(n extended-regex) pattern (e.g. accessions, genus, species). It is recommended to specify the pattern between quotation marks. -
As the assembled SSU sequence is often the consensus of several copies with sequence variation within the sequenced genome (e.g. Větrovský and Baldrian 2013), it may contain ambiguous positions resulting from the consensus of different sequenced bases at those positions. In such cases, degenerated nucleotides are used to represent the consensus of different character states (see e.g. Table 1 in Johnson 2010), or lowercase characters when a deletion (i.e. gap) is involved in the consensus. Note that every degenerated nucleotide can be replaced by the character state
N
using option-N
. -
The (number of) ambiguous positions can be slightly modified by considering shorter HTS reads (option
-L
), putative sequencing errors (option-Q
), weak alignments (option-M
), low coverage depth (option-D
) or alternative model sequence (option-p
). The consensus definition can be modified by tuning the two options-F
and-A
, corresponding to the options--call-fract
and--het-fract
of samtools consensus (mode simple), respectively. -
No output file is written in several situations:
• insufficient coverage depth (default: at least 50×; option-D
),
• too short assembled SSU sequence (default: at least 1,000 bps; option-l
),
• too many ambiguous positions (i.e. more than 5%).
Example
In order to illustrate the usefulness of ASSU, the following example describes its usage for assembling the 16S rRNA (consensus) segment of Escherichia coli O113:H21 strain FWSEC0011. Its genome assembly (GCF_005171095.1) consists of one chromosome (NZ_CP031892.1) and one plasmid (NZ_CP031893.1), built from short and long HTS reads (SRS3815841).
Downloading input files
Paired-end sequencing of this genome was performed using Illumina Miseq, and the resulting pair of (compressed) FASTQ files (225 Mb and 249 Mb, respectively) can be downloaded using the following command lines:
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR789/009/SRR7896249/SRR7896249_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR789/009/SRR7896249/SRR7896249_2.fastq.gz
Running ASSU
Use the following command line to run ASSU on these two FASTQ files using 12 threads:
./ASSU.sh -t 12 -o FWSEC0011.ssu.fasta -v SRR7896249_*.fastq.gz
Note that the SSU databank used for this assembly is the version 2024-02-18 (20,404 sequences).
As the verbose mode was set (option -v
), this command line leads to the following output:
# ASSU v1.1
# Copyright (C) 2024 Institut Pasteur
+ https://gitlab.pasteur.fr/GIPhy/ASSU
> Syst: x86_64-redhat-linux-gnu
> Bash: 4.4.20(1)-release
> SSUdb: /local/bin/ASSU/db/SSUdb.gz
> SSUdb v2024-02-18 (20404 sequences)
[00:00] checking input files ... [ok]
+ SRR7896249_1.fastq.gz
+ SRR7896249_2.fastq.gz
[00:00] creating tmp directory .... [ok]
> TMP_DIR=/tmp/ASSU.uYf5cUoa6R
[00:01] examining SSU databank ...... [ok]
> model: Bacteria | Escherichia fergusonii | NR_074902.1 | 1542 bps
[00:27] building SSU sequence .... [ok]
> 3016 selected reads (903953 bases; lgt > 269)
> coverage depth: 586x
> 1543 bps (ambiguous bases: 14)
10 20 30 40 50 60 70 80 90 100
| | | | | | | | | |
1 AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGRAARCAGCTTGCTGYTTYGCTGACG
* * * *
101 AGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAG
201 GGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTWGTWGGTGGGGTAACGGCTCACCWAGGCGACGATCCCTAGCTGGTCTGAGA
* * *
301 GGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGC
401 CGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAG
501 CACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCA
601 GATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCG
701 TAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGG
801 TAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCA
901 AGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCA
1001 CRGAASTTTYCAGAGATGaGAWTgGTGCCTTCGGGAACYGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCG
* * * * * * *
1101 CAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGT
1201 CATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTA
1301 GTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACAC
1401 CGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTA
1501 ACAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA
[00:30] writing output file ... [ok]
+ FASTA: FWSEC0011.ssu.fasta
[00:30] exit
The SSU sequence NR_074902.1 was selected as a model to carry out the reference-guided assembly using 3,016 HTS reads, leading to an assembled (consensus) sequence of length 1,543 bps (coverage depth: 586×) written into the FASTA file FWSEC0011.ssu.fasta. The overall running time was < 30 seconds.
The assembled SSU sequence contains 14 ambiguous bases, highlighted with a *
in the above output. This suggests that the genome of E. coli O113:H21 strain FWSEC0011 contains different 16S rRNA copies with sequence variations.
In fact, its chromosome (NZ_CP031892.1) contains seven 16S rRNA segments labeled with the following locus tags:
• C8202_RS02200
• C8202_RS06240
• C8202_RS19645
• C8202_RS23325
• C8202_RS23525
• C8202_RS24135
• C8202_RS24625
Below is represented a multiple sequence alignment (MSA) of these seven 16S rRNA segments together with the assembled SSU sequence, showing that the 14 ambiguous bases (*
) reflects as expected the variability between the different copies.
10 20 30 40 50 60 70 80 90 100
| | | | | | | | | |
SSU AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGRAARCAGCTTGCTGYTTYGCTGACG
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||*||*||||||||||*||*|||||||
RS02200 AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAACAGCTTGCTGTTTCGCTGACG
RS06240 AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAACAGCTTGCTGTTTCGCTGACG
RS19645 AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGCAGCTTGCTGCTTCGCTGACG
RS23325 AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGAAAGCAGCTTGCTGCTTTGCTGACG
RS23525 AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGAAAGCAGCTTGCTGCTTTGCTGACG
RS24135 AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGCAGCTTGCTGCTTCGCTGACG
RS24625 AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGCAGCTTGCTGCTTTGCTGACG
110 120 130 140 150 160 170 180 190 200
| | | | | | | | | |
SSU AGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RS02200 AGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAG
RS06240 AGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAG
RS19645 AGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAG
RS23325 AGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAG
RS23525 AGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAG
RS24135 AGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAG
RS24625 AGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAG
210 220 230 240 250 260 270 280 290 300
| | | | | | | | | |
SSU GGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTWGTWGGTGGGGTAACGGCTCACCWAGGCGACGATCCCTAGCTGGTCTGAGA
|||||||||||||||||||||||||||||||||||||||||||||||||*||*|||||||||||||||||||*|||||||||||||||||||||||||||
RS02200 GGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGA
RS06240 GGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGA
RS19645 GGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGA
RS23325 GGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTTGTTGGTGGGGTAACGGCTCACCAAGGCGACGATCCCTAGCTGGTCTGAGA
RS23525 GGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTTGTTGGTGGGGTAACGGCTCACCAAGGCGACGATCCCTAGCTGGTCTGAGA
RS24135 GGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTTGTTGGTGGGGTAACGGCTCACCAAGGCGACGATCCCTAGCTGGTCTGAGA
RS24625 GGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTTGTTGGTGGGGTAACGGCTCACCAAGGCGACGATCCCTAGCTGGTCTGAGA
310 320 330 340 350 360 370 380 390 400
| | | | | | | | | |
SSU GGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGC
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RS02200 GGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGC
RS06240 GGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGC
RS19645 GGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGC
RS23325 GGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGC
RS23525 GGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGC
RS24135 GGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGC
RS24625 GGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGC
410 420 430 440 450 460 470 480 490 500
| | | | | | | | | |
SSU CGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RS02200 CGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAG
RS06240 CGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAG
RS19645 CGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAG
RS23325 CGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAG
RS23525 CGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAG
RS24135 CGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAG
RS24625 CGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTACTCATTGACGTTACCCGCAGAAGAAG
510 520 530 540 550 560 570 580 590 600
| | | | | | | | | |
SSU CACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RS02200 CACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCA
RS06240 CACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCA
RS19645 CACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCA
RS23325 CACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCA
RS23525 CACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCA
RS24135 CACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCA
RS24625 CACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCA
610 620 630 640 650 660 670 680 690 700
| | | | | | | | | |
SSU GATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RS02200 GATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCG
RS06240 GATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCG
RS19645 GATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCG
RS23325 GATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCG
RS23525 GATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCG
RS24135 GATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCG
RS24625 GATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCG
710 720 730 740 750 760 770 780 790 800
| | | | | | | | | |
SSU TAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGG
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RS02200 TAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGG
RS06240 TAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGG
RS19645 TAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGG
RS23325 TAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGG
RS23525 TAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGG
RS24135 TAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGG
RS24625 TAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGG
810 820 830 840 850 860 870 880 890 900
| | | | | | | | | |
SSU TAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RS02200 TAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCA
RS06240 TAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCA
RS19645 TAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCA
RS23325 TAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCA
RS23525 TAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCA
RS24135 TAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCA
RS24625 TAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCA
900 920 930 940 950 960 970 980 990 1000
| | | | | | | | | |
SSU AGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RS02200 AGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCA
RS06240 AGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCA
RS19645 AGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCA
RS23325 AGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCA
RS23525 AGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCA
RS24135 AGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCA
RS24625 AGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCA
1010 1020 1030 1040 1050 1060 1070 1080 1090 1100
| | | | | | | | | |
SSU CRGAASTTTYCAGAGATGaGAWTgGTGCCTTCGGGAACYGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCG
|*|||*|||*||||||||*||*|*||||||||||||||*|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RS02200 CAGAACTTTCCAGAGATG-GATTGGTGCCTTCGGGAACTGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCG
RS06240 CAGAACTTTCCAGAGATG-GATTGGTGCCTTCGGGAACTGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCG
RS19645 CGGAAGTTTTCAGAGATGAGAAT-GTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCG
RS23325 CGGAAGTTTTCAGAGATGAGAAT-GTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCG
RS23525 CGGAAGTTTTCAGAGATGAGAAT-GTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCG
RS24135 CGGAAGTTTTCAGAGATGAGAAT-GTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCG
RS24625 CGGAAGTTTTCAGAGATGAGAAT-GTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCG
1110 1120 1130 1140 1150 1160 1170 1180 1190 1200
| | | | | | | | | |
SSU CAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGT
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RS02200 CAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGT
RS06240 CAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGT
RS19645 CAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGT
RS23325 CAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGT
RS23525 CAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGT
RS24135 CAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGT
RS24625 CAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGT
1210 1220 1230 1240 1250 1260 1270 1280 1290 1300
| | | | | | | | | |
SSU CATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RS02200 CATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTA
RS06240 CATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTA
RS19645 CATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTA
RS23325 CATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTA
RS23525 CATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTA
RS24135 CATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTA
RS24625 CATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTA
1310 1320 1330 1340 1350 1360 1370 1380 1390 1400
| | | | | | | | | |
SSU GTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACAC
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RS02200 GTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACAC
RS06240 GTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACAC
RS19645 GTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACAC
RS23325 GTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACAC
RS23525 GTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACAC
RS24135 GTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACAC
RS24625 GTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACAC
1410 1410 1430 1440 1450 1460 1470 1480 1490 1500
| | | | | | | | | |
SSU CGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
RS02200 CGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTA
RS06240 CGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTA
RS19645 CGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTA
RS23325 CGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTA
RS23525 CGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTA
RS24135 CGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTA
RS24625 CGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTA
1510 1520 1530 1540
| | | |
SSU ACAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA
|||||||||||||||||||||||||||||||||||||||||||
RS02200 ACAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA
RS06240 ACAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA
RS19645 ACAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA
RS23325 ACAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA
RS23525 ACAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA
RS24135 ACAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA
RS24625 ACAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTTA
As the HTS reads arise from an E. coli genome, ASSU can also be run by using option -p
to specify this species as model (don't forget to use quotation marks when specifying a multiple word pattern with option -p
):
./ASSU.sh -t 12 -p "Escherichia coli" -v SRR7896249_*.fastq.gz
This command line leads to the following output:
# ASSU v1.1
# Copyright (C) 2024 Institut Pasteur
+ https://gitlab.pasteur.fr/GIPhy/ASSU
> Syst: x86_64-redhat-linux-gnu
> Bash: 4.4.20(1)-release
> SSUdb: /local/bin/ASSU/db/SSUdb.gz
> SSUdb v2024-02-18 (20404 sequences)
[00:00] checking input files ... [ok]
+ SRR7896249_1.fastq.gz
+ SRR7896249_2.fastq.gz
[00:00] creating tmp directory .... [ok]
> TMP_DIR=/tmp/ASSU.TxJar4O6ZP
[00:00] examining SSU databank ...... [ok]
> selection pattern: Escherichia coli
> model: Bacteria | Escherichia coli | NR_114042.1 | 1467 bps
[00:25] building SSU sequence .... [ok]
> 3016 selected reads (903953 bases; lgt > 269)
> coverage depth: 616x
> 1468 bps (ambiguous bases: 14)
10 20 30 40 50 60 70 80 90 100
| | | | | | | | | |
1 ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGRAARCAGCTTGCTGYTTYGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGG
* * * *
101 GAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGG
201 ATGTGCCCAGATGGGATTAGCTWGTWGGTGGGGTAACGGCTCACCWAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGA
* * *
301 CACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTT
401 GTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCG
501 CGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGG
601 GAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCG
701 AAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACT
801 TGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGG
901 GGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACRGAASTTTYCAGAGATGaGAWTgGTG
* * * * * *
1001 CCTTCGGGAACYGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTT
*
1101 GCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGACCAGGGCTAC
1201 ACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAACTCGACT
1301 CCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTG
1401 CAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAG
[00:29] writing output file ... [ok]
+ FASTA: ssu.fasta
[00:29] exit
As expected, ASSU assembles a similar 16S rRNA sequence using E. coli as a model (e.g. same ambiguous positions). However, as the E. coli model sequence (NR_114042.1; 1,467 bps) from the SSU databank is shorter than the E. fergusonii one (NR_074902.1; 1,542 bps), the last assembled SSU sequence (1,468 bps) is also shorter than the previously assembled one (1,543 bps).
References
Johnson AD (2010) An extended IUPAC nomenclature code for polymorphic nucleic acids. Bioinformatics, 26(10):1386-1389. doi:10.1093/bioinformatics/btq098.
Roguski L, Deorowicz S (2014) DSRC 2: Industry-oriented compression of FASTQ files. Bioinformatics, 30(15):2213-2215. doi:10.1093/bioinformatics/btu208.
Větrovský T, Baldrian P (2013) The Variability of the 16S rRNA Gene in Bacterial Genomes and Its Consequences for Bacterial Community Analyses. PLoS One, 8(2):e57923. doi:10.1371/journal.pone.0057923.
Citations
Kämpfer P, Glaeser SP, McInroy JA, Busse H-J, Clermont D, Criscuolo A (2024) Description of Cohnella rhizoplanae sp. nov., isolated from the root surface of soybean (Glycine max). Antonie van Leeuwenhoek, 118:41. doi:10.1007/s10482-024-02051-y