- ``-o``: put all result and temporary files to this directory
In your ``Examples/3-pangenome`` folder, you should have your pangenome in a file called ``PanGenome-GENO3.All.prt-clust-0.8-mode1_<date>.tsv.lst``. It contains 1 line per family. The first column is the family number, and others are all family members. You also have the qualitative (``.quali.txt``) and quantitative (``.quanti.txt``) matrix of this pangenome, as well as a summary file (``.summary.txt``). See more information on those files in :ref:`output format description<panfile>`.
In your ``Examples/3-pangenome`` folder, you should have your pangenome in a file called ``PanGenome-GENO3.All.prt-clust-0.8-mode1.lst``. It contains 1 line per family. The first column is the family number, and others are all family members. You also have the qualitative (``.quali.txt``) and quantitative (``.quanti.txt``) matrix of this pangenome, as well as a summary file (``.summary.txt``). See more information on those files in :ref:`output format description<panfile>`.
In your ``Examples/2-res-prokka/Proteins`` folder, you should have a new file, ``GENO3.All.prt``, containing all proteins of the 3 genomes.
...
...
@@ -190,29 +190,28 @@ If you used the same dataset and parameters as in this file, you should get a pa
If you did not run the annotate step, but just want to test the pangenome step, use::
The core genome is inferred from the pangenome. So, the only required file is your pangenome, obtained at last step. By default, it will generate a core genome. Here is the command line to obtain the core genome of our dataset::
You now have your core genome (we used the default parameter) in ``Examples/4-corepers/PersGenome_<pangenome-filename>_1.lst``. With ``_1`` meaning that you asked for 100% of genomes present in each family.
If you used the same dataset and parameters as in this file, you should get a core genome with **6 families**.
If you want a persistent genome, specify the required :ref:`options<docorepers>` (minimum percentage of genomes in a family to be considered as persistent, allowing or not multi/mixed families...). For example, for a strict persistent genome at 95%::
The output file will be ``Examples/4-corepers/PersGenome_<pangenome-filename>_0.95.lst``, and will contain the same 6 families (95% of 3 genomes is all genomes).
If you did not run the annotate and pangenome steps, but only want to test the corepers module, run::
If you already calculated the genomes metrics (genome name, size, L90, nb of contigs), you can directly give them as an input instead of the list of genome files. This text file must have at least 4 columns (others are ignored), with the following headers (in any order): 'to_annotate', 'gsize', 'nb_conts', 'L90'.
This file can be the :ref:`output file<step2>` of ``prepare``.
This file can be the :ref:`output file<step2>` of ``prepare``, or even the ouput of this ``annotate`` step, if you want to re-run it with other parameters.
.. _seq:
...
...
@@ -515,7 +515,7 @@ This will create a folder ``<res_path>``, with the following files inside:
- ``QC_L90-<list_file>.png``: histogram of the L90 values of all genomes
- ``QC_nb-contigs-<list_file>.png``: histogram of number of contigs in all genomes
- ``discarded-<list_file>.lst``: list of genomes that would be discarded if you keep the default limits (L90 :math:`\leq` 100 and #contigs :math:`\leq` 999).
- ``info-genomes-<list_file>.lst``: file with information on each genome: size, number of contigs and L90.
- ``ALL-GENOMES-info-<list_file>.lst``: file with information on each genome: size, number of contigs and L90.
- ``tmp_files`` folder: containing your genomic sequences, split at each stretch of at least 5 ``N``.
.. _logf:
...
...
@@ -552,7 +552,7 @@ This command will run the same steps as described in quality control only, with
This will create a folder ``<res_path>``, with the following files inside:
- same files as quality control only, except ``info-genomes-<list_file>.lst``.
- same files as quality control only, except ``ALL-GENOMES-info-<list_file>.lst``.
- ``LSTINFO_<list_file>.lst``: information on annotated genomes, as described :ref:`here<lstinfof>`
- prokka result folders in your ``tmp_files`` directory
- The 5 folders ``LSTINFO``, ``gff3``, ``Replicons``, ``Genes`` and ``Proteins`` as described in :ref:`output file formats<outform>`.
...
...
@@ -815,8 +815,6 @@ If you want to calculate the core/persistent genome of a subset of genomes, give
Output file format
------------------
.. note:: See :ref:`below<docorepers>` for the details on output filenames.
Your persistent genome file (``PersGenome_<pangenome>_<tol>[-multi][-mixed].lst`` or specified name) has the same format as the pangenome file. The family numbers in the first column correspond to pangenome family numbers.
.. _docorepers:
...
...
@@ -1040,9 +1038,13 @@ In your ``<resdir>`` directory, you will find:
+ ``<dataset_name>-current.<fam_num>.prt`` with all proteins extracted
+ ``<dataset_name>-current.<fam_num>.miss.lst`` with the list of genomes not present in the family
+ ``<dataset_name>-complete.nucl.cat.aln`` DNA sequence concatenation of all family alignments
+ ``<dataset_name>-complete..cat.aln`` concatenation of all family alignments in aa (if option required by user)
+ ``<dataset_name>-complete.aa.cat.aln`` concatenation of all family alignments in aa (if option required by user)
- a folder ``Phylo-<dataset_name>``: contains
+ ``<dataset_name>.nucl.grp.aln``, the alignment of all families grouped by genome, as described in :ref:`output files section<outalign>`. This is the file you will need to infer a phylogenetic tree.
+ ``<dataset_name>.aa.grp.aln``, same, but protein sequences instead of DNA. Generated only if option ``-P`` is given
- a folder ``Phylo-<dataset_name>``: contains ``<dataset_name>.nucl.grp.aln``, the alignment of all families grouped by genome, as described in :ref:`output files section<outalign>`. This is the file you will need to infer a phylogenetic tree.