Skip to content
Snippets Groups Projects
Commit 560c4ff3 authored by Veronique Legrand's avatar Veronique Legrand
Browse files

Julian's updates

parent 81b3d164
No related branches found
No related tags found
No related merge requests found
......@@ -7,118 +7,145 @@ PhageTerm.py - run as command line in a shell
VERSION
=======
Version 4.1.0 (first python3 version)
Version 4.0.0
Compatible with python 3.7
INTRODUCTION
============
PhageTerm software is a tool to determine phage termini and packaging mode
from high throughput sequences that rely on the random fragmentation of DNA (e.g.
Illumina TruSeq). Phage sequencing reads from a fastq file are aligned to the phage
reference genome in order to calculate two types of coverage values (whole genome coverage
and the starting position coverage). The starting position coverage is used to perform a
detailed termini analysis. If the user provides the host sequence, reads that does not
match the phage genome are tested on the host using the same mapping function.
PhageTermVirome software is a tool to determine phage termini and packaging mode
from high throughput sequences that rely on the random fragmentation of DNA (e.g. Covaris fragmentation)
and conservation of natural DNA ends (e.g. library preparation using Illumina TruSeq).
Phage sequencing reads from a fastq file are aligned to the assembled phage genome in order to
calculate two types of coverage values (whole genome coverage and the Starting Position Coverage (SPC)).
The starting position coverage is used to perform a detailed termini and packaging mode analysis.
If user suspect the phage to have a Mu-like type of packaging, he can additionally provide the host (bacterial)
genome sequence. This analysis will take the reads that does not match the phage genome and align them on the bacterial
genome using the same mapping function. The analysis to identify Mu-like phages is available only when providing a
single phage genome (not possible if user provide a multi-fast file with multiple assembled phage contigs).
- Multi-fasta reference e.g. VIROME
- Host : Bexare, if you are using multiple sequences, the host analysis is not possible. Separate your references if you want to do this analysis.
The previous PhageTerm program (single phage analysis) and information are still available at https://sourceforge.net/projects/phageterm/ (for versions <3.0.0)
The PhageTerm program and information is available at https://sourceforge.net/projects/phageterm/ for versions <3.0.0
and at : https://gitlab.pasteur.fr/vlegrand/ptv for versions higher.
A Galaxy wrapper version is also available at https://galaxy.pasteur.fr (for versions <3.0.0)
A Galaxy wrapper version is also available at https://galaxy.pasteur.fr (only for the first version PhageTerm, PhageTermVirome is not implemented on Galaxy yet).
Since version 3.0.0, PhageTerm can work in 2 modes:
- mono machine mode (parallelization on several cores on tne same machine).
- multi machine mode (parallelization on several machines, using intermediate files for data exchange).
- the usual mono machine mode (parallelization on several cores on the same machine).
- a new multi machine mode (advanced users) with parallelization on several machines, using intermediate files for data exchange.
The default mode is mono machine.
Version 3.0.0 up to version 4.0 work with python 2.7
Since version 4.0, PhageTerm works with python 3.7 only
Since version 4.1, pvalue and pvalue adj for peaks are printed in the workflow.txt report (case of virome analysis),
Since version 4.0, PhageTerm (now PhageTermVirome) works with python 3.7
PREREQUISITES
=============
For version 3.0 up to version 4.0 (not included)
Unix/Linux
For version 4.0
- Python 2.7.X
- matplotlib 1.3.1
- numpy 1.9.2
- pandas 0.19.1
- sklearn 0.18.1
- scipy 0.18.1
- statsmodels 0.6.1
- reportlab 3.3.0
Unix/Linux
A conda virtualenv containing python2.7 and all dependencies is provided for convenience so that users
don't need to install anything else than miniconda or conda.
- backports
- backports.functools_lru_cache
- backports_abc
- cycler
- libwebp-base
- lz4-c
- matplotlib-base
- matplotlib
- numpy
- openssl
- pandas
- patsy
- pillow
- pip
- pyparsing
- python=3.7
- python-dateutil
- python_abi
- pytz
- readline
- reportlab
- scikit-learn
- scipy
- setuptools
- singledispatch
- statsmodels
- tk
- tornado
A conda virtualenv containing python3.7 and all dependencies is provided for convenience so that users
don't need to install anything else than miniconda or conda. (See below)
For version 4.0 and higher
Unix/Linux
INSTALLING PHAGETERMVIROME USING THE CONDA VIRTUALENV (easiest option)
======================================================================
- Python 3.7
- matplotlib
- numpy
- pandas
- sklearn
- scipy
- statsmodels
- reportlab
First install miniconda (you don't even need to have python 2.7 or python 3.7 installed on your machine for that since
miniconda contains it): https://docs.conda.io/en/latest/miniconda.html
A conda virtualenv containing python3.7 and all dependencies is provided for convenience so that users
don't need to install anything else than miniconda or conda.
Then, create the conda environment using the yml file PhageTerm_env_3.yml file for version >=4.0 (python3)
conda env create -f PhageTerm_env_3.yml
Then activate the environment so you can launch PhageTermVirome:
conda activate PhageTerm_env_py3
USING THE CONDA VIRTUALENV
==========================
First install miniconda (you don't even need to have python 2.7 or python 3.7 installed on your machine for that;
miniconda contains it): https://docs.conda.io/en/latest/miniconda.html
NOTE:
Then, create the miniconda environment from the PhageTerm_env.yml file for version<4.0 (python2):
You can still use the old PhageTerm under python 2.7 (but no multi-fast analysis possible) using the miniconda environment from the PhageTerm_env.yml file for version<4.0 (python2):
conda env create -f PhageTerm_env.yml
or from the PhageTerm_env_3.yml file for version >=4.0 (python3)
conda env create -f PhageTerm_env_3.yml
conda activate PhageTerm_env
Acctivate it to be able to work:
conda activate PhageTerm_env
or
conda activate PhageTerm_env_py3
COMMAND LINE
============
Basic usage with mandatory options (PhageTermVirome needs at least one read file, but user can provide a second corresponding paired-end read file if available, using the -p option).
./PhageTerm.py -f reads.fastq -r phage_sequence.fasta
[-n phage_name -p reads_paired -s seed_lenght -d surrounding -t installation_test -c nbr_core -g host.fasta -l limit_multi-fasta -v virome_time]
[--mm --dir_cov_mm path_to_coverage_results -c nb_cores --core_id idx_core -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta]
[--mm --dir_cov_mm path_to_coverage_results --dir_seq_mm path_to_sequence_results --DR_path path_to_results --seq_id index_of_sequence --nb_pieces nbr_of_read_chunks -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta]
[--mm --DR_path path_to_results --dir_seq_mm path_to_sequence_results -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta]
(warning increase process time)]
./PhageTerm.py -f reads.fastq -r phage_sequence(s).fasta
Help:
./PhageTerm.py -h
./PhageTerm.py --help
Software run test:
-t TEST_VALUE, --test=TEST_VALUE
TEST_VALUE=C5 : Test run for a 5' cohesive end (e.g. Lambda)
TEST_VALUE=C3 : Test run for a 3' cohesive end (e.g. HK97)
TEST_VALUE=DS : Test run for a short Direct Terminal Repeats end (e.g. T7)
TEST_VALUE=DL : Test run for a long Direct Terminal Repeats end (e.g. T5)
TEST_VALUE=H : Test run for a Headful packaging (e.g. P1)
TEST_VALUE=M : Test run for a Mu-like packaging (e.g. Mu)
Non-mandatory options
[-p reads_paired -c nbr_core_threads -n analysis_name -s seed_lenght -d surrounding -t installation_test -g host.fasta -l contig_size_limit_multi-fasta -v virome_run_time_estimation]
Additional advanced options (only for multi-machine users)
[--mm --dir_cov_mm path_to_coverage_results -c nb_cores --core_id idx_core -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta]
[--mm --dir_cov_mm path_to_coverage_results --dir_seq_mm path_to_sequence_results --DR_path path_to_results --seq_id index_of_sequence --nb_pieces nbr_of_read_chunks -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta] [--mm --DR_path path_to_results --dir_seq_mm path_to_sequence_results -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta]
Mandatory Options:
Detailed ptions:
Raw reads file in fastq format:
-f INPUT_FILE, --fastq=INPUT_FILE
......@@ -126,21 +153,21 @@ COMMAND LINE
(NGS sequences from random fragmentation DNA only,
e.g. Illumina TruSeq)
Phage genome in fasta format:
Phage genome(s) in fasta format:
-r INPUT_FILE, --ref=INPUT_FILE
Reference phage genome as unique contig in fasta format
Reference phage genome(s) as unique contig in fasta format
Other options common to both modes:
Raw reads file in fastq format:
Raw reads file in fastq format:
-p INPUT_FILE, --paired=INPUT_FILE
Paired fastq reads
(NGS sequences from random fragmentation DNA only,
e.g. Illumina TruSeq)
Name of the phage being analyzed by the user:
Analysis_name to write on output reports:
-n PHAGE_NAME, --phagename=PHAGE_NAME
Manually enter the name of the phage being analyzed.
Used as prefix for output files.
......@@ -150,7 +177,7 @@ COMMAND LINE
Manually enter the lenght of the seed used for reads
in the mapping process (Default: 20).
Lenght of the seed used for reads in the mapping process:
Number of nucleotides around the main peak to consider for merging adjacent significant peaks (set to 1 to discover secondary terminus but sites).
-d SUROUNDING_LENGHT, --surrounding=SUROUNDING_LENGHT
Manually enter the lenght of the surrounding used to
merge close peaks in the analysis process (Default: 20).
......@@ -165,11 +192,11 @@ COMMAND LINE
Phage mean coverage to use (Default: 250).
Define phage mean coverage:
-l LIMIT_MFASTA, —limit=LIMIT_MFASTA
-l LIMIT_FASTA, —limit=LIMIT_FASTA
Minimum phage fasta length (Default: 500).
Options for mono machine (default) mode
Options for mono machine (default) mode only
Software run test:
-t TEST_VALUE, --test=TEST_VALUE
......@@ -186,14 +213,13 @@ COMMAND LINE
Options for multi machine mode
Options for multi machine mode only
Indicate that PageTerm should run on several machines:
Indicate that PhageTerm should run on several machines:
--mm
Options for step 1 (calculating reads coverage) on several machines
Options for step 1 of multi-machine mode (calculating reads coverage) on several machines
Directory for coverage results:
--dir_cov_mm=DIR_PATH/DIR_NAME
......@@ -220,7 +246,7 @@ COMMAND LINE
Options for step 2 (calculating per sequence statistics from reads coverage results) on several machines
Options for step 2 of multi-machine mode (calculating per sequence statistics from reads coverage results) on several machines
Directory for coverage results:
--dir_cov_mm=DIR_PATH/DIR_NAME
......@@ -250,7 +276,7 @@ COMMAND LINE
Must be the same value as given via -c at step 1 (CORE_NBR).
Options for step 3 (final report generation)
Options for step 3 of multi-machine mode (final report generation)
Directory for DR results
--DR_path=DIR_PATH/DIR_NAME
......@@ -275,7 +301,7 @@ OUTPUT FILES
(ii) Statistical table (.csv)
(iii) Sequence files (.fasta)
(iii) File containingg contains re-organized to stat at the predicted termini (.fasta)
CONTACT
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment