diff --git a/README.txt b/README.txt index 0faafb6c157bfbc6aee589b9427ada52afa15bab..e9ed3d9360d093e4f61c21516fc52ca1c327af16 100644 --- a/README.txt +++ b/README.txt @@ -7,118 +7,145 @@ PhageTerm.py - run as command line in a shell VERSION ======= -Version 4.1.0 (first python3 version) +Version 4.0.0 +Compatible with python 3.7 INTRODUCTION ============ -PhageTerm software is a tool to determine phage termini and packaging mode -from high throughput sequences that rely on the random fragmentation of DNA (e.g. -Illumina TruSeq). Phage sequencing reads from a fastq file are aligned to the phage -reference genome in order to calculate two types of coverage values (whole genome coverage -and the starting position coverage). The starting position coverage is used to perform a -detailed termini analysis. If the user provides the host sequence, reads that does not -match the phage genome are tested on the host using the same mapping function. +PhageTermVirome software is a tool to determine phage termini and packaging mode +from high throughput sequences that rely on the random fragmentation of DNA (e.g. Covaris fragmentation) +and conservation of natural DNA ends (e.g. library preparation using Illumina TruSeq). +Phage sequencing reads from a fastq file are aligned to the assembled phage genome in order to +calculate two types of coverage values (whole genome coverage and the Starting Position Coverage (SPC)). +The starting position coverage is used to perform a detailed termini and packaging mode analysis. +If user suspect the phage to have a Mu-like type of packaging, he can additionally provide the host (bacterial) +genome sequence. This analysis will take the reads that does not match the phage genome and align them on the bacterial +genome using the same mapping function. The analysis to identify Mu-like phages is available only when providing a +single phage genome (not possible if user provide a multi-fast file with multiple assembled phage contigs). -- Multi-fasta reference e.g. VIROME -- Host : Bexare, if you are using multiple sequences, the host analysis is not possible. Separate your references if you want to do this analysis. +The previous PhageTerm program (single phage analysis) and information are still available at https://sourceforge.net/projects/phageterm/ (for versions <3.0.0) -The PhageTerm program and information is available at https://sourceforge.net/projects/phageterm/ for versions <3.0.0 -and at : https://gitlab.pasteur.fr/vlegrand/ptv for versions higher. -A Galaxy wrapper version is also available at https://galaxy.pasteur.fr (for versions <3.0.0) +A Galaxy wrapper version is also available at https://galaxy.pasteur.fr (only for the first version PhageTerm, PhageTermVirome is not implemented on Galaxy yet). Since version 3.0.0, PhageTerm can work in 2 modes: -- mono machine mode (parallelization on several cores on tne same machine). -- multi machine mode (parallelization on several machines, using intermediate files for data exchange). +- the usual mono machine mode (parallelization on several cores on the same machine). +- a new multi machine mode (advanced users) with parallelization on several machines, using intermediate files for data exchange. + The default mode is mono machine. Version 3.0.0 up to version 4.0 work with python 2.7 -Since version 4.0, PhageTerm works with python 3.7 only -Since version 4.1, pvalue and pvalue adj for peaks are printed in the workflow.txt report (case of virome analysis), - - +Since version 4.0, PhageTerm (now PhageTermVirome) works with python 3.7 PREREQUISITES ============= -For version 3.0 up to version 4.0 (not included) -Unix/Linux +For version 4.0 -- Python 2.7.X -- matplotlib 1.3.1 -- numpy 1.9.2 -- pandas 0.19.1 -- sklearn 0.18.1 -- scipy 0.18.1 -- statsmodels 0.6.1 -- reportlab 3.3.0 +Unix/Linux -A conda virtualenv containing python2.7 and all dependencies is provided for convenience so that users -don't need to install anything else than miniconda or conda. + - backports + - backports.functools_lru_cache + - backports_abc + - cycler + - libwebp-base + - lz4-c + - matplotlib-base + - matplotlib + - numpy + - openssl + - pandas + - patsy + - pillow + - pip + - pyparsing + - python=3.7 + - python-dateutil + - python_abi + - pytz + - readline + - reportlab + - scikit-learn + - scipy + - setuptools + - singledispatch + - statsmodels + - tk + - tornado +A conda virtualenv containing python3.7 and all dependencies is provided for convenience so that users +don't need to install anything else than miniconda or conda. (See below) -For version 4.0 and higher -Unix/Linux +INSTALLING PHAGETERMVIROME USING THE CONDA VIRTUALENV (easiest option) +====================================================================== -- Python 3.7 -- matplotlib -- numpy -- pandas -- sklearn -- scipy -- statsmodels -- reportlab +First install miniconda (you don't even need to have python 2.7 or python 3.7 installed on your machine for that since +miniconda contains it): https://docs.conda.io/en/latest/miniconda.html -A conda virtualenv containing python3.7 and all dependencies is provided for convenience so that users -don't need to install anything else than miniconda or conda. +Then, create the conda environment using the yml file PhageTerm_env_3.yml file for version >=4.0 (python3) + conda env create -f PhageTerm_env_3.yml +Then activate the environment so you can launch PhageTermVirome: + + conda activate PhageTerm_env_py3 -USING THE CONDA VIRTUALENV -========================== -First install miniconda (you don't even need to have python 2.7 or python 3.7 installed on your machine for that; -miniconda contains it): https://docs.conda.io/en/latest/miniconda.html +NOTE: -Then, create the miniconda environment from the PhageTerm_env.yml file for version<4.0 (python2): +You can still use the old PhageTerm under python 2.7 (but no multi-fast analysis possible) using the miniconda environment from the PhageTerm_env.yml file for version<4.0 (python2): conda env create -f PhageTerm_env.yml -or from the PhageTerm_env_3.yml file for version >=4.0 (python3) - conda env create -f PhageTerm_env_3.yml +conda activate PhageTerm_env -Acctivate it to be able to work: - conda activate PhageTerm_env - - or - conda activate PhageTerm_env_py3 COMMAND LINE ============ +Basic usage with mandatory options (PhageTermVirome needs at least one read file, but user can provide a second corresponding paired-end read file if available, using the -p option). - ./PhageTerm.py -f reads.fastq -r phage_sequence.fasta -[-n phage_name -p reads_paired -s seed_lenght -d surrounding -t installation_test -c nbr_core -g host.fasta -l limit_multi-fasta -v virome_time] -[--mm --dir_cov_mm path_to_coverage_results -c nb_cores --core_id idx_core -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta] -[--mm --dir_cov_mm path_to_coverage_results --dir_seq_mm path_to_sequence_results --DR_path path_to_results --seq_id index_of_sequence --nb_pieces nbr_of_read_chunks -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta] -[--mm --DR_path path_to_results --dir_seq_mm path_to_sequence_results -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta] - (warning increase process time)] + ./PhageTerm.py -f reads.fastq -r phage_sequence(s).fasta Help: ./PhageTerm.py -h ./PhageTerm.py --help + + + Software run test: + -t TEST_VALUE, --test=TEST_VALUE + TEST_VALUE=C5 : Test run for a 5' cohesive end (e.g. Lambda) + TEST_VALUE=C3 : Test run for a 3' cohesive end (e.g. HK97) + TEST_VALUE=DS : Test run for a short Direct Terminal Repeats end (e.g. T7) + TEST_VALUE=DL : Test run for a long Direct Terminal Repeats end (e.g. T5) + TEST_VALUE=H : Test run for a Headful packaging (e.g. P1) + TEST_VALUE=M : Test run for a Mu-like packaging (e.g. Mu) + + +Non-mandatory options + +[-p reads_paired -c nbr_core_threads -n analysis_name -s seed_lenght -d surrounding -t installation_test -g host.fasta -l contig_size_limit_multi-fasta -v virome_run_time_estimation] + + +Additional advanced options (only for multi-machine users) + + +[--mm --dir_cov_mm path_to_coverage_results -c nb_cores --core_id idx_core -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta] +[--mm --dir_cov_mm path_to_coverage_results --dir_seq_mm path_to_sequence_results --DR_path path_to_results --seq_id index_of_sequence --nb_pieces nbr_of_read_chunks -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta] [--mm --DR_path path_to_results --dir_seq_mm path_to_sequence_results -p reads_paired -s seed_lenght -d surrounding -l limit_multi-fasta] + - Mandatory Options: + Detailed ptions: + Raw reads file in fastq format: -f INPUT_FILE, --fastq=INPUT_FILE @@ -126,21 +153,21 @@ COMMAND LINE (NGS sequences from random fragmentation DNA only, e.g. Illumina TruSeq) - Phage genome in fasta format: + Phage genome(s) in fasta format: -r INPUT_FILE, --ref=INPUT_FILE - Reference phage genome as unique contig in fasta format + Reference phage genome(s) as unique contig in fasta format Other options common to both modes: - Raw reads file in fastq format: + Raw reads file in fastq format: -p INPUT_FILE, --paired=INPUT_FILE Paired fastq reads (NGS sequences from random fragmentation DNA only, e.g. Illumina TruSeq) - Name of the phage being analyzed by the user: + Analysis_name to write on output reports: -n PHAGE_NAME, --phagename=PHAGE_NAME Manually enter the name of the phage being analyzed. Used as prefix for output files. @@ -150,7 +177,7 @@ COMMAND LINE Manually enter the lenght of the seed used for reads in the mapping process (Default: 20). - Lenght of the seed used for reads in the mapping process: + Number of nucleotides around the main peak to consider for merging adjacent significant peaks (set to 1 to discover secondary terminus but sites). -d SUROUNDING_LENGHT, --surrounding=SUROUNDING_LENGHT Manually enter the lenght of the surrounding used to merge close peaks in the analysis process (Default: 20). @@ -165,11 +192,11 @@ COMMAND LINE Phage mean coverage to use (Default: 250). Define phage mean coverage: - -l LIMIT_MFASTA, —limit=LIMIT_MFASTA + -l LIMIT_FASTA, —limit=LIMIT_FASTA Minimum phage fasta length (Default: 500). - Options for mono machine (default) mode + Options for mono machine (default) mode only Software run test: -t TEST_VALUE, --test=TEST_VALUE @@ -186,14 +213,13 @@ COMMAND LINE - Options for multi machine mode + Options for multi machine mode only - Indicate that PageTerm should run on several machines: + Indicate that PhageTerm should run on several machines: --mm - - Options for step 1 (calculating reads coverage) on several machines + Options for step 1 of multi-machine mode (calculating reads coverage) on several machines Directory for coverage results: --dir_cov_mm=DIR_PATH/DIR_NAME @@ -220,7 +246,7 @@ COMMAND LINE - Options for step 2 (calculating per sequence statistics from reads coverage results) on several machines + Options for step 2 of multi-machine mode (calculating per sequence statistics from reads coverage results) on several machines Directory for coverage results: --dir_cov_mm=DIR_PATH/DIR_NAME @@ -250,7 +276,7 @@ COMMAND LINE Must be the same value as given via -c at step 1 (CORE_NBR). - Options for step 3 (final report generation) + Options for step 3 of multi-machine mode (final report generation) Directory for DR results --DR_path=DIR_PATH/DIR_NAME @@ -275,7 +301,7 @@ OUTPUT FILES (ii) Statistical table (.csv) - (iii) Sequence files (.fasta) + (iii) File containingg contains re-organized to stat at the predicted termini (.fasta) CONTACT