Snippets Groups Projects

2 years ago
237b8d11

README update · 237b8d11
Julien GUGLIELMINI authored 2 years ago

237b8d11

History

README update
Julien GUGLIELMINI authored 2 years ago

README.md 7.24 KiB

wGRR - weighted Gene Repertoire Relatedness

Description

Introduction

wGRR is a set of scripts to calculate the weighted Gene Repertoire Relatedness (wGRR) of protein sequences of a set of genetic elements (genomes, plasmids, prophages, etc.). wGRR relies on Best Bi-directional Hits (BBH) which are couples of proteins that are reciprocal best matches in a BLAST-like analysis. Once the BBH have been defined for a pair of genetic elements, we can write the wGRR as:

wGRR(A,B) = \frac{\sum_{i}{id(A_i,B_i)}}{min(P_A,P_B)}

where

$i$
is the number of BBH between elements
$A$
and
$B$
$id(A_i,B_i)$
is the identity score of BBH
$i$
$P_A$
and
$P_B$
are the numbers of proteins of
$A$
and
$B$
respectively

NOTE
wGRR calculates 3 versions of the wGRR score depending on which BBH are considered (numerator) and what proteins should be counted (denominator). See the Output section of this manual for more explanations.

Dependencies

BBH are defined by all versus all protein comparisons using MMseqs2.

The scripts are written in zsh and awk and as such wGRR should run on any Unix system. A recent version of awk is necessary (gawk) is recommended.

wGRR can also be run on Institut Pasteur's cluster Maestro. In this case, all dependencies will be automatically loaded upon execution.

Installation

Download the files, then (if necessary):

chmod +x wGRR*

Usage

Example

./wGRR -i example/test_0.prt -o TEST0

On a local machine

./wGRR -i $fasta [-p $mmseqs2_path -o $output_prefix -t $threads -a $comparisons -T -f]

WARNING
Memory consumption can be really high and running wGRR might exhaust your system. It is advised to perform a test run (with the -T flag) first.

On an interactive session on Maestro

wGRR can be used interactively on Maestro. First you need to allocate resources. For example

salloc -p hubbioit -c 10 --mem 20G

This will allocate 10 CPUs and 20G of memory of the hubbioit partition.

When the resource is available, wGRR can be executed as described above. Note that you do not need to load any module - wGRR will automatically handle this. The number of threads passed via the -t option will be used for both the MMseqs step and the wGRR calculation.

Using `sbatch`

This is the recommended way of using wGRR for large datasets. Simply use the sbatch command with the proper partition specification. Note that there is no need to allocate more than one CPU (with the sbatch option -c). For example:

sbatch -p hubbioit ./wGRR -i test_2.prt -t 30

This will run wGRR on the file test_2.prt on the hubbioit partition. If you do not provide a partition, wGRR will use Maestro's default, i.e. the "common" partition with the "normal" Quality of Service (QoS).
The MMseqs job will be submitted to the cluster's scheduler with 30 CPUs (-t 30). Then for the actual wGRR calculation, the required amount of jobs (depending on the value passed with the -a option) will be submitted to the queue. If 100 jobs (1 CPU each) are necessary, a job array of 100 jobs will be submitted to the scheduler. To avoid using 100% of your partition's CPUs you can adjust the number of maximum jobs running simultaneously with the -m option.

If you do not have access to a dedicated partition, or if there is not enough free CPUs on your partition, you can try to turn on the -f flag. By doing so the wGRR workers will be submitted to the common and dedicated machines of Maestro, on the "fast" QoS. Jobs running on the fast QoS have a higher priority (so the workers will start faster) but are limited to 2 hours. Also, using the -m parameter is not necessary because in this case you will use a lot of different common resources. But you need to be sure that each worker will end in less than 2 hours otherwise the run will fail.

Mandatory parameter

$fasta is a fasta file containing all the proteins of all the elements you want to compare. The protein names must be formatted as:

>Element_id_PROT

with PROT being a protein identifier, unique for each protein inside the element Element_id. For example:

>ESCO001.0321.00001.P001_00141

This corresponds to the protein 00141 of the element ESCO001.0321.00001.P001.

Notes:

wGRR expects all characters after the last underscore to correspond to the protein id.
These characters can be anything except spaces and underscores.
The Element_id can have any characters (including underscores) except spaces.

Optional parameters

$mmseqs2_path should be provided if the mmseqs executable is not in your system path. This is not needed if you work on Maestro.
$output_prefix is a prefix that will be used for each output files. By default, a random string is used.
$threads is the number of threads to use. Using more than one is highly recommended.
$comparisons is the number of genetic elements to be compared simultaneously on each thread. The default value, 10,000, means that on each thread groups of 10,000 genetic elements will be compared simultaneously. Increasing this number will decrease the total number of tasks to be performed but will increase RAM usage.
test run. By using the -T flag, a test run is executed, which omits MMseqs and wGRR calculations, and outputs some statistics of your input file to help you set up the analysis properly.

Output

wGRR produces a table with several metrics corresponding to 3 different ways of calculating the wGRR.

The first two columns indicate the pair of elements for this row.

wGRR1 is the easiest way of calculating the wGRR. It corresponds to the sum of the identity scores for all BBH between the two elements, divided by the smallest total number of protein of the two elements (columns RealLengthA and RealLengthB).

Common1 is the proportion of common proteins, i.e. the number of BBH pairs divided by the mean number of proteins of the two elements.

wGRR2 uses the same numerator than wGRR1. For the denominator calculation, an extra single linkage clustering step is perfomed to build protein families. If a protein belongs to a family containing a BBH but this protein is not part of a BBH, then it is not counted in the denominator, resulting sometimes in a greater wGRR. This results in a new number of proteins (nProt2) per element, and the denominator is the smallest nProt2.
wGRR2 is especially interesting when lots of paralogs are expected - the presence of paralogs will artificially lower the wGRR1.

Common2 is the number of BBH pairs divided by the mean nProt2 of two elements.

wGRR3 also uses the protein families. But this time, if two BBH pairs are found in the same family, only one will be counted (the one with the highest identity score) for the numerator. nProt3 is the number of protein families + the number of proteins that are not part of a family. The denominator is the smallest nProt3.

Common3 is the number of protein families containing at least one BBH divided by the mean nProt3 of two elements.