Snippets Groups Projects

2 years ago
2b6789a3

Fixed path to image in readme · 2b6789a3
Julien GUGLIELMINI authored 2 years ago

2b6789a3

History

Fixed path to image in readme
Julien GUGLIELMINI authored 2 years ago

README.md 6.37 KiB

wGRR - weighted Gene Repertoire Relatedness

Description

Introduction

wGRR is a set of scripts to calculate the weighted Gene Repertoire Relatedness (wGRR) of protein sequences of a set of genetic elements (genomes, plasmids, prophages, etc.). wGRR relies on Best Bi-directional Hits (BBH) which are couples of proteins that are reciprocal best matches in a BLAST-like analysis. Once the BBH have been defined for a pair of genetic elements, we can write the wGRR as:

wGRR(A,B) = \frac{\sum_{i}{id(A_i,B_i)}}{min(P_A,P_B)}

where

id(A_i,B_i) is the identity score for each BBH pair between element A and element B
min(P_A,P_B) is the number of proteins in the smallest of A and B elements.

Dependencies

BBH are defined by all versus all protein comparisons using MMseqs2.

The scripts are written in zsh and awk and as such wGRR should run on any Unix system. A recent version of awk is necessary (gawk) is recommended.

wGRR can also be run on Institut Pasteur's cluster Maestro. In this case, all dependencies will be automatically loaded upon execution.

Installation

Download the files, then (if necessary):

chmod +x wGRR*

Usage

Example

./wGRR -i example/test_0.prt -o TEST0

On a local machine

./wGRR -i $fasta [-p $mmseqs2_path -o $output_prefix -t $threads -a $comparisons -T -f]

On an interactive session on Maestro

wGRR can be used interactively on Maestro. First, you need to allocate resources. For example

salloc -p hubbioit -c 10 --mem 20G

This will allocate 10 CPUs and 20G of memory of the hubbioit partition.

When the resource is available, wGRR can be executed as described above. Note that you do not need to load any module - wGRR will automatically handle this. The number of threads $THREADS passed by the -t option will be used for both the MMseqs step and the wGRR calculation.

Using `sbatch`

This is the recommended way of using wGRR. Simply use the sbatch command with the proper partition specification. Note that there is no need to allocate more than one CPU (with the sbatch option -c). For example:

sbatch -p hubbioit ./wGRR -i test_2.prt -t 30

This will run wGRR on the file test_2.prt on the hubbioit partition. The MMseqs job will be submitted to the cluster's scheduler with 30 CPUs. Then for the actual wGRR calculation, the required amount of jobs (depending on the value passed with the -a option) will be submitted to the queue. If 100 jobs (1 CPU each) are necessary, a job array of 100 jobs will be submitted to the scheduler. You can adjust the number of maximum jobs running simultaneously (to avoid using 100% of your partition's CPUs) by using the -m option.

If you do not have access to a dedicated partition, or if there is not enough free CPUs on your partition, you can try to turn on the -f flag. By doing so, the wGRR workers will be submitted to the common and dedicated machines of Maestro, on the "fast" Quality of Service (QoS). Jobs running on the fast QoS have a higher priority (so the workers will start faster) but are limited to 2 hours. Also, using the -m parameter is less necessary because you will use a lot of different common resources. But you need to be sure that each worker will end in less than 2 hours - otherwise the run will fail.

Mandatory parameter

$fasta is a fasta file containing all the proteins of all the elements you want to compare. The protein names must be formatted as:

>Element_id_PROT

with PROT being a protein identifier, unique for each protein inside the element Element_id. For example:

>ESCO001.0321.00001.P001_00141

This corresponds to the protein 00141 of the element ESCO001.0321.00001.P001.

Notes:

wGRR expects all characters after the last underscore to correspond to the protein id.
These characters can be anything except spaces and underscores.
The Element_id can have any characters (including underscores) except spaces.

Optional parameters

$mmseqs2_path should be provided if the mmseqs executable is not in your system path. This is not needed if you work on Maestro.
$output_prefix is a prefix that will be used for each output files. By default, a random string is used.
$threads is the number of threads to use. Using more than one is highly recommended.
$comparisons is the number of genetic elements to be compared simultaneously on each thread. The default value, 10,000, means that on each thread groups of 10,000 genetic elements will be compared simultaneously. Increasing this number will decrease the total number of tasks to be performed but will increase RAM usage.
test run. By using the -T flag, a test run is executed, which omits MMseqs and wGRR calculations, and outputs some statistics of your input file to help you set up the analysis properly.

Output

wGRR produces a table with several metrics corresponding to 3 different ways of calculating the wGRR.

The first two columns indicate the pair of elements for this row.

wGRR1 is the easiest way of calculating the wGRR. It corresponds to the sum of the identity scores for all BBH between the two elements, divided by the smallest total number of protein of the two elements (columns RealLengthA and RealLengthB).

Common1 is the proportion of common proteins, i.e. the number of BBH pairs divided by the same denominator as the wGRR.

wGRR2 uses the same numerator than wGRR1. For the denominator calculation, an extra single linkage clustering step is perfomed to build protein families. If a protein belongs to a family containing a BBH, but this protein is not part of a BBH, then it is not counted in the denominator, resulting sometimes in a greater wGRR. This is can be useful when lots of paralogs are expected - the presence of paralogs will artificially lower the wGRR as calculated for wGRR1.

Common2 is the number of BBH pairs divided by the denominator of wGRR2.

wGRR3 also uses the protein families. But this time, if two BBH pairs are found in the same family, only one will be counted (the one with the highest identity score) for the numerator. The denominator is the number of protein families + the number of proteins that are not part of a family.

Common3 is the number of protein families containing at least one BBH divided by the denominator of wGRR3.