Usage | Requirement |
---|---|
TABLE OF CONTENTS
- AIM
- WARNING
- CONTENT
- INPUT
- HOW TO RUN
- OUTPUT
- VERSIONS
- LICENCE
- CITATION
- CREDITS
- ACKNOWLEDGEMENTS
- WHAT'S NEW IN
AIM
- Annotation of mRNA sequencing of the Immunoglobuline Heavy or Light variable region (fasta file sequences).
- Clustering of the annotated sequences into clonal groups (same germline origin).
- Tree visualization of the clonal groups.
WARNINGS
- Right now, only dedicated to the analysis of VDJ repertoires (corresponding to the germlines/imgt/<SPECIES>/vdj folder of the IMGT database.
- To make the repertoires contingency tables and heatmaps, the script currently takes the first annotation of the imgt annotation if several are presents in the v_call or j_call column of the all_passed_seq.tsv file.
CONTENT
Files and folder | Description |
---|---|
main.nf | File that can be executed using a linux terminal, a MacOS terminal or Windows 10 WSL2. |
nextflow.config | Parameter settings for the main.nf file. Users have to open this file, set the desired settings and save these modifications before execution. |
bin folder | Contains files required by the main.nf file. |
xlsx2fasta.R | Accessory file that creates all the fasta files required from a .xlsx file. To use it, 1) open the file, 2) complete the "Parameters that need to be set by the user" section, 3) save the modifications and 4) run the file in R. |
Licence.txt | Licence of the release. |
INPUT
Required files |
---|
A folder (zipped or not) containing nucleotide fasta files, each containing a single sequence. Use xlsx2fasta.R (https://github.com/gael-millot/xlsx2fasta) if sequences are in a .xlsx file. |
A metadata file (optional) for adding informations in the results. |
The dataset used in the nextflow.config file, as example, is available at https://zenodo.org/records/8403994.
Use this code to split a multi sequence fasta file into fasta files made of a single sequence:
FASTA_FILE="./test.fasta" # add path and name of the fasta file here
awk -v slice_size=1 -v prefix="cut" '$1 ~ /^>/{nbSeq++; currSlice=int((nbSeq-1)/slice_size)+1; myOutFile=prefix"_"currSlice".fasta"}{print $0 > myOutFile}' ${FASTA_FILE}
HOW TO RUN
1. Prerequisite
Installation of:
nextflow DSL2
Graphviz, sudo apt install graphviz
for Linux ubuntu
Apptainer
2. Local running (personal computer)
2.1. main.nf file in the personal computer
- Mount a server if required:
DRIVE="Z" # change the letter to fit the correct drive sudo mkdir /mnt/share sudo mount -t drvfs $DRIVE: /mnt/share
Warning: if no mounting, it is possible that nextflow does nothing, or displays a message like:
Launching `main.nf` [loving_morse] - revision: d5aabe528b /mnt/share/Users
- Run the following command from where the main.nf and nextflow.config files are (example: \wsl$\Ubuntu-20.04\home\gael):
nextflow run main.nf -c nextflow.config
with -c to specify the name of the config file used.
2.3. main.nf file in the public gitlab repository
Run the following command from where you want the results:
nextflow run -hub pasteur gmillot/repertoire_profiler -r v1.0.0
3. Distant running (example with the Pasteur cluster)
3.1. Pre-execution
Copy-paste this after having modified the EXEC_PATH variable:
EXEC_PATH="/pasteur/helix/projects/BioIT/gmillot/repertoire_profiler" # where the bin folder of the main.nf script is located export CONF_BEFORE=/opt/gensoft/exe # on maestro export JAVA_CONF=java/13.0.2 export JAVA_CONF_AFTER=bin/java # on maestro export APP_CONF=apptainer/1.3.5 export APP_CONF_AFTER=bin/apptainer # on maestro export GIT_CONF=git/2.39.1 export GIT_CONF_AFTER=bin/git # on maestro export GRAPHVIZ_CONF=graphviz/2.42.3 export GRAPHVIZ_CONF_AFTER=bin/graphviz # on maestro MODULES="${CONF_BEFORE}/${JAVA_CONF}/${JAVA_CONF_AFTER},${CONF_BEFORE}/${APP_CONF}/${APP_CONF_AFTER},${CONF_BEFORE}/${GIT_CONF}/${GIT_CONF_AFTER}/${GRAPHVIZ_CONF}/${GRAPHVIZ_CONF_AFTER}" cd ${EXEC_PATH} chmod 755 ${EXEC_PATH}/bin/*.* # not required if no bin folder module load ${JAVA_CONF} ${APP_CONF} ${GIT_CONF} ${GRAPHVIZ_CONF}
3.2. main.nf file in a cluster folder
Modify the second line of the code below, and run from where the main.nf and nextflow.config files are (which has been set thanks to the EXEC_PATH variable above):
HOME_INI=$HOME HOME="${HELIXHOME}/repertoire_profiler/" # $HOME changed to allow the creation of .nextflow into /$HELIXHOME/repertoire_profiler/, for instance. See NFX_HOME in the nextflow software script nextflow run --modules ${MODULES} main.nf -c nextflow.config HOME=$HOME_INI
3.3. main.nf file in the public gitlab repository
Modify the first and third lines of the code below, and run (results will be where the EXEC_PATH variable has been set above):
VERSION="v1.0" HOME_INI=$HOME HOME="${HELIXHOME}/repertoire_profiler/" # $HOME changed to allow the creation of .nextflow into /$HELIXHOME/repertoire_profiler/, for instance. See NFX_HOME in the nextflow software script nextflow run --modules ${MODULES} -hub pasteur gmillot/repertoire_profiler -r $VERSION -c $HOME/nextflow.config HOME=$HOME_INI
4. Error messages and solutions
Message 1
Unknown error accessing project `gmillot/repertoire_profiler` -- Repository may be corrupted: /pasteur/sonic/homes/gmillot/.nextflow/assets/gmillot/repertoire_profiler
Purge using:
rm -rf /pasteur/sonic/homes/gmillot/.nextflow/assets/gmillot*
Message 2
WARN: Cannot read project manifest -- Cause: Remote resource not found: https://gitlab.pasteur.fr/api/v4/projects/gmillot%2Frepertoire_profiler
Contact Gael Millot (distant repository is not public).
Message 3
permission denied
Use chmod to change the user rights. Example linked to files in the bin folder:
chmod 755 bin/*.*
OUTPUT
An example of results obtained with the dataset is present at this address: https://zenodo.org/record/8403994/files/repertoire_profiler.zip
Complete informations are in the Protocol 144-rev0 Ig clustering - Immcantation.docx (contact Gael Millot).
Mandatory elements:
repertoire_profiler_<UNIQUE_ID> folder | Description |
---|---|
reports | Folder containing all the reports of the different processes, including the nextflow.config file used. |
repertoires | Folder containing the repertoires, i.e., contingency tables of the VDJ allele usage from the all_passed_seq.tsv file (see below). Warning: the script currently takes the first annotation of the imgt annotation if several are presents in the v_call or j_call column of the all_passed_seq.tsv file. (e.g., v_call with IGKV1-39*01,IGKV1D-39*01), so that contingencies are identical to those from the donut frequencies, that use germline_v_call and germline_j_call columns (allele reassignment by the CreateGermlines.py tool of immcantation) |
png | Folder containing the graphs in png format. |
svg | Folder containing the graphs in svg vectorial format. |
RData | Folder containing, for each clonal group, objects that can be used in R to further analyze of plot the data:
Also contains the all_trees.RData file that combine the trees R objects of the different files in a single trees object. |
seq_distance.pdf | Distribution of the distances between the two nearest sequences (see the nearest_distance column in the all_passed_seq.tsv file). |
donuts.pdf | donut plots showing the frequency of sequences per clonal groups, among:
|
repertoire.pdf | heatmap of the files from the repertoires folder (see above), showing the frequency of alleles used among all the all passed sequences ("all"), non empty cells ("non-zero") and "annotated" sequences (if metadata are provided). Non-zero means that unused alleles are removed from the heatmap (empty row or column). Warning: to build the repertoire contingencies, the script currently takes the first annotation of the imgt annotation if several are presents in the v_call or j_call column of the all_passed_seq.tsv file (see the all_passed_seq_several_annot_igmt.tsv file below) |
germ_tree.pdf | Phylogenic trees of the sequences that belong to a clonal (supposedly germline) group made of at least n sequences, n being set by the nb_seq_per_clone parameter in the nextflow.config file (one page per clonal group). Warning: clonal group full names are those given by dowser::formatClones, i.e., those from germinal_v_call and germinal_j_call from the all_passed_seq.tsv file. |
germ_no_tree.pdf | All the clonal groups with less than n sequences, n being set by the nb_seq_per_clone parameter in the nextflow.config file (one page per clonal group). Clonal group nformation is recapitulated in each page. |
donut_stat.tsv | stats associated to the donuts.pdf file. |
igblast_unaligned_seq.tsv | Names of sequences that failed to be annotated by igblast (empty file if all the sequences are annotated). |
igblast_aligned_seq.tsv | Names of sequences annotated by igblast (more precisely by MakeDb.py igblast). If empty, generate a subsequent nextflow failure. The number lines in igblast_unaligned_seq.tsv and igblast_aligned_seq.tsv is equal to the number of submitted .fasta files. |
productive_seq.tsv | Sequences annotated by igblast (see the unproductive_seq.tsv file for sequences that failed to be productive annotated by igblast). Productive means: (1) coding region has an open reading frame, (2) no defect in the start codon, splicing sites or regulatory elements, (3) no internal stop codons, (4) an in-frame junction region. |
all_passed_seq.tsv | Sequences from the productive_seq.tsv file with germline clustering (clone ID), allele reannotation (germinal_v_call and germinal_j_call columns), mutation load, distance and sequence nickname (annotation from the metadata file) added. Warning: the number of sequences (i.e., rows) can be lower than in the productive_seq.tsv file due to sequences that failed to be clone assigned (see the non_clone_assigned_sequence.tsv file). Column description:
|
all_passed_seq_several_annot_igmt.tsv | Sequences from the all_passed_seq.tsv file with several annotation in the v_call and j_call columns, because several alleles had the best alignment using imgt blast (same ) |
unproductive_seq.tsv | Sequences that failed productive annotations by igblast (empty file if all the sequences are productively annotated). |
non_clone_assigned_sequence.tsv | Productive sequences that failed to be assigned to a clone ID by the DefineClones.py function (empty file if all the sequences are assigned). |
germ_tree_clone_id.tsv | Clonal group IDs used in the germline tree analysis (clonal group with at least n sequences, n being set by the nb_seq_per_clone parameter in the nextflow.config file). |
germ_tree_dismissed_clone_id.tsv | Clonal group IDs not used in the germline tree analysis (clonal group with less than n sequences, n being set by the nb_seq_per_clone parameter in the nextflow.config file). |
germ_tree_seq.tsv | Sequences of the all_passed_seq.tsv file used in the germline tree analysis (clonal group with at least n sequences, n being set by the nb_seq_per_clone parameter in the nextflow.config file). |
germ_tree_dismissed_seq.tsv | Sequences of the all_passed_seq.tsv file not used in the germline tree analysis (clonal group with less than n sequences, n being set by the nb_seq_per_clone parameter in the nextflow.config file). |
germ_tree_dup_seq_not_displayed.tsv | Sequences file used in the germline tree analysis but not displayed in the graph, (1) because strictly identical to another sequence already in the tree and (2) because the tree_duplicate_seq parameter of the nextflow.config file has been set to "FALSE". |
Optional elements only returned if the igblast_aa parameter is 'false' and if the input fasta are nucleotide sequences:
repertoire_profiler_xxxxx folder | Description |
---|---|
aa | Folder containing the translation of the alignment_sequence column of the productive_seq.tsv file in fasta files. |
aligned_seq | Folder containing the alignment_sequence column of the productive_seq.tsv file in fasta files. |
aa.tsv | File containing all the translation of the alignment_sequence column of the productive_seq.tsv file. |
VERSIONS
The different releases are tagged here.
LICENCE
This package of scripts can be redistributed and/or modified under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. Distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchandability or fitness for a particular purpose. See the GNU General Public License for more details at https://www.gnu.org/licenses or in the Licence.txt attached file.
CITATION
Version V10.3:
CREDITS
Pascal Chappert, INSERM U1151 Institut Necker Enfants Malades, Paris, France
Frédéric Lemoine, Bioinformatics and Biostatistics Hub, Institut Pasteur, Paris, France
Gael A. Millot, Bioinformatics and Biostatistics Hub, Institut Pasteur, Paris, France
ACKNOWLEDGEMENTS
The developers & maintainers of the mentioned softwares and packages, including:
Special acknowledgement to Kenneth Hoehn, Yale School of Medicine, New Haven, CT, USA
WHAT'S NEW IN
v11.1
- Option comment_char = '' of read.table() for all .R files with read.table() in bin
v11.0
- Tree plots improved
v10.4
- Donut plot legend corrected
v10.3
- Spaces removed in file names
- Spaces removed in the first line of each fasta file
- .fas fasta extension allowed
v10.2
xlsx2fasta.R file: error fixed for the fasta by categ
v10.1
xlsx2fasta.R file improved to take into account the problem of NA
v10.0
xlsx2fasta.R file strongly improved to deal with empty sequences
v9.9
Bug fixed in xlsx2fasta.R
v9.8
Error fixed in the nextflow.config file about cute path
v9.7
Cute folder updated to 12.8 to fix the palette bug in fun_gg_donut()
v9.6
Bug in the metadata_check process fixed
v9.5
README improved so that now dataset and results are in zenodo
v9.4
bug fixed
v9.3
important check added for the metadata file
v9.2
- repertoires now ok with a mix of IGK and IGL sequences
- first annotation taken if several v or j allele annotation by imgt for the repertoires
v9.1
bug corrected for the meta_name_replacement parameter
v9.0
repertoire_profiler.nf and .config name changed so that now can be run from gitlab
v8.14
README file improved, that clarify the differences between sequence_alignment and germline_* sequences
v8.13
README file improved, that clarify if results are from the productive seq or all passed seq
v8.12
Check added for the metadata file of the meta_path parameter. But check the content of the first column of this file remains to be added
v8.11
bugs fixed
v8.10
bugs fixed
v8.9
ig_clustering name replaced everywhere by repertoire_profiler
v8.8
Quotes removed from all output files
v8.7
xlsx2fasta.R file modified so that it now split data according to each class of the categ parameter
v8.6
Minor aesthetic modifications in trees and donut
v8.5
Bugs fixed in distToNearest
v8.4
- Bugs fixed in clone_assignment and get_tree with a new output file created non_clone_assigned_sequence.tsv
- Bugs fixed in tree_vizu when no metadata
v8.3
- Bug fixed in tree_vizu: now nb of removed seq are displayed again
- Now annotated seq are colored
v8.2
Bug fixed in tree_vizu: now meta are displayed again
v8.1
Bug fixed in get_tree
v8.0
- xlsx2fasta.R file modified so that fasta files have correct names
- now trees.pdf return an empty graph (but with infos) for clonal groups without trees
v7.1
Clean version of the xlsx2fasta.R script: can now be run on any excel file
v7.0
- Bug of the tree_seq_not_displayed.tsv file fixed
- Number of seq removed added in tree leafs
v6.11
Code debbuged. tree_seq_not_displayed.tsv remains to be debugged
v6.10
Repertoires improved
v6.9
Repertoires added
v6.8
- igblast_aa parameter not operational yet. igblast_aa = "true" does not work for the moment because no j data in the imgt database and no junction data are returned which block the clone_assignment process
- tree_vizu process not sensitive to cache
v6.7
Problem of cache fixed for distance_hist process and bug fixed for empty pdf plot in this process. It was the name seq_distance, creating a replacement of the seq_distance.pdf file
v6.6
Names of metadata now systematically present in trees, even when identical sequences are removed, and bug fixed for empty pdf plot
v6.5
Code secured and bug fixed for NULL metadata file path
v6.4
Distances added in the returned productive_seq.tsv, now 3 histograms to help to set the distance threshold, many bugs fixed
v6.3
AA sequences added. Translation of the alignment_sequence column is added in the returned productive_seq.tsv and new aa.tsv file
v6.2
Supp donut added regarding clonal groups with functional annotation
v6.1
New parameters for the donut
v6.0
Distance histogram added Empty graphs added New parameters for the donut
v5.1
Metadata info added in donut plot
v5.0
Bug solved when the tree_meta_path_names parameter is a categorical column New tree_meta_name_replacement parameter
v4.4
Bug solved in seq_not_displayed.tsv file
v4.3
igblast_aa parameter removed from the .config file bug solved in tree_vizu
v4.2
Donut charts grouped in a single pdf
v4.1
seq_not_displayed.tsv file added to better understand the absence of seq in trees
V and J alleles added in the donut legend
v4.0
tree_meta_path modified so that it can now be a 'NULL' path
V and J alleles added in tree titles
Duplicated sequences can be removed or not from trees
v3.5
Bug solved
v3.4
Two Donut charts added
v3.3
Empty channel solved
Donut chart added
v3.2
README file updated for localization of the igblast database
v3.1
README file updated
v3.0
First version that provides trees of clonal groups
v2.1
xlsx2fasta.R file added | Nicer tree representation added
v2.0
Conversion into DSL2 ok
v1.0
First DSL1 version that works