TaggingBackends
This library helps to implement automatic tagging backends for the Nyx larva tagger.
Template project for tagging backends
A tagging backend, called e.g. TaggingBackend
, is a Python project with the following directory structure:
├── LICENSE <- Default is MIT.
├── README.md <- Project description.
├── data
│ ├── raw <- The input data are accessible from this
│ │ directory, with their original file structure.
│ ├── interim <- Preprocessed data and extracted features
│ │ can be stored in this directory.
│ └── processed <- Predicted labels from predict_model.py are
│ expected in this directory.
│
├── models <- Hyperparameters and weights of trained
│ classifiers can be stored here.
├── pretrained_models <- Partially trained models the training procedure
│ starts from; optional.
│
├── pyproject.toml <- Project definition file for Poetry.
├── src
│ └── taggingbackend <- Python package name.
│ │ Same as project name, with lowercase letters,
│ │ hyphens converted into underscores.
│ │ For example, `My-Tagger` becomes `my_tagger`.
│ ├── __init__.py <- Defines variable `__version__`.
│ ├── data
│ │ └── make_dataset.py <- Picks and converts raw data files; if files
│ │ are to be written, they go into data/interim;
│ │ optional.
│ ├── features
│ │ └── build_features.py <- Extracts and saves features to file in
│ │ data/interim; optional.
│ └── models
│ ├── train_model.py <- Trains the behavior tagging algorithm and
│ │ stores the trained model in models/;
│ │ optional.
│ ├── finetune_model.py <- Further trains the behavior tagging algorithm
│ │ and stores the retrained model as a new model
│ │ instance in models/; optional.
│ │ *Available since version 0.14*.
│ └── predict_model.py <- Loads the trained model and features from
│ data/interim, and moves the resulting
│ labels in data/processed.
└── test
├── __init__.py <- Empty file.
└── test_taggingbackend.py <- Automated tests; optional.
Filename is `test_<package_name>.py`.
The above structure borrows elements from the Cookiecutter Data Science project template, adapted for use with Poetry.
The src/<package_name>/{data,features,models}
directories can accommodate Python modules
(in subpackages <package_name>.{data,features,models}
respectively).
For example, the model can be implemented as a Python class in an additional file in
src/<package_name>/models
, e.g. mymodel.py
.
In this case, an empty __init__.py
file should be created in the same directory.
As the Python package is installed, this custom module will be loadable from anywhere
with import <package_name>.models.mymodel
.
On the other hand, the make_dataset.py
, build_features.py
, predict_model.py
,
train_model.py
and finetune_model.py
are Python scripts, with a main program.
These scripts are run using Poetry, from the project root.
More exactly, although the Nyx tagging UI does not expect the backend to be a Python
project, the backend should be set a Poetry-managed virtual environment with the
taggingbackends
package installed as a dependency, so that the backend can be operated
calling poetry run tagging-backend [train|predict|finetune]
, which in turn
calls the above-mentioned Python scripts.
New in version 0.14, fine-tuning: finetune_model.py
differs from train_model.py
as
it takes an existing trained model and further trains it. In contrast, train_model.py
trains a model from data only or a so-called pretrained model.
For example, MaggotUBA-adapter trains a classifier on top of a pretrained encoder.
In this particular backend, train_model.py
picks a pretrained encoder in the
pretrained_models
directory and saves the resulting model (encoder+classifier) in the
models
directory. finetune_model.py
instead picks a model from the models
directory
and saves the retrained model in models
as well, under a different name (subdirectory).
Note that the pretrained_models
directory is included more for explanatory purposes.
It is not expected or checked for by the TaggingBackends logic, unlike all the other
directories and scripts mentioned above. The pretrained_models
directory was introduced
by MaggotUBA-adapter.
See example scripts in the examplebackend
directory.
Only predict_model.py
is required by the Nyx tagging UI.
The simplest working directory structure for a tagging backend is:
├── models/
│ └── trained_model/
├── pyproject.toml
└── scripts/
└── predict_model.py
with trained_model
the name of the trained model.
Backends that do not need to store trained models should still have an empty subdirectory there,
as these subdirectories in models
are looked for by the Nyx tagger UI.
The data
directory is automatically created by the BackendExplorer
object, together with its raw
and processed
subdirectories, therefore there is no need to include these directories in the backend.
The tests
directory is renamed test
for compatibility with Julia projects.
Python/Poetry do not need additional configuration to properly handle the tests.
Input and output data
Per default, the input data will be copied into data/raw
.
Input data files can be in any format, and a backend is responsible for handling these
various files.
Per default, training labels are provided as json files that can be loaded using the
taggingbackends.labels.Labels
class.
Predicted labels are expected in data/processed
.
A backend can make use of data/interim
or ignore it.
Similarly, the trained models can be stored in models
or not.
However, a subdirectory by the name of the model instance should be created for model
discovery (see below).
Full paths to data and model directories, and files, are made available in the training
and prediction procedures with a taggingbackends.explorer.BackendExplorer
object.
Labels
The internal representation is as follows:
- dictionary of run/assay identifiers (
str
, typicallydate_time
) as keys and, as values:- dictionary of track/larva identifiers (
int
) as keys and, as values:- dictionary of timestamps (
float
) as keys and discrete behavioral states/actions (str
orlist
ofstr
) as values.
- dictionary of timestamps (
- dictionary of track/larva identifiers (
Labels are encapsulated in a dedicated datatype that also stores metadata and information about labels (names, colors).
See taggingbackends.data.labels.Labels
, and an example json labels file.
Model specification
A backend can train and use multiple model instances.
Each instance is assigned an identifier and the related files, including data files,
are actually stored in corresponding subdirectories in the data/raw
, data/interim
,
data/processed
and models
directories.
Per default, a new model is identified with a timestamp in the YYYYMMDD_HHMMSS
format.
For example, the taggingbackends.explorer.BackendExplorer.list_input_files
method will
seek for files in the data/raw/<instance_identifier>
directory only.
A backend can store the model in the models/<instance_identifier>
directory
(please call the taggingbackends.explorer.BackendExplorer.model_dir
method to get the
exact location) or not.
However, the directory should be created to make the trained models discoverable.
Indeed, the Nyx tagging UI will seek for subdirectories in the models
directory to list
the available trained models.
Recommended installation
TaggingBackends uses both Python and Julia. In particular, Python-side taggingbackends may call Julia-side TaggingBackends to compile the data from a data repository, prior to training a tagger.
The communication between the two language stacks requires Julia package TaggingBackends to be manually installed. Depending on where TaggingBackends is installed, a pointer to that location may be needed.
The simplest approach consists in doing so in the main Julia environment. In this case, you will have to explicitly install PlanarLarvae as well, e.g.:
julia -e 'using Pkg; Pkg.add(url="https://gitlab.pasteur.fr/nyx/planarlarvae.jl"); Pkg.add(url="https://gitlab.pasteur.fr/nyx/TaggingBackends")'
This is enough for Python-side taggingbackends to find its Julia counterpart.
Another approach we recommend, so that your main Julia environment is not populated by packages you do not need in every circumtances, consists in installing TaggingBackends in an existing Julia environment, e.g. the one that accommodates the LarvaTagger package.
As a major inconvenience of this approach, the JULIA_PROJECT
environment variable will have to be set whenever tagging-backend train
is called.
The JULIA_PROJECT
variable should be the absolute path to the directory associated with the environment that accommodates the TaggingBackends package.
If for example you have a local copy of the LarvaTagger.jl project, you can install TaggingBackends in the associated environment:
cd /path/to/LarvaTagger/
julia --project=. -e 'using Pkg; Pkg.add(url="https://gitlab.pasteur.fr/nyx/TaggingBackends")'
export JULIA_PROJECT=$(pwd)
The export
expression will work in Unix-like OSes only. On Windows, the JULIA_PROJECT
environment variable can be set in the same configuration panel as the user's Path
variable.
Note also that, with the above export
expression, the JULIA_PROJECT
environment variable is set for the running command interpreter, and will have to be set again if the interpreter is closed and started again.
To install a backend, taking MaggotUBA-adapter as an example:
git clone --depth 1 --single-branch -b 20230311 https://gitlab.pasteur.fr/nyx/MaggotUBA-adapter MaggotUBA
cd MaggotUBA
JULIA_PROJECT=$(pwd) poetry install
You can check for message "PyCall is already installed" in the output of:
JULIA_PROJECT=$(pwd) poetry run python -c 'import julia; julia.install()'
Note that, if PyCall is not found, the above command will install it. However, TaggingBackends still needs to be installed for Python-side taggingbackends to successfully call Julia-side TaggingBackends.
So again, the JULIA_PROJECT
environment variable must be set accordingly everytime the train
command is called, which can also be done assigning the adequate absolute path to the variable on the same line as the command, immediately before the command.
For example, from the backend directory tree, on Unix-like OSes:
JULIA_PROJECT=<path> poetry run tagging-backend train
or, from LarvaTagger.jl root directory:
JULIA_PROJECT=<path> scripts/larvatagger.jl train
with <path>
the absolute path to the Julia project/environement with TaggingBackends installed.
There is a known issue with JULIA_PROJECT
not being properly propagated on calling larvatagger.jl
, in the case Julia was installed using juliaup.
Prefer jill, instead of juliaup.
Note also that, on Linux, or macOS with coreutils installed, a relative path can be conveniently turned into an absolute path using the realpath
command:
JULIA_PROJECT=$(realpath <path>)