TaggingBackends
This library helps to implement automatic tagging backends for the Nyx larva tagger.
Template project for tagging backends
A tagging backend, called e.g. TaggingBackend
, is a Python project with the following directory structure:
├── LICENSE <- Default is MIT.
├── README.md <- Project description.
├── data
│ ├── raw <- The input data are accessible from this
│ │ directory, with their original file structure.
│ ├── interim <- Preprocessed data and extracted features
│ │ can be stored in this directory.
│ └── processed <- Predicted labels from predict_model.py are
│ expected in this directory.
├── models <- Hyperparameters and weights of trained
│ classifiers can be stored here.
│
├── pyproject.toml <- Project definition file for Poetry.
├── src
│ └── taggingbackend <- Python package name.
│ │ Same as project name, with lowercase letters,
│ │ hyphens converted into underscores.
│ │ For example, `My-Tagger` becomes `my_tagger`.
│ ├── __init__.py <- Defines variable `__version__`.
│ ├── data
│ │ └── make_dataset.py <- Picks and converts raw data files; if files
│ │ are to be written, they go into data/interim;
│ │ optional.
│ ├── features
│ │ └── build_features.py <- Extracts and saves features to file in
│ │ data/interim; optional.
│ └── models
│ ├── train_model.py <- Trains the behavior tagging algorithm and
│ │ stores the trained model in models/;
│ │ optional.
│ └── predict_model.py <- Loads the trained model and features from
│ data/interim, and moves the resulting
│ labels in data/processed.
└── test
├── __init__.py <- Empty file.
└── test_taggingbackend.py <- Automated tests; optional.
Filename is `test_<package_name>.py`.
The above structure borrows elements from the Cookiecutter Data Science project template, adapted for use with Poetry.
The src/<package_name>/{data,features,models}
directories can accommodate Python modules (in subpackages <package_name>.{data,features,models}
respectively).
For example, the model can be implemented as a Python class in an additional file in
src/<package_name>/models
, e.g. mymodel.py
.
In this case, an empty __init__.py
file should be created in the same directory.
As the Python package is installed, this custom module will be loadable from anywhere
with import <package_name>.models.mymodel
.
On the other hand, the make_dataset.py
, build_features.py
, predict_model.py
and train_model.py
are Python scripts, with a main program.
These scripts will be run using Poetry, from the project root.
See example scripts in the examplebackend
directory.
Only predict_model.py
is required by the Nyx tagging UI.
The simplest working directory structure for a tagging backend is:
├── models/
│ └── trained_model/
├── pyproject.toml
└── scripts/
└── predict_model.py
with trained_model
the name of the trained model.
Backends that do not need to store trained models should still have an empty subdirectory there,
as these subdirectories in models
are looked for by the Nyx tagger UI.
The data
directory is automatically created by the BackendExplorer
object, together with its raw
and processed
subdirectories, therefore there is no need to include these directories in the backend.
Although the Nyx tagger UI does not expect the project to include a Python package, a Poetry-managed virtual environment should be set up with the taggingbackends
package installed, so that the command poetry run tagging-backend
is available at the project root directory.
The tests
directory is renamed test
for compatibility with Julia projects.
Python/Poetry do not need additional configuration to properly handle the tests.
Input and output data
Per default, the input data will be copied into data/raw
.
Input data files can be in any format, and a backend is responsible for handling these
various files.
Per default, training labels are provided as json files that can be loaded using the
taggingbackends.labels.Labels
class.
Predicted labels are expected in data/processed
.
A backend can make use of data/interim
or ignore it.
Similarly, the trained models can be stored in models
or not.
However, a subdirectory by the name of the model instance should be created for model
discovery (see below).
Full paths to data and model directories, and files, are made available in the training
and prediction procedures with a taggingbackends.explorer.BackendExplorer
object.
Labels
The internal representation is as follows:
- dictionary of run identifiers (
str
, typicallydate_time
) as keys and, as values:- dictionary of larva identifiers (
int
) as keys and, as values:- dictionary of timestamps (
float
) as keys and discrete behavioral states (str
) as values.
- dictionary of timestamps (
- dictionary of larva identifiers (
Labels are encapsulated in a dedicated datatype that also stores metadata and information about labels (names, colors).
See taggingbackends.data.labels.Labels
, and an example json labels file.
Model specification
A backend can train and use multiple model instances.
Each instance is assigned an identifier and the related files, including data files,
are actually stored in corresponding subdirectories in the data/raw
, data/interim
,
data/processed
and models
directories.
Per default, a new model is identified with a timestamp in the YYYYMMDD_HHMMSS
format.
For example, the taggingbackends.explorer.BackendExplorer.list_input_files
method will
seek for files in the data/raw/<instance_identifier>
directory only.
A backend can store the model in the models/<instance_identifier>
directory
(please call the taggingbackends.explorer.BackendExplorer.model_dir
method to get the
exact location) or not.
However, the directory should be created to make the trained models discoverable.
Indeed, the Nyx tagging UI will seek for subdirectories in the models
directory to list
the available trained models.
Recommended installation
TaggingBackends uses both Python and Julia. In particular, Python-side taggingbackends may call Julia-side TaggingBackends to compile the data from a data repository, prior to training a tagger.
The communication between the two language stacks requires Julia package TaggingBackends to be manually installed. Depending on where TaggingBackends is installed, a pointer to that location may be needed.
The simplest approach consists in doing so in the main Julia environment. In this case, you will have to explicitly install PlanarLarvae as well:
julia -e 'using Pkg; Pkg.add(url="https://gitlab.pasteur.fr/nyx/planarlarvae.jl"); Pkg.add(url="https://gitlab.pasteur.fr/nyx/TaggingBackends")'
This is enough for Python-side taggingbackends to find its Julia counterpart.
Another approach we recommend, if you have LarvaTagger.jl installed, consists in installing TaggingBackends in the Julia environment associated with LarvaTagger, and set the JULIA_PROJECT environment variable so that it points to the directory associated with this environment.
This directory typically is the root directory of your local copy of LarvaTagger.jl, if you installed LarvaTagger with julia --project=.
.
To do so, on Unix-like OSes:
cd /path/to/LarvaTagger.jl/
export JULIA_PROJECT=$(pwd)
julia --project=. -e 'using Pkg; Pkg.add(url="https://gitlab.pasteur.fr/nyx/TaggingBackends")'
With the above export
expression, the JULIA_PROJECT environment variable is set for the running commandline.
To install a backend, taking MaggotUBA-adapter as an example:
git clone --depth 1 --single-branch -b 20230111 https://gitlab.pasteur.fr/nyx/MaggotUBA-adapter MaggotUBA
cd MaggotUBA
poetry install
You can check for message "PyCall is already installed" in the output of:
poetry run python -c 'import julia; julia.install()'
Note that, if PyCall is not found, the above command will install it. However, TaggingBackends still needs to be installed for Python-side taggingbackends to successfully call Julia-side TaggingBackends.
A major drawback of this second approach is the JULIA_PROJECT environment variable must be set accordingly everytime the train
command is called.
For example, from the backend directory tree, on Unix-like OSes:
JULIA_PROJECT=<path> poetry run tagging-backend train
or, from LarvaTagger.jl root directory:
JULIA_PROJECT=<path> scripts/larvatagger.jl train
with <path>
the path of the Julia project with TaggingBackends installed.
Note however that the last command above will not work if Julia was installed using juliaup. Prefer jill.