Commit 126032cf authored by Kenzo-Hugo Hillion's avatar Kenzo-Hugo Hillion
Browse files

Merge branch '140-update-documentation' into 'dev'

Dev documentation to run metageneDB locally

Closes #140

See merge request !65
parents c9b0d9b3 2cc71fd9
Pipeline #33343 passed with stages
in 3 minutes and 35 seconds
......@@ -3,150 +3,15 @@
[![pipeline status](https://gitlab.pasteur.fr/metagenomics/metagenedb/badges/dev/pipeline.svg)](https://gitlab.pasteur.fr/metagenomics/metagenedb/commits/dev)
[![coverage report](https://gitlab.pasteur.fr/metagenomics/metagenedb/badges/dev/coverage.svg)](https://gitlab.pasteur.fr/metagenomics/metagenedb/commits/dev)
Django based project to build genes catalog and tools
to play with it and contact external services.
## The project
----
The main motivation behind MetageneDB is to provide a support for all the analysis that are based on gene catalogs.
It is composed of both an API and a client side for visualization and interaction with the DB.
## Setup the services on your local machine
* Graphical interface to browse through the catalog
* REST API to programmatically query and retrieve information from the database
* (not implemented) Interface to perform analysis from gene counts present on the catalog
### Dependencies
## Wiki & Documentation
The application depends on different services that run independently on docker images and all of this is
orchestrated by `docker-compose`.
Therefore to run the application you need:
* `Docker` : [Install instructions](https://docs.docker.com/install/)
* `Docker Compose` : [Install instructions](https://docs.docker.com/compose/install/)
### Configuration
For `docker-compose`, you need to create a `.env` file: `touch .env`. An example is available: `.env.sample`.
The settings of the Django server is based on the `backend/.env` file. You can copy the sample file
(`cp backend/.env.sample backend/.env`) and fill in the variables.
You can of course customize more of the Django server settings in the `settings` module of metagenedb.
Now we will go through the different parts
#### Secret key
This is the Django `SECRET_KEY` and you need to specify your own. For instance you can use the command
`openssl rand -base64 32` to generate one by command line.
#### Create your own DB on postgresql
The following variables have the default value:
```bash
DATABASE_HOST=postgresql
DATABASE_USER=postgres
DATABASE_NAME=postgres
DATABASE_PASSWORD=""
DATABASE_PORT=5432
```
It will work if you leave it as it is but you might face security issues having a by default database
without credentials.
What we recommand is to create your own database. Here is described one way to do it. To do that you need to
first run the db image and identify its running ID:
```bash
khillion:~/metagenedb $ docker-compose up postgresql -d # This runs only the postgresql service of your docker-compose in detached mode. You can also detached from you running screen using Ctrl+Z
Creating postgresql ... done
khillion:~/metagenedb $ docker ps # List your running docker images
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5002f210f9d8 postgres:11.4-alpine "docker-entrypoint.s…" 1 minute ago Up 1 minute 0.0.0.0:5433->5432/tcp postgresql
```
Now that you have the `CONTAINER ID`, here `5002f210f9d8` you can run a `bash` terminal in this container and
create your own database:
```bash
khillion:~/metagenedb $ docker exec -it 5002f210f9d8 bash
bash-5.0# psql --user=postgres
````
This will open the `SQL` console where you can do what you need:
```psql
CREATE ROLE metagenedb WITH PASSWORD 'yourawesomepassword';
ALTER ROLE metagenedb WITH CREATEDB;
CREATE DATABASE metagenedb WITH OWNER metagenedb;
exit
```
Now you have you own database, protected by a password and you need to update your `.env`:
```bash
DATABASE_HOST=postgresql
DATABASE_USER=metagenedb
DATABASE_NAME=metagenedb
DATABASE_PASSWORD=yourawesomepassword
DATABASE_PORT=5432
```
> **Note**: The by default port for postgres is `5432`. In the `docker-compose.yaml` you will notice that this
port is redirected to `5433` on the `localhost`. This is done in order to not interfere with your local
postgres if you have one. This means you need to change `DATABASE_HOST` to `localhost` and `DATABASE_POS
### Pre-computed statistics
Some statistics about genes are pre-computed and can be accessed through the `/api/catalog/v1/statistics` endpoint.
The ID is constructed with the following format: `<statisctics-type>-<gene_source>-<method>-<options>`.
----
## Run the application
For the moment, only the `docker-compose.dev.yaml` is used. To run the application simply run the command:
```bash
docker-compose up --build
```
The `--build` option is only necessary during the first usage or when you make changes that need the docker
container to be built again.
Since directories with source codes are mounted in the containers, changes you make locally should be
directly reflected on the application.
### Populate the database
You have a set of scripts available within the `backend/scripts` directory that you can execute directly
from within the container. First identify the container ID corresponding to the backend with `docker ps` command. Then you can execute a bash terminal within the container and execute the scripts you want:
```bash
docker exec -it YOURCONTAINER_ID bash
root@YOURCONTAINER_ID:/code# python scripts/script.py
```
For the moment you can:
* Import all kegg orthologies with `load_kegg.py`: It directly fetch all KEGGs KO from the KEGG REST API.
* Import genes from IGC catalog from the [annotation file](ftp://ftp.cngb.org/pub/SciRAID/Microbiome/humanGut_9.9M/GeneAnnotation/IGC.annotation_OF.summary.gz). You can a small part of this annotation file in the `dev_data` folder.
> **Note**: You can also execute the scripts locally from a `pipenv shell` for instance. You need to make
sure that you change the way to log to postgres since the access is different from your machine compared to
from a container.
-----
## Dev tips
#### Profiling code
```python
from metagenedb.common.utils.profiling import profile
@profile("/my/file/path")
def my_function(a, b, c):
...
```
```bash
snakeviz /my/file/path
```
\ No newline at end of file
For more information, please have a look at our [Wiki](https://gitlab.pasteur.fr/metagenomics/metagenedb/-/wikis/MetageneDB)
......@@ -6,3 +6,4 @@ DATABASE_NAME=postgres
DATABASE_PASSWORD=""
DATABASE_PORT=5432
SECRET_KEY=YOUR_KEY
DB_LOG_LEVEL=INFO
from factory import DjangoModelFactory, fuzzy
from faker import Factory
from factory import DjangoModelFactory, Faker, fuzzy
from metagenedb.apps.catalog import models
from .fuzzy_base import FuzzyLowerText
faker = Factory.create()
SELECTED_SOURCE = [i[0] for i in models.Function.SOURCE_CHOICES]
EGGNOG_VERSIONS = [i[0] for i in models.EggNOG.VERSION_CHOICES]
class BaseFunctionFactory(DjangoModelFactory):
function_id = FuzzyLowerText(prefix='function-', length=15)
class FunctionFactory(BaseFunctionFactory):
class FunctionFactory(DjangoModelFactory):
class Meta:
model = models.Function
function_id = Faker('bothify', text='function-####')
source = fuzzy.FuzzyChoice(SELECTED_SOURCE)
function_id = FuzzyLowerText(prefix='function-', length=15)
class EggNOGFactory(BaseFunctionFactory):
class EggNOGFactory(DjangoModelFactory):
function_id = Faker('bothify', text='COG####')
name = Faker('bothify', text='COG-????????')
class Meta:
model = models.EggNOG
version = fuzzy.FuzzyChoice(EGGNOG_VERSIONS)
class KeggOrthologyFactory(BaseFunctionFactory):
class KeggOrthologyFactory(DjangoModelFactory):
function_id = Faker('bothify', text='K0####')
class Meta:
model = models.KeggOrthology
......
from factory import (
DjangoModelFactory, RelatedFactory, SubFactory, fuzzy
DjangoModelFactory, Faker, RelatedFactory, SubFactory, fuzzy
)
from faker import Factory
from metagenedb.apps.catalog import models
from .fuzzy_base import FuzzyLowerText
from .function import FunctionFactory
from .function import FunctionFactory, KeggOrthologyFactory, EggNOGFactory
from .taxonomy import TaxonomyFactory
faker = Factory.create()
GENE_SOURCES = [i[0] for i in models.Gene.SOURCE_CHOICES]
......@@ -18,9 +14,9 @@ class GeneFactory(DjangoModelFactory):
class Meta:
model = models.Gene
gene_id = FuzzyLowerText(prefix='gene-', length=15)
name = fuzzy.FuzzyText(prefix='name-', length=15)
length = fuzzy.FuzzyInteger(200, 10000)
gene_id = Faker('bothify', text='gene-?#?#??-#??#?#')
name = Faker('bothify', text='Gene_name-##-????')
length = Faker('pyint', min_value=200, max_value=4200)
source = fuzzy.FuzzyChoice(GENE_SOURCES)
......@@ -36,9 +32,17 @@ class GeneFunctionFactory(DjangoModelFactory):
function = SubFactory(FunctionFactory)
class GeneKeggFactory(GeneFunctionFactory):
function = SubFactory(KeggOrthologyFactory)
class GeneEggNOGFactory(GeneFunctionFactory):
function = SubFactory(EggNOGFactory)
class GeneWithKeggFactory(GeneFactory):
kegg = RelatedFactory(GeneFunctionFactory, 'gene', function__source='kegg')
kegg = RelatedFactory(GeneKeggFactory, 'gene')
class GeneWithEggNOGFactory(GeneFactory):
eggnog = RelatedFactory(GeneFunctionFactory, 'gene', function__source='eggnog')
eggnog = RelatedFactory(GeneEggNOGFactory, 'gene')
from collections import OrderedDict
from factory import DjangoModelFactory, fuzzy
from faker import Factory
......@@ -25,44 +27,48 @@ class DbGenerator:
self.created_ids = set() # store already created IDs to skip them
def generate_db_from_tree(self, tree):
"""
Tree need to be an OrderedDict from higher to lower level
"""
self.last_tax = None
for rank, desc in tree.items():
if desc['tax_id'] not in self.created_ids:
TaxonomyFactory.create(
self.last_tax = TaxonomyFactory.create(
tax_id=desc['tax_id'],
name=desc['name'],
rank=rank,
parent=getattr(self, "last_tax", None)
)
self.created_ids.add(desc['tax_id'])
self.last_tax.build_hierarchy()
def _generate_lactobacillus_db(db_generator):
"""
Generate db with few ranks corresponding to Lactobacillus genus
"""
tree = {
"class": {"name": "Bacilli", "tax_id": "91061"},
"genus": {"name": "Lactobacillus", "tax_id": "1578"},
"order": {"name": "Lactobacillales", "tax_id": "186826"},
"family": {"name": "Lactobacillaceae", "tax_id": "33958"},
"phylum": {"name": "Firmicutes", "tax_id": "1239"},
"no_rank": {"name": "cellular organisms", "tax_id": "131567"},
"superkingdom": {"name": "Bacteria", "tax_id": "2"},
"species_group": {"name": "Lactobacillus casei group", "tax_id": "655183"}
}
tree = OrderedDict()
tree['no_rank'] = {"name": "root", "tax_id": "1"}
tree["superkingdom"] = {"name": "Bacteria", "tax_id": "2"}
tree["phylum"] = {"name": "Firmicutes", "tax_id": "1239"}
tree["class"] = {"name": "Bacilli", "tax_id": "91061"}
tree["order"] = {"name": "Lactobacillales", "tax_id": "186826"}
tree["family"] = {"name": "Lactobacillaceae", "tax_id": "33958"}
tree["genus"] = {"name": "Lactobacillus", "tax_id": "1578"}
tree["species_group"] = {"name": "Lactobacillus casei group", "tax_id": "655183"}
db_generator.generate_db_from_tree(tree)
def _generate_escherichia_db(db_generator):
tree = {
"class": {"name": "Gammaproteobacteria", "tax_id": "1236"},
"genus": {"name": "Escherichia", "tax_id": "561"},
"order": {"name": "Enterobacterales", "tax_id": "91347"},
"family": {"name": "Enterobacteriaceae", "tax_id": "543"},
"phylum": {"name": "Proteobacteria", "tax_id": "1224"},
"no_rank": {"name": "cellular organisms", "tax_id": "131567"},
"species": {"name": "Escherichia coli", "tax_id": "562"},
"superkingdom": {"name": "Bacteria", "tax_id": "2"}
}
tree = OrderedDict()
tree["no_rank"] = {"name": "root", "tax_id": "1"}
tree["superkingdom"] = {"name": "Bacteria", "tax_id": "2"}
tree["phylum"] = {"name": "Proteobacteria", "tax_id": "1224"}
tree["class"] = {"name": "Gammaproteobacteria", "tax_id": "1236"}
tree["order"] = {"name": "Enterobacterales", "tax_id": "91347"}
tree["family"] = {"name": "Enterobacteriaceae", "tax_id": "543"}
tree["genus"] = {"name": "Escherichia", "tax_id": "561"}
tree["species"] = {"name": "Escherichia coli", "tax_id": "562"}
db_generator.generate_db_from_tree(tree)
......
import logging
from random import randint
from django.core.management.base import BaseCommand
from metagenedb.apps.catalog.factory import GeneFactory, GeneWithEggNOGFactory, GeneWithKeggFactory
from metagenedb.apps.catalog.factory.taxonomy import generate_simple_db as gen_tax_db
from metagenedb.apps.catalog.models import (
Gene, Function, Taxonomy
)
from metagenedb.apps.catalog.management.commands.compute_stats import (
ComputeStatistics, ComputeCounts, ComputeGeneLength, ComputeTaxonomyRepartition, ComputeTaxonomyPresence
)
logging.basicConfig(format='[%(asctime)s] %(levelname)s:%(name)s:%(message)s')
logger = logging.getLogger()
def empty_db():
Gene.objects.all().delete()
Taxonomy.objects.all().delete()
Function.objects.all().delete()
def create_taxonomy_db():
Taxonomy.objects.all().delete()
gen_tax_db()
def create_genes_db():
Gene.objects.all().delete()
GeneFactory.create_batch(50)
GeneWithEggNOGFactory.create_batch(15)
GeneWithKeggFactory.create_batch(12)
for tax in Taxonomy.objects.all():
GeneFactory.create_batch(randint(1, 10), taxonomy=tax)
GeneWithEggNOGFactory.create(taxonomy=tax)
GeneWithKeggFactory.create(taxonomy=tax)
def compute_stats():
ComputeStatistics('all').clean_db()
for gene_source in ['all', 'virgo', 'igc']:
ComputeCounts(gene_source).all()
ComputeGeneLength(gene_source).all()
ComputeTaxonomyRepartition(gene_source).all()
ComputeTaxonomyPresence(gene_source).all()
def create_small_db():
empty_db()
create_taxonomy_db()
create_genes_db()
compute_stats()
class Command(BaseCommand):
help = 'Create a light DB with random items to illustrate functionnalities of the application.'
def set_logger_level(self, verbosity):
if verbosity > 2:
logger.setLevel(logging.DEBUG)
elif verbosity > 1:
logger.setLevel(logging.INFO)
def handle(self, *args, **options):
self.set_logger_level(int(options['verbosity']))
create_small_db()
......@@ -62,12 +62,13 @@ class Taxonomy(models.Model):
Build and save parental hierarchy for an entry
"""
hierarchy = {}
if self.name != 'root' and self.parent is not None:
if self.name != 'root':
hierarchy[self.rank] = {
'tax_id': self.tax_id,
'name': self.name
}
hierarchy = {**hierarchy, **getattr(self.parent, 'hierarchy', self.parent.build_hierarchy())}
if self.parent is not None:
hierarchy = {**hierarchy, **getattr(self.parent, 'hierarchy', self.parent.build_hierarchy())}
self.hierarchy = hierarchy
self.save()
return hierarchy
......
......@@ -297,12 +297,12 @@ class TestBuildTaxoMapping(APITestCase):
self.assertDictEqual(self.import_igc_genes.phylum_mapping, expected_phylum_dict)
class TestBuildBuildFunctionCatalog(APITestCase):
class TestBuildFunctionCatalog(APITestCase):
@classmethod
def setUpTestData(cls):
cls.keggs = KeggOrthologyFactory.create_batch(100)
cls.eggnogs = EggNOGFactory.create_batch(100)
cls.keggs = KeggOrthologyFactory.create_batch(10)
cls.eggnogs = EggNOGFactory.create_batch(10)
def setUp(self):
self.import_igc_genes = ImportIGCGenes('test', 'test_url', 'test_token')
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment