README.md 6.46 KB
Newer Older
Olivier Sallou's avatar
Olivier Sallou committed
1
BioMAJ3
Olivier Sallou's avatar
Olivier Sallou committed
2 3 4 5
=====

This project is a complete rewrite of BioMAJ (http://biomaj.genouest.org).

Olivier Sallou's avatar
Olivier Sallou committed
6 7 8 9 10 11 12 13 14 15 16
BioMAJ (BIOlogie Mise A Jour) is a workflow engine dedicated to data
synchronization and processing. The Software automates the update cycle and the
supervision of the locally mirrored databank repository.

Common usages are to download remote databanks (Genbank for example) and apply
some transformations (blast indexing, emboss indexing,...). Any script can be
applied on downloaded data. When all treatments are successfully applied, bank
is put in "production" on a dedicated release directory.
With cron tasks, update tasks can be executed at regular interval, data are
downloaded again only if a change is detected.

Olivier Sallou's avatar
Olivier Sallou committed
17
More documentation is available in wiki page.
Olivier Sallou's avatar
Olivier Sallou committed
18

19 20
BioMAJ is python 2 and 3 compatible.

Olivier Sallou's avatar
Olivier Sallou committed
21 22 23 24 25 26 27 28 29 30 31
Getting started
===============

Edit global.properties file to match your settings. Minimal conf are database connection and directories.

    biomaj-cli.py -h

    biomaj-cli.py --config global.properties --status

    biomaj-cli.py --config global.properties  --bank alu --update

Olivier Sallou's avatar
Olivier Sallou committed
32 33 34
Migration
=========

Olivier Sallou's avatar
Olivier Sallou committed
35
To migrate from previous BioMAJ 1.x, a script is available at:
Olivier Sallou's avatar
Olivier Sallou committed
36
https://github.com/genouest/biomaj-migrate. Script will import old database to
Olivier Sallou's avatar
Olivier Sallou committed
37
the new database, and update configuration files to the modified format. Data directory is the same.
Olivier Sallou's avatar
Olivier Sallou committed
38

Olivier Sallou's avatar
Olivier Sallou committed
39 40 41 42 43
Migration for 3.0 to 3.1:

Biomaj 3.1 provides an optional micro service architecture, allowing to separate and distributute/scale biomaj components on one or many hosts. This implementation is optional but recommended for server installations. Monolithic installation can be kept for local computer installation.
To upgrade an existing 3.0 installation, as biomaj code has been split into multiple components, it is necessary to install/update biomaj python package but also biomaj-cli and biomaj-daemon packages. Then database must be upgraded manually (see Upgrading in documentation).

44 45 46 47
To execute database migration:

    python biomaj_migrate_database.py

Olivier Sallou's avatar
Olivier Sallou committed
48 49 50 51 52 53 54 55 56 57
Application Features
====================

* Synchronisation:
 * Multiple remote protocols (ftp, sftp, http, local copy, ....)
 * Data transfers integrity check
 * Release versioning using a incremental approach
 * Multi threading
 * Data extraction (gzip, tar, bzip)
 * Data tree directory normalisation
Olivier Sallou's avatar
Olivier Sallou committed
58 59


Olivier Sallou's avatar
Olivier Sallou committed
60
* Pre &Post processing :
61
 * Advanced workflow description (D.A.G)
Olivier Sallou's avatar
Olivier Sallou committed
62 63 64
 * Post-process indexation for various bioinformatics software (blast, srs,
   fastacmd, readseq, etc…)
 * Easy integration of personal scripts for bank post-processing automation
Olivier Sallou's avatar
Olivier Sallou committed
65 66


Olivier Sallou's avatar
Olivier Sallou committed
67
* Supervision:
Olivier Sallou's avatar
Olivier Sallou committed
68 69 70
 * Optional Administration web interface (biomaj-watcher)
 * CLI management
 * Mail alerts for the update cycle supervision
Olivier Sallou's avatar
Olivier Sallou committed
71 72
 * Prometheus and Influxdb optional integration
 * Optional consul supervision of processes
Olivier Sallou's avatar
Olivier Sallou committed
73 74


Olivier Sallou's avatar
Olivier Sallou committed
75 76 77 78 79 80 81
* Scalability:
  * Monolithic (local install) or microservice architecture (remote access to a BioMAJ server)
  * Microservice installation allows per process scalability and supervision (number of process in charge of download, execution, etc.)


* Remote access:
  * Optional FTP server providing authenticated or anonymous data access
Olivier Sallou's avatar
Olivier Sallou committed
82

83 84 85
Dependencies
============

Olivier Sallou's avatar
Olivier Sallou committed
86
Packages:
87
 * Debian: libcurl-dev, gcc
Olivier Sallou's avatar
Olivier Sallou committed
88
 * CentOs: libcurl-devel, openldap-devel, gcc
Olivier Sallou's avatar
Olivier Sallou committed
89

90 91
 Linux tools: tar, unzip, gunzip, bunzip

Olivier Sallou's avatar
Olivier Sallou committed
92 93 94 95 96
Database:
 * mongodb (local or remote)

Indexing (optional):
 * elasticsearch (global property, use_elastic=1)
97

Olivier Sallou's avatar
Olivier Sallou committed
98 99
ElasticSearch indexing add advanced search features to biomaj to find bank
having files with specific format etc...
Olivier Sallou's avatar
Olivier Sallou committed
100 101 102 103 104 105 106 107 108 109 110 111
Configuration of ElasticSearch is not in the scope of BioMAJ documentation.
For a basic installation, one instance of ElasticSearch is enough (low volume of
data), in such a case, the ElasticSearch configuration file should be modified
accordingly:

    node.name: "biomaj" (or any other name)
    index.number_of_shards: 1
    index.number_of_replicas: 0

Installation
============

Olivier Sallou's avatar
Olivier Sallou committed
112 113
From source:

Olivier Sallou's avatar
Olivier Sallou committed
114 115 116 117
After dependencies installation, go in BioMAJ source directory:

    python setup.py install

Olivier Sallou's avatar
Olivier Sallou committed
118 119 120 121
From packages:

    pip install biomaj biomaj-cli biomaj-daemon

Olivier Sallou's avatar
Olivier Sallou committed
122 123 124 125 126 127 128 129

You should consider using a Python virtual environment (virtualenv) to install BioMAJ.

In tools/examples, copy the global.properties and update it to match your local
installation.

The tools/process contains example process files (python and shell).

Olivier Sallou's avatar
Olivier Sallou committed
130

Olivier Sallou's avatar
Olivier Sallou committed
131 132 133 134 135 136 137 138 139 140 141 142
Docker
======

You can use BioMAJ with Docker (genouest/biomaj)


    docker pull genouest/biomaj
    docker pull mongo
    docker run --name biomaj-mongodb -d mongo
    # Wait ~10 seconds for mongo to initialize
    # Create a local directory where databases will be permanently stored
    # *local_path*
Olivier Sallou's avatar
Olivier Sallou committed
143
    docker run --rm -v local_path:/var/lib/biomaj --link biomaj-mongodb:biomaj-mongodb osallou/biomaj-docker --help
Olivier Sallou's avatar
Olivier Sallou committed
144 145


Olivier Sallou's avatar
Olivier Sallou committed
146
Copy your bank properties in directory *local_path*/conf and post-processes (if any) in *local_path*/process
Olivier Sallou's avatar
Olivier Sallou committed
147

Olivier Sallou's avatar
Olivier Sallou committed
148
You can override global.properties in /etc/biomaj/global.properties (-v xx/global.properties:/etc/biomaj/global.properties)
Olivier Sallou's avatar
Olivier Sallou committed
149

Olivier Sallou's avatar
Olivier Sallou committed
150 151 152
No default bank property file or process are available in the container.

Examples are available at https://github.com/genouest/biomaj-data
Olivier Sallou's avatar
Olivier Sallou committed
153 154 155 156 157

API documentation
=================

https://readthedocs.org/projects/biomaj/
158

159 160 161
Status
======

Olivier Sallou's avatar
Olivier Sallou committed
162
[![Build Status](https://travis-ci.org/genouest/biomaj.svg?branch=master)](https://travis-ci.org/genouest/biomaj)
Olivier Sallou's avatar
Olivier Sallou committed
163

Olivier Sallou's avatar
Olivier Sallou committed
164 165
[![Documentation Status](https://readthedocs.org/projects/biomaj/badge/?version=latest)](https://readthedocs.org/projects/biomaj/?badge=latest)

166
[![Code Health](https://landscape.io/github/genouest/biomaj/master/landscape.svg?style=flat)](https://landscape.io/github/genouest/biomaj/master)
Olivier Sallou's avatar
Olivier Sallou committed
167

168 169 170 171 172 173 174 175 176 177
Testing
=======

Execute unit tests

    nosetests

Execute unit tests but disable ones needing network access

    nosetests -a '!network'
Olivier Sallou's avatar
Olivier Sallou committed
178

Olivier Sallou's avatar
Olivier Sallou committed
179 180 181 182 183 184 185 186 187 188 189 190 191

Monitoring
==========

InfluxDB can be used to monitor biomaj. Following series are available:

* biomaj.banks.quantity (number of banks)
* biomaj.production.size.total (size of all production directories)
* biomaj.workflow.duration (workflow duration)
* biomaj.production.size.latest (size of latest update)
* biomaj.bank.update.downloaded_files (number of downloaded files)
* biomaj.bank.update.new (track updates)

Olivier Sallou's avatar
Olivier Sallou committed
192 193 194 195
License
=======

A-GPL v3+
196 197 198 199 200

Remarks
=======

Biomaj uses libcurl, for sftp libcurl must be compiled with sftp support
Olivier Sallou's avatar
Olivier Sallou committed
201 202 203 204

To delete elasticsearch index:

 curl -XDELETE 'http://localhost:9200/biomaj_test/'
Olivier Sallou's avatar
credits  
Olivier Sallou committed
205 206 207 208 209 210 211 212

Credits
======

Special thanks for tuco at Pasteur Institute for the intensive testing and new
ideas....
Thanks to the old BioMAJ team for the work they have done.

Olivier Sallou's avatar
Olivier Sallou committed
213
BioMAJ is developped at IRISA research institute.