Skip to content
Snippets Groups Projects

Resolve "Change simulation tool"

Merged Kenzo-Hugo Hillion requested to merge 10-change-simulation into master
1 file
+ 17
119
Compare changes
  • Side-by-side
  • Inline
+ 17
119
# Simulation
[CAMISIM](https://github.com/CAMI-challenge/CAMISIM) can model different microbial abundance
profiles (from multi-sample time series to differential abundance studies) and was used to
generate the benchmark data sets of the first CAMI challenge.
[InSilicoSeq](https://github.com/HadrienG/InSilicoSeq) is a sequencing simulator producing
realistic Illumina reads. Primarily intended for simulating metagenomic samples, it can also
be used to produce sequencing data from a single genome.
We describe here only the usage of de novo metagenomes simulation. For a more exhaustive
documentation, please refer to [CASISIM wiki](https://github.com/CAMI-challenge/CAMISIM/wiki).
We describe here only the usage for a simple simulation. For a more exhaustive documentation,
please refer to [InSilicoSeq documentation](https://insilicoseq.readthedocs.io/en/latest/?badge=latest)
or look at the help section of the tool with `iss generate --help`.
## Install
CAMISIM contains a lot of dependencies and the list can be found [Here](https://github.com/CAMI-challenge/CAMISIM/wiki/User-manual#installation).
You have several ways of installing InSilico seq that can be found [Here](https://insilicoseq.readthedocs.io/en/latest/iss/install.html).
However, we recommand the use of the docker image `cami/camisim:latest` or the singularity one ([here](https://gitlab.pasteur.fr/metagenomics/singularity/tree/master/tools/camisim), only from docker for the moment).
Here, we will use of the docker image `hadrieng/insilicoseq:1.4.2` or the singularity one built from the docker image.
## Run de novo metagenomes simulation
The repository of the tool ([Here](https://github.com/CAMI-challenge/CAMISIM)) comes with a
serie of sample data which makes it possible to directly test the tool without downloading
anything.
### Build your reference metagenome
For this example, we are going to start from the files given and modify a bit the parameters to
make the process a bit faster. Files can be found in the `example/` directory.
All the path are preceded by the `/input` directory since we are going to mount our config
files into this directory.
First step you need to do is select genomes to build your own metagenome. You just need to select genomes of
interest and put them in one `fasta` file.
> **Warning** Generated paired-end reads are in one unique `fastq` file that need to be
splitted in `_R1` and `_R2` files.
> **Note** InSilicoSeq can [download and build a metagenome for you](https://github.com/HadrienG/InSilicoSeq#generate-reads-without-input-genomes) but you need internet access.
### With docker
```bash
docker pull cami/camisim:latest
docker pull hadrieng/insilicoseq:1.4.2
mkdir output
docker run -it -v "/path/to/folder/example/:/input:rw" -v "/path/to/folder/output/:/output:rw" cami/camisim:latest metagenomesimulation.py /input/config.ini
docker run -it -v "/metagenome/folder:/input:rw" -v "/path/to/folder/output/:/output:rw" hadrieng/insilicoseq:1.4.2 iss generate --genomes /input/metagenome.fasta --model HiSeq --output /output/simulation --compress
```
### With singularity on TARS
You can build the singularity image from the docker one. A recipe doing that is
available [here](https://gitlab.pasteur.fr/metagenomics/singularity/tree/master/tools/camisim).
You can easily build the singularity image from the docker one with `singularity pull NAME.simg docker:hadrieng/insilicoseq:1.4.2` command.
You will probably need to mount also a tmp directory since the one by default from
singularity seems to be too small for the tool.
```bash
module load singularity
mkdir output tmp
singularity run --bind /path/to/folder/example:/input,/path/to/folder/tmp:/tmp,/path/to/folder/output:/output /path/to/singularity.simg /input/config.ini
```
> **WARNING**: This process can require at least 12Gb of RAM
### Configuration file
You can here set the different parameters for your simulation. The customed file is the
`config.ini` file.
We will quickly go through the different part of this config file. You can find the
complete description on the [Documentation](https://github.com/CAMI-challenge/CAMISIM/wiki/Configuration-File-Options).
By default path can be kept by default. However, it is easier to use absolute path
instead of relative one (by default) for singularity.
#### Main
```ini
[Main]
seed=632741178 # if None is used, random seed is chosen
phase=0 # 0: Full run; 1: Only community design; 2: Start with read simulation
max_processors=8
dataset_id=RL # name of the created sample
output_directory=out
temp_directory=/tmp
gsa=True # whether a gold standard assembly should be created
pooled_gsa=True # whether a pooled gold standard over all samples is created
anonymous=True # whether the output is anonymized (reads from genomes are shuffled)
compress=1 # 0 is for no compression, 9 is maximum compression
```
#### Read Simulator
```ini
[ReadSimulator]
readsim=/usr/local/bin/tools/art_illumina-2.3.6/art_illumina # leave by default since we are in a container
error_profiles=/usr/local/bin/tools/art_illumina-2.3.6/profiles # leave by default
samtools=/usr/local/bin/tools/samtools-1.3/samtools # leave by default
profile=mbarc # choose for ART: mi/hi/hi150/mbarc
size=0.1 # size of a single sample in Gigabasepairs (Gbp)
type=art # simulation tool
fragments_size_mean=270
fragment_size_standard_deviation=27
```
All the path for the tools are kept by default since we are using the tool from a container.
For the different profile, this corresponds to errors profiles and the documentation mention
that `mbarc` is recommended for bacterial communities.
#### Community Design
```ini
[CommunityDesign]
#distribution_file_paths='out/abundance0.tsv', 'out/abundance1.tsv', 'out/abundance2.tsv', 'out/abundance3.tsv', 'out/abundance4.tsv', 'out/abundance5.tsv', 'out/abundance6.tsv', 'out/abundance7.tsv', 'out/abundance8.tsv', 'out/abundance9.tsv'
ncbi_taxdump=/usr/local/bin/tools/ncbi-taxonomy_20170222.tar.gz
strain_simulation_template=scripts/StrainSimulationWrapper/sgEvolver/simulation_dir
number_of_samples=10 # Number of samples to be created
```
Again, different path can be kept by default since everything is running in the container.
#### Community
```ini
[community0]
metadata=metadata.tsv # tab separated data table It maps genome ids with additional information of their classification
id_to_genome_file=genome_to_id.tsv # format is: "genome_ID\tgenome_path" without header
id_to_gff_file=
genomes_total=24 # Total number of simulated genomes
genomes_real=24 # Number of genomes used from the input genomes
max_strains_per_otu=1 # Maximum number of strains drawn from genomes belonging to a single OTU
ratio=1 # ratio between different communities
mode=differential # mode for changing the abundances in different samples (replicates/timeseries_lognormal/timeseries_normal/differential)
log_mu=1 # mean of the used log-normal distribution
log_sigma=2 # standard deviation of the used log-normal distribution
gauss_mu=1 # mean of the used normal distribution
gauss_sigma=1 # standard deviation of the used normal distribution
view=False # show the used distribution of genomes before simulating
mkdir output
singularity run /path/to/singularity.simg iss generate --genomes metagenome.fasta --model HiSeq --output /output/simulation --compress
```
##### Metadata `tsv` file
The first row of the file is the header for the following columns:
genome_ID | OTU | NCBI_ID | novelty_category
--------- | --- | ------- | ----------------
__novelty category__: if a genome is not in the database, how "new" is it in comparison to genomes in the NCBI (`new_strain`, `new_species`, `new_genus`, ...).
Also see more information on this file here: [Genome selection](https://github.com/CAMI-challenge/CAMISIM/wiki/Genome-Selection)
A by default file is provided but we are going to use copy of them in order to easily give the
possibility to use customized ones.
##### Number of genomes
Difference between `genomes_total` and `genomes_real` are simulated by sgEvolver
Loading