Getting the data
# for V. cholerae
./scripts/download_genome.py configs/download_Vcholerae.yaml
# for E. coli
./scripts/download_genome.py configs/download_Ecoli.yaml
Computing codon usage
Counting codons
On an interactive compute node (with 2 CPUs requested (for slurm, option -c 2
)):
# for V. cholerae
snakemake --snakefile workflow/codon_usage.snakefile --configfile configs/codon_usage_Vcholerae.yaml -j 2 1>log 2>err
snakemake --snakefile workflow/codon_usage.snakefile --configfile configs/codon_usage_Vcholerae_first30.yaml -j 2 1>log 2>err
# for E. coli
snakemake --snakefile workflow/codon_usage.snakefile --configfile configs/codon_usage_Ecoli.yaml -j 2 1>log 2>err
snakemake --snakefile workflow/codon_usage.snakefile --configfile configs/codon_usage_Ecoli_first30.yaml -j 2 1>log 2>err
As of 05/07/2021, This actually only counts codons.
Codon counts are available here:
Similar files are also available in the codon_usage_first30
folder.
Filtering genes
Gene filtering is performed in the following Jupyter notebooks:
Select_genes_Vcholerae.ipynb
Select_genes_Vcholerae_first30.ipynb
Select_genes_Ecoli.ipynb
Select_genes_Ecoli_first30.ipynb
This consists in pre-processing the counts to exclude genes with no valid start codons and group codons by corresponding amino-acid.
The notebooks can be executed on the command-line as follows:
jupyter nbconvert --execute --to html Select_genes_Vcholerae.ipynb
jupyter nbconvert --execute --to html Select_genes_Vcholerae_first30.ipynb
jupyter nbconvert --execute --to html Select_genes_Ecoli.ipynb
jupyter nbconvert --execute --to html Select_genes_Ecoli_first30.ipynb
This results in the following "filtered" counts table.
For V. cholerae:
For E. coli:
Similar files are also available in the codon_usage_first30
folder.
Computing various usage biases
The actual codon usage biases computations is performed in the following Jupyter notebooks.
Explore_usage_biases_Vcholerae.ipynb
Explore_usage_biases_Vcholerae_first30.ipynb
Explore_usage_biases_Ecoli.ipynb
Explore_usage_biases_Ecoli_first30.ipynb
The notebooks can be executed on the command-line as follows:
jupyter nbconvert --execute --to html Explore_usage_biases_Vcholerae.ipynb
jupyter nbconvert --execute --to html Explore_usage_biases_Vcholerae_first30.ipynb
jupyter nbconvert --execute --to html Explore_usage_biases_Ecoli.ipynb
jupyter nbconvert --execute --to html Explore_usage_biases_Ecoli_first30.ipynb
This results in the following standardized codon biases tables:
For V. cholerae:
- "Gene-wide" codon biases: standardized_codon_usage_biases.tsv
- "By amino-acid" codon biases: standardized_codon_usage_biases_by_aa.tsv
For E. coli:
- "Gene-wide" codon biases: standardized_codon_usage_biases.tsv
- "By amino-acid" codon biases: standardized_codon_usage_biases_by_aa.tsv
Similar files are also available in the codon_usage_first30
folder.
Clustering genes based on their codon usage biases
The above notebooks (Explore_usage_biases_Vcholerae.ipynb
and Explore_usage_biases_Ecoli.ipynb
, and their _first30
versions)
are also used to try clustering methods applied for each amino-acid on "by amino-acid codon biases",
with as many clusters as there are codons for this amino-acid:
-
KMeans (previously tried for V. cholerae only, not used any more).
-
A method inspired by KMeans (i.e. using the same distance type, the squared euclidean distance), but with fixed "centroids" defined by computing standardized usage biases assuming exclusive use of one of the possible codons ("full bias"). The genes are simply assigned to the cluster whose centroids they are closest.
-
Assigning a gene to a cluster labelled by the codon with the highest bias for this gene ("highest bias").
The second and third methods make it easier to associate each cluster to a preferred codon.
The comparison of cluster composition on one example showed that the results are different. All methods seem to yield reasonable results, judging by the differentiated codon usage bias distributions that are obtained when drawing violin plots for each cluster.
For the sake of clarity, the KMeans method was not used for E. coli.
The cluster assignations are available in the following tables:
For V. cholerae:
- "By amino-acid" codon biases, with cluster assignations based on biases within each codon family: standardized_codon_usage_biases_by_aa_with_aa_based_clusterings.tsv
For E. coli:
- "By amino-acid" codon biases, with cluster assignations based on biases within each codon family: standardized_codon_usage_biases_by_aa_with_aa_based_clusterings.tsv
For each method and each amino-acid there is one column containing the cluster assignations.
The column names follow the pattern cluster_<aa>_<method>
where <aa>
is the one-letter code of an amino-acid, and <method>
is either full_bias
or highest_bias
.
Similar files are also available in the codon_usage_first30
folder.
Extracting lists of genes for each cluster
This is done in the following notebook:
Extract_gene_lists_Vcholerae.ipynb
Extract_gene_lists_Vcholerae_first30.ipynb
Extract_gene_lists_Ecoli.ipynb
Extract_gene_lists_Ecoli_first30.ipynb
The notebooks can be executed on the command-line as follows:
jupyter nbconvert --execute --to html Extract_gene_lists_Vcholerae.ipynb
jupyter nbconvert --execute --to html Extract_gene_lists_Vcholerae_first30.ipynb
jupyter nbconvert --execute --to html Extract_gene_lists_Ecoli.ipynb
jupyter nbconvert --execute --to html Extract_gene_lists_Ecoli_first30.ipynb
These notebooks contains links to gene lists as well as violin plots reprensenting the distribution of codon usage biases within each cluster.
Extracting most positively biased genes from each cluster
A further processing step was added to the above notebooks, that consists in defining a threshold that cuts the distribution of codon usage biases for a given gene cluster at a "valley" of low density in this distribution. Genes above this threshold are those with the top positively-biased genes for the codon around which the cluster is built.