Commit 33165a29 authored by Cosmin  SAVEANU's avatar Cosmin SAVEANU

Corrected the markdown from the pymultialign_description.md

parent 02aad68c
Multiple alignment visualization (pymultialign)
# Multiple alignment visualization (pymultialign)
Working on model organisms to understand conserved molecular mechanisms
is often frustrating because my favorite yeast genes often have
......@@ -12,43 +12,37 @@ various NMD factors is reproduced below:
| *S. cerevisiae* | *C. elegans* | *D. melano-gaster* | *H. sapiens* |
| --------------- | ------------ | ------------------ | ------------ |
| Upf1\ | SMG-2 | Upf1\ | UPF1\ |
| Upf2\ | | Upf2\ | UPF2\ |
| Upf3 | SMG-3 | Upf3 | UPF3a/b |
| | | | |
| | SMG-4 | | |
| --------------- | ------------ | ------------------ | ------------ |
| Upf1<br>Upf2<br>Upf3 | SMG-2<br>SMG-3<br>SMG-4 | Upf1<br>Upf2<br>Upf3 | UPF1<br>UPF2<br>UPF3a/b |
| Nmd4 (?) | SMG-6 | Smg6 | SMG6 |
| --------------- | ------------ | ------------------ | ------------ |
| Ebs1 (?) | SMG-5; SMG-7 | Smg5 (Smg7?) | SMG5; SMG7 |
| --------------- | ------------ | ------------------ | ------------ |
| absent | SMG-1 | Smg1 | SMG1 |
| --------------- | ------------ | ------------------ | ------------ |
\
The architecture of Upf1s from different organisms is quite similar,
with most oand an example of use. Theyf the sequence conserved, with the
with most of the sequence conserved, with the
exception of the N-ter and C-ter domain. But what is exactly the
correspondence between, say, residue 572 in yeast and the equivalent
aminoacid in human cells ? I can align the sequences if they are similar
enough and find the answer. But to know how conserved this position is
in other species, I would:
\
- go to [OrthoMCL](http://orthomcl.org/orthomcl/) and find the group of
conserved Upf1s;\
- recover the Fasta format sequences of the proteins;\
conserved Upf1s;
- recover the Fasta format sequences of the proteins;
- do a multiple alignment (using [jalview](http://www.jalview.org/) and
web services for multiple alignments, such as T-coffee or Muscle);\
web services for multiple alignments, such as T-coffee or Muscle);
- manually filter those that are obviously wrong (potential annotation
issues);\
- align any sequence that passed;\
- count residues to find equivalences;\
- find some way to make a picture out of the result.\
\
issues);
- align any sequence that passed;
- count residues to find equivalences;
- find some way to make a picture out of the result.
The last step is somewhat difficult. How to compress the alignment
information and show it in an easy to grasp way ?\
\
information and show it in an easy to grasp way ?
After searching for some time for a solution, without finding one, I
wrote two short Python scripts that can do the job. The first isolates
two sequences of interest from a multiple alignment. Then, it looks for
......@@ -67,25 +61,20 @@ position that has a amino acid there, it will have a value of 0 as well.
Finally, the output for two sequences contains several columns with the
following information:
\- alignment\_pos: a number that starts at 1 at the first amino acid in
- `alignment_pos`: a number that starts at 1 at the first amino acid in
the multiple alignment and counts all the positions; only the positions
at which either of the two selected sequences has a residue are
conserved.
\- position\_1 and position\_2 - the residue number for each of the two
- `position_1` and `position_2` - the residue number for each of the two
sequences, not includding gaps (from 1 to the length of each sequence);
\- pcaligned\_1 and pcaligned\_2 - percent of sequences conserved at a
- `pcaligned_1` and `pcaligned_2` - percent of sequences conserved at a
given position, if the sequence 1 or 2 have at that same position an
aminoacid from the same group;
\- aagroup\_1 and aagroup\_2 - group of similarity used to calculate
- `aagroup_1` and `aagroup_2` - group of similarity used to calculate
percentages;
\- ownaa\_1 and ownaa\_2 - the residue present at that position of the
- `ownaa_1` and `ownaa_2` - the residue present at that position of the
alignment - if there is no correspondence, the value is \"nan\";
\- idx - an index from 1 to the common length of both sequences.
- `idx` - an index from 1 to the common length of both sequences.
The second script uses the output from the first one to build an image
of the aligment information. Having separated steps for counting aligned
......@@ -93,8 +82,8 @@ residues and for the display allows more flexibility. One can just use
the first script to find equivalent regions and input that in R or
Python for his or her own graphical representation. Alternatively, a
different computation could be used in the first step, and, as long as
the format is the same, rapidly generate a standard graphics out of it.\
\
the format is the same, rapidly generate a standard graphics out of it.
The scripts have a few parameters, briefly explained if the scripts are
invoked with the -h flag. Required libraries are pandas, numpy and
plotnine. They can be installed via conda or pip.
plotnine. They can be installed via `conda` or `pip`.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment