Commit 32ac503c authored by csaveanu's avatar csaveanu

added description

parent 26dd949e
Multiple alignment visualization (pymultialign)
Working on model organisms to understand conserved molecular mechanisms
is often frustrating because my favorite yeast genes often have
different names in *C. elegans*, people or *D. melanogaster*. It becomes
worse when you want to compare interaction domains or, god forbid,
specific point mutations. NMD illustrates well these issues, as the
yeast Nam7 protein is known as SMG-2 in *C. elegans* and as Upf1 in
yeast and other species. A table of correspondence between names of
various NMD factors is reproduced below:
| *S. cerevisiae* | *C. elegans* | *D. melano-gaster* | *H. sapiens* |
| Upf1\ | SMG-2 | Upf1\ | UPF1\ |
| Upf2\ | | Upf2\ | UPF2\ |
| Upf3 | SMG-3 | Upf3 | UPF3a/b |
| | | | |
| | SMG-4 | | |
| Nmd4 (?) | SMG-6 | Smg6 | SMG6 |
| Ebs1 (?) | SMG-5; SMG-7 | Smg5 (Smg7?) | SMG5; SMG7 |
| absent | SMG-1 | Smg1 | SMG1 |
The architecture of Upf1s from different organisms is quite similar,
with most oand an example of use. Theyf the sequence conserved, with the
exception of the N-ter and C-ter domain. But what is exactly the
correspondence between, say, residue 572 in yeast and the equivalent
aminoacid in human cells ? I can align the sequences if they are similar
enough and find the answer. But to know how conserved this position is
in other species, I would:
- go to [OrthoMCL]( and find the group of
conserved Upf1s;\
- recover the Fasta format sequences of the proteins;\
- do a multiple alignment (using [jalview]( and
web services for multiple alignments, such as T-coffee or Muscle);\
- manually filter those that are obviously wrong (potential annotation
- align any sequence that passed;\
- count residues to find equivalences;\
- find some way to make a picture out of the result.\
The last step is somewhat difficult. How to compress the alignment
information and show it in an easy to grasp way ?\
After searching for some time for a solution, without finding one, I
wrote two short Python scripts that can do the job. The first isolates
two sequences of interest from a multiple alignment. Then, it looks for
the number of sequences that have a similar amino acid at each aligned
position. Similarity in the first version of the script is defined by
the chemical properties of amino acids - aliphatic \"GAVLI\", aromatic
\"FYW\", sulphur containing \"CM\", having hydroxyl groups \"ST\", basic
\"KRH\", acidic \"DENQ\", rigid \"P\". Better similarity scores can also
be used, and an residue may belong to several groups. For each position
and for each group of aminoacids we compute how many of the sequences
display an aligned residue from that group. If the amino acid of the
selected sequence belongs to the most frequent group in the alignment,
it gets the value of the percentage of that group, if not, it gets a
value of \"0\". If the alignment has only inserted gaps (\"-\") at a
position that has a amino acid there, it will have a value of 0 as well.
Finally, the output for two sequences contains several columns with the
following information:
\- alignment\_pos: a number that starts at 1 at the first amino acid in
the multiple alignment and counts all the positions; only the positions
at which either of the two selected sequences has a residue are
\- position\_1 and position\_2 - the residue number for each of the two
sequences, not includding gaps (from 1 to the length of each sequence);
\- pcaligned\_1 and pcaligned\_2 - percent of sequences conserved at a
given position, if the sequence 1 or 2 have at that same position an
aminoacid from the same group;
\- aagroup\_1 and aagroup\_2 - group of similarity used to calculate
\- ownaa\_1 and ownaa\_2 - the residue present at that position of the
alignment - if there is no correspondence, the value is \"nan\";
\- idx - an index from 1 to the common length of both sequences.
The second script uses the output from the first one to build an image
of the aligment information. Having separated steps for counting aligned
residues and for the display allows more flexibility. One can just use
the first script to find equivalent regions and input that in R or
Python for his or her own graphical representation. Alternatively, a
different computation could be used in the first step, and, as long as
the format is the same, rapidly generate a standard graphics out of it.\
The scripts have a few parameters, briefly explained if the scripts are
invoked with the -h flag. Required libraries are pandas, numpy and
plotnine. They can be installed via conda or pip.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment