Commit 597466f5 authored by Bertrand  NÉRON's avatar Bertrand NÉRON
Browse files

add pseudocode for most of exercises

adapt solutions to use functions as they are view earlier
externalize code to be dowloadable for student
parent 2192b4c9
This diff is collapsed.
......@@ -11,7 +11,6 @@ Exercises
Exercise
--------
Calculates the 10 first number of the Fibonacci sequence .
The Fibonacci sequence are the numbers in the following integer sequence:
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, ...
......@@ -24,35 +23,162 @@ The fibonacci suite can be defined as following:
|
| F\ :sub:`n` = F\ :sub:`n-1` + F\ :sub:`n-2`
Write a function which take an integer ``n`` as parameter
and returns a list containing the ``n`` first number of the Fibonacci sequence.
.. literalinclude:: _static/code/fibonacci_iteration.py
:linenos:
:language: python
:download:`fibonacci_iteration.py <_static/code/fibonacci_iteration.py>` .
We will see another way more elegant to implement the fibonacci suite in :ref:`Advance Programming Techniques` section.
Exercise
--------
display the largest element in list (containing float or integer only)?::
Reimplement your own function max (my_max).
This function will take a list or tuple of float or integer and
returns the largest element?
l = [1,2,3,4,58,9]
higest = l[0]
for i in l:
if i > highest:
highest = i
print highest
Write the pseudocode before to propose an implementation.
pseudocode
^^^^^^^^^^
| *function my_max(l)*
| *max <- first elt of l*
| *for each elts of l*
| *if elt is > max*
| *max <- elt*
| *return max*
implementation
^^^^^^^^^^^^^^
::
def my_max(seq):
"""
return the maximum value in a sequence
work only with integer or float
"""
higest = seq[0]
for i in seq:
if i > highest:
highest = i
return highest
l = [1,2,3,4,58,9]
print my_max(l)
58
Exercise
--------
let the following enzymes collection: ::
| We want to establish a restriction map of a sequence.
| But we will do this step by step.
| and reuse the enzymes used in previous chapter:
* create a function that take a sequence and an enzyme as parameter and return
the position of first binding sites.
(write the pseudocode)
**pseudocode**
| *function one_enz_binding_site(dna, enzyme)*
| *if enzyme binding site is substring of dna*
| *return of first position of substring in dna*
**implementation**
.. literalinclude:: _static/code/restriction.py
:linenos:
:lines: 1-8
:language: python
* improve the previous function to return all positions of binding sites
**pseudocode of first algorithm**
| *function one_enz_binding_sites(dna, enzyme)*
| *positions <- empty*
| *if enzyme binding site is substring of dna*
| *add the position of the first substring in dna in positions*
| *positions <- find binding_sites in rest of dna sequence*
| *return positions*
**implementation**
.. literalinclude:: _static/code/restriction.py
:linenos:
:lines: 9-16
:language: python
**pseudocode of second algorithm**
| *function one_enz_binding_sites(dna, enzyme)*
| *positions <- empty*
| *find first position of binding site in dna*
| *while we find binding site in dna*
| *add position of binding site to positions*
| *find first position of binding site in dna in rest of dna*
| *return positions*
**implementation**
.. literalinclude:: _static/code/restriction.py
:linenos:
:lines: 19-25
:language: python
search all positions of Ecor1 binding sites in dna_1
::
import collections
RestrictEnzyme = collections.namedtuple("RestrictEnzyme", "name comment sequence cut end")
ecor1 = RestrictEnzyme("EcoRI", "Ecoli restriction enzime I", "gaattc", 1, "sticky")
dna_1 = """tcgcgcaacgtcgcctacatctcaagattcagcgccgagatccccgggggttgagcgatccccgtcagttggcgtgaattcag
cagcagcgcaccccgggcgtagaattccagttgcagataatagctgatttagttaacttggatcacagaagcttccaga
ccaccgtatggatcccaacgcactgttacggatccaattcgtacgtttggggtgatttgattcccgctgcctgccagg"""
* generalize the binding sites function to take a list of enzymes and return a list of tuple (enzyme name, position)
**pseudocode**
| *function binding_sites(dna, set of enzymes)*
| *positions <- empty*
| *for each enzyme in enzymes*
| *pos <- one_enz_binding_sites(dna, enzyme)*
| *pos <- for each position create a tuple enzyme name, position*
| *positions <- pos*
| *return positions*
**implementation**
in bonus we can try to sort the list in the order of the position of the binding sites like this:
[('Sau3aI', 38), ('SmaI', 42), ('Sau3aI', 56), ('EcoRI', 75), ...
.. literalinclude:: _static/code/restriction.py
:linenos:
:lines: 27-
:language: python
::
import collections
RestrictEnzyme = collections.namedtuple("RestrictEnzyme", "name comment sequence cut end")
ecor1 = RestrictEnzyme("EcoRI", "Ecoli restriction enzime I", "gaattc", 1, "sticky")
ecor5 = RestrictEnzyme("EcoRV", "Ecoli restriction enzime V", "gatatc", 3, "blunt")
bamh1 = RestrictEnzyme("BamHI", "type II restriction endonuclease from Bacillus amyloliquefaciens ", "ggatcc", 1, "sticky")
hind3 = RestrictEnzyme("HindIII", "type II site-specific nuclease from Haemophilus influenzae", "aagctt", 1 , "sticky")
......@@ -72,77 +198,17 @@ and the 2 dna fragments: ::
attcagtggcttcagaattcctcccgggagaagctgaatagtgaaacgattgaggtgttgtggtgaaccgagtaag
agcagcttaaatcggagagaattccatttactggccagggtaagagttttggtaaatatatagtgatatctggcttg"""
| which enzymes cut the dna_1 get the name of the enzymes and all their positions of binding site?
| the dna_2 ?
| the dna_1 but not the dna_2?
::
enzymes= (ecor1, ecor5, bamh1, hind3, taq1, not1, sau3a1, hae3, sma1)
binding_sites(dna_1, enzymes)
[('Sau3aI', 38), ('SmaI', 42), ('Sau3aI', 56), ('EcoRI', 75), ('SmaI', 95), ('EcoRI', 105),
('Sau3aI', 144), ('HindIII', 152), ('BamHI', 173), ('Sau3aI', 174), ('BamHI', 193), ('Sau3aI', 194)]
dna_1 = dna_1.replace('\n', '')
dans_2 = dna_2.replace('\n', '')
binding_sites(dna_2, enzymes)
[('EcoRI', 11), ('NotI', 33), ('HaeIII', 35), ('EcoRI', 98), ('SmaI', 106),
('EcoRI', 179), ('HaeIII', 193), ('EcoRV', 225)]
enzymes = [ecor1, ecor5, bamh1, hind3, taq1, not1, sau3a1, hae3, sma1]
digest_1 = []
for enz in enzymes:
pos = dna_1.find(enz.sequence)
if pos != -1:
digest_1.append(enz)
with this first algorithm we find if an enzyme cut the dna but we cannot find all cuts in the dna for an enzyme.
If we find a cutting site, we must search again starting at the first nucleotid after the begining of the match
until the end of the the dna, for this we use the start parameter of the find function, and so on.
As we don't know how many loop we need to scan the dna until the end we use a ``while`` loop testing for the presence of a cutting site.::
digest_1 = []
for enz in enzymes:
pos = dna_1.find(enz.sequence)
while pos != -1:
digest_1.append(enz)
pos = dna_1.find(enz.sequence, pos + 1)
digest_2 = []
for enz in enzymes:
pos = dna_2.find(enz.sequence)
while pos != -1:
digest_2.append(enz)
pos = dna_2.find(enz.sequence, pos + 1)
cut_dna_1 = set(digest_1)
cut_dna_2 = set(digest_2)
cut_dna_1_not_dna_2 = cut_dna_1 - cut_dna_2
but we want also the position, for instance to compute the fragments of dna. ::
digest_1 = []
for enz in enzymes:
pos = dna_1.find(enz.sequence)
while pos != -1:
digest_1.append((enz, pos))
pos = dna_1.find(enz.sequence, pos + 1)
#if we want to sort the list in function of their positions in the sequence
from operator import itemgetter
digest_1.sort(key=itemgetter(1))
print [(e.name, pos) for e, pos in digest_1]
:download:`restriction.py <_static/code/restriction.py>` .
digest_2 = []
for enz in enzymes:
pos = dna_2.find(enz.sequence)
while pos != -1:
digest_2.append((enz, pos))
pos = dna_2.find(enz.sequence, pos + 1)
print "list of all enzymes cutting dna 1 and theirs position in dna1 :", [(e.name, pos) for e, pos in digest_1]
print "list of all enzymes cutting dna 2 and theirs position in dna2 :", [(e.name, pos) for e, pos in digest_2]
cut_dna_1 = set([e.name for e, pos in digest_1])
cut_dna_2 = set([e.name for e, pos in digest_2])
cut_dna_1_not_dna_2 = cut_dna_1 - cut_dna_2
Exercise
--------
......
......@@ -195,7 +195,7 @@ without header or any non sequence characters
pseudocode:
| *fasta_to_one_line(seq)*
| *function fasta_to_one_line(seq)*
| *header_end_at <- find the first return line character*
| *raw_seq <- remove header from sequence*
| *raw_seq <- remove non sequence chars*
......
.. _Creating_and_Calling_Functions:
.. _Dive_into_Functions:
******************************
Creating and Calling Functions
******************************
*******************
Dive into Functions
*******************
Exercises
=========
......@@ -105,7 +105,7 @@ As soon as you make an assignment to a variable in a scope,
that variable becomes local to that scope and shadows any similarly named variable in the outer scope.
even if the assignment appear later in code.
Here *x = y* make *x* as local variable whatever you are in func.
so at line *y = x + 2* we try to use the loca variable *x* but we have to asign it a value (it is done later) so
so at line *y = x + 2* we try to use the local variable *x* but we have to asign it a value (it is done later) so
Python raise an UnboundLocalError (`see python faq for details <https://docs.python.org/3/faq/programming.html#why-am-i-getting-an-unboundlocalerror-when-the-variable-has-a-value>`_)
.. container:: clearer
......
......@@ -27,22 +27,13 @@ Exercise
Write a program that calculates the similarity of 2 RNA sequences.
* To compute the simalirity you need to parse a file containing the similarity matrix.
* To compute the simalirity you need to parse a file containing the :download:`similarity matrix <_static/data/similarity_matrix>`.
**Hint**: use the module containing the functions that handle a matrix from previous chapter.
put this matrix.py file in a directory named "my_python_lib" in your home or Desktop
and import it in your current program (the similarity script must be placed elsewhere).
* The similarity of the 2 sequences is the sum of base similarities.
so you have to compare the first base of to sequence and use the matrix to get the similarity
so you have to compare the first base of two sequences and use the matrix to get the similarity
from the similarity table, on so on for all bases then sum these similarities.
.. note::
as we don't yet see how to read a file, we provide a list of strings that represents the file
as we can get them if we read that file.
::
lines = iter([' A G C U\n'
'A 1.0 0.5 0.0 0.0\n',
'G 0.5 1.0 0.0 0.0\n',
'C 0.0 0.0 1.0 0.5\n',
'U 0.0 0.0 0.5 1.0\n'])
.. literalinclude:: _static/code/similarity.py
:linenos:
......
def all_codons():
all_codons = []
alphabet = 'acgt'
for base_1 in alphabet:
for base_2 in alphabet:
for base_3 in alphabet:
codon = base_1 + base_2 + base_3
all_codons.append(codon)
return all_codons
\ No newline at end of file
from itertools import product
def all_codons():
alphabet = 'acgt'
all_codons = [ ''.join(codon) for codon in product(alphabet, repeat = 3)]
return all_codons
\ No newline at end of file
def one_line(seq):
return seq.replace('\n', '')
def enz_filter(enzymes, dna):
cuting_enz = []
for enz in enzymes:
if enz.sequence in dna:
cuting_enz.append(enz)
return cuting_enz
import collections
def kmer(sequence, k):
kmers = collection.defaultdict(int)
for i in range(len(sequence) - k):
kmer = sequence[i:i + k]
def get_kmer_occurences(seq, kmer_len):
"""
return a list of tuple
each tuple contains a kmers present in seq and its occurence
"""
kmers = {}
stop = len(seq) - kmer_len
for i in range(stop + 1):
kmer = s[i : i + kmer_len]
kmers[kmer] = kmers.get(kmer, 0) + 1
return kmers
\ No newline at end of file
return kmers.items()
\ No newline at end of file
import collections
from operator import itemgetter
def get_kmer_occurences(seq, kmer_len):
"""
return a list of tuple
each tuple contains a kmers present in seq and its occurence
"""
kmers = collections.defaultdict(int)
stop = len(seq) - kmer_len
for i in range(stop + 1):
kmer = s[i : i + kmer_len]
kmers[kmer] += 1
kmers = kmers.items()
kmers.sort(key = itemgetter(1), reverse =True)
return kmers
\ No newline at end of file
from operator import itemgetter
def one_enz_binding_site(dna, enzyme):
"""
return the first position of enzyme binding site in dna
or None if there is not
"""
pos = dna.find(enzyme.sequence)
if pos != -1:
return pos
def one_enz_binding_sites1(dna, enzyme):
"""
return all positions of enzyme binding sites in dna
"""
positions = []
pos = dna.find(enzyme.sequence)
if pos != -1:
positions.append(pos)
positions.extend(one_enz_binding_site1(dna[pos+1:], enzyme))
return positions
def one_enz_binding_sites(dna, enzyme):
"""
return all positions of enzyme binding sites in dna
"""
positions = []
pos = dna.find(enzyme.sequence)
while pos != -1:
positions.append(pos)
pos = dna.find(enzyme.sequence, pos + 1)
return positions
def binding_sites(dna, enzymes):
"""
return all positions of all enzymes binding sites present in dna
sort by the incresing position
"""
positions = []
for enzyme in enzymes:
pos = one_enz_binding_sites(dna, enzyme)
pos = [(enzyme.name, pos) for pos in pos]
positions.extend(pos)
positions.sort(key = itemgetter(1))
return positions
def rev_comp(seq):
"""
return the reverse complement of seq
the sequence must be in lower case
"""
complement = {'a' : 't',
'c' : 'g',
'g' : 'c',
't' : 'a'}
rev_seq = seq[::-1]
rev_comp = ''
for nt in rev_seq:
rev_comp += complement[nt]
return rev_comp
\ No newline at end of file
import string
def rev_comp(seq):
"""
return the reverse complement of seq
the case is respect but if the sequence mix upper and lower case the function will failed
"""
upper = seq.isupper()
reverse = seq[::-1]
direct = 'acgt'
comp = 'tgca'
if upper:
table = string.maketrans(direct.upper(), comp.upper())
else:
table = string.maketrans(direct, comp)
rev_comp = reverse.translate(table)
return rev_comp
\ No newline at end of file
import sys
import os.path
sys.path.insert(0, os.path.join(expanduser('~'), "my_python_lib"))
import matrix
lines = iter( [' A G C U\n',
'A 1.0 0.5 0.0 0.0\n',
'G 0.5 1.0 0.0 0.0\n',
'C 0.0 0.0 1.0 0.5\n',
'U 0.0 0.0 0.5 1.0\n']
)
def parse_similarity_file():
def parse_similarity_file(path):
"""
parse file containing RNA similarity matrix and return a matrix
"""
sim_matrix = matrix.create(4, 4)
#skip first line
lines.next()
for row_no, line in enumerate(lines):
line = line.strip()
fields = line.split()
values = [float(val) for val in fields[1:]]
matrix.replace_row(sim_matrix, row_no, values)
with open(path, 'r') as sim_file:
#skip first line
sim_file.next()
for row_no, line in enumerate(sim_file):
line = line.strip()
fields = line.split()
values = [float(val) for val in fields[1:]]
matrix.replace_row(sim_matrix, row_no, values)
return sim_matrix
def get_similarity(b1, b2, sim_matrix):
"""
:param b1: the first base must be in ('A', 'G', 'C', 'U')
......@@ -40,6 +41,7 @@ def get_similarity(b1, b2, sim_matrix):
if not b2 in bases:
raise KeyError("unknown base b2: " + str(b2))
return matrix.get_cell(sim_matrix, bases[b1], bases[b2])
def compute_similarity(seq1, seq2, sim_matrix):
"""
......@@ -58,12 +60,13 @@ def compute_similarity(seq1, seq2, sim_matrix):
sim = get_similarity(b1, b2, sim_matrix)
similarities.append(sim)
return sum(similarities)
if __name__ == '__main__':
seq1 = 'AGCAUCUA'
seq2 = 'ACCGUUCU'
sim_matrix = parse_similarity_file()
sim_matrix = parse_similarity_file("similarity_matrix")
print matrix.to_str(sim_matrix)
similarity = compute_similarity(seq1, seq2, sim_matrix)
print similarity
\ No newline at end of file
......@@ -19,17 +19,21 @@ genetic_code = { 'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C',
def translate(nuc_seq, code):
prot_seq = ''
start = 0
while (start + 2) < len(nuc_seq):
n = 0
# to avoid to compute len(seq)/3 at each loop
# I compute it once and use a reference
# it could be expensive if the sequence is very long.
cycle = len(nuc_seq)/3
while n < cycle:
start = n * 3
end = start + 3
print start, end
codon = nuc_seq[start:end]
codon = codon.lower()
if codon in code:
prot_seq += code[codon]
else:
raise RuntimeError("unknow codon: " + codon)
start += 3
n += 1
return prot_seq
def translate2(nuc_seq, code, phase = 1):
......@@ -39,7 +43,9 @@ def translate2(nuc_seq, code, phase = 1):
elif -4 < phase < 0:
start = -phase - 1
nuc_seq = nuc_seq[::-1]
while(start + 2) < len(nuc_seq):
# an other way to determine the end of looping
stop_iteration = len(nuc_seq)
while (start + 2) < stop_iteration:
end = start + 3
codon = nuc_seq[start:end].lower()
if codon in code:
......@@ -47,4 +53,4 @@ def translate2(nuc_seq, code, phase = 1):
else:
raise RuntimeError("unknow codon")
start += 3
return prot_seq
\ No newline at end of file
return prot_seq
def uniqify(l):
uniq = []
for item in l:
if item not in uniq:
uniq.append(item)
return (uniq)
\ No newline at end of file
A G C U
A 1.0 0.5 0.0 0.0
G 0.5 1.0 0.0 0.0
C 0.0 0.0 1.0 0.5
U 0.0 0.0 0.5 1.0
......@@ -18,7 +18,7 @@ Contents:
Collection_Data_Types
Logical_Operations
Control_Flow_Statements
Creating_and_Calling_Functions
Dive_into_Functions
Modules_and_Packages