### add pseudocode for most of exercises

```adapt solutions to use functions as they are view earlier
externalize code to be dowloadable for student```
parent 2192b4c9
This diff is collapsed.
 ... ... @@ -11,7 +11,6 @@ Exercises Exercise -------- Calculates the 10 first number of the Fibonacci sequence . The Fibonacci sequence are the numbers in the following integer sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, ... ... ... @@ -24,35 +23,162 @@ The fibonacci suite can be defined as following: | | F\ :sub:`n` = F\ :sub:`n-1` + F\ :sub:`n-2` Write a function which take an integer ``n`` as parameter and returns a list containing the ``n`` first number of the Fibonacci sequence. .. literalinclude:: _static/code/fibonacci_iteration.py :linenos: :language: python :download:`fibonacci_iteration.py <_static/code/fibonacci_iteration.py>` . We will see another way more elegant to implement the fibonacci suite in :ref:`Advance Programming Techniques` section. Exercise -------- display the largest element in list (containing float or integer only)?:: Reimplement your own function max (my_max). This function will take a list or tuple of float or integer and returns the largest element? l = [1,2,3,4,58,9] higest = l for i in l: if i > highest: highest = i print highest Write the pseudocode before to propose an implementation. pseudocode ^^^^^^^^^^ | *function my_max(l)* | *max <- first elt of l* | *for each elts of l* | *if elt is > max* | *max <- elt* | *return max* implementation ^^^^^^^^^^^^^^ :: def my_max(seq): """ return the maximum value in a sequence work only with integer or float """ higest = seq for i in seq: if i > highest: highest = i return highest l = [1,2,3,4,58,9] print my_max(l) 58 Exercise -------- let the following enzymes collection: :: | We want to establish a restriction map of a sequence. | But we will do this step by step. | and reuse the enzymes used in previous chapter: * create a function that take a sequence and an enzyme as parameter and return the position of first binding sites. (write the pseudocode) **pseudocode** | *function one_enz_binding_site(dna, enzyme)* | *if enzyme binding site is substring of dna* | *return of first position of substring in dna* **implementation** .. literalinclude:: _static/code/restriction.py :linenos: :lines: 1-8 :language: python * improve the previous function to return all positions of binding sites **pseudocode of first algorithm** | *function one_enz_binding_sites(dna, enzyme)* | *positions <- empty* | *if enzyme binding site is substring of dna* | *add the position of the first substring in dna in positions* | *positions <- find binding_sites in rest of dna sequence* | *return positions* **implementation** .. literalinclude:: _static/code/restriction.py :linenos: :lines: 9-16 :language: python **pseudocode of second algorithm** | *function one_enz_binding_sites(dna, enzyme)* | *positions <- empty* | *find first position of binding site in dna* | *while we find binding site in dna* | *add position of binding site to positions* | *find first position of binding site in dna in rest of dna* | *return positions* **implementation** .. literalinclude:: _static/code/restriction.py :linenos: :lines: 19-25 :language: python search all positions of Ecor1 binding sites in dna_1 :: import collections RestrictEnzyme = collections.namedtuple("RestrictEnzyme", "name comment sequence cut end") ecor1 = RestrictEnzyme("EcoRI", "Ecoli restriction enzime I", "gaattc", 1, "sticky") dna_1 = """tcgcgcaacgtcgcctacatctcaagattcagcgccgagatccccgggggttgagcgatccccgtcagttggcgtgaattcag cagcagcgcaccccgggcgtagaattccagttgcagataatagctgatttagttaacttggatcacagaagcttccaga ccaccgtatggatcccaacgcactgttacggatccaattcgtacgtttggggtgatttgattcccgctgcctgccagg""" * generalize the binding sites function to take a list of enzymes and return a list of tuple (enzyme name, position) **pseudocode** | *function binding_sites(dna, set of enzymes)* | *positions <- empty* | *for each enzyme in enzymes* | *pos <- one_enz_binding_sites(dna, enzyme)* | *pos <- for each position create a tuple enzyme name, position* | *positions <- pos* | *return positions* **implementation** in bonus we can try to sort the list in the order of the position of the binding sites like this: [('Sau3aI', 38), ('SmaI', 42), ('Sau3aI', 56), ('EcoRI', 75), ... .. literalinclude:: _static/code/restriction.py :linenos: :lines: 27- :language: python :: import collections RestrictEnzyme = collections.namedtuple("RestrictEnzyme", "name comment sequence cut end") ecor1 = RestrictEnzyme("EcoRI", "Ecoli restriction enzime I", "gaattc", 1, "sticky") ecor5 = RestrictEnzyme("EcoRV", "Ecoli restriction enzime V", "gatatc", 3, "blunt") bamh1 = RestrictEnzyme("BamHI", "type II restriction endonuclease from Bacillus amyloliquefaciens ", "ggatcc", 1, "sticky") hind3 = RestrictEnzyme("HindIII", "type II site-specific nuclease from Haemophilus influenzae", "aagctt", 1 , "sticky") ... ... @@ -72,77 +198,17 @@ and the 2 dna fragments: :: attcagtggcttcagaattcctcccgggagaagctgaatagtgaaacgattgaggtgttgtggtgaaccgagtaag agcagcttaaatcggagagaattccatttactggccagggtaagagttttggtaaatatatagtgatatctggcttg""" | which enzymes cut the dna_1 get the name of the enzymes and all their positions of binding site? | the dna_2 ? | the dna_1 but not the dna_2? :: enzymes= (ecor1, ecor5, bamh1, hind3, taq1, not1, sau3a1, hae3, sma1) binding_sites(dna_1, enzymes) [('Sau3aI', 38), ('SmaI', 42), ('Sau3aI', 56), ('EcoRI', 75), ('SmaI', 95), ('EcoRI', 105), ('Sau3aI', 144), ('HindIII', 152), ('BamHI', 173), ('Sau3aI', 174), ('BamHI', 193), ('Sau3aI', 194)] dna_1 = dna_1.replace('\n', '') dans_2 = dna_2.replace('\n', '') binding_sites(dna_2, enzymes) [('EcoRI', 11), ('NotI', 33), ('HaeIII', 35), ('EcoRI', 98), ('SmaI', 106), ('EcoRI', 179), ('HaeIII', 193), ('EcoRV', 225)] enzymes = [ecor1, ecor5, bamh1, hind3, taq1, not1, sau3a1, hae3, sma1] digest_1 = [] for enz in enzymes: pos = dna_1.find(enz.sequence) if pos != -1: digest_1.append(enz) with this first algorithm we find if an enzyme cut the dna but we cannot find all cuts in the dna for an enzyme. If we find a cutting site, we must search again starting at the first nucleotid after the begining of the match until the end of the the dna, for this we use the start parameter of the find function, and so on. As we don't know how many loop we need to scan the dna until the end we use a ``while`` loop testing for the presence of a cutting site.:: digest_1 = [] for enz in enzymes: pos = dna_1.find(enz.sequence) while pos != -1: digest_1.append(enz) pos = dna_1.find(enz.sequence, pos + 1) digest_2 = [] for enz in enzymes: pos = dna_2.find(enz.sequence) while pos != -1: digest_2.append(enz) pos = dna_2.find(enz.sequence, pos + 1) cut_dna_1 = set(digest_1) cut_dna_2 = set(digest_2) cut_dna_1_not_dna_2 = cut_dna_1 - cut_dna_2 but we want also the position, for instance to compute the fragments of dna. :: digest_1 = [] for enz in enzymes: pos = dna_1.find(enz.sequence) while pos != -1: digest_1.append((enz, pos)) pos = dna_1.find(enz.sequence, pos + 1) #if we want to sort the list in function of their positions in the sequence from operator import itemgetter digest_1.sort(key=itemgetter(1)) print [(e.name, pos) for e, pos in digest_1] :download:`restriction.py <_static/code/restriction.py>` . digest_2 = [] for enz in enzymes: pos = dna_2.find(enz.sequence) while pos != -1: digest_2.append((enz, pos)) pos = dna_2.find(enz.sequence, pos + 1) print "list of all enzymes cutting dna 1 and theirs position in dna1 :", [(e.name, pos) for e, pos in digest_1] print "list of all enzymes cutting dna 2 and theirs position in dna2 :", [(e.name, pos) for e, pos in digest_2] cut_dna_1 = set([e.name for e, pos in digest_1]) cut_dna_2 = set([e.name for e, pos in digest_2]) cut_dna_1_not_dna_2 = cut_dna_1 - cut_dna_2 Exercise -------- ... ...
 ... ... @@ -195,7 +195,7 @@ without header or any non sequence characters pseudocode: | *fasta_to_one_line(seq)* | *function fasta_to_one_line(seq)* | *header_end_at <- find the first return line character* | *raw_seq <- remove header from sequence* | *raw_seq <- remove non sequence chars* ... ...
 .. _Creating_and_Calling_Functions: .. _Dive_into_Functions: ****************************** Creating and Calling Functions ****************************** ******************* Dive into Functions ******************* Exercises ========= ... ... @@ -105,7 +105,7 @@ As soon as you make an assignment to a variable in a scope, that variable becomes local to that scope and shadows any similarly named variable in the outer scope. even if the assignment appear later in code. Here *x = y* make *x* as local variable whatever you are in func. so at line *y = x + 2* we try to use the loca variable *x* but we have to asign it a value (it is done later) so so at line *y = x + 2* we try to use the local variable *x* but we have to asign it a value (it is done later) so Python raise an UnboundLocalError (`see python faq for details `_) .. container:: clearer ... ...
 ... ... @@ -27,22 +27,13 @@ Exercise Write a program that calculates the similarity of 2 RNA sequences. * To compute the simalirity you need to parse a file containing the similarity matrix. * To compute the simalirity you need to parse a file containing the :download:`similarity matrix <_static/data/similarity_matrix>`. **Hint**: use the module containing the functions that handle a matrix from previous chapter. put this matrix.py file in a directory named "my_python_lib" in your home or Desktop and import it in your current program (the similarity script must be placed elsewhere). * The similarity of the 2 sequences is the sum of base similarities. so you have to compare the first base of to sequence and use the matrix to get the similarity so you have to compare the first base of two sequences and use the matrix to get the similarity from the similarity table, on so on for all bases then sum these similarities. .. note:: as we don't yet see how to read a file, we provide a list of strings that represents the file as we can get them if we read that file. :: lines = iter([' A G C U\n' 'A 1.0 0.5 0.0 0.0\n', 'G 0.5 1.0 0.0 0.0\n', 'C 0.0 0.0 1.0 0.5\n', 'U 0.0 0.0 0.5 1.0\n']) .. literalinclude:: _static/code/similarity.py :linenos: ... ...
 def all_codons(): all_codons = [] alphabet = 'acgt' for base_1 in alphabet: for base_2 in alphabet: for base_3 in alphabet: codon = base_1 + base_2 + base_3 all_codons.append(codon) return all_codons \ No newline at end of file
 from itertools import product def all_codons(): alphabet = 'acgt' all_codons = [ ''.join(codon) for codon in product(alphabet, repeat = 3)] return all_codons \ No newline at end of file
 def one_line(seq): return seq.replace('\n', '') def enz_filter(enzymes, dna): cuting_enz = [] for enz in enzymes: if enz.sequence in dna: cuting_enz.append(enz) return cuting_enz
 import collections def kmer(sequence, k): kmers = collection.defaultdict(int) for i in range(len(sequence) - k): kmer = sequence[i:i + k] def get_kmer_occurences(seq, kmer_len): """ return a list of tuple each tuple contains a kmers present in seq and its occurence """ kmers = {} stop = len(seq) - kmer_len for i in range(stop + 1): kmer = s[i : i + kmer_len] kmers[kmer] = kmers.get(kmer, 0) + 1 return kmers \ No newline at end of file return kmers.items() \ No newline at end of file
 import collections from operator import itemgetter def get_kmer_occurences(seq, kmer_len): """ return a list of tuple each tuple contains a kmers present in seq and its occurence """ kmers = collections.defaultdict(int) stop = len(seq) - kmer_len for i in range(stop + 1): kmer = s[i : i + kmer_len] kmers[kmer] += 1 kmers = kmers.items() kmers.sort(key = itemgetter(1), reverse =True) return kmers \ No newline at end of file
 from operator import itemgetter def one_enz_binding_site(dna, enzyme): """ return the first position of enzyme binding site in dna or None if there is not """ pos = dna.find(enzyme.sequence) if pos != -1: return pos def one_enz_binding_sites1(dna, enzyme): """ return all positions of enzyme binding sites in dna """ positions = [] pos = dna.find(enzyme.sequence) if pos != -1: positions.append(pos) positions.extend(one_enz_binding_site1(dna[pos+1:], enzyme)) return positions def one_enz_binding_sites(dna, enzyme): """ return all positions of enzyme binding sites in dna """ positions = [] pos = dna.find(enzyme.sequence) while pos != -1: positions.append(pos) pos = dna.find(enzyme.sequence, pos + 1) return positions def binding_sites(dna, enzymes): """ return all positions of all enzymes binding sites present in dna sort by the incresing position """ positions = [] for enzyme in enzymes: pos = one_enz_binding_sites(dna, enzyme) pos = [(enzyme.name, pos) for pos in pos] positions.extend(pos) positions.sort(key = itemgetter(1)) return positions
 def rev_comp(seq): """ return the reverse complement of seq the sequence must be in lower case """ complement = {'a' : 't', 'c' : 'g', 'g' : 'c', 't' : 'a'} rev_seq = seq[::-1] rev_comp = '' for nt in rev_seq: rev_comp += complement[nt] return rev_comp \ No newline at end of file
 import string def rev_comp(seq): """ return the reverse complement of seq the case is respect but if the sequence mix upper and lower case the function will failed """ upper = seq.isupper() reverse = seq[::-1] direct = 'acgt' comp = 'tgca' if upper: table = string.maketrans(direct.upper(), comp.upper()) else: table = string.maketrans(direct, comp) rev_comp = reverse.translate(table) return rev_comp \ No newline at end of file
 import sys import os.path sys.path.insert(0, os.path.join(expanduser('~'), "my_python_lib")) import matrix lines = iter( [' A G C U\n', 'A 1.0 0.5 0.0 0.0\n', 'G 0.5 1.0 0.0 0.0\n', 'C 0.0 0.0 1.0 0.5\n', 'U 0.0 0.0 0.5 1.0\n'] ) def parse_similarity_file(): def parse_similarity_file(path): """ parse file containing RNA similarity matrix and return a matrix """ sim_matrix = matrix.create(4, 4) #skip first line lines.next() for row_no, line in enumerate(lines): line = line.strip() fields = line.split() values = [float(val) for val in fields[1:]] matrix.replace_row(sim_matrix, row_no, values) with open(path, 'r') as sim_file: #skip first line sim_file.next() for row_no, line in enumerate(sim_file): line = line.strip() fields = line.split() values = [float(val) for val in fields[1:]] matrix.replace_row(sim_matrix, row_no, values) return sim_matrix def get_similarity(b1, b2, sim_matrix): """ :param b1: the first base must be in ('A', 'G', 'C', 'U') ... ... @@ -40,6 +41,7 @@ def get_similarity(b1, b2, sim_matrix): if not b2 in bases: raise KeyError("unknown base b2: " + str(b2)) return matrix.get_cell(sim_matrix, bases[b1], bases[b2]) def compute_similarity(seq1, seq2, sim_matrix): """ ... ... @@ -58,12 +60,13 @@ def compute_similarity(seq1, seq2, sim_matrix): sim = get_similarity(b1, b2, sim_matrix) similarities.append(sim) return sum(similarities) if __name__ == '__main__': seq1 = 'AGCAUCUA' seq2 = 'ACCGUUCU' sim_matrix = parse_similarity_file() sim_matrix = parse_similarity_file("similarity_matrix") print matrix.to_str(sim_matrix) similarity = compute_similarity(seq1, seq2, sim_matrix) print similarity \ No newline at end of file
 ... ... @@ -19,17 +19,21 @@ genetic_code = { 'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C', def translate(nuc_seq, code): prot_seq = '' start = 0 while (start + 2) < len(nuc_seq): n = 0 # to avoid to compute len(seq)/3 at each loop # I compute it once and use a reference # it could be expensive if the sequence is very long. cycle = len(nuc_seq)/3 while n < cycle: start = n * 3 end = start + 3 print start, end codon = nuc_seq[start:end] codon = codon.lower() if codon in code: prot_seq += code[codon] else: raise RuntimeError("unknow codon: " + codon) start += 3 n += 1 return prot_seq def translate2(nuc_seq, code, phase = 1): ... ... @@ -39,7 +43,9 @@ def translate2(nuc_seq, code, phase = 1): elif -4 < phase < 0: start = -phase - 1 nuc_seq = nuc_seq[::-1] while(start + 2) < len(nuc_seq): # an other way to determine the end of looping stop_iteration = len(nuc_seq) while (start + 2) < stop_iteration: end = start + 3 codon = nuc_seq[start:end].lower() if codon in code: ... ... @@ -47,4 +53,4 @@ def translate2(nuc_seq, code, phase = 1): else: raise RuntimeError("unknow codon") start += 3 return prot_seq \ No newline at end of file return prot_seq
 def uniqify(l): uniq = [] for item in l: if item not in uniq: uniq.append(item) return (uniq) \ No newline at end of file
 A G C U A 1.0 0.5 0.0 0.0 G 0.5 1.0 0.0 0.0 C 0.0 0.0 1.0 0.5 U 0.0 0.0 0.5 1.0

28.8 KB

34.6 KB

 ... ... @@ -18,7 +18,7 @@ Contents: Collection_Data_Types Logical_Operations Control_Flow_Statements Creating_and_Calling_Functions Dive_into_Functions Modules_and_Packages Input_Output