Unverified Commit c5581f23 authored by Bertrand  NÉRON's avatar Bertrand NÉRON
Browse files

Merge branch 'master' of gitlab.pasteur.fr:hub-courses/python_one_week_4_biologists_solutions

parents 2763f420 e986fb63
Pipeline #58155 passed with stages
in 13 seconds
.. sectnum:: .. sectnum::
:start: 5 :start: 5
.. _Collection_Data_types: .. _Collection_Data_types:
********************* *********************
...@@ -16,27 +16,27 @@ Exercise ...@@ -16,27 +16,27 @@ Exercise
| Draw the representation in memory of the following expressions. | Draw the representation in memory of the following expressions.
| what is the data type of each object? | what is the data type of each object?
:: ::
x = [1, 2, 3, 4] x = [1, 2, 3, 4]
y = x[1] y = x[1]
y = 3.14 y = 3.14
x[1] = 'foo' x[1] = 'foo'
.. figure:: _static/figs/list_1.png .. figure:: _static/figs/list_1.png
:width: 400px :width: 400px
:alt: set :alt: set
:figclass: align-center :figclass: align-center
:: ::
x = [1, 2, 3, 4] x = [1, 2, 3, 4]
x += [5, 6] x += [5, 6]
.. figure:: _static/figs/augmented_assignment_list.png .. figure:: _static/figs/augmented_assignment_list.png
:width: 400px :width: 400px
:alt: set :alt: set
:figclass: align-center :figclass: align-center
:: ::
...@@ -46,121 +46,132 @@ Exercise ...@@ -46,121 +46,132 @@ Exercise
>>> x += [5,6] >>> x += [5,6]
>>> id(x) >>> id(x)
139950507563632 139950507563632
With mutable object like ``list`` when we mutate the object the state of the object is modified. With mutable object like ``list``, when we mutate the object, the state of the object is modified.
But the reference to the object is still unchanged. But the reference to the object is still unchanged.
So in this example we have two ways to access to the list [1,2] if we modify the state of the list itself.
but not the references to this object, then the 2 variables x and y still reference the list containing
[1,2,3,4].
compare with the exercise on string and integers: Comparison with the exercise on strings and integers:
Since list are mutable, when ``+=`` is used the original list object is modified, so no rebinding of *x* is necessary. Since lists are mutable, when ``+=`` is used, the original list object is modified, so no rebinding of *x* is necessary.
We can observe this using *id()* which give the memory address of an object. This address does not change after the We can observe this using *id()* which gives the memory address of an object. This address does not change after the
``+=`` operation. ``+=`` operation.
.. note:: .. note::
even the results is the same there is a subtelty to use augmented operator. Even the results are the same, there is a subtelty to use augmented operator.
in ``a operator= b`` python looks up ``a``s value only once, so it is potentially faster In ``a operator= b`` opeeration, Python looks up ``a``'s value only once, so it is potentially faster
than the ``a = a operator b``. than the ``a = a operator b`` operation.
compare :: Compare ::
x = 3 x = 3
y = x y = x
y += 3 y += 3
x = ? x = ?
y = ? y = ?
.. figure:: _static/figs/augmented_assignment_int2.png .. figure:: _static/figs/augmented_assignment_int2.png
:width: 400px :width: 400px
:alt: augmented_assignment :alt: augmented_assignment
:figclass: align-center :figclass: align-center
and :: and ::
x = [1,2] x = [1,2]
y = x y = x
y += [3,4] y += [3,4]
x = ? x = ?
y = ? y = ?
.. figure:: _static/figs/augmented_assignment_list2.png .. figure:: _static/figs/augmented_assignment_list2.png
:width: 400px :width: 400px
:alt: list extend :alt: list extend
:figclass: align-center :figclass: align-center
In this example we have two ways to access to the list ``[1, 2]``.
If we modify the state of the list itself, but not the references to this object, then the two variables ``x`` and ``y`` still reference the list containing
``[1, 2, 3, 4]``.
Exercise Exercise
-------- --------
wihout using python shell, what is the results of the following statements: .. note::
``sum`` is a function that returns the sum of all the elements of a list.
.. note::
sum is a function which return the sum of each elements of a list. Wihout using the Python shell, tell what are the effects of the following statements::
::
x = [1, 2, 3, 4] x = [1, 2, 3, 4]
x[3] = -4 # what is the value of x now ? x[3] = -4 # What is the value of x now?
y = sum(x)/len(x) #what is the value of y ? why ? y = sum(x) / len(x) # What is the value of y? Why?
y = 0.5 Solution (using the Python shell ;) )::
.. warning::
>>> x = [1, 2, 3, 4]
>>> x[3] = -4
>>> x
[1, 2, 3, -4]
>>> y = sum(x) / len(x)
>>> y
0.5
Here, we compute the mean of the values contained in the list ``x``, after having changed its last element to -4.
In python2 the result is :: .. .. warning::
y = 0 .. In python2 the result is ::
because sum(x) is an integer, len(x) is also an integer so in python2.x the result is an integer, .. y = 0
.. because sum(x) is an integer, len(x) is also an integer so in python2.x the result is an integer,
all the digits after the periods are discarded. all the digits after the periods are discarded.
Exercise Exercise
-------- --------
Draw the representation in memory of the following expressions. :: Draw the representation in memory of the ``x`` and ``y`` variables when the following code is executed::
x = [1, ['a','b','c'], 3, 4] x = [1, ['a', 'b', 'c'], 3, 4]
y = x[1] y = x[1]
y[2] = 'z' y[2] = 'z'
# what is the value of x ? # What is the value of x?
.. figure:: _static/figs/list_2-1.png .. figure:: _static/figs/list_2-1.png
:width: 400px :width: 400px
:alt: set :alt: set
:figclass: align-center :figclass: align-center
.. container:: clearer .. container:: clearer
.. image :: _static/figs/spacer.png .. image :: _static/figs/spacer.png
When we execute *y = x[1]*, we create ``y`` wich reference the list ``['a', 'b', 'c']``. When we execute *y = x[1]*, we create ``y`` which references the list ``['a', 'b', 'c']``.
This list has 2 references on it: ``y`` and ``x[1]`` . This list has 2 references on it: ``y`` and ``x[1]`` .
.. figure:: _static/figs/list_2-2.png .. figure:: _static/figs/list_2-2.png
:width: 400px :width: 400px
:alt: set :alt: set
:figclass: align-center :figclass: align-center
.. container:: clearer .. container:: clearer
.. image :: _static/figs/spacer.png .. image :: _static/figs/spacer.png
This object is a list so it is a mutable object. This object is a list so it is a mutable object.
So we can access **and** modify it by the two ways ``y`` or ``x[1]`` :: So we can access **and** modify it by the two ways ``y`` or ``x[1]`` ::
x = [1, ['a','b','z'], 3, 4] x = [1, ['a','b','z'], 3, 4]
Exercise Exercise
-------- --------
...@@ -177,17 +188,23 @@ or :: ...@@ -177,17 +188,23 @@ or ::
Exercise Exercise
-------- --------
generate a list containing all codons. .. note::
A codon is a triplet of nucleotides.
A nucleotide can be one of the four letters A, C, G, T
Write a function that returns a list containing strings representing all possible codons.
Write the pseudocode before proposing an implementation.
pseudocode: pseudocode:
""""""""""" """""""""""
| *function all_codons()* | *function all_codons()*
| *all_codons <- empty list* | *all_codons <- empty list*
| *let varying the first base* | *let vary the first base*
| *for each first base let varying the second base* | *for each first base let vary the second base*
| *for each combination first base, second base let varying the third base* | *for each combination first base, second base let vary the third base*
| *add the concatenation base 1 base 2 base 3 to all_codons* | *add the concatenation base 1 base 2 base 3 to all_codons*
| *return all_codons* | *return all_codons*
...@@ -201,14 +218,14 @@ first implementation: ...@@ -201,14 +218,14 @@ first implementation:
python -i codons.py python -i codons.py
>>> codons = all_codons() >>> codons = all_codons()
:download:`codons.py <_static/code/codons.py>` . :download:`codons.py <_static/code/codons.py>`.
second implementation: second implementation:
"""""""""""""""""""""" """"""""""""""""""""""
Mathematically speaking the generation of all codons can be the cartesian product Mathematically speaking the generation of all codons can be the cartesian product
between 3 vectors 'acgt'. between 3 vectors 'acgt'.
In python there is a function to do that in ``itertools module``: `https://docs.python.org/3/library/itertools.html#itertools.product <product>`_ In python there is a function to do that in ``itertools module``: `https://docs.python.org/3/library/itertools.html#itertools.product <product>`_
...@@ -220,14 +237,14 @@ In python there is a function to do that in ``itertools module``: `https://docs. ...@@ -220,14 +237,14 @@ In python there is a function to do that in ``itertools module``: `https://docs.
python -i codons.py python -i codons.py
>>> codons = all_codons() >>> codons = all_codons()
:download:`codons_itertools.py <_static/code/codons_itertools.py>` . :download:`codons_itertools.py <_static/code/codons_itertools.py>` .
Exercise Exercise
-------- --------
From a list return a new list without any duplicate, regardless of the order of items. From a list return a new list without any duplicate, regardless of the order of items.
For example: :: For example: ::
>>> l = [5,2,3,2,2,3,5,1] >>> l = [5,2,3,2,2,3,5,1]
...@@ -268,7 +285,7 @@ If we plan to use ``uniqify`` with large list we should find a better algorithm. ...@@ -268,7 +285,7 @@ If we plan to use ``uniqify`` with large list we should find a better algorithm.
In the specification we can read that uniqify can work *regardless the order of the resulting list*. In the specification we can read that uniqify can work *regardless the order of the resulting list*.
So we can use the specificity of set :: So we can use the specificity of set ::
>>> list(set(l)) >>> list(set(l))
...@@ -277,18 +294,18 @@ Exercise ...@@ -277,18 +294,18 @@ Exercise
We need to compute the occurrence of all kmers of a given length present in a sequence. We need to compute the occurrence of all kmers of a given length present in a sequence.
Below we propose 2 algorithms. Below we propose 2 algorithms.
pseudo code 1 pseudo code 1
""""""""""""" """""""""""""
| *function get_kmer_occurences(seq, kmer_len)* | *function get_kmer_occurences(seq, kmer_len)*
| *all_kmers <- generate all possible kmer of kmer_len* | *all_kmers <- generate all possible kmer of kmer_len*
| *occurences <- 0* | *occurences <- 0*
| *for each kmer in all_kmers* | *for each kmer in all_kmers*
| *count occurence of kmer* | *count occurence of kmer*
| *store occurence* | *store occurence*
pseudo code 2 pseudo code 2
""""""""""""" """""""""""""
...@@ -297,29 +314,29 @@ pseudo code 2 ...@@ -297,29 +314,29 @@ pseudo code 2
| *from i = 0 to sequence length - kmer_len* | *from i = 0 to sequence length - kmer_len*
| *kmer <- kmer startin at pos i im sequence* | *kmer <- kmer startin at pos i im sequence*
| *increase by of occurence of kmer* | *increase by of occurence of kmer*
.. note:: .. note::
Computer scientists typically measure an algorithm’s efficiency in terms of its worst-case running time, Computer scientists typically measure an algorithm’s efficiency in terms of its worst-case running time,
which is the largest amount of time an algorithm can take given the most difficult input of a fixed size. which is the largest amount of time an algorithm can take given the most difficult input of a fixed size.
The advantage to considering the worst case running time is that we are guaranteed that our algorithm The advantage to considering the worst case running time is that we are guaranteed that our algorithm
will never behave worse than our worst-case estimate. will never behave worse than our worst-case estimate.
Big-O notation compactly describes the running time of an algorithm. Big-O notation compactly describes the running time of an algorithm.
For example, if your algorithm for sorting an array of n numbers takes roughly n2 operations for the most difficult dataset, For example, if your algorithm for sorting an array of n numbers takes roughly n2 operations for the most difficult dataset,
then we say that the running time of your algorithm is O(n2). In reality, depending on your implementation, it may be use any number of operations, then we say that the running time of your algorithm is O(n2). In reality, depending on your implementation, it may be use any number of operations,
such as 1.5n2, n2 + n + 2, or 0.5n2 + 1; all these algorithms are O(n2) because big-O notation only cares about the term that grows the fastest with such as 1.5n2, n2 + n + 2, or 0.5n2 + 1; all these algorithms are O(n2) because big-O notation only cares about the term that grows the fastest with
respect to the size of the input. This is because as n grows very large, the difference in behavior between two O(n2) functions, respect to the size of the input. This is because as n grows very large, the difference in behavior between two O(n2) functions,
like 999 · n2 and n2 + 3n + 9999999, is negligible when compared to the behavior of functions from different classes, like 999 · n2 and n2 + 3n + 9999999, is negligible when compared to the behavior of functions from different classes,
say O(n2) and O(n6). Of course, we would prefer an algorithm requiring 1/2 · n2 steps to an algorithm requiring 1000 · n2 steps. say O(n2) and O(n6). Of course, we would prefer an algorithm requiring 1/2 · n2 steps to an algorithm requiring 1000 · n2 steps.
When we write that the running time of an algorithm is O(n2), we technically mean that it does not grow faster than a function with a When we write that the running time of an algorithm is O(n2), we technically mean that it does not grow faster than a function with a
leading term of c · n2, for some constant c. Formally, a function f(n) is Big-O of function g(n), or O(g(n)), when f(n) <= c · g(n) for some leading term of c · n2, for some constant c. Formally, a function f(n) is Big-O of function g(n), or O(g(n)), when f(n) <= c · g(n) for some
constant c and sufficiently large n. constant c and sufficiently large n.
For more on Big-O notation, see A `http://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/Beginner's <Guide to Big-O Notation>`_. For more on Big-O notation, see A `http://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/Beginner's <Guide to Big-O Notation>`_.
Compare the pseudocode of each of them and implement the fastest one. :: Compare the pseudocode of each of them and implement the fastest one. ::
...@@ -334,6 +351,7 @@ Compare the pseudocode of each of them and implement the fastest one. :: ...@@ -334,6 +351,7 @@ Compare the pseudocode of each of them and implement the fastest one. ::
acggcaacatggctggccagtgggctctgagaggagaaagtccagtggatgctcttggtctggttcgtgagcgcaacaca""" acggcaacatggctggccagtgggctctgagaggagaaagtccagtggatgctcttggtctggttcgtgagcgcaacaca"""
<<<<<<< HEAD
In the first algorithm. In the first algorithm.
| we first compute all kmers we generate 4\ :sup:`kmer length` | we first compute all kmers we generate 4\ :sup:`kmer length`
...@@ -341,6 +359,15 @@ In the first algorithm. ...@@ -341,6 +359,15 @@ In the first algorithm.
| so for each kmer we read all the sequence so the algorithm is in O( 4\ :sup:`kmer length` * ``sequence length``) | so for each kmer we read all the sequence so the algorithm is in O( 4\ :sup:`kmer length` * ``sequence length``)
| In the second algorithm we read the sequence only once | In the second algorithm we read the sequence only once
=======
In the first alogrithm.
| we first compute all kmers we generate 4\ :sup:`kmer length`
| then we count the occurence of each kmer in the sequence
| so for each kmer we read all the sequence so the algorith is in O( 4\ :sup:`kmer length` * ``sequence length``)
| In the secon algorithm we read the sequence only once
>>>>>>> e986fb63db27fe063adb907bfb916dbb79c5db9b
| So the algorithm is in O(sequence length) | So the algorithm is in O(sequence length)
...@@ -369,9 +396,9 @@ Compute the 6 mers occurences of the sequence above, and print each 6mer and it' ...@@ -369,9 +396,9 @@ Compute the 6 mers occurences of the sequence above, and print each 6mer and it'
aacttc .. 1 aacttc .. 1
gcaact .. 1 gcaact .. 1
aaatat .. 2 aaatat .. 2
:download:`kmer.py <_static/code/kmer.py>` . :download:`kmer.py <_static/code/kmer.py>`.
bonus: bonus:
...@@ -406,9 +433,9 @@ Print the kmers by ordered by occurences. ...@@ -406,9 +433,9 @@ Print the kmers by ordered by occurences.
aggaaa .. 4 aggaaa .. 4
ttctga .. 3 ttctga .. 3
ccagtg .. 3 ccagtg .. 3
:download:`kmer_2.py <_static/code/kmer_2.py>` . :download:`kmer_2.py <_static/code/kmer_2.py>`.
Exercise Exercise
...@@ -438,10 +465,10 @@ pseudocode: ...@@ -438,10 +465,10 @@ pseudocode:
>>> seq = 'acggcaacatggctggccagtgggctctgagaggagaaagtccagtggatgctcttggtctggttcgtgagcgcaacaca' >>> seq = 'acggcaacatggctggccagtgggctctgagaggagaaagtccagtggatgctcttggtctggttcgtgagcgcaacaca'
>>> print rev_comp(seq) >>> print rev_comp(seq)
tgtgttgcgctcacgaaccagaccaagagcatccactggactttctcctctcagagcccactggccagccatgttgccgt tgtgttgcgctcacgaaccagaccaagagcatccactggactttctcctctcagagcccactggccagccatgttgccgt
:download:`rev_comp.py <_static/code/rev_comp.py>` .
:download:`rev_comp.py <_static/code/rev_comp.py>`.
other solution other solution
"""""""""""""" """"""""""""""
...@@ -454,9 +481,9 @@ to change, the second string the corresponding characters in the new string. ...@@ -454,9 +481,9 @@ to change, the second string the corresponding characters in the new string.
Thus the two strings **must** have the same length. The correspondance between Thus the two strings **must** have the same length. The correspondance between
the characters to change and their new values is made in function of their position. the characters to change and their new values is made in function of their position.
the first character of the first string will be replaced by the first character of the second string, the first character of the first string will be replaced by the first character of the second string,
the second character of the first string will be replaced by the second character of the second string, on so on. the second character of the first string will be replaced by the second character of the second string, on so on.
So we can write the reverse complement without loop. So we can write the reverse complement without loop.
.. literalinclude:: _static/code/rev_comp2.py .. literalinclude:: _static/code/rev_comp2.py
:linenos: :linenos:
:language: python :language: python
...@@ -467,7 +494,7 @@ So we can write the reverse complement without loop. ...@@ -467,7 +494,7 @@ So we can write the reverse complement without loop.
>>> seq = 'acggcaacatggctggccagtgggctctgagaggagaaagtccagtggatgctcttggtctggttcgtgagcgcaacaca' >>> seq = 'acggcaacatggctggccagtgggctctgagaggagaaagtccagtggatgctcttggtctggttcgtgagcgcaacaca'
>>> print rev_comp(seq) >>> print rev_comp(seq)
tgtgttgcgctcacgaaccagaccaagagcatccactggactttctcctctcagagcccactggccagccatgttgccgt tgtgttgcgctcacgaaccagaccaagagcatccactggactttctcctctcagagcccactggccagccatgttgccgt
:download:`rev_comp2.py <_static/code/rev_comp2.py>` . :download:`rev_comp2.py <_static/code/rev_comp2.py>` .
Exercise Exercise
...@@ -477,7 +504,7 @@ let the following enzymes collection: ...@@ -477,7 +504,7 @@ let the following enzymes collection:
We decide to implement enzymes as tuple with the following structure We decide to implement enzymes as tuple with the following structure
("name", "comment", "sequence", "cut", "end") ("name", "comment", "sequence", "cut", "end")
:: ::
ecor1 = ("EcoRI", "Ecoli restriction enzime I", "gaattc", 1, "sticky") ecor1 = ("EcoRI", "Ecoli restriction enzime I", "gaattc", 1, "sticky")
ecor5 = ("EcoRV", "Ecoli restriction enzime V", "gatatc", 3, "blunt") ecor5 = ("EcoRV", "Ecoli restriction enzime V", "gatatc", 3, "blunt")
...@@ -513,22 +540,22 @@ and the 2 dna fragments: :: ...@@ -513,22 +540,22 @@ and the 2 dna fragments: ::
#. use the functions above to compute the enzymes which cut the dna_1 #. use the functions above to compute the enzymes which cut the dna_1
apply the same functions to compute the enzymes which cut the dna_2 apply the same functions to compute the enzymes which cut the dna_2
compute the difference between the enzymes which cut the dna_1 and enzymes which cut the dna_2 compute the difference between the enzymes which cut the dna_1 and enzymes which cut the dna_2
.. literalinclude:: _static/code/enzyme_1.py .. literalinclude:: _static/code/enzyme_1.py
:linenos: :linenos:
:language: python :language: python
:: ::
from enzyme_1 import * from enzyme_1 import *
enzymes = [ecor1, ecor5, bamh1, hind3, taq1, not1, sau3a1, hae3, sma1] enzymes = [ecor1, ecor5, bamh1, hind3, taq1, not1, sau3a1, hae3, sma1]
dna_1 = one_line(dna_1) dna_1 = one_line(dna_1)
dans_2 = one_line(dna_2) dans_2 = one_line(dna_2)
enz_1 = enz_filter(enzymes, dna_1) enz_1 = enz_filter(enzymes, dna_1)
enz_2 = enz_filter(enzymes, dna_2) enz_2 = enz_filter(enzymes, dna_2)
enz1_only = set(enz_1) - set(enz_2) enz1_only = set(enz_1) - set(enz_2)
:download:`enzymes_1.py <_static/code/enzyme_1.py>` . :download:`enzymes_1.py <_static/code/enzyme_1.py>`.
with this algorithm we find if an enzyme cut the dna but we cannot find all cuts in the dna for an enzyme. :: with this algorithm we find if an enzyme cut the dna but we cannot find all cuts in the dna for an enzyme. ::
...@@ -569,7 +596,7 @@ The code must be adapted as below ...@@ -569,7 +596,7 @@ The code must be adapted as below
:linenos: