Unverified Commit c5581f23 authored by Bertrand  NÉRON's avatar Bertrand NÉRON
Browse files

Merge branch 'master' of gitlab.pasteur.fr:hub-courses/python_one_week_4_biologists_solutions

parents 2763f420 e986fb63
Pipeline #58155 passed with stages
in 13 seconds
...@@ -47,25 +47,22 @@ Exercise ...@@ -47,25 +47,22 @@ Exercise
>>> id(x) >>> id(x)
139950507563632 139950507563632
With mutable object like ``list`` when we mutate the object the state of the object is modified. With mutable object like ``list``, when we mutate the object, the state of the object is modified.
But the reference to the object is still unchanged. But the reference to the object is still unchanged.
So in this example we have two ways to access to the list [1,2] if we modify the state of the list itself.
but not the references to this object, then the 2 variables x and y still reference the list containing
[1,2,3,4].
compare with the exercise on string and integers: Comparison with the exercise on strings and integers:
Since list are mutable, when ``+=`` is used the original list object is modified, so no rebinding of *x* is necessary. Since lists are mutable, when ``+=`` is used, the original list object is modified, so no rebinding of *x* is necessary.
We can observe this using *id()* which give the memory address of an object. This address does not change after the We can observe this using *id()* which gives the memory address of an object. This address does not change after the
``+=`` operation. ``+=`` operation.
.. note:: .. note::
even the results is the same there is a subtelty to use augmented operator. Even the results are the same, there is a subtelty to use augmented operator.
in ``a operator= b`` python looks up ``a``s value only once, so it is potentially faster In ``a operator= b`` opeeration, Python looks up ``a``'s value only once, so it is potentially faster
than the ``a = a operator b``. than the ``a = a operator b`` operation.
compare :: Compare ::
x = 3 x = 3
y = x y = x
...@@ -95,41 +92,55 @@ and :: ...@@ -95,41 +92,55 @@ and ::
:figclass: align-center :figclass: align-center
In this example we have two ways to access to the list ``[1, 2]``.
If we modify the state of the list itself, but not the references to this object, then the two variables ``x`` and ``y`` still reference the list containing
``[1, 2, 3, 4]``.
Exercise Exercise
-------- --------
wihout using python shell, what is the results of the following statements:
.. note:: .. note::
sum is a function which return the sum of each elements of a list. ``sum`` is a function that returns the sum of all the elements of a list.
Wihout using the Python shell, tell what are the effects of the following statements::
::
x = [1, 2, 3, 4] x = [1, 2, 3, 4]
x[3] = -4 # what is the value of x now ? x[3] = -4 # What is the value of x now?
y = sum(x)/len(x) #what is the value of y ? why ? y = sum(x) / len(x) # What is the value of y? Why?
Solution (using the Python shell ;) )::
>>> x = [1, 2, 3, 4]
>>> x[3] = -4
>>> x
[1, 2, 3, -4]
>>> y = sum(x) / len(x)
>>> y
0.5
y = 0.5 Here, we compute the mean of the values contained in the list ``x``, after having changed its last element to -4.
.. warning::
In python2 the result is :: .. .. warning::
y = 0 .. In python2 the result is ::
because sum(x) is an integer, len(x) is also an integer so in python2.x the result is an integer, .. y = 0
.. because sum(x) is an integer, len(x) is also an integer so in python2.x the result is an integer,
all the digits after the periods are discarded. all the digits after the periods are discarded.
Exercise Exercise
-------- --------
Draw the representation in memory of the following expressions. :: Draw the representation in memory of the ``x`` and ``y`` variables when the following code is executed::
x = [1, ['a','b','c'], 3, 4] x = [1, ['a', 'b', 'c'], 3, 4]
y = x[1] y = x[1]
y[2] = 'z' y[2] = 'z'
# what is the value of x ? # What is the value of x?
.. figure:: _static/figs/list_2-1.png .. figure:: _static/figs/list_2-1.png
:width: 400px :width: 400px
...@@ -141,7 +152,7 @@ Draw the representation in memory of the following expressions. :: ...@@ -141,7 +152,7 @@ Draw the representation in memory of the following expressions. ::
.. image :: _static/figs/spacer.png .. image :: _static/figs/spacer.png
When we execute *y = x[1]*, we create ``y`` wich reference the list ``['a', 'b', 'c']``. When we execute *y = x[1]*, we create ``y`` which references the list ``['a', 'b', 'c']``.
This list has 2 references on it: ``y`` and ``x[1]`` . This list has 2 references on it: ``y`` and ``x[1]`` .
...@@ -178,16 +189,22 @@ or :: ...@@ -178,16 +189,22 @@ or ::
Exercise Exercise
-------- --------
generate a list containing all codons. .. note::
A codon is a triplet of nucleotides.
A nucleotide can be one of the four letters A, C, G, T
Write a function that returns a list containing strings representing all possible codons.
Write the pseudocode before proposing an implementation.
pseudocode: pseudocode:
""""""""""" """""""""""
| *function all_codons()* | *function all_codons()*
| *all_codons <- empty list* | *all_codons <- empty list*
| *let varying the first base* | *let vary the first base*
| *for each first base let varying the second base* | *for each first base let vary the second base*
| *for each combination first base, second base let varying the third base* | *for each combination first base, second base let vary the third base*
| *add the concatenation base 1 base 2 base 3 to all_codons* | *add the concatenation base 1 base 2 base 3 to all_codons*
| *return all_codons* | *return all_codons*
...@@ -202,7 +219,7 @@ first implementation: ...@@ -202,7 +219,7 @@ first implementation:
python -i codons.py python -i codons.py
>>> codons = all_codons() >>> codons = all_codons()
:download:`codons.py <_static/code/codons.py>` . :download:`codons.py <_static/code/codons.py>`.
second implementation: second implementation:
"""""""""""""""""""""" """"""""""""""""""""""
...@@ -334,6 +351,7 @@ Compare the pseudocode of each of them and implement the fastest one. :: ...@@ -334,6 +351,7 @@ Compare the pseudocode of each of them and implement the fastest one. ::
acggcaacatggctggccagtgggctctgagaggagaaagtccagtggatgctcttggtctggttcgtgagcgcaacaca""" acggcaacatggctggccagtgggctctgagaggagaaagtccagtggatgctcttggtctggttcgtgagcgcaacaca"""
<<<<<<< HEAD
In the first algorithm. In the first algorithm.
| we first compute all kmers we generate 4\ :sup:`kmer length` | we first compute all kmers we generate 4\ :sup:`kmer length`
...@@ -341,6 +359,15 @@ In the first algorithm. ...@@ -341,6 +359,15 @@ In the first algorithm.
| so for each kmer we read all the sequence so the algorithm is in O( 4\ :sup:`kmer length` * ``sequence length``) | so for each kmer we read all the sequence so the algorithm is in O( 4\ :sup:`kmer length` * ``sequence length``)
| In the second algorithm we read the sequence only once | In the second algorithm we read the sequence only once
=======
In the first alogrithm.
| we first compute all kmers we generate 4\ :sup:`kmer length`
| then we count the occurence of each kmer in the sequence
| so for each kmer we read all the sequence so the algorith is in O( 4\ :sup:`kmer length` * ``sequence length``)
| In the secon algorithm we read the sequence only once
>>>>>>> e986fb63db27fe063adb907bfb916dbb79c5db9b
| So the algorithm is in O(sequence length) | So the algorithm is in O(sequence length)
...@@ -371,7 +398,7 @@ Compute the 6 mers occurences of the sequence above, and print each 6mer and it' ...@@ -371,7 +398,7 @@ Compute the 6 mers occurences of the sequence above, and print each 6mer and it'
aaatat .. 2 aaatat .. 2
:download:`kmer.py <_static/code/kmer.py>` . :download:`kmer.py <_static/code/kmer.py>`.
bonus: bonus:
...@@ -408,7 +435,7 @@ Print the kmers by ordered by occurences. ...@@ -408,7 +435,7 @@ Print the kmers by ordered by occurences.
ccagtg .. 3 ccagtg .. 3
:download:`kmer_2.py <_static/code/kmer_2.py>` . :download:`kmer_2.py <_static/code/kmer_2.py>`.
Exercise Exercise
...@@ -439,7 +466,7 @@ pseudocode: ...@@ -439,7 +466,7 @@ pseudocode:
>>> print rev_comp(seq) >>> print rev_comp(seq)
tgtgttgcgctcacgaaccagaccaagagcatccactggactttctcctctcagagcccactggccagccatgttgccgt tgtgttgcgctcacgaaccagaccaagagcatccactggactttctcctctcagagcccactggccagccatgttgccgt
:download:`rev_comp.py <_static/code/rev_comp.py>` . :download:`rev_comp.py <_static/code/rev_comp.py>`.
other solution other solution
...@@ -528,7 +555,7 @@ and the 2 dna fragments: :: ...@@ -528,7 +555,7 @@ and the 2 dna fragments: ::
enz_2 = enz_filter(enzymes, dna_2) enz_2 = enz_filter(enzymes, dna_2)
enz1_only = set(enz_1) - set(enz_2) enz1_only = set(enz_1) - set(enz_2)
:download:`enzymes_1.py <_static/code/enzyme_1.py>` . :download:`enzymes_1.py <_static/code/enzyme_1.py>`.
with this algorithm we find if an enzyme cut the dna but we cannot find all cuts in the dna for an enzyme. :: with this algorithm we find if an enzyme cut the dna but we cannot find all cuts in the dna for an enzyme. ::
...@@ -569,7 +596,7 @@ The code must be adapted as below ...@@ -569,7 +596,7 @@ The code must be adapted as below
:linenos: :linenos:
:language: python :language: python
:download:`enzymes_1_namedtuple.py <_static/code/enzyme_1_namedtuple.py>` . :download:`enzymes_1_namedtuple.py <_static/code/enzyme_1_namedtuple.py>`.
Exercise Exercise
-------- --------
......
...@@ -167,11 +167,12 @@ create a representation in fasta format of following sequence : ...@@ -167,11 +167,12 @@ create a representation in fasta format of following sequence :
TFKFYMPKKATELKHLQCLEEELKPLEEVLNLAQSKNFHLRPRDLISNINVIVLELKGSE TFKFYMPKKATELKHLQCLEEELKPLEEVLNLAQSKNFHLRPRDLISNINVIVLELKGSE
TTFMCEYADETATIVEFLNRWITFCQSIISTLT""" TTFMCEYADETATIVEFLNRWITFCQSIISTLT"""
>>> s = name + comment + '\n' + sequence >>> s = ">" + name + " " + comment + '\n' + sequence
or or
>>> s = "{name} {comment} \n{sequence}".format(id= id, comment = comment, sequence = sequence) >>> s = ">{name} {comment}\n{sequence}".format(id=id, comment=comment, sequence=sequence)
or or
>>> s = f""{name} {comment} \n{sequence}"
>>> s = f">{name} {comment}\n{sequence}"
Exercise Exercise
...@@ -180,7 +181,7 @@ Exercise ...@@ -180,7 +181,7 @@ Exercise
For the following exercise use the python file :download:`sv40 in fasta <_static/code/sv40_file.py>` which is a python file with the sequence of sv40 in fasta format For the following exercise use the python file :download:`sv40 in fasta <_static/code/sv40_file.py>` which is a python file with the sequence of sv40 in fasta format
already embeded, and use python -i sv40_file.py to work. already embeded, and use python -i sv40_file.py to work.
how long is the sv40 in bp? How long is the sv40 in bp?
Hint : the fasta header is 61bp long. Hint : the fasta header is 61bp long.
(http://www.ncbi.nlm.nih.gov/nuccore/J02400.1) (http://www.ncbi.nlm.nih.gov/nuccore/J02400.1)
...@@ -203,7 +204,7 @@ pseudocode: ...@@ -203,7 +204,7 @@ pseudocode:
:language: python :language: python
:download:`fasta_to_one_line.py <_static/code/fasta_to_one_line.py>` . :download:`fasta_to_one_line.py <_static/code/fasta_to_one_line.py>`.
:: ::
...@@ -215,14 +216,16 @@ pseudocode: ...@@ -215,14 +216,16 @@ pseudocode:
>>> print len(sv40_seq) >>> print len(sv40_seq)
5243 5243
Is that the following enzymes: Consider the following restriction enzymes:
* BamHI (ggatcc), * BamHI (ggatcc)
* EcorI (gaattc), * EcorI (gaattc)
* HindIII (aagctt), * HindIII (aagctt)
* SmaI (cccggg) * SmaI (cccggg)
have recogition sites in sv40 (just answer by True or False)? :: For each of them, tell whether it has recogition sites in sv40 (just answer by True or False).
::
>>> "ggatcc".upper() in sv40_sequence >>> "ggatcc".upper() in sv40_sequence
True True
...@@ -233,13 +236,16 @@ have recogition sites in sv40 (just answer by True or False)? :: ...@@ -233,13 +236,16 @@ have recogition sites in sv40 (just answer by True or False)? ::
>>> "cccggg".upper() in sv40_sequence >>> "cccggg".upper() in sv40_sequence
False False
for the enzymes which have a recognition site can you give their positions? :: For the enzymes which have a recognition site can you give their positions?
::
>>> sv40_sequence = sv40_sequence.lower() >>> sv40_sequence = sv40_sequence.lower()
>>> sv40_sequence.find("ggatcc") >>> sv40_sequence.find("ggatcc")
2532 2532
>>> # remind the string are numbered from 0 >>> # remind the string are numbered from 0
>>> 2532 + 1 = 2533 >>> 2532 + 1
2533
>>> # the recognition motif of BamHI start at 2533 >>> # the recognition motif of BamHI start at 2533
>>> sv40_sequence.find("gaattc") >>> sv40_sequence.find("gaattc")
1781 1781
...@@ -248,30 +254,44 @@ for the enzymes which have a recognition site can you give their positions? :: ...@@ -248,30 +254,44 @@ for the enzymes which have a recognition site can you give their positions? ::
1045 1045
>>> # HindIII -> 1046 >>> # HindIII -> 1046
is there only one site in sv40 per enzyme? Is there only one site in sv40 per enzyme?
The ``find`` method gives the index of the first occurrence or -1 if the substring is not found.
So we can not determine the number of occurrences of a site only with the ``find`` method.
The ``find`` method give the index of the first occurrence or -1 if the substring is not found.
So we can not determine the occurrences of a site only with the find method.
We can know how many sites are present with the ``count`` method. We can know how many sites are present with the ``count`` method.
We will see how to determine the site of all occurrences when we learn looping and conditions.
::
>>> sv40_seq.count("ggatcc")
1
>>> sv40_seq.count("gaattc")
1
>>> sv40_seq.count("aagctt")
6
>>> sv40_seq.count("cccggg")
0
We will see how to determine all occurrences of restriction sites when we learn looping and conditions.
Exercise Exercise
-------- --------
We want to perform a PCR on sv40, can you give the length and the sequence of the amplicon? We want to perform a PCR on sv40. Can you give the length and the sequence of the amplicon?
Write a function which have 3 parameters ``sequence``, ``primer_1`` and ``primer_2`` Write a function which has 3 parameters ``sequence``, ``primer_1`` and ``primer_2`` and returns the amplicon length.
* *We consider only the cases where primer_1 and primer_2 are present in sequence* * *We consider only the cases where primer_1 and primer_2 are present in the sequence.*
* *to simplify the exercise, the 2 primers can be read directly on the sv40 sequence.* * *To simplify the exercise, the 2 primers can be read directly in the sv40 sequence (i.e. no need to reverse-complement).*
test you algorithm with the following primers Test you algorithm with the following primers:
| primer_1 : 5' CGGGACTATGGTTGCTGACT 3' | primer_1 : 5' CGGGACTATGGTTGCTGACT 3'
| primer_2 : 5' TCTTTCCGCCTCAGAAGGTA 3' | primer_2 : 5' TCTTTCCGCCTCAGAAGGTA 3'
Write the pseudocode before to implement it. Write the function in pseudocode before implementing it.
| *function amplicon_len(sequence primer_1, primer_2)* | *function amplicon_len(sequence primer_1, primer_2)*
| *pos_1 <- find position of primer_1 in sequence* | *pos_1 <- find position of primer_1 in sequence*
...@@ -293,16 +313,18 @@ Write the pseudocode before to implement it. ...@@ -293,16 +313,18 @@ Write the pseudocode before to implement it.
>>> print amplicon_len(sequence, first_primer, second_primer ) >>> print amplicon_len(sequence, first_primer, second_primer )
199 199
:download:`amplicon_len.py <_static/code/amplicon_len.py>` . :download:`amplicon_len.py <_static/code/amplicon_len.py>`.
Exercise Exercise
-------- --------
reverse the following sequence "TACCTTCTGAGGCGGAAAGA" (don't compute the complement): :: #. Reverse the following sequence ``"TACCTTCTGAGGCGGAAAGA"`` (don't compute the complement).
::
>>> "TACCTTCTGAGGCGGAAAGA"[::-1] >>> "TACCTTCTGAGGCGGAAAGA"[::-1]
or # or
>>> s = "TACCTTCTGAGGCGGAAAGA" >>> s = "TACCTTCTGAGGCGGAAAGA"
>>> l = list(s) >>> l = list(s)
# take care reverse() reverse a list in place (the method do a side effect and return None ) # take care reverse() reverse a list in place (the method do a side effect and return None )
...@@ -310,20 +332,24 @@ reverse the following sequence "TACCTTCTGAGGCGGAAAGA" (don't compute the complem ...@@ -310,20 +332,24 @@ reverse the following sequence "TACCTTCTGAGGCGGAAAGA" (don't compute the complem
>>> l.reverse() >>> l.reverse()
>>> print l >>> print l
>>> ''.join(l) >>> ''.join(l)
or # or
>>> rev_s = reversed(s) >>> rev_s = reversed(s)
''.join(rev_s) ''.join(rev_s)
The most efficient way to reverse a string or a list is the way using the slice. The most efficient way to reverse a string or a list is the way using the slice.
.. #. Using the shorter string ``s = 'gaattc'`` draw what happens in memory when you reverse ``s``.
Exercise Exercise
-------- --------
| The il2_human contains 4 cysteins (C) in positions 9, 78, 125, 145. | The ``il2_human`` sequence contains 4 cysteins (C) in positions 9, 78, 125, 145.
| We want to generate the sequence of a mutatnt were the cysteins 78 and 125 are replaced by serins (S) | We want to generate the sequence of a mutant where the cysteins 78 and 125 are replaced by serins (S)
| Write the pseudocode, before to propose an implementation: | Write the pseudocode, before proposing an implementation:
We have to take care of the string numbered vs sequence numbered:
We have to take care of the difference between Python string numbering and usual position numbering:
| C in seq -> in string | C in seq -> in string
| 9 -> 8 | 9 -> 8
...@@ -332,18 +358,16 @@ We have to take care of the string numbered vs sequence numbered: ...@@ -332,18 +358,16 @@ We have to take care of the string numbered vs sequence numbered:
| 145 -> 144 | 145 -> 144
| *generate 3 slices from the il2_human* | *generate 3 slices from the il2_human*
| *head <- from the begining and cut between the first cytein and the second* | *head <- from the begining and cut between the first cystein and the second*
| *body <- include the 2nd and 3rd cystein* | *body <- include the 2nd and 3rd cystein*
| *tail <- cut after the 3rd cystein until the end* | *tail <- cut after the 3rd cystein until the end*
| *replace body cystein by serin* | *replace body cystein by serin*
| *make new sequence with head body_mutate tail* | *make new sequence with head body_mutate tail*
il2_human =
'MYRMQLLSCIALSLALVTNSAPTSSSTKKTQLQLEHLLLDLQMILNGINNYKNPKLTRMLTFKFYMPKKATELKHLQCLEEELKPLEEVLNLAQSKNFHLRPRDLISNINVIVLELKGSETTFMCEYADETATIVEFLNRWITFCQSIISTLT'
:: ::
il2_human = 'MYRMQLLSCIALSLALVTNSAPTSSSTKKTQLQLEHLLLDLQMILNGINNYKNPKLTRMLTFKFYMPKKATELKHLQCLEEELKPLEEVLNLAQSKNFHLRPRDLISNINVIVLELKGSETTFMCEYADETATIVEFLNRWITFCQSIISTLT'
head = il2_human[:77] head = il2_human[:77]
body = il2_human[77:125] body = il2_human[77:125]
tail = il2_human[126:] tail = il2_human[126:]
...@@ -353,15 +377,15 @@ il2_human = ...@@ -353,15 +377,15 @@ il2_human =
Exercise Exercise
-------- --------
Write a function Write a function which:
* which take a sequence as parameter * takes a sequence as parameter;
* compute the GC% * computes the GC%;
* and return it * and returns it;
* display the results readable for human as a micro report like this: * displays the results as a "human-readable" micro report like this:
'the sv40 is 5243 bp length and have 40.80% gc' ``'The sv40 is 5243 bp length and has 40.80% gc'``.
use sv40 sequence to test your function. Use the sv40 sequence to test your function.
.. literalinclude:: _static/code/gc_percent.py .. literalinclude:: _static/code/gc_percent.py
:linenos: :linenos:
...@@ -375,8 +399,8 @@ use sv40 sequence to test your function. ...@@ -375,8 +399,8 @@ use sv40 sequence to test your function.
>>> >>>
>>> sequence = fasta_to_one_line(sv40) >>> sequence = fasta_to_one_line(sv40)
>>> gc_pc = gc_percent(sequence) >>> gc_pc = gc_percent(sequence)
>>> report = "the sv40 is {0} bp length and have {1:.2%} gc".format(len(sequence), gc_pc) >>> report = "The sv40 is {0} bp length and has {1:.2%} gc".format(len(sequence), gc_pc)
>>> print report >>> print report
'the sv40 is 5243 bp length and have 40.80% gc' 'The sv40 is 5243 bp length and has 40.80% gc'
:download:`gc_percent.py <_static/code/gc_percent.py>` . :download:`gc_percent.py <_static/code/gc_percent.py>` .
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment