Collection_Data_Types.rst 19.9 KB
Newer Older
1
2
.. sectnum::
   :start: 5
Blaise Li's avatar
Blaise Li committed
3

Bertrand  NÉRON's avatar
Bertrand NÉRON committed
4
5
6
7
8
.. _Collection_Data_types:

*********************
Collection Data Types
*********************
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
9

10
Exercises
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
11
12
=========

13
Exercise
14
15
--------

16
17
| Draw the representation in memory of the following expressions.
| what is the data type of each object?
18

Blaise Li's avatar
Blaise Li committed
19
::
20
21
22
23
24

   x = [1, 2, 3, 4]
   y = x[1]
   y = 3.14
   x[1] = 'foo'
Blaise Li's avatar
Blaise Li committed
25

26
27
28
29
.. figure:: _static/figs/list_1.png
   :width: 400px
   :alt: set
   :figclass: align-center
Blaise Li's avatar
Blaise Li committed
30

31
32
33
34
35
::

   x = [1, 2, 3, 4]
   x += [5, 6]

Blaise Li's avatar
Blaise Li committed
36
.. figure:: _static/figs/augmented_assignment_list.png
37
38
   :width: 400px
   :alt: set
Blaise Li's avatar
Blaise Li committed
39
   :figclass: align-center
40
41
42
43
44
45
46
47
48

::

   >>> x = [1, 2, 3, 4]
   >>> id(x)
   139950507563632
   >>> x += [5,6]
   >>> id(x)
   139950507563632
Blaise Li's avatar
Blaise Li committed
49
50

With mutable object like ``list``, when we mutate the object, the state of the object is modified.
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
51
But the reference to the object is still unchanged.
52

Blaise Li's avatar
Blaise Li committed
53
Comparison with the exercise on strings and integers:
54

Blaise Li's avatar
Blaise Li committed
55
56
Since lists are mutable, when ``+=`` is used, the original list object is modified, so no rebinding of *x* is necessary.
We can observe this using *id()* which gives the memory address of an object. This address does not change after the
57
58
59
``+=`` operation.

.. note::
Blaise Li's avatar
Blaise Li committed
60
61
62
   Even the results are the same, there is a subtelty to use augmented operator.
   In ``a operator= b`` opeeration, Python looks up ``a``'s value only once, so it is potentially faster
   than the ``a = a operator b`` operation.
63

64

Blaise Li's avatar
Blaise Li committed
65
Compare ::
66
67
68
69
70
71

   x = 3
   y = x
   y += 3
   x = ?
   y = ?
Blaise Li's avatar
Blaise Li committed
72
73
74


.. figure:: _static/figs/augmented_assignment_int2.png
75
76
   :width: 400px
   :alt: augmented_assignment
Blaise Li's avatar
Blaise Li committed
77
78
   :figclass: align-center

79
80
81
82
83
84
85

and ::

   x = [1,2]
   y = x
   y += [3,4]
   x = ?
Blaise Li's avatar
Blaise Li committed
86
   y = ?
87
88


Blaise Li's avatar
Blaise Li committed
89
.. figure:: _static/figs/augmented_assignment_list2.png
90
91
   :width: 400px
   :alt: list extend
Blaise Li's avatar
Blaise Li committed
92
   :figclass: align-center
93
94


Blaise Li's avatar
Blaise Li committed
95
96
97
In this example we have two ways to access to the list ``[1, 2]``.
If we modify the state of the list itself, but not the references to this object, then the two variables ``x`` and ``y`` still reference the list containing
``[1, 2, 3, 4]``.
98
99


100
Exercise
101
102
--------

Blaise Li's avatar
Blaise Li committed
103
.. note::
Blaise Li's avatar
Blaise Li committed
104
   ``sum`` is a function that returns the sum of all the elements of a list.
Blaise Li's avatar
Blaise Li committed
105
106

Wihout using the Python shell, tell what are the effects of the following statements::
Blaise Li's avatar
Blaise Li committed
107

108

109
   x = [1, 2, 3, 4]
Blaise Li's avatar
Blaise Li committed
110
111
112
113
114
115
116
117
118
119
120
121
122
123
   x[3] = -4  # What is the value of x now?
   y = sum(x) / len(x)  # What is the value of y? Why?

Solution (using the Python shell ;) )::

    >>> x = [1, 2, 3, 4]
    >>> x[3] = -4
    >>> x
    [1, 2, 3, -4]
    >>> y = sum(x) / len(x)
    >>> y
    0.5

Here, we compute the mean of the values contained in the list ``x``, after having changed its last element to -4.
124

Blaise Li's avatar
Blaise Li committed
125
.. .. warning::
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
126

Blaise Li's avatar
Blaise Li committed
127
..    In python2 the result is ::
128

Blaise Li's avatar
Blaise Li committed
129
..        y = 0
130

Blaise Li's avatar
Blaise Li committed
131
..    because sum(x) is an integer, len(x) is also an integer so in python2.x the result is an integer,
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
132
    all the digits after the periods are discarded.
133
134


135
Exercise
136
137
--------

Blaise Li's avatar
Blaise Li committed
138
Draw the representation in memory of the ``x`` and ``y`` variables when the following code is executed::
139

Blaise Li's avatar
Blaise Li committed
140
   x = [1, ['a', 'b', 'c'], 3, 4]
141
142
   y = x[1]
   y[2] = 'z'
Blaise Li's avatar
Blaise Li committed
143
   # What is the value of x?
Blaise Li's avatar
Blaise Li committed
144

145
146
147
148
.. figure:: _static/figs/list_2-1.png
   :width: 400px
   :alt: set
   :figclass: align-center
Blaise Li's avatar
Blaise Li committed
149

150
151
152
153

.. container:: clearer

    .. image :: _static/figs/spacer.png
Blaise Li's avatar
Blaise Li committed
154

Blaise Li's avatar
Blaise Li committed
155
 When we execute *y = x[1]*, we create ``y`` which references the list ``['a', 'b', 'c']``.
156
 This list has 2 references on it: ``y`` and ``x[1]`` .
Blaise Li's avatar
Blaise Li committed
157
158


159
160
161
162
.. figure:: _static/figs/list_2-2.png
   :width: 400px
   :alt: set
   :figclass: align-center
Blaise Li's avatar
Blaise Li committed
163
164


165
166
167
.. container:: clearer

    .. image :: _static/figs/spacer.png
Blaise Li's avatar
Blaise Li committed
168
169


170
171
 This object is a list so it is a mutable object.
 So we can access **and** modify it by the two ways ``y`` or ``x[1]`` ::
Blaise Li's avatar
Blaise Li committed
172

173
   x = [1, ['a','b','z'], 3, 4]
Blaise Li's avatar
Blaise Li committed
174

175
176
177
178
179
180
181
182
183
Exercise
--------

from the list l = [1, 2, 3, 4, 5, 6, 7, 8, 9] generate 2 lists l1 containing all odd values, and l2 all even values.::

   l = [1, 2, 3, 4, 5, 6, 7, 8, 9]
   l1 = l[::2]
   l2 = l[1::2]

Bertrand  NÉRON's avatar
Bertrand NÉRON committed
184
185
186
187
188
or ::

    even = [item for item in l if item % 2 == 0]
    odd = [item for item in l if item % 2 != 0]

189
Exercise
190
--------
Blaise Li's avatar
Blaise Li committed
191

Blaise Li's avatar
Blaise Li committed
192
193
194
195
196
197
198
.. note::
    A codon is a triplet of nucleotides.
    A nucleotide can be one of the four letters A, C, G, T

Write a function that returns a list containing strings representing all possible codons.

Write the pseudocode before proposing an implementation.
Blaise Li's avatar
Blaise Li committed
199

200
201
202
203
204
pseudocode:
"""""""""""

| *function all_codons()*
|     *all_codons <- empty list*
Blaise Li's avatar
Blaise Li committed
205
206
207
|     *let vary the first base*
|     *for each first base let vary the second base*
|     *for each combination first base, second base let vary the third base*
208
209
210
211
212
213
214
215
216
217
218
219
220
|     *add the concatenation base 1 base 2 base 3 to all_codons*
|     *return all_codons*

first implementation:
"""""""""""""""""""""
.. literalinclude:: _static/code/codons.py
   :linenos:
   :language: python

::

   python -i codons.py 
   >>> codons = all_codons()
Blaise Li's avatar
Blaise Li committed
221
222

:download:`codons.py <_static/code/codons.py>`.
223
224
225
226

second implementation:
""""""""""""""""""""""

Bertrand  NÉRON's avatar
Bertrand NÉRON committed
227
Mathematically speaking the generation of all codons can be the cartesian product
Blaise Li's avatar
Blaise Li committed
228
between 3 vectors 'acgt'.
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
229
In python there is a function to do that in ``itertools module``: `https://docs.python.org/3/library/itertools.html#itertools.product <product>`_
230
231
232
233
234
235
236
237
238
239


.. literalinclude:: _static/code/codons_itertools.py
   :linenos:
   :language: python

::

   python -i codons.py 
   >>> codons = all_codons()
Blaise Li's avatar
Blaise Li committed
240

241
242
:download:`codons_itertools.py <_static/code/codons_itertools.py>` .

Blaise Li's avatar
Blaise Li committed
243

244
Exercise
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
245
246
--------

Blaise Li's avatar
Blaise Li committed
247
From a list return a new list without any duplicate, regardless of the order of items.
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
248
249
250
251
252
253
254
For example: ::

   >>> l = [5,2,3,2,2,3,5,1]
   >>> uniqify(l)
   >>> [1,2,3,5] #is one of the solutions 


255
256
pseudocode:
"""""""""""
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
257

258
259
260
261
262
| *function uniqify(l)*
|     *uniq <- empty list*
|     *for each element of l*
|        *add element if is not in uniq*
|     *return uniq*
263

264
265
implementation:
"""""""""""""""
266

267
268
269
.. literalinclude:: _static/code/uniqify.py
   :linenos:
   :language: python
270

271
::
272

273
274
275
   >>> l=[1,2,3,2,3,4,5,1,2,3,3,2,7,8,9]
   >>> uniqify(l)
   [1, 2, 3, 4, 5, 7, 8, 9]
276

Bertrand  NÉRON's avatar
Bertrand NÉRON committed
277
:download:`uniqify.py <_static/code/uniqify.py>` .
278

279
280
second implementation:
""""""""""""""""""""""
281

282
283
284
285
The problem with the first implementation come from the line 4.
Remember that the membership operator uses a linear search for list, which can be slow for very large collections.
If we plan to use ``uniqify`` with large list we should find a better algorithm.
In the specification we can read that uniqify can work *regardless the order of the resulting list*.
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
286
So we can use the specificity of set ::
287

Blaise Li's avatar
Blaise Li committed
288

289
   >>> list(set(l))
290

291

292
293
Exercise
--------
294

Bertrand  NÉRON's avatar
Bertrand NÉRON committed
295
We need to compute the occurrence of all kmers of a given length present in a sequence.
296

Blaise Li's avatar
Blaise Li committed
297
Below we propose 2 algorithms.
298

299
300
pseudo code 1
"""""""""""""
301

302
303
|   *function get_kmer_occurences(seq, kmer_len)*
|      *all_kmers <- generate all possible kmer of kmer_len*
Blaise Li's avatar
Blaise Li committed
304
|      *occurences <- 0*
305
306
307
|      *for each kmer in all_kmers*
|         *count occurence of kmer*
|         *store occurence*
Blaise Li's avatar
Blaise Li committed
308

309
310
311
312
313
314
315
316
pseudo code 2
"""""""""""""

|  *function get_kmer_occurences(seq, kmer_len)*
|     *all_kmers <- empty*
|     *from i = 0 to sequence length - kmer_len*
|        *kmer <- kmer startin at pos i im sequence*
|        *increase by of occurence of kmer*
Blaise Li's avatar
Blaise Li committed
317

318

319
.. note::
320

Blaise Li's avatar
Blaise Li committed
321
322
323
   Computer scientists typically measure an algorithm’s efficiency in terms of its worst-case running time,
   which is the largest amount of time an algorithm can take given the most difficult input of a fixed size.
   The advantage to considering the worst case running time is that we are guaranteed that our algorithm
324
   will never behave worse than our worst-case estimate.
Blaise Li's avatar
Blaise Li committed
325
326
327
328
329
330
331

   Big-O notation compactly describes the running time of an algorithm.
   For example, if your algorithm for sorting an array of n numbers takes roughly n2 operations for the most difficult dataset,
   then we say that the running time of your algorithm is O(n2). In reality, depending on your implementation, it may be use any number of operations,
   such as 1.5n2, n2 + n + 2, or 0.5n2 + 1; all these algorithms are O(n2) because big-O notation only cares about the term that grows the fastest with
   respect to the size of the input. This is because as n grows very large, the difference in behavior between two O(n2) functions,
   like 999 · n2 and n2 + 3n + 9999999, is negligible when compared to the behavior of functions from different classes,
332
333
   say O(n2) and O(n6). Of course, we would prefer an algorithm requiring 1/2 · n2 steps to an algorithm requiring 1000 · n2 steps.

Blaise Li's avatar
Blaise Li committed
334
335
   When we write that the running time of an algorithm is O(n2), we technically mean that it does not grow faster than a function with a
   leading term of c · n2, for some constant c. Formally, a function f(n) is Big-O of function g(n), or O(g(n)), when f(n) <= c · g(n) for some
336
337
338
   constant c and sufficiently large n.

   For more on Big-O notation, see A `http://rob-bell.net/2009/06/a-beginners-guide-to-big-o-notation/Beginner's <Guide to Big-O Notation>`_.
Blaise Li's avatar
Blaise Li committed
339

340

341
Compare the pseudocode of each of them and implement the fastest one. ::
342

343
344
345
346
347
348
349
350
351
   """gtcagaccttcctcctcagaagctcacagaaaaacacgctttctgaaagattccacactcaatgccaaaatataccacag
      gaaaattttgcaaggctcacggatttccagtgcaccactggctaaccaagtaggagcacctcttctactgccatgaaagg
      aaaccttcaaaccctaccactgagccattaactaccatcctgtttaagatctgaaaaacatgaagactgtattgctcctg
      atttgtcttctaggatctgctttcaccactccaaccgatccattgaactaccaatttggggcccatggacagaaaactgc
      agagaagcataaatatactcattctgaaatgccagaggaagagaacacagggtttgtaaacaaaggtgatgtgctgtctg
      gccacaggaccataaaagcagaggtaccggtactggatacacagaaggatgagccctgggcttccagaagacaaggacaa
      ggtgatggtgagcatcaaacaaaaaacagcctgaggagcattaacttccttactctgcacagtaatccagggttggcttc
      tgataaccaggaaagcaactctggcagcagcagggaacagcacagctctgagcaccaccagcccaggaggcacaggaaac
      acggcaacatggctggccagtgggctctgagaggagaaagtccagtggatgctcttggtctggttcgtgagcgcaacaca"""
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
352
353


354
<<<<<<< HEAD
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
355
In the first algorithm.
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
356

357
| we first compute all kmers we generate 4\ :sup:`kmer length`
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
358
359
| then we count the occurrence of each kmer in the sequence
| so for each kmer we read all the sequence so the algorithm is in O( 4\ :sup:`kmer length` * ``sequence length``)
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
360

Bertrand  NÉRON's avatar
Bertrand NÉRON committed
361
| In the second algorithm we read the sequence only once
362
=======
Blaise Li's avatar
Blaise Li committed
363
In the first alogrithm.
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
364

365
366
| we first compute all kmers we generate 4\ :sup:`kmer length`
| then we count the occurence of each kmer in the sequence
Blaise Li's avatar
Blaise Li committed
367
| so for each kmer we read all the sequence so the algorith is in O( 4\ :sup:`kmer length` * ``sequence length``)
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
368

Blaise Li's avatar
Blaise Li committed
369
| In the secon algorithm we read the sequence only once
370
>>>>>>> e986fb63db27fe063adb907bfb916dbb79c5db9b
371
| So the algorithm is in O(sequence length)
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
372
373


374
Compute the 6 mers occurences of the sequence above, and print each 6mer and it's occurence one per line.
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
375

376
377
378
.. literalinclude:: _static/code/kmer.py
   :linenos:
   :language: python
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
379

380
::
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
381

382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
   >>> s = """"gtcagaccttcctcctcagaagctcacagaaaaacacgctttctgaaagattccacactcaatgccaaaatataccacag
   ... gaaaattttgcaaggctcacggatttccagtgcaccactggctaaccaagtaggagcacctcttctactgccatgaaagg
   ... aaaccttcaaaccctaccactgagccattaactaccatcctgtttaagatctgaaaaacatgaagactgtattgctcctg
   ... atttgtcttctaggatctgctttcaccactccaaccgatccattgaactaccaatttggggcccatggacagaaaactgc
   ... agagaagcataaatatactcattctgaaatgccagaggaagagaacacagggtttgtaaacaaaggtgatgtgctgtctg
   ... gccacaggaccataaaagcagaggtaccggtactggatacacagaaggatgagccctgggcttccagaagacaaggacaa
   ... ggtgatggtgagcatcaaacaaaaaacagcctgaggagcattaacttccttactctgcacagtaatccagggttggcttc
   ... tgataaccaggaaagcaactctggcagcagcagggaacagcacagctctgagcaccaccagcccaggaggcacaggaaac
   ... acggcaacatggctggccagtgggctctgagaggagaaagtccagtggatgctcttggtctggttcgtgagcgcaacaca"""
   >>> s = s.replace('\n', '')
   >>> kmers = get_kmer_occurences(s, 6)
   >>> for kmer in kmers:
   >>>   print kmer[0], '..', kmer[1]
   gcagag .. 2
   aacttc .. 1
   gcaact .. 1
   aaatat .. 2
Blaise Li's avatar
Blaise Li committed
399
400
401


:download:`kmer.py <_static/code/kmer.py>`.
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
402
403
404


bonus:
405
""""""
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
406

407
Print the kmers by ordered by occurences.
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
408

Bertrand  NÉRON's avatar
Bertrand NÉRON committed
409
410
| see `https://docs.python.org/3/library/stdtypes.html#mutable-sequence-types <sort>`_
| see `https://docs.python.org/3/library/operator.html#operator.itemgetter <operator.itemgetter>`_
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
411
412


413
414
415
.. literalinclude:: _static/code/kmer_2.py
   :linenos:
   :language: python
416

417
::
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
418

419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
   >>> s = """"gtcagaccttcctcctcagaagctcacagaaaaacacgctttctgaaagattccacactcaatgccaaaatataccacag
   ... gaaaattttgcaaggctcacggatttccagtgcaccactggctaaccaagtaggagcacctcttctactgccatgaaagg
   ... aaaccttcaaaccctaccactgagccattaactaccatcctgtttaagatctgaaaaacatgaagactgtattgctcctg
   ... atttgtcttctaggatctgctttcaccactccaaccgatccattgaactaccaatttggggcccatggacagaaaactgc
   ... agagaagcataaatatactcattctgaaatgccagaggaagagaacacagggtttgtaaacaaaggtgatgtgctgtctg
   ... gccacaggaccataaaagcagaggtaccggtactggatacacagaaggatgagccctgggcttccagaagacaaggacaa
   ... ggtgatggtgagcatcaaacaaaaaacagcctgaggagcattaacttccttactctgcacagtaatccagggttggcttc
   ... tgataaccaggaaagcaactctggcagcagcagggaacagcacagctctgagcaccaccagcccaggaggcacaggaaac
   ... acggcaacatggctggccagtgggctctgagaggagaaagtccagtggatgctcttggtctggttcgtgagcgcaacaca"""
   >>> s = s.replace('\n', '')
   >>> kmers = get_kmer_occurences(s, 6)
   >>> for kmer, occ in kmers:
   >>>   print kmer, '..', occ
   cacagg .. 4
   aggaaa .. 4
   ttctga .. 3
   ccagtg .. 3
Blaise Li's avatar
Blaise Li committed
436
437
438


:download:`kmer_2.py <_static/code/kmer_2.py>`.
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
439
440


441
442
443
Exercise
--------

444
445
446
447
448
| Write a function which take a sequence as parameter and return it's reversed complement.
| Write the pseudocode before to propose an implementation.

pseudocode:
"""""""""""
449

450
451
452
453
454
455
| *function reverse_comp(sequence)*
|     *complement <- establish a correpondance and each base and its complement*
|     *rev_seq <- revert the sequence*
|     *rev_comp <- empty*
|     *for each nt of rev_seq*
|        *concatenate nt complement to rev_comp*
456
|     *return rev_comp*
457

458
.. literalinclude:: _static/code/rev_comp.py
459
460
   :linenos:
   :language: python
461

462
463
464
465
466
::
   >>> from rev_comp import rev_comp
   >>>
   >>> seq = 'acggcaacatggctggccagtgggctctgagaggagaaagtccagtggatgctcttggtctggttcgtgagcgcaacaca'
   >>> print rev_comp(seq)
467
   tgtgttgcgctcacgaaccagaccaagagcatccactggactttctcctctcagagcccactggccagccatgttgccgt
468

Blaise Li's avatar
Blaise Li committed
469
470
471
:download:`rev_comp.py <_static/code/rev_comp.py>`.


472
473
474
other solution
""""""""""""""

Bertrand  NÉRON's avatar
Bertrand NÉRON committed
475
python provide an interesting method for our problem.
476
477
478
479
480
The ``translate`` method work on string and need a parameter which is a object
that can do the correspondance between characters in old string a the new one.
``maketrans`` is a function in module ``string`` that allow us to build this object.
``maketrans`` take 2 arguments, two strings, the first string contains the characters
to change, the second string the corresponding characters in the new string.
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
481
482
Thus the two strings **must** have the same length. The correspondance between
the characters to change and their new values is made in function of their position.
483
the first character of the first string will be replaced by the first character of the second string,
Blaise Li's avatar
Blaise Li committed
484
the second character of the first string will be replaced by the second character of the second string, on so on.
485
So we can write the reverse complement without loop.
Blaise Li's avatar
Blaise Li committed
486

487
488
489
.. literalinclude:: _static/code/rev_comp2.py
   :linenos:
   :language: python
490
491

::
492
493
494
495
496
   >>> from rev_comp2 import rev_comp
   >>>
   >>> seq = 'acggcaacatggctggccagtgggctctgagaggagaaagtccagtggatgctcttggtctggttcgtgagcgcaacaca'
   >>> print rev_comp(seq)
   tgtgttgcgctcacgaaccagaccaagagcatccactggactttctcctctcagagcccactggccagccatgttgccgt
Blaise Li's avatar
Blaise Li committed
497

498
:download:`rev_comp2.py <_static/code/rev_comp2.py>` .
499

500
501
502
Exercise
--------

Bertrand  NÉRON's avatar
Bertrand NÉRON committed
503
504
505
506
let the following enzymes collection:
We decide to implement enzymes as tuple with the following structure
("name", "comment", "sequence", "cut", "end")
::
Blaise Li's avatar
Blaise Li committed
507

Bertrand  NÉRON's avatar
Bertrand NÉRON committed
508
509
510
511
512
513
514
515
516
517

   ecor1 = ("EcoRI", "Ecoli restriction enzime I", "gaattc", 1, "sticky")
   ecor5 = ("EcoRV", "Ecoli restriction enzime V", "gatatc", 3, "blunt")
   bamh1 = ("BamHI", "type II restriction endonuclease from Bacillus amyloliquefaciens ", "ggatcc", 1, "sticky")
   hind3 = ("HindIII", "type II site-specific nuclease from Haemophilus influenzae", "aagctt", 1 , "sticky")
   taq1 = ("TaqI", "Thermus aquaticus", "tcga", 1 , "sticky")
   not1 = ("NotI", "Nocardia otitidis", "gcggccgc", 2 , "sticky")
   sau3a1 = ("Sau3aI", "Staphylococcus aureus", "gatc", 0 , "sticky")
   hae3 = ("HaeIII", "Haemophilus aegyptius", "ggcc", 2 , "blunt")
   sma1 =  ("SmaI", "Serratia marcescens", "cccggg", 3 , "blunt")
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533

and the 2 dna fragments: ::

   dna_1 = """tcgcgcaacgtcgcctacatctcaagattcagcgccgagatccccgggggttgagcgatccccgtcagttggcgtgaattcag
   cagcagcgcaccccgggcgtagaattccagttgcagataatagctgatttagttaacttggatcacagaagcttccaga
   ccaccgtatggatcccaacgcactgttacggatccaattcgtacgtttggggtgatttgattcccgctgcctgccagg"""

   dna_2 = """gagcatgagcggaattctgcatagcgcaagaatgcggccgcttagagcgatgctgccctaaactctatgcagcgggcgtgagg
   attcagtggcttcagaattcctcccgggagaagctgaatagtgaaacgattgaggtgttgtggtgaaccgagtaag
   agcagcttaaatcggagagaattccatttactggccagggtaagagttttggtaaatatatagtgatatctggcttg"""

| which enzymes cut the dna_1 ?
|                  the dna_2 ?
|                  the dna_1 but not the dna_2?


Bertrand  NÉRON's avatar
Bertrand NÉRON committed
534
535
536
537
538
539
540
541
542
* In a file <my_file.py>
    #. Write a function *seq_one_line* which take a multi lines sequence and return a sequence in one line.
    #. Write a function *enz_filter* which take a sequence and a list of enzymes and return a new list containing
       the enzymes which have a binding site in the sequence
    #. open a terminal with the command python -i <my_file.py>
    #. copy paste the enzymes and dna fragments
    #. use the functions above to compute the enzymes which cut the dna_1
       apply the same functions to compute the enzymes which cut the dna_2
       compute the difference between the enzymes which cut the dna_1 and enzymes which cut the dna_2
Blaise Li's avatar
Blaise Li committed
543

544
.. literalinclude:: _static/code/enzyme_1.py
545
546
547
548
549
   :linenos:
   :language: python

::
   from enzyme_1 import *
Blaise Li's avatar
Blaise Li committed
550

551
552
553
554
   enzymes = [ecor1, ecor5, bamh1, hind3, taq1, not1, sau3a1, hae3, sma1]
   dna_1 = one_line(dna_1)
   dans_2 = one_line(dna_2)
   enz_1 = enz_filter(enzymes, dna_1)
Blaise Li's avatar
Blaise Li committed
555
   enz_2 = enz_filter(enzymes, dna_2)
556
557
   enz1_only = set(enz_1) - set(enz_2)

Blaise Li's avatar
Blaise Li committed
558
:download:`enzymes_1.py <_static/code/enzyme_1.py>`.
559
560
561
562
563
564
565
566

with this algorithm we find if an enzyme cut the dna but we cannot find all cuts in the dna for an enzyme. ::

   enzymes = [ecor1, ecor5, bamh1, hind3, taq1, not1, sau3a1, hae3, sma1]
   digest_1 = []
   for enz in enzymes:
      print enz.name, dna_1.count(enz.sequence)

Bertrand  NÉRON's avatar
Bertrand NÉRON committed
567
the latter algorithm display the number of occurrence of each enzyme, But we cannot determine the position of every sites.
568
We will see how to do this later.
569

Bertrand  NÉRON's avatar
Bertrand NÉRON committed
570
Bonus
571
^^^^^
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
572
573
574
575
576
577
578
579
580

There is another kind of tuple which allow to access to itmes by index or name.
This data collection is called NamedTuple. The NamedTuple are not accessible directly they are in `collections` package,
so we have to import it before to use it.
We also have to define which name correspond to which item::

    import collections
    RestrictEnzyme = collections.namedtuple("RestrictEnzyme", ("name", "comment", "sequence", "cut", "end"))

Bertrand  NÉRON's avatar
typos    
Bertrand NÉRON committed
581
Then we can use this new kind of tuple::
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598

    ecor1 = RestrictEnzyme("EcoRI", "Ecoli restriction enzime I", "gaattc", 1, "sticky")
    ecor5 = RestrictEnzyme("EcoRV", "Ecoli restriction enzime V", "gatatc", 3, "blunt")
    bamh1 = RestrictEnzyme("BamHI", "type II restriction endonuclease from Bacillus amyloliquefaciens ", "ggatcc", 1, "sticky")
    hind3 = RestrictEnzyme("HindIII", "type II site-specific nuclease from Haemophilus influenzae", "aagctt", 1 , "sticky")
    taq1 = RestrictEnzyme("TaqI", "Thermus aquaticus", "tcga", 1 , "sticky")
    not1 = RestrictEnzyme("NotI", "Nocardia otitidis", "gcggccgc", 2 , "sticky")
    sau3a1 = RestrictEnzyme("Sau3aI", "Staphylococcus aureus", "gatc", 0 , "sticky")
    hae3 = RestrictEnzyme("HaeIII", "Haemophilus aegyptius", "ggcc", 2 , "blunt")
    sma1 =  RestrictEnzyme("SmaI", "Serratia marcescens", "cccggg", 3 , "blunt")

The code must be adapted as below

.. literalinclude:: _static/code/enzyme_1_namedtuple.py
   :linenos:
   :language: python

Blaise Li's avatar
Blaise Li committed
599
:download:`enzymes_1_namedtuple.py <_static/code/enzyme_1_namedtuple.py>`.
600

601
Exercise
Bertrand  NÉRON's avatar
Bertrand NÉRON committed
602
603
604
605
606
--------

given the following dict : ::

   d = {1 : 'a', 2 : 'b', 3 : 'c' , 4 : 'd'}
Blaise Li's avatar
Blaise Li committed
607

Bertrand  NÉRON's avatar
Bertrand NÉRON committed
608
609
610
611
612
613
614
615
616
We want obtain a new dict with the keys and the values inverted so we will obtain: ::

   inverted_d  {'a': 1, 'c': 3, 'b': 2, 'd': 4}

solution ::

   inverted_d = {}
   for key in d.keys():
       inverted_d[d[key]] = key
Blaise Li's avatar
Blaise Li committed
617

Bertrand  NÉRON's avatar
Bertrand NÉRON committed
618
619
620
621
622
solution ::

   inverted_d = {}
   for key, value in d.items():
       inverted_d[value] = key
Blaise Li's avatar
Blaise Li committed
623

Bertrand  NÉRON's avatar
Bertrand NÉRON committed
624
625
solution ::

626
   inverted_d = {v : k for k, v in d.items()}
Blaise Li's avatar
Blaise Li committed
627