add exercise using csv module

779fff97 · Bertrand NÉRON · ec869e63 · 779fff97 · 779fff97 · 779fff97
Verified Commit 779fff97 authored 6 years ago by Bertrand NÉRON
--- a/source/Input_Output.rst
+++ b/source/Input_Output.rst
@@ -58,6 +58,74 @@ use the file :download:`seq.fasta <_static/data/seq.fasta>` to test your code.

 :download:`fasta_reader.py <_static/code/fasta_reader.py>` .   

+
+Exercise
+--------
+
+Read a multiple sequence file in fasta format and write to a new file, one sequence by file,
+only sequences starting with methionine and containing at least six tryptophanes (W).
+ 
+(*you should create files for sequences: ABCD1_HUMAN, ABCD1_MOUSE, ABCD2_HUMAN, ABCD2_MOUSE, ABCD2_RAT, ABCD4_HUMAN, ABCD4_MOUSE*)
+
+bonus
+^^^^^
+
+Write sequences with 80 aa/line
+
+.. literalinclude:: _static/code/fasta_filter.py
+   :linenos:
+   :language: python
+
+:download:`fasta_iterator.py <_static/code/fasta_filter.py>` .
+
+Exercise
+--------
+
+we ran a blast with the folowing command *blastall -p blastp -d uniprot_sprot -i query_seq.fasta -e 1e-05 -m 8 -o blast2.txt*
+
+-m 8 is the tabular output. So each fields is separate to the following by a '\t' 
+
+The fields are: query id, database sequence (subject) id, percent identity, alignment length, number of mismatches, number of gap openings, 
+query start, query end, subject start, subject end, Expect value, HSP bit score. 
+
+:download:`blast2.txt <_static/data/blast2.txt>` .
+
+| parse the file
+| sort the hits by their *percent identity* in the descending order.
+| write the results in a new file.
+
+(adapted from *managing your biological data with python* p138) 
+
+.. literalinclude:: _static/code/parse_blast.py
+   :linenos:
+   :language: python
+
+:download:`parse_blast.py <_static/code/parse_blast.py>` .
+
+Exercise
+--------
+
+* Parse the files exp1.csv and exp2.csv (:download:`exp1.csv <_static/data/exp1.csv>`, :download:`exp2.csv <_static/data/exp2.csv>`)
+  (create a function to parse file and keep only fields: GenAge ID, symbol, name, entrez gene id, uniprot)
+* get the genes which are in the exp1 but not in the exp2 the 2 files are in csv format based on the uniprot identifier.
+* write the result in a file in csv format.
+
+Hint:
+^^^^^
+
+Use the module csv in python https://docs.python.org/3/library/csv.html#module-csv
+use a reader like below ::
+
+ >>> reader = csv.reader(input, quotechar='"')
+
+
+.. literalinclude:: _static/code/csv.py
+   :linenos:
+   :language: python
+
+:download:`csv.py <_static/code/csv.py>` .
+
+
 Exercise
 --------

@@ -70,7 +138,7 @@ solution 1
   :linenos:
   :language: python

-:download:`multiple_fasta_reader.py <_static/code/multiple_fasta_reader.py>` 
+:download:`multiple_fasta_reader.py <_static/code/multiple_fasta_reader.py>`

 solution 2
 ^^^^^^^^^^
@@ -78,7 +146,7 @@ solution 2
   :linenos:
   :language: python

-:download:`multiple_fasta_reader2.py <_static/code/multiple_fasta_reader2.py>` 
+:download:`multiple_fasta_reader2.py <_static/code/multiple_fasta_reader2.py>`

 solution 3
 ^^^^^^^^^^
@@ -86,8 +154,8 @@ solution 3
   :linenos:
   :language: python

-:download:`fasta_iterator.py <_static/code/fasta_iterator.py>` .   
-   
+:download:`fasta_iterator.py <_static/code/fasta_iterator.py>` .
+
 With the first version, we have to load all sequences before to treat them.
 if the file is huge (>G0) it can be a problem.

@@ -96,61 +164,18 @@ To do that we have to open the file outside the reader function
 The fasta format is very convenient for human but not for parser.
 The end of a sequence is indicated by the end of file or the begining of a new one.
 So with this version we have play with the cursor to place the cursor backward
-when we encouter a new sequence. then the cursor is placed at the right place 
+when we encouter a new sequence. then the cursor is placed at the right place
 for the next sequence.

-    
-The third version  is an iterator and use generator. 
+
+The third version  is an iterator and use generator.
 generators are functions which keep a state between to calls.
 generators does not use return to return a value but the keyword yield.
 Thus this implementation retrun sequence by sequence without to play with the cursor.
-You can call this function and put in in a loop or call next. 
+You can call this function and put in in a loop or call next.
 Work with the sequence and pass to the next sequence on so on.
 for instance which is a very convenient way to use it: ::
-   
+
   for seq in fasta_iter('my_fast_file.fasta'):
      print seq
-    
-
-Exercise
--------
-
-Read a multiple sequence file in fasta format and write to a new file, one sequence by file,
-only sequences starting with methionine and containing at least six tryptophanes (W).
- 
-(*you should create files for sequences: ABCD1_HUMAN, ABCD1_MOUSE, ABCD2_HUMAN, ABCD2_MOUSE, ABCD2_RAT, ABCD4_HUMAN, ABCD4_MOUSE*)
-
-bonus
-^^^^^
-
-Write sequences with 80 aa/line
-
-.. literalinclude:: _static/code/fasta_filter.py
-   :linenos:
-   :language: python
-
-:download:`fasta_iterator.py <_static/code/fasta_filter.py>` .
-
-Exercise
--------
-
-we ran a blast with the folowing command *blastall -p blastp -d uniprot_sprot -i query_seq.fasta -e 1e-05 -m 8 -o blast2.txt*
-
-m 8 is the tabular output. So each fields is separate to the following by a '\t' 
-
-The fields are: query id, database sequence (subject) id, percent identity, alignment length, number of mismatches, number of gap openings, 
-query start, query end, subject start, subject end, Expect value, HSP bit score. 
-
-:download:`blast2.txt <_static/data/blast2.txt>` .
-
-| parse the file
-| sort the hits by their *percent identity* in the descending order.
-| write the results in a new file.
-
-(adapted from *managing your biological data with python* p138) 
-
-.. literalinclude:: _static/code/parse_blast.py
-   :linenos:
-   :language: python

-:download:`parse_blast.py <_static/code/parse_blast.py>` .   
\ No newline at end of file
--- a/source/_static/code/csv.py
+++ b/source/_static/code/csv.py
+import csv
+
+
+def parse_gene_file(path):
+    genes = set()
+    with open(path, 'r') as input:
+        reader = csv.reader(input, quotechar='"')
+        for row in reader:
+            id_, symbol, _, name, entrez, uniprot, *_ = row
+            genes.add((symbol, name, entrez, uniprot))
+    return genes
+
+
+if __name__ == '__main__':
+    exp1 = parse_gene_file('exp1.csv')
+    exp2 = parse_gene_file('exp2.csv')
+
+    exp1_symbol = {item[-1] for item in exp1}
+    exp2_symbol = {item[-1] for item in exp2}
+
+    spe = exp1_symbol - exp2_symbol
+    with open('exp1_specific.csv', 'w') as out:
+        writer = csv.writer(out)
+        for row in exp1:
+            if row[-1] in spe:
+                writer.writerow(row)
\ No newline at end of file
--- a/source/_static/data/exp1.csv
+++ b/source/_static/data/exp1.csv
--- a/source/_static/data/exp2.csv
+++ b/source/_static/data/exp2.csv