Skip to content
Snippets Groups Projects
Verified Commit 779fff97 authored by Bertrand  NÉRON's avatar Bertrand NÉRON
Browse files

add exercise using csv module

parent ec869e63
No related branches found
No related tags found
No related merge requests found
Pipeline #10355 passed
......@@ -58,6 +58,74 @@ use the file :download:`seq.fasta <_static/data/seq.fasta>` to test your code.
:download:`fasta_reader.py <_static/code/fasta_reader.py>` .
Exercise
--------
Read a multiple sequence file in fasta format and write to a new file, one sequence by file,
only sequences starting with methionine and containing at least six tryptophanes (W).
(*you should create files for sequences: ABCD1_HUMAN, ABCD1_MOUSE, ABCD2_HUMAN, ABCD2_MOUSE, ABCD2_RAT, ABCD4_HUMAN, ABCD4_MOUSE*)
bonus
^^^^^
Write sequences with 80 aa/line
.. literalinclude:: _static/code/fasta_filter.py
:linenos:
:language: python
:download:`fasta_iterator.py <_static/code/fasta_filter.py>` .
Exercise
--------
we ran a blast with the folowing command *blastall -p blastp -d uniprot_sprot -i query_seq.fasta -e 1e-05 -m 8 -o blast2.txt*
-m 8 is the tabular output. So each fields is separate to the following by a '\t'
The fields are: query id, database sequence (subject) id, percent identity, alignment length, number of mismatches, number of gap openings,
query start, query end, subject start, subject end, Expect value, HSP bit score.
:download:`blast2.txt <_static/data/blast2.txt>` .
| parse the file
| sort the hits by their *percent identity* in the descending order.
| write the results in a new file.
(adapted from *managing your biological data with python* p138)
.. literalinclude:: _static/code/parse_blast.py
:linenos:
:language: python
:download:`parse_blast.py <_static/code/parse_blast.py>` .
Exercise
--------
* Parse the files exp1.csv and exp2.csv (:download:`exp1.csv <_static/data/exp1.csv>`, :download:`exp2.csv <_static/data/exp2.csv>`)
(create a function to parse file and keep only fields: GenAge ID, symbol, name, entrez gene id, uniprot)
* get the genes which are in the exp1 but not in the exp2 the 2 files are in csv format based on the uniprot identifier.
* write the result in a file in csv format.
Hint:
^^^^^
Use the module csv in python https://docs.python.org/3/library/csv.html#module-csv
use a reader like below ::
>>> reader = csv.reader(input, quotechar='"')
.. literalinclude:: _static/code/csv.py
:linenos:
:language: python
:download:`csv.py <_static/code/csv.py>` .
Exercise
--------
......@@ -70,7 +138,7 @@ solution 1
:linenos:
:language: python
:download:`multiple_fasta_reader.py <_static/code/multiple_fasta_reader.py>`
:download:`multiple_fasta_reader.py <_static/code/multiple_fasta_reader.py>`
solution 2
^^^^^^^^^^
......@@ -78,7 +146,7 @@ solution 2
:linenos:
:language: python
:download:`multiple_fasta_reader2.py <_static/code/multiple_fasta_reader2.py>`
:download:`multiple_fasta_reader2.py <_static/code/multiple_fasta_reader2.py>`
solution 3
^^^^^^^^^^
......@@ -86,8 +154,8 @@ solution 3
:linenos:
:language: python
:download:`fasta_iterator.py <_static/code/fasta_iterator.py>` .
:download:`fasta_iterator.py <_static/code/fasta_iterator.py>` .
With the first version, we have to load all sequences before to treat them.
if the file is huge (>G0) it can be a problem.
......@@ -96,61 +164,18 @@ To do that we have to open the file outside the reader function
The fasta format is very convenient for human but not for parser.
The end of a sequence is indicated by the end of file or the begining of a new one.
So with this version we have play with the cursor to place the cursor backward
when we encouter a new sequence. then the cursor is placed at the right place
when we encouter a new sequence. then the cursor is placed at the right place
for the next sequence.
The third version is an iterator and use generator.
The third version is an iterator and use generator.
generators are functions which keep a state between to calls.
generators does not use return to return a value but the keyword yield.
Thus this implementation retrun sequence by sequence without to play with the cursor.
You can call this function and put in in a loop or call next.
You can call this function and put in in a loop or call next.
Work with the sequence and pass to the next sequence on so on.
for instance which is a very convenient way to use it: ::
for seq in fasta_iter('my_fast_file.fasta'):
print seq
Exercise
--------
Read a multiple sequence file in fasta format and write to a new file, one sequence by file,
only sequences starting with methionine and containing at least six tryptophanes (W).
(*you should create files for sequences: ABCD1_HUMAN, ABCD1_MOUSE, ABCD2_HUMAN, ABCD2_MOUSE, ABCD2_RAT, ABCD4_HUMAN, ABCD4_MOUSE*)
bonus
^^^^^
Write sequences with 80 aa/line
.. literalinclude:: _static/code/fasta_filter.py
:linenos:
:language: python
:download:`fasta_iterator.py <_static/code/fasta_filter.py>` .
Exercise
--------
we ran a blast with the folowing command *blastall -p blastp -d uniprot_sprot -i query_seq.fasta -e 1e-05 -m 8 -o blast2.txt*
-m 8 is the tabular output. So each fields is separate to the following by a '\t'
The fields are: query id, database sequence (subject) id, percent identity, alignment length, number of mismatches, number of gap openings,
query start, query end, subject start, subject end, Expect value, HSP bit score.
:download:`blast2.txt <_static/data/blast2.txt>` .
| parse the file
| sort the hits by their *percent identity* in the descending order.
| write the results in a new file.
(adapted from *managing your biological data with python* p138)
.. literalinclude:: _static/code/parse_blast.py
:linenos:
:language: python
:download:`parse_blast.py <_static/code/parse_blast.py>` .
\ No newline at end of file
import csv
def parse_gene_file(path):
genes = set()
with open(path, 'r') as input:
reader = csv.reader(input, quotechar='"')
for row in reader:
id_, symbol, _, name, entrez, uniprot, *_ = row
genes.add((symbol, name, entrez, uniprot))
return genes
if __name__ == '__main__':
exp1 = parse_gene_file('exp1.csv')
exp2 = parse_gene_file('exp2.csv')
exp1_symbol = {item[-1] for item in exp1}
exp2_symbol = {item[-1] for item in exp2}
spe = exp1_symbol - exp2_symbol
with open('exp1_specific.csv', 'w') as out:
writer = csv.writer(out)
for row in exp1:
if row[-1] in spe:
writer.writerow(row)
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment