...
 
Commits (2)
# 14607-klebsiella-data
## How to use it
```shell script
# export NCBI_API_KEY=blablabla
# export NCBI_EMAIL=john.doe@pasteur.fr
pip install -r requirements.txt
./extract.py -i genomes_proks815-genomesonnovember.sample.csv --attribute-threshold 0.01 -o my-genomes_proks815-genomesonnovember-with-attributes.csv
```
It is recommended to provide an API key and email, see https://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.Coming_in_December_2018_API_Key for more information.
## What does it do
The execution have two steps :
* Count how many times each attributes is filled to know which attributes should be exported
* Actually export for each strain the requested sra and assembly attributes, plus biosample attributes that are filled in more than *attribute-threshold* of the strain (default is 0.01 i.e 1%).
### First step
It fetch all biosample records in the input file, and count how many times each attribute is filled while ignoring values such as Missing, Unknown, N/A, ...
It produce a file bio_sample_attributes.csv with the counting observed.
### Second step
It uses the counting of bio_sample_attributes to select attributes to export. For example collection_date is filled for 96% of the strains while host_age in only filled for 7% pf the strains. Attributes of biosample are sorted in a decreasing order of usage.
## Tests préliminaire
J'utilise l'api REST e-utils du ncbi
......
......@@ -2,6 +2,7 @@
import csv
import json
import re
import traceback
from tqdm import tqdm
......@@ -64,7 +65,10 @@ def fetch_attribute(args):
bio_samples_attributes = {}
for strain in tqdm(strains):
for attr_name, aka, value in strain.get_bio_sample_attributes_and_value():
translated_attr_name = max(aka, key=lambda x: bio_samples_attributes_raw[x][1])
filtered_aka = [a for a in aka if re.match('^[a-z_0-9]+$', a)]
if len(filtered_aka) == 0:
filtered_aka = aka
translated_attr_name = max(filtered_aka, key=lambda x: bio_samples_attributes_raw[x][1])
stats = bio_samples_attributes.setdefault(translated_attr_name, {"cpt": 0, "values": set()})
if translated_attr_name != attr_name:
translations.add((attr_name, translated_attr_name))
......
......@@ -141,21 +141,21 @@ urine_collect_meth 1 0.00012099213551119177
tissue 1 0.00012099213551119177
MLST Sequence Type 1 0.00012099213551119177
abs_air_humidity 1 0.00012099213551119177
air temperature 1 0.00012099213551119177
air_temp 1 0.00012099213551119177
build_occup_type 1 0.00012099213551119177
building_setting 1 0.00012099213551119177
carb_dioxide 1 0.00012099213551119177
filter_type 1 0.00012099213551119177
heating and cooling system type 1 0.00012099213551119177
heat_cool_type 1 0.00012099213551119177
indoor_space 1 0.00012099213551119177
light_type 1 0.00012099213551119177
occup_samp 1 0.00012099213551119177
occupant_dens_samp 1 0.00012099213551119177
organism count 1 0.00012099213551119177
relative air humidity 1 0.00012099213551119177
organism_count 1 0.00012099213551119177
rel_air_humidity 1 0.00012099213551119177
space_typ_state 1 0.00012099213551119177
typical occupant density 1 0.00012099213551119177
ventilation type 1 0.00012099213551119177
typ_occupant_dens 1 0.00012099213551119177
ventilation_type 1 0.00012099213551119177
component_organism 1 0.00012099213551119177
Jinru Ji 1 0.00012099213551119177
Yonghong Xiao 1 0.00012099213551119177
......