Processing a dataset
Use open_data_store with a loader, processor, and writer app to batch-process a directory of files apply_to, and enable progress bars and parallel execution. Then inspect results.
We will translate the DNA sequences in raw.zip into amino acid and store them as a sqlite database. We will interrogate the generated data store to get a synopsis of the results.
Translating DNA to amino acid
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | |
- Open the zipped input data store, selecting
.fafiles as members. - Create a writable SQLite output data store. Using a single database file is more efficient than writing many small files.
- Compose loader, translator, and writer into a single pipeline.
- Apply the pipeline to every member of the input data store. Results are written to
out_dstore. - Summary showing counts of completed records, not-completed records, and log files.
- Verify the integrity of all records via MD5 checksums.
- Summary of why some records could not be processed — e.g. sequences not divisible by 3 or containing stop codons.
Note
The .completed and .not_completed attributes give access to the different types of members, while .members gives them all. For example, len(out_dstore.not_completed) returns the count of failed records and each element is a DataMember.