Read and write files
How to use open_() for reading and writing files with automatic compression detection (gzip, bzip2, lzma, zip), atomic_write for safe file writes that clean up on failure, iter_splitlines and iter_line_blocks for streaming large files, and is_url/open_url for working with URLs.
Writing a compressed file
open_() detects the compression format from the file suffix and handles it automatically. Writing a gzip-compressed text file is identical to writing a plain text file — just use a .gz suffix.
1 2 3 4 5 | |
Reading a compressed file
Reading works the same way — open_() detects the .gz suffix and decompresses transparently.
1 2 3 4 5 6 | |
Supported compression formats are gzip (.gz), bzip2 (.bz2), lzma (.xz, .lzma), and zip (.zip).
Reading a URL
open_() also handles URLs. Use is_url() to check whether a path is a URL before opening it.
Checking and reading a URL
1 2 3 4 5 6 7 8 9 | |
is_url()returnsTrueforhttp,https, andfilescheme URLs.open_()detects the URL and delegates toopen_url(). Only read mode is supported for URLs.
Efficiently reading large files
Reading an entire large file into memory or iterating line by line with Python's built-in readline() can be inefficient. The built-in approach makes a system call for every line, which becomes a bottleneck for files with millions of lines. scinexus provides two functions that read data in large chunks and then split into lines, greatly reducing I/O overhead.
iter_splitlines
iter_splitlines(path, chunk_size=1_000_000) reads a file in chunks (default 1 MB) and yields individual lines. It correctly handles lines that span chunk boundaries.
from scinexus.io_util import iter_splitlines
for line in iter_splitlines("large_file.txt"):
process(line)
iter_line_blocks
iter_line_blocks(path, num_lines=1000, chunk_size=5_000_000) builds on iter_splitlines — it accumulates lines into lists of num_lines and yields each list. This is useful when downstream processing works on batches (e.g. FASTA records where each record spans a fixed number of lines).
from scinexus.io_util import iter_line_blocks
for block in iter_line_blocks("large_file.txt", num_lines=1000):
process_batch(block) # block is a list of up to 1000 strings
Use iter_splitlines when you need one line at a time. Use iter_line_blocks when your processing naturally operates on batches of lines.