Read and write files

How to use open_() for reading and writing files with automatic compression detection (gzip, bzip2, lzma, zip), atomic_write for safe file writes that clean up on failure, iter_splitlines and iter_line_blocks for streaming large files, and is_url/open_url for working with URLs.

Writing a compressed file

open_() detects the compression format from the file suffix and handles it automatically. Writing a gzip-compressed text file is identical to writing a plain text file — just use a .gz suffix.

from scinexus import open_

with open_("data/sample.txt.gz", "wt") as f:
    f.write("Hello, compressed world!\n")
    f.write("Line two of the file.\n")

Reading a compressed file

Reading works the same way — open_() detects the .gz suffix and decompresses transparently.

from scinexus import open_

with open_("data/sample.txt.gz") as f:
    print(f.read())

# Hello, compressed world! Line two of the file.

Supported compression formats are gzip (.gz), bzip2 (.bz2), lzma (.xz, .lzma), and zip (.zip).

Reading a URL

open_() also handles URLs. Use is_url() to check whether a path is a URL before opening it.

Checking and reading a URL

from scinexus.io_util import is_url
from scinexus import open_

url = "https://github.com/user-attachments/files/26728407/raw.zip"
print(is_url(url))  # (1)!

with open_(url, "rb") as f:  # (2)!
    header = f.read(20)
print(header)

is_url() returns True for http, https, and file scheme URLs.
open_() detects the URL and delegates to open_url(). Only read mode is supported for URLs.

Efficiently reading large files

Reading an entire large file into memory or iterating line by line with Python's built-in readline() can be inefficient. The built-in approach makes a system call for every line, which becomes a bottleneck for files with millions of lines. scinexus provides two functions that read data in large chunks and then split into lines, greatly reducing I/O overhead.

`iter_splitlines`

iter_splitlines(path, chunk_size=1_000_000) reads a file in chunks (default 1 MB) and yields individual lines. It correctly handles lines that span chunk boundaries.

from scinexus.io_util import iter_splitlines

for line in iter_splitlines("large_file.txt"):
    process(line)

`iter_line_blocks`

iter_line_blocks(path, num_lines=1000, chunk_size=5_000_000) builds on iter_splitlines — it accumulates lines into lists of num_lines and yields each list. This is useful when downstream processing works on batches (e.g. FASTA records where each record spans a fixed number of lines).

from scinexus.io_util import iter_line_blocks

for block in iter_line_blocks("large_file.txt", num_lines=1000):
    process_batch(block)  # block is a list of up to 1000 strings

Use iter_splitlines when you need one line at a time. Use iter_line_blocks when your processing naturally operates on batches of lines.