Skip to content

IO utilities

API for file IO functions that work independently of the app framework.

atomic_write

performs atomic write operations, cleans up if fails

__init__(path, tmpdir=None, in_zip=None, mode='w', encoding=None)

Parameters:

Name Type Description Default
path PathType

path to file, or relative to directory specified by in_zip

required
tmpdir PathType | None

directory where temporary file will be created

None
in_zip PathType | bool | None

path to the zip archive containing path, e.g. if in_zip="path/to/data.zip", then path="data/seqs.tsv" Decompressing the archive will produce the "data/seqs.tsv"

None
mode str

file writing mode

'w'
encoding str | None

text encoding

None

close()

closes file

write(text)

writes text to file

get_format_suffixes(filename)

returns file, compression suffixes

is_url(path)

whether a path is a url

iter_line_blocks(path, num_lines=1000, chunk_size=5000000)

yields list with num_lines str from path

Parameters:

Name Type Description Default
path PathType

data file

required
num_lines int | None

number of lines per block. If None just returns all lines.

1000
chunk_size int | None

number of bytes to load in one go from path

5000000

iter_record_chunks(*, path, delimiter, chunk_size=5000000)

yield bytes between successive occurrences of delimiter

Parameters:

Name Type Description Default
path PathType

data file. Accepts a path, URL, or any PathType and opens it via open_(path, mode="rb") so compressed formats are handled transparently. If path is a URL the stream is read in full (chunk_size is forced to None).

required
delimiter bytes

bytes delimiter on which records are split. Must be non-empty.

required
chunk_size int | None

bytes read per iteration. If None, or if the on-disk file is smaller than chunk_size, the file is read in a single call.

5000000

Yields:

Type Description
bytes

each item is the content between two successive delimiters. The first item is whatever precedes the first delimiter (often empty for files that start with a delimiter). The final item is whatever follows the last delimiter; callers filter as needed for their format.

Raises:

Type Description
ValueError

if delimiter is empty.

Notes

Reads path in chunks of chunk_size bytes and splits on delimiter, holding any trailing partial record across chunk boundaries so that delimiters spanning a boundary are detected correctly. Peak memory is bounded by chunk_size plus the size of the largest record, rather than the full file size.

Operates on raw bytes only; callers that need text decoding should do so per yielded record.

Examples:

>>> import tempfile, pathlib
>>> with tempfile.NamedTemporaryFile(suffix=".bin", delete=False) as f:
...     _ = f.write(b">a\nAAA>b\nBBB>c\nCCC")
...     tmp = pathlib.Path(f.name)
>>> list(iter_record_chunks(path=tmp, delimiter=b">", chunk_size=8))
[b'', b'a\nAAA', b'b\nBBB', b'c\nCCC']
>>> tmp.unlink()

iter_splitlines(path, chunk_size=1000000)

yields line from file

Parameters:

Name Type Description Default
path PathType

data file

required
chunk_size int | None

number of bytes to load in one go from path

1000000
Notes

Loads chunks of data from the file, yields one line at a time

open_(filename, mode='rt', **kwargs)

open that handles different compression

Parameters:

Name Type Description Default
filename PathType

path or url, if a url delegates processing to open_url

required
mode str

standard file opening mode

'rt'
kwargs Any

passed to open functions

{}

Returns:

Type Description
an object compatible with the file protocol

open_url(url, mode='rt', **kwargs)

open a url

Parameters:

Name Type Description Default
url str | ParseResult

A url of file in http or https web address

required
mode str

mode of reading file, 'rb', 'rt', 'r'

'rt'

Raises:

Type Description
Rasies IOError if mode is write or it's not a url.

Returns:

Type Description
file object which reads binary if "b" in mode, else text.

open_zip(filename, mode='r', **kwargs)

open a single member zip-compressed file

Note

If mode="r". The function raises ValueError if zip has > 1 record. The returned object is wrapped by TextIOWrapper with latin encoding (so it's not a bytes string).

If mode="w", returns an atomic_write() instance.

path_exists(path)

whether path is a valid path and it exists