IO utilities
API for file IO functions that work independently of the app framework.
atomic_write
performs atomic write operations, cleans up if fails
__init__(path, tmpdir=None, in_zip=None, mode='w', encoding=None)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
PathType
|
path to file, or relative to directory specified by in_zip |
required |
tmpdir
|
PathType | None
|
directory where temporary file will be created |
None
|
in_zip
|
PathType | bool | None
|
path to the zip archive containing path, e.g. if in_zip="path/to/data.zip", then path="data/seqs.tsv" Decompressing the archive will produce the "data/seqs.tsv" |
None
|
mode
|
str
|
file writing mode |
'w'
|
encoding
|
str | None
|
text encoding |
None
|
close()
closes file
write(text)
writes text to file
get_format_suffixes(filename)
returns file, compression suffixes
is_url(path)
whether a path is a url
iter_line_blocks(path, num_lines=1000, chunk_size=5000000)
iter_record_chunks(*, path, delimiter, chunk_size=5000000)
yield bytes between successive occurrences of delimiter
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
PathType
|
data file. Accepts a path, URL, or any |
required |
delimiter
|
bytes
|
bytes delimiter on which records are split. Must be non-empty. |
required |
chunk_size
|
int | None
|
bytes read per iteration. If |
5000000
|
Yields:
| Type | Description |
|---|---|
bytes
|
each item is the content between two successive delimiters. The first item is whatever precedes the first delimiter (often empty for files that start with a delimiter). The final item is whatever follows the last delimiter; callers filter as needed for their format. |
Raises:
| Type | Description |
|---|---|
ValueError
|
if |
Notes
Reads path in chunks of chunk_size bytes and splits on
delimiter, holding any trailing partial record across chunk
boundaries so that delimiters spanning a boundary are detected
correctly. Peak memory is bounded by chunk_size plus the size
of the largest record, rather than the full file size.
Operates on raw bytes only; callers that need text decoding should do so per yielded record.
Examples:
>>> import tempfile, pathlib
>>> with tempfile.NamedTemporaryFile(suffix=".bin", delete=False) as f:
... _ = f.write(b">a\nAAA>b\nBBB>c\nCCC")
... tmp = pathlib.Path(f.name)
>>> list(iter_record_chunks(path=tmp, delimiter=b">", chunk_size=8))
[b'', b'a\nAAA', b'b\nBBB', b'c\nCCC']
>>> tmp.unlink()
iter_splitlines(path, chunk_size=1000000)
yields line from file
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
PathType
|
data file |
required |
chunk_size
|
int | None
|
number of bytes to load in one go from path |
1000000
|
Notes
Loads chunks of data from the file, yields one line at a time
open_(filename, mode='rt', **kwargs)
open that handles different compression
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filename
|
PathType
|
path or url, if a url delegates processing to open_url |
required |
mode
|
str
|
standard file opening mode |
'rt'
|
kwargs
|
Any
|
passed to open functions |
{}
|
Returns:
| Type | Description |
|---|---|
an object compatible with the file protocol
|
|
open_url(url, mode='rt', **kwargs)
open a url
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str | ParseResult
|
A url of file in http or https web address |
required |
mode
|
str
|
mode of reading file, 'rb', 'rt', 'r' |
'rt'
|
Raises:
| Type | Description |
|---|---|
Rasies IOError if mode is write or it's not a url.
|
|
Returns:
| Type | Description |
|---|---|
file object which reads binary if "b" in mode, else text.
|
|
open_zip(filename, mode='r', **kwargs)
open a single member zip-compressed file
Note
If mode="r". The function raises ValueError if zip has > 1 record. The returned object is wrapped by TextIOWrapper with latin encoding (so it's not a bytes string).
If mode="w", returns an atomic_write() instance.
path_exists(path)
whether path is a valid path and it exists