struct MetroArea {
int year;
char name[12];
char country[2];
float population;
};
Parsing binary records with struct
The struct
module provides functions to parse fields of bytes into a tuple of Python objects,
and to perform the opposite conversion, from a tuple into packed bytes.
struct
can be used with bytes
, bytearray
, and memoryview
objects.
The struct
module is powerful and convenient, but before using it
you should seriously consider alternatives, so that’s the first short section in this post.
Contents:
Should we use struct
?
Proprietary binary records in the real world are brittle and can be corrupted easily. The super simple example in Struct 101 will expose one of many caveats: a string field may be limited only by its size in bytes, it may be padded by spaces, or it may contain a null-terminated string followed by random garbage up to a certain size. There is also the problem of endianness: the order of the bytes used to represent integers and floats, which depends on the CPU architecture.
If you need to read or write from an existing binary format, I recommend trying to find a library that is ready to use instead of rolling your own solution.
If you need to exchange binary data among in-company Python systems, the pickle module is the easiest way—but beware that different versions of Python use different binary formats by default, and reading a pickle may run arbitrary code, so it’s not safe for external use.
If the exchange involves programs in other languages, use JSON or a multi-platform binary serialization format like MessagePack or Protocol Buffers.
Struct 101
Suppose you need to read a binary file containing data about metropolitan areas, produced by a program in C with a record defined as Example 1
Here is how to read one record in that format, using struct.unpack
:
>>> from struct import unpack, calcsize
>>> FORMAT = 'i12s2sf'
>>> size = calcsize(FORMAT)
>>> data = open('metro_areas.bin', 'rb').read(size)
>>> data
b"\xe2\x07\x00\x00Tokyo\x00\xc5\x05\x01\x00\x00\x00JP\x00\x00\x11X'L"
>>> unpack(FORMAT, data)
(2018, b'Tokyo\x00\xc5\x05\x01\x00\x00\x00', b'JP', 43868228.0)
Note how unpack
returns a tuple with four fields, as specified by the FORMAT
string.
The letters and numbers in FORMAT
are Format Characters described in the struct
module documentation.
part | size | C type | Python type | limits to actual content |
---|---|---|---|---|
|
4 bytes |
|
|
32 bits; range -2,147,483,648 to 2,147,483,647 |
|
12 bytes |
|
|
length = 12 |
|
2 bytes |
|
|
length = 2 |
|
4 bytes |
|
|
32-bits; approximante range ± 3.4×1038 |
One detail about the layout of metro_areas.bin
is not clear from the code in Example 1:
size is not the only difference between the name
and country
fields.
The country
field always holds a 2-letter country code,
but name
is a null-terminated sequence with up to 12 bytes including the terminating
b'\0'
—which you can see in Example 2 right after the word
Tokyo
.[1]
Now let’s review a script to extract all records from metro_areas.bin
and produce a simple report like this:
$ python3 metro_read.py
2018 Tokyo, JP 43,868,228
2015 Shanghai, CN 38,700,000
2015 Jakarta, ID 31,689,592
Example 3 showcases the handy struct.iter_unpack
function.
metro_areas.bin
from struct import iter_unpack
FORMAT = 'i12s2sf' # (1)
def text(field: bytes) -> str: # (2)
octets = field.split(b'\0', 1)[0] # (3)
return octets.decode('cp437') # (4)
with open('metro_areas.bin', 'rb') as fp: # (5)
data = fp.read()
for fields in iter_unpack(FORMAT, data): # (6)
year, name, country, pop = fields
place = text(name) + ', ' + text(country) # (7)
print(f'{year}\t{place}\t{pop:,.0f}')
-
The
struct
format. -
Utility function to decode and clean up the
bytes
fields; returns astr
.[2] -
Handle null-terminated C string: split once on
b'\0'
, then take the first part. -
Decode
bytes
intostr
. -
Open and read the whole file in binary mode;
data
is abytes
object. -
iter_unpack(…)
returns a generator that produces one tuple of fields for each sequence of bytes matching the format string. -
The
name
andcountry
fields need further processing by thetext
function.
The struct
module provides no way to specify null-terminated string fields.
When processing a field like name
in the example above,
after unpacking we need to inspect the returned bytes to discard the first b'\0'
and all bytes after it in that field.
It is quite possible that bytes after the first b'\0'
and up to the end of the field are garbage. You can actually see that in Example 2.
Memory views can make it easier to experiment and debug programs using struct
, as the next section explains.
Structs and Memory Views
Python’s memoryview
type does not let you create or store byte sequences.
Instead, it provides shared memory access to slices
of data from other binary sequences, packed arrays,
and buffers such as Python Imaging Library (PIL) images,[3] without copying the bytes.
Example 4 shows the use of memoryview
and struct
together to extract the width and height of a GIF image.
>>> import struct
>>> fmt = '<3s3sHH' # (1)
>>> with open('filter.gif', 'rb') as fp:
... img = memoryview(fp.read()) # (2)
...
>>> header = img[:10] # (3)
>>> bytes(header) # (4)
b'GIF89a+\x02\xe6\x00'
>>> struct.unpack(fmt, header) # (5)
(b'GIF', b'89a', 555, 230)
>>> del header # (6)
>>> del img
-
struct
format:<
little-endian;3s3s
two sequences of 3 bytes;HH
two 16-bit integers. -
Create
memoryview
from file contents in memory… -
…then another
memoryview
by slicing the first one; no bytes are copied here. -
Convert to
bytes
for display only; 10 bytes are copied here. -
Unpack
memoryview
into tuple of: type, version, width, and height. -
Delete references to release the memory associated with the memoryview instances.
Note that slicing a memoryview
returns a new memoryview
, without copying bytes.[4]
I will not go deeper into memoryview
or the struct
module,
but if you work with binary data, you’ll find it worthwhile to study their docs:
Built-in Types » Memory Views and struct — Interpret bytes as packed binary data.
\0
and \x00
are two valid escape sequences for the null character, chr(0)
, in a Python str
or bytes
literal.
mmap
module to open the image as a memory-mapped file. That module is outside the scope of this book, but if you read and change binary files frequently, learning more about mmap
— Memory-mapped file support will be very fruitful.