Monday, September 01, 2008

Packed Binary Routines for PLT Scheme

I have implemented an equivalent of the Python pack/unpack functions in PLT Scheme. I needed it primarily to be able to (more easily) read binary data from other applications for analysis in PLT Scheme.

The documentation for the Python capability can be found in the Python Library Reference: 4.3 struct -- Interpret strings as packed data. I have implemented the entire capability except for the "p" and "P" formats that encode a "Pascal string". [I don't have pack_into or unpack_from implemented yet - they are new in Pyton 2.5 - but I will implement them before posting the package to PLaneT. The following description is modified from the above reference.

packed-binary.ss

This module performs conversions between PLT Scheme values and C structs represented as PLT Scheme byte strings. It uses format strings (explained below) as compact descriptions of the layout of the C structs and the intended conversion to/from PLT Scheme values. This can be used in handling binary data stored in files or from network connections, among other sources.

The module defines the following functions:

(pack format v1 v2 ...)
Returns a byte string containing the values v1, v2, ... packed according to the given format. The arguments must match the values required by the format exactly.
(pack-into format buffer offset v1 v2 ...)
Pack the values v1, v2, ... according to the given format into the (mutable) byte string buffer starting at offset. Note that the offset is not an optional argument.
(write-packed format port v1 v2 ...)
Pack the values v1, v2, ... according to the given format and write them to the specified output port. Note that port is not an optional argument. [Note that write-packed is not part of the Python module, but was added because of its utility.]
(unpack format bytes)
Unpack the bytes according to the given format. The result is a list, even if it contains exactly one item. The bytes must contain exactly the amount of data required by the format [(bytes-length bytes) must equal (calculate-size format)].
(unpack-from format buffer (offset 0))
Unpack the byte string buffer according to the given format. The result is a list, even if it contains exactly one element. The buffer must contain at least the amount of data required by the format [(- (bytes-length buffer) offset) must be at least (calculate-size format)].
(read-packed format port)
Read (calculate-size format) bytes from the input port and unpack them according to the given format. The result is a list, even if it contains exactly one element. [Note that read-packed is not part of the Python module, but was added because of its utility.]
(calculate-size format)
Return the size of the byte string corresponding to the given format.
Format characters have the following meaning; the conversion between C and PLT Scheme values should be obvious given their types:

FormatC TypePLT Scheme
xpad byteno value
ccharchar
bsigned charinteger
Bunsigned charinteger
hshortinteger
Hunsigned shortinteger
iintinteger
Iunsigned intinteger
llonginteger
Lunsigned longinteger
qlong longinteger
Qunsigned long longinteger
ffloatreal
ddoublereal
schar[]string
A format character may be preceded by an integral repeat count. For example, the format string "4h" means exactly the same thing as "hhhh".

Whitespace characters between formats are ignored; a count and its format must not contain whitespace though.

For the "s" format character, the count is interpreted as the size of the string, not a repeat count like for the other format characters. For example, "10s" means a 10-byte string while "10c" means 10 characters. For packing, the string is truncated or padded with null bytes as appropriate to make it fit. For unpacking, the resulting string always has exactly the specified number of bytes. As a special case, "0s" means a single, empty string (while "0c" means 0 characters).

By default, C numbers are represented in the machine's native format and byte order, and properly aligned by skipping pad bytes if necessary.

Alternatively, the first character of the format string can be used to indicate the byte order, size, and alignment of the packed data according to the following table:
CharacterByte orderSize and alignment
@nativenative
=nativestandard
<little endianstandard
>big endianstandard
!network (= big endian)standard

If the first character is not one of these, "@" is assumed.

Native byte order is big endian or little endian. For example, Motorola and Sun processors are big endian; Intel and DEC processors are little endian.

Standard size and alignment are as follows: no alignment is required for any type (so you have to use pad bytes); short is 2 bytes; int and long are 4 bytes; long long is 8 bytes; float and double are 32-bit and 64-but IEEE floating point numbers, respectively.

Note the difference between "@" and "=": both use native byte order, but the size and alignment of the latter is standardized.

The form "!" is available for those who can't remember whether network byte order is big endian or little endian - it is big endian.

There is no way to indicate non-native byte order (force byte swapping); use the appropriate choice of "<" or ">".

Hint, to align the end of a structure to the alignment requirement of a particular type, end the format with the code for that type with a repeat count of zero. For example, the format "llh0l" specified two pad bytes at the end, assuming longs are aligned on 4-byte boundaries. This only works when native size and alignment are in effect; standard size and alignment does not enforce any alignment.

The current implementation may not properly handle native alignment in all cases. For the current implementation, the native alignment is assumed to be the same as the size. This may result in excess pad bytes, particularly for 8-byte objects.

I will be putting the code up on the Schematics project on source forge (probably tomorrow) and it will be available on PLaneT soon thereafter.

Labels:

2 Comments:

Blogger Dan L said...

I have an application where I need to pack scheme data up for consumption by a C program. Any updates on the code's availability?

5:14 PM  
Blogger Doug Williams said...

The code is available from the PLaneT repository using the form (require (planet williams/packed-binary/packed-binary)) or from the Schematics project subversion repository at http://sourceforge.net/projects/schematics/. Thanks for the interest.

9:11 AM  

Post a Comment

<< Home