Tuesday, March 25, 2008

Binary Interfaces

Here's something I dug up from an old folder that may have some merit being posted. Heh... It's from when I first started working... I've learned a lot about programming since then, so don't judge me for my choices from back then, ok?

This is the result of research into creating a binary interface for one of our protocols that exclusively used XML strings. The goal was to pass binary data through XML. We wanted to use a standard if possible, if not, then something close to a standard. Today, I'd probably spend all my time arguing to not use XML for binary transfers. Since we were using Java, Java RMI would have been sufficient for what we were doing back then with very little hassle. If not, I bet it would be easier to find solutions.

But here's what I came up with back then:

Option one is to encode the binary data into XML. Simply create a tag and place the bytes in. The only problem is that XML has a limited character set, so the byte stream will need to be encoded into characters and then decoded after it’s received. There are different algorithms for encoding. If large data sets are anticipated to be transferred and the byte value distribution within the set to be skewed, it would be best to go with the Huffman encoding approach (which encodes based on byte frequency). If we anticipate the data set to be small or we want to make a quick application, it’s best to go with a Base-64 encoding scheme (which encodes based on a standard 3 byte to 4 character scheme). The better algorithm (Huffman encoding) gives about 1 character per 1 byte of data and the second best, but more popular, algorithm (Base 64) gives about 1.5 characters per one byte of data. This method is simple, though definitely not ideal due to the overhead of encoding and decoding as well as the not-quite 1 to 1 byte to character conversion. Also, the encoding is not a standard and so the receiver may not know how to decode the data.

Option two is a Multipart/Related MIME. The essential idea for this is to send a MIME multipart/related message over http that can contain multiple types of data, such as xml and an image. There is a SOAP standard that allows for this (which even allows for the xml to reference the data, which is used if the binary data is to be one of the parameters), though it does not seem to be an XML standard. However, at this point, there IS no XML standard (according to my research) and this seems to be as close to a standard as anything else. Multipart/Related MIME messages are currently used for e-mail attachments.

Option Three is DIME messaging, which is a new specification for handling binary data with SOAP messages (or other messages.) It doesn't seem to be widespread, but it seems to be getting there. It is quite similar to Multipart/Related MIME messages, but offers a few key benefits. Essentially, it sacrifices flexibility for simplicity; so it is faster and simple to create. Also, it allows to break up large data sets so that they can be sent in chunks. Furthermore, DIME will have standards that will apply to more than just http. Microsoft will be motivated to focus on DIME for binary attachments in its future with SOAP tools and platforms.

Those are the most feasible options found trying to keep the current XML messaging intact.

Sources:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnservice/html/service01152002.asp - DIME
http://www.w3.org/TR/SOAP-attachments - Multipart MIME
http://www.javaworld.com/javaworld/javatips/jw-javatip117.html?tip - Encoding

We ended up going with Option 1 using the Base 64 approach. For transferring mostly 70KB files around, it's performed decently.