Andre's Blog
Perfection is when there is nothing left to take away
Data object serialization

The concept of serializing data objects located in memory, so they can be stored or transmitted over a network link, has been around for a long, long time. Seems that almost everyone took their shot at it over the last 20-30 years - RPC, Borland's streamable interfaces, MFC's archives, Java's remote method invocation (RMI),  ASN.1, distributed COM, SOAP, XML RPC, just to name a few well-known ones and not to mention myriads of proprietary designs.

How it works

Let's take a look at how serialization works. Suppose you have a simple object representing a gamer. Each gamer may have multiple friends, who are also gamers. Such gamer can be represented using a simple C++ structure like this:

struct gamer_t {
   char      *name;
   int       score;
   int       friend_count; 
   gamer_t   *friends;
};

, which would look like this in memory for a gamer with two friends:

┌───────────────────────────┐
│ name         : Devastator │
│ score        : 1003       │
│ friend_count : 2          │
│ friends      : pointer  ●------► ┌─────────────────────────────┐
└───────────────────────────┘      │friend[0]: name: Rachel; ... │
                                   ├─────────────────────────────┤
                                   │friend[1]: name: Doku;   ... │
                                   └─────────────────────────────┘

The data member friends is a pointer and cannot be saved to disk or transferred over the network because in either case its value, the address of the array of gamer_t structures, would have no meaning when read from disk or received by another process.

The simplest, and very naive, way to serialize gamer_t is to write its data members to the stream, one by one. Such stream would have this content when generated on an x86 machine:

Devastator 00 EB 03 02 00 Rachel 00 56 00 00 00 Doku 00 3F 00 00 00
───────────── ───── ───── ───────────────────── ───────────────────
name          score count      friend[0]             friend[1]

In this stream, a zero byte indicates the end of a string of characters, much the same way C++ strings are terminated. The name is followed by EB 03, which is a hexadecimal representation of 1003 stored on a Little Endian processor architecture (least significant byte is at the lower address). The last two bytes, 02 00, is the count of gamer_t structures representing friends, which follow the main gamer_t structure.

Byte order

Those who have any experience in programming will immediately note that the stream described above cannot be read on a non-x86 architecture because EB 03 will be read as 60163, which would make this gamer's score much better than it really is, but, what's more important and dangerous, the sequence 02 00 will be read as 512, implying that there are 512 gamer_t structures following the main one. In order to deal with byte ordering issues, all participating systems must either agree on one type of ordering or encode byte ordering within the stream.

Versioning

Many naive implementations of serializable data objects lack object versioning. That is, if a new data member would be added to gamer_t, there would be no way for the receiving component to know that. Embedding per-object version information increases the size of the stream, but is invaluable when data structure is expected to change, which is true, over time, for most data structures.

Size vs. quality

As the robustness of a stream increases, so does its size, accommodating versions, sub-object size information, check sums, character set and currency identifiers, and so on. It is not unusual for performance-conscious implementations to use individual bits in order to decrease stream size.

Character streams

However, some implementations prefer operator's convenience over the stream size and use character-only streams, which are easier to troubleshoot, if you are a human, and, sometimes provide better security, as more readily-available tools may be used to ensure the stream does not carry unsafe data.

Many Internet protocols were initially designed to serve text data. For example, SMTP, HTTP, POP3, etc, were initially meant to transport character messages. Even today binary data has to be often turned into character-based data using such methods as quoted-printable and base64 encodings, although many of the protocols have been extended to be able to serve binary data as-is.

XML streams

One of the most famous types of character documents used in data streaming and storage is XML and its ancestor, SGML (HTML).The advantage of a character-based stream is self-evident for anyone who ever looked at a source of an HTML page trying to fix it, work around some problem or, sometimes, just to learn how it is done.

XML has gone a long way and not only offers a well-documented way to store and transfer structured data, but also provides many tools that help to query large XML structures (XPath), transform them into other structures (XSLT), validate them (DTD and schemas) and so on. The list is quite long.

Just like in case with binary representation, there are many ways to represent data in XML. For example, the gamer_t structure described above may be serialized as any of the following examples:

<gamer name="Devastator" score="1003">
<friends>
   <gamer name="Rachel" score="86"/>
   <gamer name="Doku" score="63"/> 
</friends> 
</gamer>
<gamers>
<gamer id="g1" name="Devastator" score"1003">
   <friends>
      <friend idref="g2"/>
      <friend idref="g3"/>
   </friends>
</gamer>
<gamer id="g2" name="Rachel" score="86"/>
<gamer id="g3" name="Doku" score="63"/>
</gamers> 

The former structure is self-contained and may be used without any additional lookups (e.g. when a result of a single game is to be displayed), while it will result in data duplication when multiple entries are supposed to be serialized (i.e. if two or more gamers have same friends). The second structure is better suited for larger gamer lists, as each gamer entry only lists references to other gamers and not the content of other gamer structures.

There many other ways to represent the same data in XML, which may be more appropriate for some situations and less for others.

JSON

JavaScript Object Notation (JSON) is a relatively recent phenomena hailed as a fat-free alternative to XML. JSON syntax is based on JavaScript object initialization syntax (ECMA-262, 3rd Edition, 11.1.5).

A typical object initialization code using the initializer syntax is as follows:

var o = {a : "ABC", b : function(p) {return p;}, c : true};
alert(o.b(o.a));

The code below is syntactically equivalent to the code above. Both code fragments will produce an alert box with text "ABC".

function myObj()
{
   this.a = "ABC";
   this.b = new Function("p", "return p");
   this.c = true;
}
var o = new myObj();
alert(o.b(o.a));

The way JSON works is that object initializers are passed as strings between websites and then parsed by either JavaScript engine or a specialized parser, resulting in a JavaScript object, which is used as a request or a response of a JSON transaction.

var json = connection.getRequest();
var request = eval('(' + json + ')');

If the text stored in the json variable is well formed, the request object will be initialized with all properties and values sent by the originator of the request.

Sounds neat, isn't it? Well, anyone who is even remotely familiar with security would comment on the fact that JSON is all about passing code between participating components. That's right, object initializer is code that gets interpreted by the JavaScript engine on the receiving side.

JSON believers go on and on how safe JSON is, as long as a set of rules is followed. However, the bottom line is that as long as unvalidated code reaches the JavaScript engine through eval, JSON cannot be considered safe. Only if JSON is treated as text while parsed outside the JavaScript engine and partially validated during parsing, it can be viewed as a safe serialization mechanism.

Another security-related aspect of JSON is that when data is pulled using the script tag, same-origin policy is not applied to scripts and it is possible for a malicious page to retrieve unauthorized data.

Bottom line

If you are planning to use serialization for inter-component communications, such as a database client talking to a database server and performance and size of the serialized data is paramount, use binary serialization and languages that are capable of handling binary data well, such as C++ and Java.

If you are planning to communicate over HTTP, XML is your best choice because it is easy to handle through XMLHTTPRequest object, which will do all HTTP and XML work and you just have to query the object for results.

If you have to use JSON, do not use JavaScript engine to parse the input, but parse it first and then create a data object out of parsed data for further processing.

Links

Extensible Markup Language (XML)

Simple Object Access Protocol (SOAP)

XML-RPC Specification

XML-RPC vs. SOAP

The XMLHttpRequest Object

ECMAScript Language Specification

http://www.json.org/

Cross-site_request_forgery

Comments:
Name:

Comment: