OutOfMemoryExceptions while remoting very large datasets
2 Sep 2008 8:41 AM
When you have to pass an object back and forth between processes or
application domains you have to serialize it into some type of stream
that can be understood by both the client and the server.
The more complex and big the object gets the more expensive it is to serialize, both CPU wise and memory wise, and if the object is big and complex enough you can easily run into out of memory exceptions during the actual serialization process... and that is exactly what happened to one of my customers...
They had to pass very large datasets back and forth between the UI layer and the datalayer and these datasets could easily get up to a couple of hundred MB in size. When they passed the datasets back they would get OutOfMemory Exceptions in stacks like this one... in other words they would get OOMs while serializing the dataset passing it back to the client...
I am not a data access guru, but I have seen this type of issue enough times that I knew what the recommendation should be.
1. Re-think the architecture... what are you using these datasets for? who will be browsing through 100s of MBs of data anyways? (and this still holds true, in most cases where there is this much data involved only a very small part of it is needed and if that is the case, then only a very small piece of the data should be handled, i.e. filter out what you need and leave the rest)
2. Re-consider passing this data through remoting/webservices/out-of-proc session state or whatever it might be. Once you start serializing and deserializing this amount of data you are threading on thin ice when it comes to the scalability of your application, both performance and memory wise. Again, this still holds true, if the dataset itself is 100 MB you will only be able to have a handful of concurrent requests before you run out of memory for the datasets alone.
3. If you really really really need this much data and this architecture you need to start thinking about moving to 64 bit, but even there you need to be careful so that you have enough RAM and disc space to back up the memory you're using, and still you need to be careful, because the more memory you use, the longer it will take to perform full garbage collections.
We discussed a couple of options like bringing back partial datasets, chunking it up, but still most of it was a no-go.
Debugging
I created a very small remoting sample with just one method that returns a very large dataset (you can find the code for the sample at the bottom of this post... just to see how much memory we were actually using for the serialization (the dataset itself was 102 MB).
I attached to the remoting server with windbg and loaded up sos (.loadby sos mscorwks) and then I set a breakpoint on mscorwks!WKS::gc_heap::allocate_large_object so that I could record the size of the allocation (?@ebx) and the stack (!clrstack) everytime we allocated a large object (I figured this was enough for a rough estimate)
Fine, what did I learn from this? well, I just confirmed what I already knew, that serialization is very expensive. In fact in my case I had to allocate 1 GB to serialize 100 MB so a factor of 10, and that is not even all... if I would have been successful in allocating this, I would still have had to allocate some more intermediate strings in the neighborhood of a couple of hundred MBs, so all in all it seemed like an insurmountable task to serialize a dataset this big.
Solutions
I mentioned a few earlier, which basically include, don't serialize datasets this big, and if you must, then go to 64-bit.
I remembered though, that on 1.1 there was an article that had some suggestions on how to optimize the serialization by creating dataset surrogates, i.e. wrapper classes that performed their own serialization rather than using the standard one that remoting uses. http://support.microsoft.com/kb/829740
I knew things had changed in 2.0 so that article was no longer applicable, but I didn't really know what it had changed to, so I went on an internet search and found this article that turned out to explain a loot of good stuff about serialization of datasets.
http://msdn.microsoft.com/en-us/magazine/cc163911.aspx
The article suggests that you should change the serialization method if you need to remote very large datasets. I did this by adding one single line to the remoting server, before returning the dataset
Have a good one,
Tess
Sample code used for this post
Server:
The more complex and big the object gets the more expensive it is to serialize, both CPU wise and memory wise, and if the object is big and complex enough you can easily run into out of memory exceptions during the actual serialization process... and that is exactly what happened to one of my customers...
They had to pass very large datasets back and forth between the UI layer and the datalayer and these datasets could easily get up to a couple of hundred MB in size. When they passed the datasets back they would get OutOfMemory Exceptions in stacks like this one... in other words they would get OOMs while serializing the dataset passing it back to the client...
0454f350 773442eb [HelperMethodFrame: 0454f350]
0454f3a8 793631b3 System.String.GetStringForStringBuilder(System.String, Int32, Int32, Int32)
0454f3d0 79363167 System.Text.StringBuilder..ctor(System.String, Int32, Int32, Int32)
0454f3f8 793630cc System.Text.StringBuilder..ctor(System.String, Int32)
0454f408 651eadee System.Data.DataSet.SerializeDataSet(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext, System.Data.SerializationFormat) 0454f448 651eaa5b System.Data.DataSet.GetObjectData(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext)
0454f458 7964db64 System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo.InitSerialize(System.Object, System.Runtime.Serialization.ISurrogateSelector, System.Runtime.Serialization.StreamingContext, System.Runtime.Serialization.Formatters.Binary.SerObjectInfoInit, System.Runtime.Serialization.IFormatterConverter, System.Runtime.Serialization.Formatters.Binary.ObjectWriter)
0454f498 793ba2bb System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo.Serialize(System.Object, System.Runtime.Serialization.ISurrogateSelector, System.Runtime.Serialization.StreamingContext, System.Runtime.Serialization.Formatters.Binary.SerObjectInfoInit, System.Runtime.Serialization.IFormatterConverter, System.Runtime.Serialization.Formatters.Binary.ObjectWriter)
0454f4c0 793b9cef System.Runtime.Serialization.Formatters.Binary.ObjectWriter.Serialize(System.Object, System.Runtime.Remoting.Messaging.Header[], System.Runtime.Serialization.Formatters.Binary.__BinaryWriter, Boolean)
0454f500 793b9954 System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Serialize(System.IO.Stream, System.Object, System.Runtime.Remoting.Messaging.Header[], Boolean)
0454f524 6778c0b0 System.Runtime.Remoting.Channels.BinaryServerFormatterSink.SerializeResponse(System.Runtime.Remoting.Channels.IServerResponseChannelSinkStack, System.Runtime.Remoting.Messaging.IMessage, System.Runtime.Remoting.Channels.ITransportHeaders ByRef, System.IO.Stream ByRef)
0454f57c 6778bb0f System.Runtime.Remoting.Channels.BinaryServerFormatterSink.ProcessMessage(System.Runtime.Remoting.Channels.IServerChannelSinkStack, System.Runtime.Remoting.Messaging.IMessage, System.Runtime.Remoting.Channels.ITransportHeaders, System.IO.Stream, System.Runtime.Remoting.Messaging.IMessage ByRef, System.Runtime.Remoting.Channels.ITransportHeaders ByRef, System.IO.Stream ByRef)
0454f600 67785616 System.Runtime.Remoting.Channels.Tcp.TcpServerTransportSink.ServiceRequest(System.Object)
0454f660 67777732 System.Runtime.Remoting.Channels.SocketHandler.ProcessRequestNow()
0454f690 677762a2 System.Runtime.Remoting.Channels.RequestQueue.ProcessNextRequest(System.Runtime.Remoting.Channels.SocketHandler)
0454f694 67777693 System.Runtime.Remoting.Channels.SocketHandler.BeginReadMessageCallback(System.IAsyncResult)
0454f6c4 7a569ca9 System.Net.LazyAsyncResult.Complete(IntPtr)
0454f6fc 7a56a46e System.Net.ContextAwareResult.CompleteCallback(System.Object)
0454f704 79373ecd System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
0454f71c 7a56a436 System.Net.ContextAwareResult.Complete(IntPtr)
0454f734 7a569bed System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object, IntPtr)
0454f764 7a61062d System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
0454f79c 79405534 System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
0454f93c 79e7c74b [GCFrame: 0454f93c]
My gut feeling was that they were SOL. I know that serialization is
very memory expensive and that the resulting serialized xml strings can
get enormous so I wasn't very surprised, especially knowing how large
their datasets were. 0454f3a8 793631b3 System.String.GetStringForStringBuilder(System.String, Int32, Int32, Int32)
0454f3d0 79363167 System.Text.StringBuilder..ctor(System.String, Int32, Int32, Int32)
0454f3f8 793630cc System.Text.StringBuilder..ctor(System.String, Int32)
0454f408 651eadee System.Data.DataSet.SerializeDataSet(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext, System.Data.SerializationFormat) 0454f448 651eaa5b System.Data.DataSet.GetObjectData(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext)
0454f458 7964db64 System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo.InitSerialize(System.Object, System.Runtime.Serialization.ISurrogateSelector, System.Runtime.Serialization.StreamingContext, System.Runtime.Serialization.Formatters.Binary.SerObjectInfoInit, System.Runtime.Serialization.IFormatterConverter, System.Runtime.Serialization.Formatters.Binary.ObjectWriter)
0454f498 793ba2bb System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo.Serialize(System.Object, System.Runtime.Serialization.ISurrogateSelector, System.Runtime.Serialization.StreamingContext, System.Runtime.Serialization.Formatters.Binary.SerObjectInfoInit, System.Runtime.Serialization.IFormatterConverter, System.Runtime.Serialization.Formatters.Binary.ObjectWriter)
0454f4c0 793b9cef System.Runtime.Serialization.Formatters.Binary.ObjectWriter.Serialize(System.Object, System.Runtime.Remoting.Messaging.Header[], System.Runtime.Serialization.Formatters.Binary.__BinaryWriter, Boolean)
0454f500 793b9954 System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Serialize(System.IO.Stream, System.Object, System.Runtime.Remoting.Messaging.Header[], Boolean)
0454f524 6778c0b0 System.Runtime.Remoting.Channels.BinaryServerFormatterSink.SerializeResponse(System.Runtime.Remoting.Channels.IServerResponseChannelSinkStack, System.Runtime.Remoting.Messaging.IMessage, System.Runtime.Remoting.Channels.ITransportHeaders ByRef, System.IO.Stream ByRef)
0454f57c 6778bb0f System.Runtime.Remoting.Channels.BinaryServerFormatterSink.ProcessMessage(System.Runtime.Remoting.Channels.IServerChannelSinkStack, System.Runtime.Remoting.Messaging.IMessage, System.Runtime.Remoting.Channels.ITransportHeaders, System.IO.Stream, System.Runtime.Remoting.Messaging.IMessage ByRef, System.Runtime.Remoting.Channels.ITransportHeaders ByRef, System.IO.Stream ByRef)
0454f600 67785616 System.Runtime.Remoting.Channels.Tcp.TcpServerTransportSink.ServiceRequest(System.Object)
0454f660 67777732 System.Runtime.Remoting.Channels.SocketHandler.ProcessRequestNow()
0454f690 677762a2 System.Runtime.Remoting.Channels.RequestQueue.ProcessNextRequest(System.Runtime.Remoting.Channels.SocketHandler)
0454f694 67777693 System.Runtime.Remoting.Channels.SocketHandler.BeginReadMessageCallback(System.IAsyncResult)
0454f6c4 7a569ca9 System.Net.LazyAsyncResult.Complete(IntPtr)
0454f6fc 7a56a46e System.Net.ContextAwareResult.CompleteCallback(System.Object)
0454f704 79373ecd System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
0454f71c 7a56a436 System.Net.ContextAwareResult.Complete(IntPtr)
0454f734 7a569bed System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object, IntPtr)
0454f764 7a61062d System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
0454f79c 79405534 System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
0454f93c 79e7c74b [GCFrame: 0454f93c]
I am not a data access guru, but I have seen this type of issue enough times that I knew what the recommendation should be.
1. Re-think the architecture... what are you using these datasets for? who will be browsing through 100s of MBs of data anyways? (and this still holds true, in most cases where there is this much data involved only a very small part of it is needed and if that is the case, then only a very small piece of the data should be handled, i.e. filter out what you need and leave the rest)
2. Re-consider passing this data through remoting/webservices/out-of-proc session state or whatever it might be. Once you start serializing and deserializing this amount of data you are threading on thin ice when it comes to the scalability of your application, both performance and memory wise. Again, this still holds true, if the dataset itself is 100 MB you will only be able to have a handful of concurrent requests before you run out of memory for the datasets alone.
3. If you really really really need this much data and this architecture you need to start thinking about moving to 64 bit, but even there you need to be careful so that you have enough RAM and disc space to back up the memory you're using, and still you need to be careful, because the more memory you use, the longer it will take to perform full garbage collections.
We discussed a couple of options like bringing back partial datasets, chunking it up, but still most of it was a no-go.
Debugging
I created a very small remoting sample with just one method that returns a very large dataset (you can find the code for the sample at the bottom of this post... just to see how much memory we were actually using for the serialization (the dataset itself was 102 MB).
I attached to the remoting server with windbg and loaded up sos (.loadby sos mscorwks) and then I set a breakpoint on mscorwks!WKS::gc_heap::allocate_large_object so that I could record the size of the allocation (?@ebx) and the stack (!clrstack) everytime we allocated a large object (I figured this was enough for a rough estimate)
Low and behold, the last attempted allocation before the OOM was a whooping 1 142 400 418 bytes (~1 GB!!!! for a 100 MB dataset)0:004> x mscorwks!WKS*allocate_large*
79ef212d mscorwks!WKS::gc_heap::allocate_large_object = <no type information>
0:004> bp 79ef212d "?@ebx;!clrstack;g"
Evaluate expression: 1142400418 = 4417a5a2
OS Thread Id: 0x128c (4)
ESP EIP
0454f350 79ef212d [HelperMethodFrame: 0454f350]
0454f3a8 793631b3 System.String.GetStringForStringBuilder(System.String, Int32, Int32, Int32)
0454f3d0 79363167 System.Text.StringBuilder..ctor(System.String, Int32, Int32, Int32)
0454f3f8 793630cc System.Text.StringBuilder..ctor(System.String, Int32)
0454f408 651eadee System.Data.DataSet.SerializeDataSet(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext, System.Data.SerializationFormat)
0454f448 651eaa5b System.Data.DataSet.GetObjectData(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext)
...
When you try to allocate an object like that it needs to be allocated
in one chunk. Since it is larger than the size of the LOH segment we
will try to create a segment the size of the object, and in my case I
just didn't have 1 GB of free space in my virtual memory in one large
chunk, so the allocation fails with an OOM. OS Thread Id: 0x128c (4)
ESP EIP
0454f350 79ef212d [HelperMethodFrame: 0454f350]
0454f3a8 793631b3 System.String.GetStringForStringBuilder(System.String, Int32, Int32, Int32)
0454f3d0 79363167 System.Text.StringBuilder..ctor(System.String, Int32, Int32, Int32)
0454f3f8 793630cc System.Text.StringBuilder..ctor(System.String, Int32)
0454f408 651eadee System.Data.DataSet.SerializeDataSet(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext, System.Data.SerializationFormat)
0454f448 651eaa5b System.Data.DataSet.GetObjectData(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext)
...
Fine, what did I learn from this? well, I just confirmed what I already knew, that serialization is very expensive. In fact in my case I had to allocate 1 GB to serialize 100 MB so a factor of 10, and that is not even all... if I would have been successful in allocating this, I would still have had to allocate some more intermediate strings in the neighborhood of a couple of hundred MBs, so all in all it seemed like an insurmountable task to serialize a dataset this big.
Solutions
I mentioned a few earlier, which basically include, don't serialize datasets this big, and if you must, then go to 64-bit.
I remembered though, that on 1.1 there was an article that had some suggestions on how to optimize the serialization by creating dataset surrogates, i.e. wrapper classes that performed their own serialization rather than using the standard one that remoting uses. http://support.microsoft.com/kb/829740
I knew things had changed in 2.0 so that article was no longer applicable, but I didn't really know what it had changed to, so I went on an internet search and found this article that turned out to explain a loot of good stuff about serialization of datasets.
http://msdn.microsoft.com/en-us/magazine/cc163911.aspx
The article suggests that you should change the serialization method if you need to remote very large datasets. I did this by adding one single line to the remoting server, before returning the dataset
Then I re-ran the test and didn't get the OOM. Not only that, but when I ran it through the debugger with the same breakpoint... instead of the 1 GB allocation, I ended up with 5 * 240 k allocations and one 225 k allocation used for the serialization (not counting any non-large objects). Memory wise, that is an improvement of 100 000% for one extra line in your code, that's a little bit hard to beat:)ds.RemotingFormat = SerializationFormat.Binary;
Have a good one,
Tess
Sample code used for this post
Server:
using System;
using System.Runtime.Remoting;
using System.Runtime.Remoting.Channels;
using System.Runtime.Remoting.Channels.Tcp;
using System.Data; namespace MyServer
{
class Program
{
static void Main(string[] args)
{
MyServer();
}
static void MyServer()
{
Console.WriteLine("Remoting Server started...");
TcpChannel tcpChannel = new TcpChannel(1234);
ChannelServices.RegisterChannel(tcpChannel, false);
Type commonInterfaceType = Type.GetType("MyServer.DataLayer");
RemotingConfiguration.RegisterWellKnownServiceType(commonInterfaceType, "DataLayerService", WellKnownObjectMode.SingleCall);
Console.WriteLine("Press ENTER to quit");
Console.ReadLine();
}
}
public interface DataLayerInterface
{
DataSet GetDS(int rows);
}
public class DataLayer : MarshalByRefObject, DataLayerInterface
{
public DataSet GetDS(int rows)
{
//populate a table with the featured products
DataTable dt = new DataTable();
DataRow dr;
DataColumn dc;
dc = new DataColumn("ID", typeof(Int32));
dc.Unique = true;
dt.Columns.Add(dc);
dt.Columns.Add(new DataColumn("FirstName", typeof(string)));
dt.Columns.Add(new DataColumn("LastName", typeof(string)));
dt.Columns.Add(new DataColumn("UserName", typeof(string)));
dt.Columns.Add(new DataColumn("IsUserAMemberOfTheAdministratorsGroup", typeof(string)));
DataSet ds = new DataSet();
ds.Tables.Add(dt);
for (int i = 0; i < rows; i++)
{
dr = dt.NewRow();
dr["id"] = i;
dr["FirstName"] = "Jane";
dr["LastName"] = "Doe";
dr["UserName"] = "jd";
dr["IsUserAMemberOfTheAdministratorsGroup"] = "No";
dt.Rows.Add(dr);
}
ds.RemotingFormat = SerializationFormat.Binary; //<-- this line makes a world of difference
return ds;
}
}
}
Client:using System.Runtime.Remoting;
using System.Runtime.Remoting.Channels;
using System.Runtime.Remoting.Channels.Tcp;
using System.Data; namespace MyServer
{
class Program
{
static void Main(string[] args)
{
MyServer();
}
static void MyServer()
{
Console.WriteLine("Remoting Server started...");
TcpChannel tcpChannel = new TcpChannel(1234);
ChannelServices.RegisterChannel(tcpChannel, false);
Type commonInterfaceType = Type.GetType("MyServer.DataLayer");
RemotingConfiguration.RegisterWellKnownServiceType(commonInterfaceType, "DataLayerService", WellKnownObjectMode.SingleCall);
Console.WriteLine("Press ENTER to quit");
Console.ReadLine();
}
}
public interface DataLayerInterface
{
DataSet GetDS(int rows);
}
public class DataLayer : MarshalByRefObject, DataLayerInterface
{
public DataSet GetDS(int rows)
{
//populate a table with the featured products
DataTable dt = new DataTable();
DataRow dr;
DataColumn dc;
dc = new DataColumn("ID", typeof(Int32));
dc.Unique = true;
dt.Columns.Add(dc);
dt.Columns.Add(new DataColumn("FirstName", typeof(string)));
dt.Columns.Add(new DataColumn("LastName", typeof(string)));
dt.Columns.Add(new DataColumn("UserName", typeof(string)));
dt.Columns.Add(new DataColumn("IsUserAMemberOfTheAdministratorsGroup", typeof(string)));
DataSet ds = new DataSet();
ds.Tables.Add(dt);
for (int i = 0; i < rows; i++)
{
dr = dt.NewRow();
dr["id"] = i;
dr["FirstName"] = "Jane";
dr["LastName"] = "Doe";
dr["UserName"] = "jd";
dr["IsUserAMemberOfTheAdministratorsGroup"] = "No";
dt.Rows.Add(dr);
}
ds.RemotingFormat = SerializationFormat.Binary; //<-- this line makes a world of difference
return ds;
}
}
}
using System;
using System.Runtime.Remoting;
using System.Runtime.Remoting.Channels;
using System.Runtime.Remoting.Channels.Tcp;
using System.Data;
using MyServer; namespace Client
{
class Program
{
static void Main(string[] args)
{
TcpChannel tcpChannel = new TcpChannel();
ChannelServices.RegisterChannel(tcpChannel, false);
Type requiredType = typeof(DataLayerInterface);
DataLayerInterface remoteObject = (DataLayerInterface)Activator.GetObject(requiredType, "tcp://localhost:1234/DataLayerService");
DataSet ds = remoteObject.GetDS(600000);
Console.WriteLine("Number of rows in ds: " + ds.Tables[0].Rows.Count.ToString());
Console.ReadLine();
}
}
}
http://msdn.microsoft.com/en-us/magazine/cc163911.aspx
using System.Runtime.Remoting;
using System.Runtime.Remoting.Channels;
using System.Runtime.Remoting.Channels.Tcp;
using System.Data;
using MyServer; namespace Client
{
class Program
{
static void Main(string[] args)
{
TcpChannel tcpChannel = new TcpChannel();
ChannelServices.RegisterChannel(tcpChannel, false);
Type requiredType = typeof(DataLayerInterface);
DataLayerInterface remoteObject = (DataLayerInterface)Activator.GetObject(requiredType, "tcp://localhost:1234/DataLayerService");
DataSet ds = remoteObject.GetDS(600000);
Console.WriteLine("Number of rows in ds: " + ds.Tables[0].Rows.Count.ToString());
Console.ReadLine();
}
}
}
http://msdn.microsoft.com/en-us/magazine/cc163911.aspx
Cutting Edge
Binary Serialization of DataSets
Dino Esposito
The ADO.NET DataSet object plays an
essential role in most of today's distributed, multitiered applications.
Instances of the DataSet class are used to move data across the tiers
and to exchange data with external services. The DataSet has most of the
features needed to represent real-world business entities.
First, the DataSet is a disconnected
object—you can use it without a physical connection to the data source.
In addition, the DataSet provides a rich programming interface. It
supports multiple tables of data, accepts relations defined between
pairs of tables, and allows you to enumerate its contents. Last but not
least, the DataSet is a fully serializable object. It can be serialized
in three different ways—to a standard .NET formatter, to an XML writer,
and through the XML serializer.
For objects that represent the real-world
entities of a multitiered app, serialization is key because the object
must be serialized to be moved across tiers. This month, I'll discuss a
problem with serialization in ADO.NET 1.x and how serialization in ADO.NET 2.0 will improve upon it. Back in the December 2002 installment of Cutting Edge, I reviewed the serialization capabilities of the DataSet object, which you might want to refer to for a little background.
The Serialization Problem Defined
I have a client with a classic three-tier
application in which a number of user interface process components were
moving data up and down to business components through a .NET Remoting
channel. The runtime environment of .NET Remoting used a custom host
application configured to work over binary serialization. The client
chose binary serialization, thinking it was the fastest way to move data
between processes. But their application was running slowly and
experiencing problems moving data. Hence, the client put a sniffer on
the network to look at the size of the data being moved back and forth
and discovered that too many bytes were being transported; the actual
size of the data was grossly exceeding the expected size. So what was
going on?
How DataSet Serialization Really Works
Marked with the [Serializable] attribute,
the DataSet object implements the ISerializable interface to gain full
control over its serialization process. The .NET formatter streamlines,
in a default way, objects that don't implement ISerializable and
guarantees that data is stored as a SOAP payload if the SoapFormatter is
used or as a binary stream if the BinaryFormatter is used. The
responsibility of the formatters, however, ends here if the serialized
object supports ISerializable. In that case, the formatter passes an
empty memory buffer (the SerializationInfo data structure) and waits for
the serializee to populate it with data. The formatter's activities are
limited to flushing this memory buffer to a binary stream or wrapping
it up as a SOAP packet. However, there are no guarantees made about the
type of the data added to the SerializationInfo object.
The following pseudocode shows how the
DataSet serializes with a .NET formatter. The GetObjectData method is
the sole method of the ISerializable interface.
void GetObjectData(SerializationInfo info, StreamingContext context) { info.AddValue("XmlSchema", this.GetXmlSchema()); this.WriteXml(strWriter, XmlWriteMode.DiffGram); info.AddValue("XmlDiffGram", strWriter.ToString()); }
Regardless of the formatter being used,
the DataSet always serializes first to XML. What's worse, the DataSet
uses a pretty verbose schema—the DiffGram format plus any related schema
information. Now take a DataSet with a few thousand records in it and
imagine such a large chunk of text traveling over the network with no
sort of optimization or compression (even blank spaces aren't removed).
That's exactly the problem that I'm sure many of you have been called to
solve at one time or another.
Avoiding the Problem
Until ADO.NET 2.0 the prognosis was
mixed. There was both good and bad news. The bad news was that this
problem is virtually unavoidable if you choose the DataSet as the
implementor of your business entities. Alternative choices are raw XML,
custom collections, and arrays of custom classes. All of these have pros
and cons. All are serializable and enumerable data types. In addition,
raw XML and custom types are better at interoperating with components
running on platforms such as J2EE than is the DataSet. This assumes
you're employing Web services to bridge user interface process
components to business entities (see Figure 1). For more information about Web services and DataSets, check out The XML Files column in the April 2003 issue of MSDN®Magazine.
![]()
Figure 1 User Interface
If you modify your architecture to do
without DataSets, you avoid the problem of serialization but lose some
usability and simplicity. The DataSet is a great class to code against.
Using it you can pass the full DataSet around, and then if needed, you
can also easily excerpt only the changed rows in each table in order to
minimize bandwidth and CPU usage for a call.
The good news is that there are some workarounds. To see the various options compared, check out the DataSet FAQ. One way to improve the end-to-end transfer speed is to override the DataSet serialization mechanism.
You can create your own DataSet-derived
class and implement the members of the ISerializable interface to
fulfill your performance and scalability requirements. As an
alternative, you can generate a typed DataSet using Visual Studio®
.NET and then modify its source code to reimplement the ISerializable
interface as needed. A third option is to use a little-known feature of
the .NET Framework—serialization surrogates. Jeffrey Richter provides
excellent coverage of serialization surrogates in his September 2002
installment of the .NET column.
Technically speaking, a serialization surrogate is a class that
implements the ISerializationSurrogate interface. The interface consists
of two members: GetObjectData and SetObjectData.
By using surrogates you can override the
way types serialize themselves. The technique has a couple of
interesting practical applications: serialization of unserializable
types and deserialization of an object to a different version of its
type. Surrogates work in cooperation with .NET runtime formatters to
handle serialization and deserialization for a given type. For example,
the BinaryFormatter class has a member named SurrogateSelector that
installs a chain of surrogates for a variety of types. When instances of
any of these types are going to be serialized or deserialized, the
process won't go through the object's serialization interface
(ISerializable or the reflection-based default algorithm) but rather
takes advantage of the surrogate's capabilities using its
ISerializationSurrogate interface. Here's an example of how to override
the serialization of a DataSet using a custom surrogate
DataSetSurrogate:
SurrogateSelector ss = new SurrogateSelector(); DataSetSurrogate dss = new DataSetSurrogate(); ss.AddSurrogate(typeof(DataSet), new StreamingContext(StreamingContextStates.All), dss); formatter.SurrogateSelector = ss;
In the DataSetSurrogate class, you
implement the GetObjectData and SetObjectData methods to work around the
known performance limitations of the standard DataSet serialization.
Next, you add an instance of the DataSet surrogate to a selector class.
Finally, the selector class is bound to an instance of the formatter.
Whichever approach you choose to improve
the DataSet's serialization performance (a new class or a serialization
surrogate), the end goal should be to reduce the amount of information
being moved.
The DataSet serializes to an XML
DiffGram—a rich XML schema that contains the current snapshot of the
data as well as pending errors on table rows and a history of all the
changes that occurred on the rows. The history is actually the original
value of the row assigned when either the DataSet was created or the
last time the AcceptChanges method was invoked.
Since the main problem is the size of the
data, compression is one possible answer. The solution outlined in
Knowledge Base article 829740 ("Improving DataSet Serialization and Remoting Performance
employs a DataSet-like class that is simply marked as Serializable and
doesn't implement ISerializable. Other ADO.NET objects (DataTable,
DataColumn) are also replaced to take full control of the serialization
process. This simple change provides a double advantage: it reduces the
amount of data being moved and reduces the pressure on the .NET Remoting
framework to serialize and deserialize larger DataSet objects.
Another approach to the problem, that is
much more than just a smart way to serialize a DataSet, involves
implementing a custom formatter, such as Angelo Scotto's
CompactFormatter (available for download at http://www.freewebs.com/compactFormatter/About.html). The CompactFormatter is a generic formatter for both the Microsoft®
.NET Framework and the .NET Compact Framework and is capable of
producing an even more compact byte stream than the native
BinaryFormatter class. In addition, it supports compression, which
further reduces the amount of data being moved. More importantly, the
CompactFormatter is not DataSet-specific and works with most .NET types.
Drilling Down a Little Further
All .NET distributed systems that make
intensive use of disconnected data (as recommended by Microsoft
architecture patterns and practices) are sensitive to the size of
serialized data. The larger the DataSet, the more CPU cycles, memory,
and bandwidth these systems consume. Fortunately, ADO.NET 2.0 provides a
great fix. Before I explain it, though, I should make clear that I'm
not saying DataSet XML serialization is a bad thing in general. Standard
DataSet XML serialization is a "stateful" form of serialization in the
sense that it maintains some state information such as pending errors
and current changes to rows. It can be customized to a great extent by
choosing the favorite combination of nodes and attributes; you can also
decide if relations should simply be tracked or rendered by nesting
child records under their parents.
What's really interesting is the fact
that as long as few records (for example, less than 100) are involved in
the serialization, the performance of a standard DataSet and that of an
optimized DataSet (custom DataSet, surrogate, compressed, whatever) is
negligible. Therefore, I wouldn't spend much time implementing
alternative forms of DataSet serialization if my application moves only
small chunks of data.
Although performance worsens as the size
of the DataSet grows, even for a thousand records the performance is not
much of a problem. Note that performance here relates to .NET Remoting
end-to-end speed, a parameter that includes the size of the data to
transfer and the costs of successive instantiations. As you scale from a
thousand to a few thousand records, the performance begins to suffer
significantly. Figure 2 shows the difference in .NET
Remoting end-to-end speed when XML (standard) and true binary
serialization are used (the latter is new to ADO.NET 2.0; more on this
in a moment). For the sake of precision, I have to say that the numbers
behind the graph have been obtained with a Beta build of ADO.NET 2.0.
![]()
Figure 2 Remoting End-to-End Time
Of course the worst scenario is when you
move a large DataSet (thousands of rows) with numerous changes. If
you're simply forwarding this DataSet to the data access layer (DAL) for
applying changes, you can alleviate the issue by using the DataSet's
GetChanges method. This method returns a new DataSet that contains only
the rows in the various tables that have been modified, added, or
deleted.
However, when you're moving a DataSet
from tier to tier, or from a business component to an external service
that implements a business process, you have no choice but to pass a
stateful representation of the data—the whole DataSet with its own set
of relations, pending changes, and errors.
Why does the end-to-end speed slow down
once you reach a certain threshold? Handling large DataSets poses an
additional problem to the .NET Remoting infrastructure aside from the
time and space needed to complete the operation. This extra problem has
to do with the specific algorithm used to serialize and, more
importantly, deserialize DataSets saved as XML DiffGrams. To restore a
DataSet from a DiffGram, the binary formatter invokes a protected
constructor on the DataSet class (part of the ISerializable
implementation). The pseudocode of this DataSet constructor is shown in Figure 3.
As you can see, the DataSet's ReadXml method is called to process the
DiffGram. Especially for large DataSets, ReadXml works by creating lots
of transient, short-lived objects, a few for each row to be processed.
This mechanism puts a lot of additional pressure on the .NET Remoting
infrastructure and consumes a lot of memory and CPU cycles. Exactly how
bad it is will depend on the runtime conditions and hardware equipment
of the system. This explains why certain clients experience visible
problems when moving only 7MB of data while others won't see problems
until they've moved 20MB of data.
![]() protected DataSet(SerializationInfo info, StreamingContext context) { string schema, diffgram; schema = (string) info.GetValue("XmlSchema", typeof(string)); diffgram = (string) info.GetValue("XmlDiffGram", typeof(string)); if (schema != null) ReadXmlSchema(new XmlTextReader(new StringReader(schema)), true); if (diffgram != null) ReadXml(new XmlTextReader(new StringReader(schema)), XmlReadMode.DiffGram); }
DataSet Serialization in ADO.NET 2.0
When you upgrade to ADO.NET 2.0, your
problems will be solved. In ADO.NET 2.0, the DataSet class provides a
new serialization option specifically designed to optimize remoting
serialization. As a result, remoting a DataSet uses less memory and
bandwidth. The end-to-end latency is greatly improved, as Figure 2 illustrates.
In ADO.NET 2.0, the DataSet and DataTable
come with a new property named RemotingFormat defined to be of type
SerializationFormat (see Figure 4). By default, the new
property is set to SerializationFormat.Xml to preserve backward
compatibility. The property affects the behavior of the GetObjectData
members on the ISerializable interface and ultimately is a way to
control the serialization of the DataSet. The following code snippet
represents the pseudocode of the method in ADO.NET 2.0:
void GetObjectData(SerializationInfo info, StreamingContext context) { SerializationFormat fmt = RemotingFormat; SerializeDataSet(info, context, fmt); } DataSet ds = GetData(); ds.RemotingFormat = SerializationFormat.Binary; BinaryFormatter bin = new BinaryFormatter(); bin.Serialize(stream, ds); ![]() private void SerializeDataSet( SerializationInfo info, StreamingContext context, SerializationFormat remotingFormat) { info.AddValue("DataSet.RemotingVersion", new Version(2, 0)); if (remotingFormat != SerializationFormat.Xml) { int i; info.AddValue("DataSet.RemotingFormat", remotingFormat); SerializeDataSetProperties(info, context); info.AddValue("DataSet.Tables.Count", this.Tables.Count); for (i=0; i< Tables.Count; i++) Tables[i].SerializeConstraints(info, context, i, true); SerializeRelations(info, context); for (i=0; i< Tables.Count; i++) Tables[i].SerializeExpressionColumns(info, context, i); for (int=0; i< Tables.Count; i++) Tables[i].SerializeTableData(info, context, i); return; } // 1.x code } ![]()
It is interesting to measure the
performance gain that you get from this new feature. Here's a simple
technique you can easily reproduce. Fill a DataSet with the results of a
query and persist it to a file on disk (see Figure 6). You can wrap the code in Figure 6
in either a Web Form or a Windows Form. Run the sample application and
take a look at the size of the files created. Try first with a simple
query like this:
SELECT lastname, firstname FROM employees ![]() SqlDataAdapter adapter = new SqlDataAdapter(query, connString); DataSet ds = new DataSet(); adapter.Fill(ds); BinaryFormatter bin = new BinaryFormatter(); // Save as XML using(StreamWriter writer1 = new StreamWriter(@"c:\xml.dat")) { bin.Serialize(writer1.BaseStream, ds); } // Save as binary using(StreamWriter writer2 = new StreamWriter(@"c:\bin.dat")) { ds.RemotingFormat = SerializationFormat.Binary; bin.Serialize(writer2.BaseStream, ds); }
If you're familiar with the Northwind
database, you know that this query returns only nine records. Quite
surprisingly, in this case the XML DiffGram is about half the size of
the binary file! Don't worry, there's nothing wrong in the code or in
the underlying technology. To see the difference, run a query that will
return a few thousand records, like this one:
SELECT * FROM [order details]
Now the binary file is about 10 times smaller than the XML file. With this evidence, look back at the graph in Figure 2.
The yellow series shows the performance of the DataSet binary
serializer and fortunately increases very slowly as the size of the
DataSet grows. The same can't be said for the DataSet XML serializer
which reports a sudden upswing and significantly decreased absolute
performance as the number of rows exceeds a few thousand.
DataTable Enhancements
In ADO.NET 1.x, the DataTable
class suffers from three main limitations. A DataTable can't be used
with Web service methods and doesn't provide a direct way to serialize
its contents. In addition, while DataTable instances can be passed
across a remoting channel, strongly typed DataTable instances cannot be
remoted. These limitations have been removed in ADO.NET 2.0. Let's take a
look at exactly how this can be accomplished.
Return values and input parameters of Web
service methods must be serializable through the XmlSerializer class.
This class is responsible for translating .NET types into XML Schema
Definition (XSD) types and vice versa. Unlike the runtime formatter
classes (such as BinaryFormatter), the XmlSerializer class doesn't
handle circular references. In other words, if you try to serialize a
DataTable you get an error because the DataTable contains a reference to
a DataSet—the DataSet property—and the DataSet, in turn, contains a
reference to the DataTable through the Tables collection. For this
reason, XmlSerializer raises an exception if you try to pass a DataTable
to, or return a DataTable from, a Web service method. DataSet objects
have the same circular reference, but unlike the DataTable class,
DataSets implement the little-known IXmlSerializable interface.
The XmlSerializer class specifically
looks for this interface when serializing an object. If the interface is
found, the serializer yields control and waits for the WriteXml/ReadXml
methods to terminate the serialization and deserialization process. The
IXmlSerializable interface has the following three members:
XmlSchema GetSchema(); void ReadXml(XmlReader reader); void WriteXml(XmlWriter writer);
In ADO.NET 2.0, the DataTable class fully
supports the IXmlSerializable interface so you can finally use
DataTables as input parameters or return values in Web service methods.
Note that the GetSchema method exists on
IXmlSerializable purely for backward compatibility, and it is safe to
choose a trivial implementation of GetSchema that returns null. However,
if the class is being used in a Web service method's signature, an
XmlSchemaProviderAttribute should be applied to the class to denote a
static method on the class that can be used to get a schema. This XSD
representation can then be mapped back to the actual object by
implementing a SchemaImporterExtension. For more information on
XmlSchemaProviderAttribute and SchemaImporterExtension, see New Features for Web Service Developers in Beta 1 of the .NET Framework 2.0.
In addition, the DataTable provides a
simple XML I/O channel. A DataTable can be populated from an XML stream
using the new ReadXml method and persisted to disk using the new
WriteXml method. The methods have the same role and signature as the
equivalent methods on the DataSet class. Last, but not least, a
DataTable supports the RemotingFormat property and can be serialized in a
true binary format.
Note that the list of changes for the
DataTable class doesn't end here. For example, the class also features
full integration with streaming interfaces and can be loaded from
readers (including SQL DataReaders) and its contents can be read through
DataReaders. However, these changes don't directly (or necessarily)
affect the building of middle tiers of enterprise class applications and
in this column I don't have the space to cover them or even the
majority of the new ADO.NET 2.0 features.
Call to Action
One of the main ADO.NET 2.0 themes is the
emphasis on improved performance and scalability. This encompasses a
new, faster index engine to locate table records and the oft-requested
binary serialization for DataSets. By adding the same capabilities to
the DataTable class, I'd say the team exceeded developers' expectations
by a wide margin.
So what's the best thing to do if you
have an application that suffers from DataSet serialization performance
problems? Once you upgrade to ADO.NET 2.0, fixing the problem is as easy
as adding one line of code—the one that sets the RemotingFormat
property on the DataSet. The fix is simple to apply and terrifically
effective. Especially if you haven't made any progress on alternative
routes (like compressors, surrogates, and the like), I suggest you stay
and start planning a full migration for when the new platform (or at
least its Go-Live license) is available. If you absolutely need a fix
today, I suggest that you choose the one that has the least impact on
your system. My experience suggests that since most applications use
typed DataSets you can just modify the GetObjectData method of your
typed DataSets to use surrogates (see the Knowledge Base article 829740)
or, more simply, zip the XML DiffGram. The performance improvement is
immediate, though not as good as it could be. It's a workaround after
all.
ADO.NET 2.0 is evolution, not revolution.
It's a fully backward-compatible data access platform in which the
RemotingFormat property stands out as the feature that really fixes many
existing enterprise applications, allowing Web services to bridge user
interface components and business entities.
Send your questions and comments for Dino to cutting@microsoft.com.
Dino Esposito is a Wintellect instructor and consultant based in Italy. Author of Programming ASP.NET and the newest Introducing ASP.NET 2.0
(both from Microsoft Press), he spends most of his time teaching
classes on ASP.NET and ADO.NET and speaking at conferences. Get in touch
with Dino at cutting@microsoft.com or join the blog at http://weblogs.asp.net/despos.
|
댓글 없음:
댓글 쓰기