Knowledge Dictionary: .net Remoting OutOfMemory with Huge Dataset

http://blogs.msdn.com/b/tess/archive/2008/09/02/outofmemoryexceptions-while-remoting-very-large-datasets.aspx

OutOfMemoryExceptions while remoting very large datasets

2 Sep 2008 8:41 AM

Comments 21

When you have to pass an object back and forth between processes or application domains you have to serialize it into some type of stream that can be understood by both the client and the server.
The more complex and big the object gets the more expensive it is to serialize, both CPU wise and memory wise, and if the object is big and complex enough you can easily run into out of memory exceptions during the actual serialization process... and that is exactly what happened to one of my customers...
They had to pass very large datasets back and forth between the UI layer and the datalayer and these datasets could easily get up to a couple of hundred MB in size. When they passed the datasets back they would get OutOfMemory Exceptions in stacks like this one... in other words they would get OOMs while serializing the dataset passing it back to the client...

0454f350 773442eb [HelperMethodFrame: 0454f350]
0454f3a8 793631b3 System.String.GetStringForStringBuilder(System.String, Int32, Int32, Int32)
0454f3d0 79363167 System.Text.StringBuilder..ctor(System.String, Int32, Int32, Int32)
0454f3f8 793630cc System.Text.StringBuilder..ctor(System.String, Int32)
0454f408 651eadee System.Data.DataSet.SerializeDataSet(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext, System.Data.SerializationFormat) 0454f448 651eaa5b System.Data.DataSet.GetObjectData(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext)
0454f458 7964db64 System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo.InitSerialize(System.Object, System.Runtime.Serialization.ISurrogateSelector, System.Runtime.Serialization.StreamingContext, System.Runtime.Serialization.Formatters.Binary.SerObjectInfoInit, System.Runtime.Serialization.IFormatterConverter, System.Runtime.Serialization.Formatters.Binary.ObjectWriter)
0454f498 793ba2bb System.Runtime.Serialization.Formatters.Binary.WriteObjectInfo.Serialize(System.Object, System.Runtime.Serialization.ISurrogateSelector, System.Runtime.Serialization.StreamingContext, System.Runtime.Serialization.Formatters.Binary.SerObjectInfoInit, System.Runtime.Serialization.IFormatterConverter, System.Runtime.Serialization.Formatters.Binary.ObjectWriter)
0454f4c0 793b9cef System.Runtime.Serialization.Formatters.Binary.ObjectWriter.Serialize(System.Object, System.Runtime.Remoting.Messaging.Header[], System.Runtime.Serialization.Formatters.Binary.__BinaryWriter, Boolean)
0454f500 793b9954 System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Serialize(System.IO.Stream, System.Object, System.Runtime.Remoting.Messaging.Header[], Boolean)
0454f524 6778c0b0 System.Runtime.Remoting.Channels.BinaryServerFormatterSink.SerializeResponse(System.Runtime.Remoting.Channels.IServerResponseChannelSinkStack, System.Runtime.Remoting.Messaging.IMessage, System.Runtime.Remoting.Channels.ITransportHeaders ByRef, System.IO.Stream ByRef)
0454f57c 6778bb0f System.Runtime.Remoting.Channels.BinaryServerFormatterSink.ProcessMessage(System.Runtime.Remoting.Channels.IServerChannelSinkStack, System.Runtime.Remoting.Messaging.IMessage, System.Runtime.Remoting.Channels.ITransportHeaders, System.IO.Stream, System.Runtime.Remoting.Messaging.IMessage ByRef, System.Runtime.Remoting.Channels.ITransportHeaders ByRef, System.IO.Stream ByRef)
0454f600 67785616 System.Runtime.Remoting.Channels.Tcp.TcpServerTransportSink.ServiceRequest(System.Object)
0454f660 67777732 System.Runtime.Remoting.Channels.SocketHandler.ProcessRequestNow()
0454f690 677762a2 System.Runtime.Remoting.Channels.RequestQueue.ProcessNextRequest(System.Runtime.Remoting.Channels.SocketHandler)
0454f694 67777693 System.Runtime.Remoting.Channels.SocketHandler.BeginReadMessageCallback(System.IAsyncResult)
0454f6c4 7a569ca9 System.Net.LazyAsyncResult.Complete(IntPtr)
0454f6fc 7a56a46e System.Net.ContextAwareResult.CompleteCallback(System.Object)
0454f704 79373ecd System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
0454f71c 7a56a436 System.Net.ContextAwareResult.Complete(IntPtr)
0454f734 7a569bed System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object, IntPtr)
0454f764 7a61062d System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
0454f79c 79405534 System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
0454f93c 79e7c74b [GCFrame: 0454f93c]

My gut feeling was that they were SOL. I know that serialization is very memory expensive and that the resulting serialized xml strings can get enormous so I wasn't very surprised, especially knowing how large their datasets were.
I am not a data access guru, but I have seen this type of issue enough times that I knew what the recommendation should be.
1. Re-think the architecture... what are you using these datasets for? who will be browsing through 100s of MBs of data anyways? (and this still holds true, in most cases where there is this much data involved only a very small part of it is needed and if that is the case, then only a very small piece of the data should be handled, i.e. filter out what you need and leave the rest)
2. Re-consider passing this data through remoting/webservices/out-of-proc session state or whatever it might be. Once you start serializing and deserializing this amount of data you are threading on thin ice when it comes to the scalability of your application, both performance and memory wise. Again, this still holds true, if the dataset itself is 100 MB you will only be able to have a handful of concurrent requests before you run out of memory for the datasets alone.
3. If you really really really need this much data and this architecture you need to start thinking about moving to 64 bit, but even there you need to be careful so that you have enough RAM and disc space to back up the memory you're using, and still you need to be careful, because the more memory you use, the longer it will take to perform full garbage collections.
We discussed a couple of options like bringing back partial datasets, chunking it up, but still most of it was a no-go.
Debugging
I created a very small remoting sample with just one method that returns a very large dataset (you can find the code for the sample at the bottom of this post... just to see how much memory we were actually using for the serialization (the dataset itself was 102 MB).
I attached to the remoting server with windbg and loaded up sos (.loadby sos mscorwks) and then I set a breakpoint on mscorwks!WKS::gc_heap::allocate_large_object so that I could record the size of the allocation (?@ebx) and the stack (!clrstack) everytime we allocated a large object (I figured this was enough for a rough estimate)

0:004> x mscorwks!WKS*allocate_large*
79ef212d mscorwks!WKS::gc_heap::allocate_large_object = <no type information>
0:004> bp 79ef212d "?@ebx;!clrstack;g"

Low and behold, the last attempted allocation before the OOM was a whooping 1 142 400 418 bytes (~1 GB!!!! for a 100 MB dataset)

Evaluate expression: 1142400418 = 4417a5a2

OS Thread Id: 0x128c (4)
ESP EIP
0454f350 79ef212d [HelperMethodFrame: 0454f350]
0454f3a8 793631b3 System.String.GetStringForStringBuilder(System.String, Int32, Int32, Int32)
0454f3d0 79363167 System.Text.StringBuilder..ctor(System.String, Int32, Int32, Int32)
0454f3f8 793630cc System.Text.StringBuilder..ctor(System.String, Int32)
0454f408 651eadee System.Data.DataSet.SerializeDataSet(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext, System.Data.SerializationFormat)
0454f448 651eaa5b System.Data.DataSet.GetObjectData(System.Runtime.Serialization.SerializationInfo, System.Runtime.Serialization.StreamingContext)
...

When you try to allocate an object like that it needs to be allocated in one chunk. Since it is larger than the size of the LOH segment we will try to create a segment the size of the object, and in my case I just didn't have 1 GB of free space in my virtual memory in one large chunk, so the allocation fails with an OOM.
Fine, what did I learn from this? well, I just confirmed what I already knew, that serialization is very expensive. In fact in my case I had to allocate 1 GB to serialize 100 MB so a factor of 10, and that is not even all... if I would have been successful in allocating this, I would still have had to allocate some more intermediate strings in the neighborhood of a couple of hundred MBs, so all in all it seemed like an insurmountable task to serialize a dataset this big.

Solutions
I mentioned a few earlier, which basically include, don't serialize datasets this big, and if you must, then go to 64-bit.
I remembered though, that on 1.1 there was an article that had some suggestions on how to optimize the serialization by creating dataset surrogates, i.e. wrapper classes that performed their own serialization rather than using the standard one that remoting uses. http://support.microsoft.com/kb/829740
I knew things had changed in 2.0 so that article was no longer applicable, but I didn't really know what it had changed to, so I went on an internet search and found this article that turned out to explain a loot of good stuff about serialization of datasets.
http://msdn.microsoft.com/en-us/magazine/cc163911.aspx
The article suggests that you should change the serialization method if you need to remote very large datasets. I did this by adding one single line to the remoting server, before returning the dataset

ds.RemotingFormat = SerializationFormat.Binary;

Then I re-ran the test and didn't get the OOM. Not only that, but when I ran it through the debugger with the same breakpoint... instead of the 1 GB allocation, I ended up with 5 * 240 k allocations and one 225 k allocation used for the serialization (not counting any non-large objects). Memory wise, that is an improvement of 100 000% for one extra line in your code, that's a little bit hard to beat:)

Have a good one,
Tess

Sample code used for this post
Server:

using System;
using System.Runtime.Remoting;
using System.Runtime.Remoting.Channels;
using System.Runtime.Remoting.Channels.Tcp;
using System.Data; namespace MyServer
{
    class Program
    {
        static void Main(string[] args)
        {
            MyServer();
        }
        static void MyServer()
        {
            Console.WriteLine("Remoting Server started...");
            TcpChannel tcpChannel = new TcpChannel(1234);
            ChannelServices.RegisterChannel(tcpChannel, false);
            Type commonInterfaceType = Type.GetType("MyServer.DataLayer");
            RemotingConfiguration.RegisterWellKnownServiceType(commonInterfaceType, "DataLayerService", WellKnownObjectMode.SingleCall);
            Console.WriteLine("Press ENTER to quit");
            Console.ReadLine();
        }
    }
    public interface DataLayerInterface
    {
        DataSet GetDS(int rows);
    }
    public class DataLayer : MarshalByRefObject, DataLayerInterface
    {
        public DataSet GetDS(int rows)
        {
            //populate a table with the featured products
            DataTable dt = new DataTable();
            DataRow dr;
            DataColumn dc;
            dc = new DataColumn("ID", typeof(Int32));
            dc.Unique = true;
            dt.Columns.Add(dc);
            dt.Columns.Add(new DataColumn("FirstName", typeof(string)));
            dt.Columns.Add(new DataColumn("LastName", typeof(string)));
            dt.Columns.Add(new DataColumn("UserName", typeof(string)));
            dt.Columns.Add(new DataColumn("IsUserAMemberOfTheAdministratorsGroup", typeof(string)));
            DataSet ds = new DataSet();
            ds.Tables.Add(dt);
            for (int i = 0; i < rows; i++)
            {
                dr = dt.NewRow();
                dr["id"] = i;
                dr["FirstName"] = "Jane";
                dr["LastName"] = "Doe";
                dr["UserName"] = "jd";
                dr["IsUserAMemberOfTheAdministratorsGroup"] = "No";
                dt.Rows.Add(dr);
            }
            ds.RemotingFormat = SerializationFormat.Binary;      //<-- this line makes a world of difference
            return ds;
        }
    }
}

Client:

using System;
using System.Runtime.Remoting;
using System.Runtime.Remoting.Channels;
using System.Runtime.Remoting.Channels.Tcp;
using System.Data;
using MyServer; namespace Client
{
    class Program
    {
        static void Main(string[] args)
        {
            TcpChannel tcpChannel = new TcpChannel();
            ChannelServices.RegisterChannel(tcpChannel, false);
            Type requiredType = typeof(DataLayerInterface);
            DataLayerInterface remoteObject = (DataLayerInterface)Activator.GetObject(requiredType, "tcp://localhost:1234/DataLayerService");
            DataSet ds = remoteObject.GetDS(600000);
            Console.WriteLine("Number of rows in ds: " + ds.Tables[0].Rows.Count.ToString());
            Console.ReadLine();
        }
    }
}

http://msdn.microsoft.com/en-us/magazine/cc163911.aspx

MSDN Magazine > Issues > 2004 > October > Cutting Edge: Binary Serialization of DataSets

Cutting Edge

Binary Serialization of DataSets

Dino Esposito

Contents

The Serialization Problem Defined
How DataSet Serialization Really Works
Avoiding the Problem
Drilling Down a Little Further
DataSet Serialization in ADO.NET 2.0
DataTable Enhancements
Call to Action

The ADO.NET DataSet object plays an essential role in most of today's distributed, multitiered applications. Instances of the DataSet class are used to move data across the tiers and to exchange data with external services. The DataSet has most of the features needed to represent real-world business entities.

First, the DataSet is a disconnected object—you can use it without a physical connection to the data source. In addition, the DataSet provides a rich programming interface. It supports multiple tables of data, accepts relations defined between pairs of tables, and allows you to enumerate its contents. Last but not least, the DataSet is a fully serializable object. It can be serialized in three different ways—to a standard .NET formatter, to an XML writer, and through the XML serializer.

For objects that represent the real-world entities of a multitiered app, serialization is key because the object must be serialized to be moved across tiers. This month, I'll discuss a problem with serialization in ADO.NET 1.x and how serialization in ADO.NET 2.0 will improve upon it. Back in the December 2002 installment of Cutting Edge, I reviewed the serialization capabilities of the DataSet object, which you might want to refer to for a little background.

The Serialization Problem Defined

I have a client with a classic three-tier application in which a number of user interface process components were moving data up and down to business components through a .NET Remoting channel. The runtime environment of .NET Remoting used a custom host application configured to work over binary serialization. The client chose binary serialization, thinking it was the fastest way to move data between processes. But their application was running slowly and experiencing problems moving data. Hence, the client put a sniffer on the network to look at the size of the data being moved back and forth and discovered that too many bytes were being transported; the actual size of the data was grossly exceeding the expected size. So what was going on?

How DataSet Serialization Really Works

Marked with the [Serializable] attribute, the DataSet object implements the ISerializable interface to gain full control over its serialization process. The .NET formatter streamlines, in a default way, objects that don't implement ISerializable and guarantees that data is stored as a SOAP payload if the SoapFormatter is used or as a binary stream if the BinaryFormatter is used. The responsibility of the formatters, however, ends here if the serialized object supports ISerializable. In that case, the formatter passes an empty memory buffer (the SerializationInfo data structure) and waits for the serializee to populate it with data. The formatter's activities are limited to flushing this memory buffer to a binary stream or wrapping it up as a SOAP packet. However, there are no guarantees made about the type of the data added to the SerializationInfo object.

The following pseudocode shows how the DataSet serializes with a .NET formatter. The GetObjectData method is the sole method of the ISerializable interface.

Copy Code

void GetObjectData(SerializationInfo info, StreamingContext context)
{
   info.AddValue("XmlSchema", this.GetXmlSchema());
   this.WriteXml(strWriter, XmlWriteMode.DiffGram);
   info.AddValue("XmlDiffGram", strWriter.ToString());
}

Regardless of the formatter being used, the DataSet always serializes first to XML. What's worse, the DataSet uses a pretty verbose schema—the DiffGram format plus any related schema information. Now take a DataSet with a few thousand records in it and imagine such a large chunk of text traveling over the network with no sort of optimization or compression (even blank spaces aren't removed). That's exactly the problem that I'm sure many of you have been called to solve at one time or another.

Avoiding the Problem

Until ADO.NET 2.0 the prognosis was mixed. There was both good and bad news. The bad news was that this problem is virtually unavoidable if you choose the DataSet as the implementor of your business entities. Alternative choices are raw XML, custom collections, and arrays of custom classes. All of these have pros and cons. All are serializable and enumerable data types. In addition, raw XML and custom types are better at interoperating with components running on platforms such as J2EE than is the DataSet. This assumes you're employing Web services to bridge user interface process components to business entities (see Figure 1). For more information about Web services and DataSets, check out The XML Files column in the April 2003 issue of MSDN®Magazine.

Figure 1 User Interface

If you modify your architecture to do without DataSets, you avoid the problem of serialization but lose some usability and simplicity. The DataSet is a great class to code against. Using it you can pass the full DataSet around, and then if needed, you can also easily excerpt only the changed rows in each table in order to minimize bandwidth and CPU usage for a call.

The good news is that there are some workarounds. To see the various options compared, check out the DataSet FAQ. One way to improve the end-to-end transfer speed is to override the DataSet serialization mechanism.

You can create your own DataSet-derived class and implement the members of the ISerializable interface to fulfill your performance and scalability requirements. As an alternative, you can generate a typed DataSet using Visual Studio® .NET and then modify its source code to reimplement the ISerializable interface as needed. A third option is to use a little-known feature of the .NET Framework—serialization surrogates. Jeffrey Richter provides excellent coverage of serialization surrogates in his September 2002 installment of the .NET column. Technically speaking, a serialization surrogate is a class that implements the ISerializationSurrogate interface. The interface consists of two members: GetObjectData and SetObjectData.

By using surrogates you can override the way types serialize themselves. The technique has a couple of interesting practical applications: serialization of unserializable types and deserialization of an object to a different version of its type. Surrogates work in cooperation with .NET runtime formatters to handle serialization and deserialization for a given type. For example, the BinaryFormatter class has a member named SurrogateSelector that installs a chain of surrogates for a variety of types. When instances of any of these types are going to be serialized or deserialized, the process won't go through the object's serialization interface (ISerializable or the reflection-based default algorithm) but rather takes advantage of the surrogate's capabilities using its ISerializationSurrogate interface. Here's an example of how to override the serialization of a DataSet using a custom surrogate DataSetSurrogate:

Copy Code

SurrogateSelector ss = new SurrogateSelector();
DataSetSurrogate dss = new DataSetSurrogate();
ss.AddSurrogate(typeof(DataSet), 
    new StreamingContext(StreamingContextStates.All), dss);
formatter.SurrogateSelector = ss;

In the DataSetSurrogate class, you implement the GetObjectData and SetObjectData methods to work around the known performance limitations of the standard DataSet serialization. Next, you add an instance of the DataSet surrogate to a selector class. Finally, the selector class is bound to an instance of the formatter.

Whichever approach you choose to improve the DataSet's serialization performance (a new class or a serialization surrogate), the end goal should be to reduce the amount of information being moved.

The DataSet serializes to an XML DiffGram—a rich XML schema that contains the current snapshot of the data as well as pending errors on table rows and a history of all the changes that occurred on the rows. The history is actually the original value of the row assigned when either the DataSet was created or the last time the AcceptChanges method was invoked.

Since the main problem is the size of the data, compression is one possible answer. The solution outlined in Knowledge Base article 829740 ("Improving DataSet Serialization and Remoting Performance employs a DataSet-like class that is simply marked as Serializable and doesn't implement ISerializable. Other ADO.NET objects (DataTable, DataColumn) are also replaced to take full control of the serialization process. This simple change provides a double advantage: it reduces the amount of data being moved and reduces the pressure on the .NET Remoting framework to serialize and deserialize larger DataSet objects.

Another approach to the problem, that is much more than just a smart way to serialize a DataSet, involves implementing a custom formatter, such as Angelo Scotto's CompactFormatter (available for download at http://www.freewebs.com/compactFormatter/About.html). The CompactFormatter is a generic formatter for both the Microsoft® .NET Framework and the .NET Compact Framework and is capable of producing an even more compact byte stream than the native BinaryFormatter class. In addition, it supports compression, which further reduces the amount of data being moved. More importantly, the CompactFormatter is not DataSet-specific and works with most .NET types.

Drilling Down a Little Further

All .NET distributed systems that make intensive use of disconnected data (as recommended by Microsoft architecture patterns and practices) are sensitive to the size of serialized data. The larger the DataSet, the more CPU cycles, memory, and bandwidth these systems consume. Fortunately, ADO.NET 2.0 provides a great fix. Before I explain it, though, I should make clear that I'm not saying DataSet XML serialization is a bad thing in general. Standard DataSet XML serialization is a "stateful" form of serialization in the sense that it maintains some state information such as pending errors and current changes to rows. It can be customized to a great extent by choosing the favorite combination of nodes and attributes; you can also decide if relations should simply be tracked or rendered by nesting child records under their parents.

What's really interesting is the fact that as long as few records (for example, less than 100) are involved in the serialization, the performance of a standard DataSet and that of an optimized DataSet (custom DataSet, surrogate, compressed, whatever) is negligible. Therefore, I wouldn't spend much time implementing alternative forms of DataSet serialization if my application moves only small chunks of data.

Although performance worsens as the size of the DataSet grows, even for a thousand records the performance is not much of a problem. Note that performance here relates to .NET Remoting end-to-end speed, a parameter that includes the size of the data to transfer and the costs of successive instantiations. As you scale from a thousand to a few thousand records, the performance begins to suffer significantly. Figure 2 shows the difference in .NET Remoting end-to-end speed when XML (standard) and true binary serialization are used (the latter is new to ADO.NET 2.0; more on this in a moment). For the sake of precision, I have to say that the numbers behind the graph have been obtained with a Beta build of ADO.NET 2.0.

Figure 2 Remoting End-to-End Time

Of course the worst scenario is when you move a large DataSet (thousands of rows) with numerous changes. If you're simply forwarding this DataSet to the data access layer (DAL) for applying changes, you can alleviate the issue by using the DataSet's GetChanges method. This method returns a new DataSet that contains only the rows in the various tables that have been modified, added, or deleted.

However, when you're moving a DataSet from tier to tier, or from a business component to an external service that implements a business process, you have no choice but to pass a stateful representation of the data—the whole DataSet with its own set of relations, pending changes, and errors.

Why does the end-to-end speed slow down once you reach a certain threshold? Handling large DataSets poses an additional problem to the .NET Remoting infrastructure aside from the time and space needed to complete the operation. This extra problem has to do with the specific algorithm used to serialize and, more importantly, deserialize DataSets saved as XML DiffGrams. To restore a DataSet from a DiffGram, the binary formatter invokes a protected constructor on the DataSet class (part of the ISerializable implementation). The pseudocode of this DataSet constructor is shown in Figure 3. As you can see, the DataSet's ReadXml method is called to process the DiffGram. Especially for large DataSets, ReadXml works by creating lots of transient, short-lived objects, a few for each row to be processed. This mechanism puts a lot of additional pressure on the .NET Remoting infrastructure and consumes a lot of memory and CPU cycles. Exactly how bad it is will depend on the runtime conditions and hardware equipment of the system. This explains why certain clients experience visible problems when moving only 7MB of data while others won't see problems until they've moved 20MB of data.

Figure 3 Pseudocode for DataSet Serialization

Copy Code

protected DataSet(SerializationInfo info, StreamingContext context) 
{
    string schema, diffgram;

    schema = (string) info.GetValue("XmlSchema", typeof(string));
    diffgram = (string) info.GetValue("XmlDiffGram", typeof(string));
    if (schema != null)
         ReadXmlSchema(new XmlTextReader(new StringReader(schema)), true);
    if (diffgram != null)
        ReadXml(new XmlTextReader(new StringReader(schema)), 
            XmlReadMode.DiffGram);
}

DataSet Serialization in ADO.NET 2.0

When you upgrade to ADO.NET 2.0, your problems will be solved. In ADO.NET 2.0, the DataSet class provides a new serialization option specifically designed to optimize remoting serialization. As a result, remoting a DataSet uses less memory and bandwidth. The end-to-end latency is greatly improved, as Figure 2 illustrates.

In ADO.NET 2.0, the DataSet and DataTable come with a new property named RemotingFormat defined to be of type SerializationFormat (see Figure 4). By default, the new property is set to SerializationFormat.Xml to preserve backward compatibility. The property affects the behavior of the GetObjectData members on the ISerializable interface and ultimately is a way to control the serialization of the DataSet. The following code snippet represents the pseudocode of the method in ADO.NET 2.0:

Copy Code

void GetObjectData(SerializationInfo info, StreamingContext context)
{
  SerializationFormat fmt = RemotingFormat;
  SerializeDataSet(info, context, fmt);
}

SerializeDataSet is a helper function that fills the SerializationInfo memory buffer with data that represents the DataSet. If the format parameter is SerializationFormat.Binary, the function goes through every object in the DataSet and copies its contents into a serializable structure—mostly ArrayLists and Hashtables. Figure 5 shows the pseudocode. If you compare that to the code in the aforementioned Knowledge Base article 829740, you'll find many similarities. Note, though, that the code in Figure 5 is based on Beta 1 and might change in the future. What shouldn't change, though, is the idea behind the binary serialization. To serialize a DataSet in a true binary fashion, here's what you do:

Copy Code

DataSet ds = GetData();
ds.RemotingFormat = SerializationFormat.Binary;
BinaryFormatter bin = new BinaryFormatter();
bin.Serialize(stream, ds);

Aside from the new RemotingFormat property, there's nothing new in this code. The impact of this new feature on existing code is very minimal.

Figure 5 Pseudocode for DataSet's Binary Serialization

Copy Code

private void SerializeDataSet(
  SerializationInfo info, StreamingContext context, 
  SerializationFormat remotingFormat)
{
  info.AddValue("DataSet.RemotingVersion", new Version(2, 0));
  if (remotingFormat != SerializationFormat.Xml)
  {
     int i;
     info.AddValue("DataSet.RemotingFormat", remotingFormat);
     SerializeDataSetProperties(info, context);

     info.AddValue("DataSet.Tables.Count", this.Tables.Count);

     for (i=0; i< Tables.Count; i++)
         Tables[i].SerializeConstraints(info, context, i, true);

     SerializeRelations(info, context);

     for (i=0; i< Tables.Count; i++)
         Tables[i].SerializeExpressionColumns(info, context, i);

     for (int=0; i< Tables.Count; i++)
         Tables[i].SerializeTableData(info, context, i);

     return;
   }

   // 1.x code
}

Figure 4 SerializationFormat Values

Value	Description
Xml	Maintained for backward compatibility, serializes the DataSet object using an XML DiffGram format as in ADO.NET 1.x; this is the default value of the RemotingFormat property
Binary	Instructs the internal serializer to use a true binary format when serializing the DataSet

It is interesting to measure the performance gain that you get from this new feature. Here's a simple technique you can easily reproduce. Fill a DataSet with the results of a query and persist it to a file on disk (see Figure 6). You can wrap the code in Figure 6 in either a Web Form or a Windows Form. Run the sample application and take a look at the size of the files created. Try first with a simple query like this:

Copy Code

SELECT lastname, firstname FROM employees

Figure 6 Testing Performance of Remoting Format

Copy Code

SqlDataAdapter adapter = new SqlDataAdapter(query, connString);
DataSet ds = new DataSet();
adapter.Fill(ds);

BinaryFormatter bin = new BinaryFormatter(); 

// Save as XML
using(StreamWriter writer1 = new StreamWriter(@"c:\xml.dat"))
{
    bin.Serialize(writer1.BaseStream, ds);
}

// Save as binary
using(StreamWriter writer2 = new StreamWriter(@"c:\bin.dat"))
{
    ds.RemotingFormat = SerializationFormat.Binary;
    bin.Serialize(writer2.BaseStream, ds);
}

If you're familiar with the Northwind database, you know that this query returns only nine records. Quite surprisingly, in this case the XML DiffGram is about half the size of the binary file! Don't worry, there's nothing wrong in the code or in the underlying technology. To see the difference, run a query that will return a few thousand records, like this one:

Copy Code

SELECT * FROM [order details]

Now the binary file is about 10 times smaller than the XML file. With this evidence, look back at the graph in Figure 2. The yellow series shows the performance of the DataSet binary serializer and fortunately increases very slowly as the size of the DataSet grows. The same can't be said for the DataSet XML serializer which reports a sudden upswing and significantly decreased absolute performance as the number of rows exceeds a few thousand.

DataTable Enhancements

In ADO.NET 1.x, the DataTable class suffers from three main limitations. A DataTable can't be used with Web service methods and doesn't provide a direct way to serialize its contents. In addition, while DataTable instances can be passed across a remoting channel, strongly typed DataTable instances cannot be remoted. These limitations have been removed in ADO.NET 2.0. Let's take a look at exactly how this can be accomplished.

Return values and input parameters of Web service methods must be serializable through the XmlSerializer class. This class is responsible for translating .NET types into XML Schema Definition (XSD) types and vice versa. Unlike the runtime formatter classes (such as BinaryFormatter), the XmlSerializer class doesn't handle circular references. In other words, if you try to serialize a DataTable you get an error because the DataTable contains a reference to a DataSet—the DataSet property—and the DataSet, in turn, contains a reference to the DataTable through the Tables collection. For this reason, XmlSerializer raises an exception if you try to pass a DataTable to, or return a DataTable from, a Web service method. DataSet objects have the same circular reference, but unlike the DataTable class, DataSets implement the little-known IXmlSerializable interface.

The XmlSerializer class specifically looks for this interface when serializing an object. If the interface is found, the serializer yields control and waits for the WriteXml/ReadXml methods to terminate the serialization and deserialization process. The IXmlSerializable interface has the following three members:

Copy Code

XmlSchema GetSchema();
void ReadXml(XmlReader reader);
void WriteXml(XmlWriter writer);

In ADO.NET 2.0, the DataTable class fully supports the IXmlSerializable interface so you can finally use DataTables as input parameters or return values in Web service methods.

Note that the GetSchema method exists on IXmlSerializable purely for backward compatibility, and it is safe to choose a trivial implementation of GetSchema that returns null. However, if the class is being used in a Web service method's signature, an XmlSchemaProviderAttribute should be applied to the class to denote a static method on the class that can be used to get a schema. This XSD representation can then be mapped back to the actual object by implementing a SchemaImporterExtension. For more information on XmlSchemaProviderAttribute and SchemaImporterExtension, see New Features for Web Service Developers in Beta 1 of the .NET Framework 2.0.

In addition, the DataTable provides a simple XML I/O channel. A DataTable can be populated from an XML stream using the new ReadXml method and persisted to disk using the new WriteXml method. The methods have the same role and signature as the equivalent methods on the DataSet class. Last, but not least, a DataTable supports the RemotingFormat property and can be serialized in a true binary format.

Note that the list of changes for the DataTable class doesn't end here. For example, the class also features full integration with streaming interfaces and can be loaded from readers (including SQL DataReaders) and its contents can be read through DataReaders. However, these changes don't directly (or necessarily) affect the building of middle tiers of enterprise class applications and in this column I don't have the space to cover them or even the majority of the new ADO.NET 2.0 features.

Call to Action

One of the main ADO.NET 2.0 themes is the emphasis on improved performance and scalability. This encompasses a new, faster index engine to locate table records and the oft-requested binary serialization for DataSets. By adding the same capabilities to the DataTable class, I'd say the team exceeded developers' expectations by a wide margin.

So what's the best thing to do if you have an application that suffers from DataSet serialization performance problems? Once you upgrade to ADO.NET 2.0, fixing the problem is as easy as adding one line of code—the one that sets the RemotingFormat property on the DataSet. The fix is simple to apply and terrifically effective. Especially if you haven't made any progress on alternative routes (like compressors, surrogates, and the like), I suggest you stay and start planning a full migration for when the new platform (or at least its Go-Live license) is available. If you absolutely need a fix today, I suggest that you choose the one that has the least impact on your system. My experience suggests that since most applications use typed DataSets you can just modify the GetObjectData method of your typed DataSets to use surrogates (see the Knowledge Base article 829740) or, more simply, zip the XML DiffGram. The performance improvement is immediate, though not as good as it could be. It's a workaround after all.

ADO.NET 2.0 is evolution, not revolution. It's a fully backward-compatible data access platform in which the RemotingFormat property stands out as the feature that really fixes many existing enterprise applications, allowing Web services to bridge user interface components and business entities.

Send your questions and comments for Dino to cutting@microsoft.com.

Dino Esposito is a Wintellect instructor and consultant based in Italy. Author of Programming ASP.NET and the newest Introducing ASP.NET 2.0 (both from Microsoft Press), he spends most of his time teaching classes on ASP.NET and ADO.NET and speaking at conferences. Get in touch with Dino at cutting@microsoft.com or join the blog at http://weblogs.asp.net/despos.

Knowledge Dictionary

2014년 3월 9일 일요일

.net Remoting OutOfMemory with Huge Dataset

OutOfMemoryExceptions while remoting very large datasets

댓글 없음:

댓글 쓰기