Monday 13 December 2010

Performance and the Entity Framework (EF) 3.5 - How to Lazy Load Fields within the one table

There are 2 different camps when it comes to application performance:
  1. Functionality should be delivered first and we can optimize performance later
  2. We should constantly strive to test and improve the application performance throughout development.  - this is particularly important when dealing with new or unproven technologies.
While it is good to be constantly proactive in improving performance, it can sometimes be a hinderance to the project and delivering functionality that the client can actually use. Clearly Microsoft has taken the first approach with the first version of the Entity Framework (EF 3.5). As a hybrid approach between these two, I strongly believe in the use of a Proof of Concept based on core use cases for every project aimed and proving integration approaches and the expected performance of the end system. This helps you develop some golden rules/rules of thumb for that particular implementation and can help you to avoid larger-scale issues down the track.

Performance approaches aside, one of my clients recently had an issue with performance of a system based on the Entity Framework 3.5. Many of the issues in general with EF performance are well documented and I will not detail them here - however there are some golden rules that apply to any database-driven application:
  1. Minimize the amount of data you bring across the network
  2. Minimize network "chattiness" as each round-trip has an overhead. You can batch up requests to resolve this issue.
  3. JOIN and Filter your queries to minimize the number of records that SQL needs to process in order to return results.
  4. Index your DB properly and use Indexed (SQL Server)/Materialized (Oracle) Views for the most common JOINS
  5. Cache Data and HTML that is static so you don't have to hit the database or the ORM model in the first place
  6. Denormalize your application if performance is suffering due to "over-normalization"
  7. Reduce the number of dynamically generated objects where possible as they incur an overhead.
  8. Explicitly loading entities rather than loading them through the ORM (e.g. via an ObjectQuery in Entity Framework) when the ORM outputs poor performing JOINS or UNIONs.
One thing that I noticed in this application that violated Rule 1 - was the use of a EF entity "SystemFile" which had a field called "Contents" that held large binary streams (aka BLOBs) and was pulling them out from the database every time the Entity was involved in a query. The Entity Framework doesn't support lazy loading of fields per se - but it does support loading of entities separately.

Using this concept, the most obvious step seemed to me to be:
  1. Remove the "Contents" field from the "SystemFile" entity so it didn't get automatically loaded when the EF entity was referenced in a LINQ2E query.
  2. Create an inherited entity "SystemFileContents" that just had the contents of the file so the application can load it up only when needed.
This was fine - but then my Data Access Layer then wouldn't compile and I received the following error:

Error 3034: Problem in Mapping Fragments starting at lines 6872, 6884: Two entities with different keys are mapped to the same row. Ensure these two mapping fragments do not map two groups of entities with overlapping keys to the same group of rows.

After a little investigation, I found there are a few different approaches to this error:
  1. Implement a Table Per Hierarchy (TPH) as described at http://msdn.microsoft.com/en-us/library/bb738443(v=VS.90).aspx. This would mean I could just make some database changes and move the file binary contents into a separate table. After that I could just make the parent "SystemFile" class an abstract one, and only refer to 2 new child classes "SystemFileWithContents" and "SystemFileWithoutContents"
  2. I could simply split the table into 2 different entities with a 1:1 association rather than an inheritance relationship in the Entity Framework Model.
Option 2 was the best in terms of minimizing code impact as this application had been in development for over a year. To this end, I used the advice here regarding adding multiple entity types for the same table.

http://blogs.msdn.com/b/adonet/archive/2008/12/05/table-splitting-mapping-multiple-entity-types-to-the-same-table.aspx

The designer in Visual Studio 2008 doesn't support this arrangement (though the designer in Visual Studio 2010 does as per http://thedatafarm.com/blog/data-access/leveraging-vs2010-rsquo-s-designer-for-net-3-5-projects/) - so you have to modify the Xml file directly and add a
"ReferentialConstraint" node to correctly relate the 2 entities:

We add the referential constraint to it to inform the model that the ids of these two types are tied to each other:

<Association Name="SystemFileSystemFileContent">
  <End Type="SampleModel.SystemFile" Role="SystemFile" Multiplicity="1" />
  <End Type="SampleModel.SystemFileContent" Role="SystemFileContent" Multiplicity="1" />
  <ReferentialConstraint>
    <Principal Role="SystemFile"><PropertyRef Name="FileId"/></Principal>
    <Dependent Role="SystemFileContent"><PropertyRef Name="FileId"/></Dependent>
  </ReferentialConstraint>
</Association>

This reduced the load on SQL and the web server as it didn't have to drag across the data dynamically on each call to the SystemFile table anymore. Any performance improvement must be measurable - so the team confirmed this with scripted Visual Studio 2008 Load tests which has a customer-validated test mix based on their expected usage of the system.

DDK

No comments: