Inside Microsoft CRM


Entity size is always a problem

Running into the customization ceiling when adding attributes? I feel your pain. I really do. The team down the hall from me is working quite hard on making some of this pain go away, and they've done a bunch of work in the query processor layer in the platform. There's a reason the limitation exists in V1.x and there's a reason it wasn't "fixed" earlier.

The original COLA (contact, opportunity, lead, and account) definitions were quite small and left a ton of room for extensions. One of the things we looked at was to allow customizations of the type where one could store everything one wanted in an XML document in the database. There were way too many problems with that approach (although there are some great upsides too). Simply put, any search and display is going to be a problem with the property bag approach. There really aren't any great mechanisms for telling <fetch> about the semantics of an attribute. It knows all about the entities, attributes, and relationships, but that's where its knowledge stops. The application, and most other display patterns (except reporting) would work fairly well because it's all just XML and XSLT, and writing another XPATH expression to reach into the bag and pull out the rabbit is a well understood problem.

The second approach was to allow physical attributes to be added to entities in the form of database columns. There are some problems with this as well, particularly around name collisions and upgrade scenarios, but none of those couldn't be overcome with some decent engineering work.

A little history lesson my help. This is an excerpt from a whitepaper I wrote when we first started looking at how to create an extensible product. This is really ancient history at this point so there's really no reason I can think of to not share it.

Several proposals are on the table to allow application developers to customize the storage characteristics (the tables and fields).

1) The approach taken for ClearLead 1.x (bCentral Customer Manager). Each interesting object (business, user, prospect, event) has a set of developer-defined named properties. This approach was an attempt at solving the problems inherent in approach 3. However, it quickly caused two severe problems. First, performance was horrible, each query required multiple outer joins to gather all the detail-level information. Secondly, the data stored rapidly exploded. Where it was be possible to store a single inbound email event record in a single row using an ntext blob, the CL model took the approach that all large data be broken into 2000 character chunks and stored individually. This required that any time this information was read or written, the data had to be reconstructed.

2) Expose a single, opaque, application-specific blob field on every interesting object. This has some appeal since it leaves all the interpretation to the application and puts the burden on the developer to manage and render this information as necessary. The drawback here is that the blob isn't quickly searchable and can't be indexed (full-text indexing is an option, but isn't quite mature enough to be relied upon).

Another drawback with this format is that simple queries against the data are difficult to construct and very expensive to run. For example, how would a query be constructed which found all contacts who brought a cat into a vet clinic in May and were serviced by Dr. Smothers. If this data is 'stuffed' into a single Xml blob, the format isn't controllable by the platform, so a generic query like this wouldn't be possible to construct.

A secondary problem with this approach is the opaqueness of the data, neither the application nor the platform have any knowledge of the document structure. The platform would need to be written with the document structure in mind to make any reasonable use of the data, in which case the extensibility mechanism is defeated. The application on the other hand may have knowledge of the structure, but may not have any guarantee on its structure. That is, the structure may need to be interpreted differently for each individual object. [If the application were to force a fixed document structure on each class of objects, that would reduce some of the problems.]

3) Supply a fixed number of customizable fields per object - say 5 or 10 sql_variant fields. The problem with this approach is that it breaks the zero, one, infinity rule. As soon as we present n fields to the developer they'd ask for n + 1. If we told them they had 255 UNICODE characters per field, they'd ask for 256. We can get around the second part of this problem by implementing the extra fields as a sql_variant, however this limits the field's usefulness by changing the meaning of the field in large searches.

4) Use a metadata-driven model and "hide" the physical model from the application and platform developers. The appeal here is that each developer can actually think about the problem at hand and customize the object definition to meet their needs.

For the longest time we (hell, I'll take the blame on this, this was my idea) were under the impression that spreading things across tables would be a way around the problem. That's one of the reasons that addresses are bound into the entities (although I really dislike that design, I have to say it does make sense at times). The issue is that SQL needs to create temp tables to hold the 'inserted' table data for the update triggers. While most people would never write all of the data to a record at once it is possible, and in those situations things will just break.

The way we get around this is to remove the updatable views and triggers altogether and use the metadata to construct the cross-table queries. Until that happens there's really no way around the 8k limitation (at least not in the supported world).

Looking back on this now I think I'd take either option 1 or 2 above. If I were to take the XML blob approach I'd likely work on an extension pattern that forced the extension author to describe the extension in terms of metadata (and in terms of an XSD) so the tools which manipulate metadata for presentation, query, and update operations would "know" how to interpret the data. It still doesn't solve the reporting problems and it likely won't until a reporting rendering engine can be built that knows about XML as source data and uses XPATH as the layout. There would still be problems with query, particularly around aggregate functions and ordering (what if someone wants to group on a element in the extension and order by another element - they're not columns to SQL so you'd need to lift that functionality out of the database where it should be and into an independent query processing layer...)