I was recently asked about ‘semantic integration‘ in the context of a technology solution relying on a more or less centralised design and validation approach. But before we get to that, let’s get some background details defined.
Integration is the process of achieving interoperability between two or more applications. Interoperability is the ability to share and use information, services and processes. This implies two types of data to be described as part of performing the integration: the data to be shared, and the data to enable the sharing.
In other words, we need to describe:
- Structures: the ability to map and transform between sets of data structures, e.g., the data fields (length, type, identification) and the nesting of data fields.
- Semantics: the ability to understand and interpret the meaning of a given data structure within a given context (e.g., ‘client’ would mean something different depending on whether you are a consulting company or a social service agency).
Some aspect of centralisation will be required to avoid complete data mess, but too much centralisation can stale most project initiatives – the issue of backward compatibility becomes a problem (as every project needs their own small adjustments). You can then either include everything (and risk creating a very, large inefficient data model) or suffer the problem of fragmentation.
Wasn’t Service Oriented Architecture (SOA) supposed to solve this problem of interoperability?
SOA has mostly been about defining loosely coupled services (surprise, surprise), but it is really only about the ‘loose coupling’ of location (UDDI), communication (HTTP, SOAP), interface (WSDL) and function (WSDL and WS-BPEL). The main aspects of these standards (in data terms) are about data to enable the sharing, and to some extend the structure of the data to be shared. But there isn’t much, if anything, about data semantics.
Semantic integration was basically left out of SOA, which, in my opinion, consists of three parts:
- Business artefacts: A set of commonly defined and understood business artefacts. These are not what integration people understand as the common or canonical data definitions, but the identification of business level artefacts (e.g., invoice, customer, order etc.) and their meaning – without the detailed data definition.They are important, as they allow for the identification of cross domain data objects that are the same; yet modelled differently. ‘Customer’ may be described differently, but conceptually remain the same. Large scale projects like the NCI Cancer project (caBIG) refers to these as ‘concept identifiers’. You can also read more about IBM’s idea about SOA and business artefacts here and here.
- Data elements definitions: Standardisation of the data elements to be made available for use as part project level message and data modelling. This is not a fixed, common data model, but the laundry list of available data elements, their type and semantic meaning. These are at the data field level (e.g., ‘mobile phone number’, ‘postcode’, or ‘order number’), and not at the ‘customer’, ‘address’ or ‘invoice’ level. Note that there is a grey zone between these two levels.
- Data elements assembly: A method or standard for the assembly of data elements to guide the project and application level modelling activities.
The idea is to get away from the ‘universal schema’ definitions, as I believe, they are too rigid and next to impossible to implement effectively. The schema modelling should be at the project and application level, but done within the context of the above. Rather than centrally manage XSD and validate XML, you aim to centrally manage how XSDs should be defined and validate the project level XSD definitions.
The approach solves the problem of semantic ambiguity without locking everyone into a single data model.