Data Integration and XML

Data integration deals with integrating heterogeneous data sources and it is a complex activity that involves reconciliation at various levels - data models, data schema and data instances.. Thus there arises a strong need for a viable automation tool that organize data into a common syntax. XML is being touted as the best in fulfilling this very critical requirement. Here we briefly explain what is all about the above-mentioned levels and how XML can accomplish these challenges. 

Data models - Data sources use different structures such as files, tables, and objects, to represent or post data. This heterogeneity nature of data structures or models indicates the need for a common data model to map information coming from various data sources.

Data Schema - There are different representations of the same entity or property. For example, two data sources may use different names to represent the same entity (price and cost) or the same name to represent two different concepts or two different ways for conveying the same information (age and date of birth). In addition, data sources may represent the same information using different data structures. For example, consider two data sources that represent data according to the relational model, where both sources model the entity "Employee" but the first uses only one table whereas the second one takes two tables for representing the same entity.  Thus the automation tool has to take all these differences into account for achieving the goal of data integration.

 Data Instances - At the instance level, integration problems include determining if different objects coming from different sources represent the same real-world entity and selecting a source when contradictory information is found in different data sources, for example, different birth dates for the same person.

XML Capabilities

1.   XML provide a simple, standard and well accepted data model. XML can structure data based on hierarchical, graph-based representations. This facilitates a powerful representation of structured, semi-structured, and unstructured information. Thus XML technology is capable of providing a very solid, flexible, and manageable common data model over the Web. There are a number of innovative software tools that makes database contents publishing in XML easier. 

2.  As XML supports extensibility, the names and meanings of XML tags are arbitrary. Thus there came a need for a set of standardized domain-specific tags and schemas as it has been felt that standardization of tags and schemas is set to simplify the data integration task. There are several attempts going on in this direction such as OASIS and Biztalk.  Also there is XSLT that can define mappings among heterogeneous tags.

 But at the same time, there arises a need for a mechanism to describe the semantics of elements and attributes. For example, supposing that two data sources use the same element "price" for describing the amount of money required to buy a specific item, then  there comes questions like which currency specifies the price, whether the price includes taxes and other expenses. Thus there is a need for understanding the semantics of the information to be integrated. Hence the concept of metadata came into existence and any metadata can be expressed using XML itself.

Also, the schema-level reconciliation process can also account for information about context, such as meanings of a particular name or specific value that depend on the context in which the information occurs. For instance, an employee's ID number can convey relevant information to readers who that know the numbering conventions being followed in that company. That is the information and the subject share the same context. Thus it becomes critical for devising efficient techniques and tools for interchange and integration of context information.

3.  Again metadata concept occupies a very important position in data instance reconciliation. It helps to deal with similar or contradictory information in the data source. For example, metadata helps to attach a time stamp or quality level to the data being integrated and such information is being used in solving conflicts among information stored in different data sources.

Conclusion

XML, an evolving Web technology, is poised to help the task of data integration and reduce the work of reconciling heterogeneous  data sources. There are efforts going on in bringing out a language that can specify the semantics associated with data content. Also the development of suitable tools for XML-based integration of heterogeneous sources is also steadily going on.