Unified Architecture for Integrating Intelligence Data

February
10
2010

Printable Copy

ACM Journal of Data and Information Quality. Pending decision

Authors

Suzanne Yoakum-Stover, Ph.D.
Institute for Modern Intelligence, Executive Director
Alexandria, VA

Tatiana Malyuta, Ph.D.
Data Tactics Corp., Principal Database Architect

Alexandria, VA
New York City College of Technology, Associate Professor

Abstract

The principal problem spanning the Intelligence Community today is how to integrate the great variety of disparate data into one single coherent repository of knowledge. Current practice whereby all data-models would be merged into a single “Uber-model” simply does not work. We require a solution that remains viable in a freely evolving, interdependent collective of human and computational systems, very little of which will ever be under our control. Our approach is database-centric and proceeds in stages. The first addresses the unified representation of the broad spectrum of artifacts existing within the Intelligence Enterprise today regardless of modality or structure. The second builds upon the foundation provided by the first to address the unified storage of structured data and semantic data integration. In both we embrace the diversity of data-models employed throughout the Intelligence Community. The result is a layered data architecture that can accommodate any kind of data without placing restrictions on vocabulary, structure, semantics, or constraints in a way that addresses today’s intelligence needs while providing a seamless transition path toward a future of Ultra-Large Scale (ULS) systems imbued with semantic technologies.

Introduction

The principal problem spanning the Intelligence Community today is how to integrate the great variety of disparate data stores and streams, both legacy and bleeding-edge, into one coherent repository of knowledge. Pieces of the intelligence puzzle lay scattered in data silos sequestered by the very systems that served to create them. Each of these systems, to include most of today’s Army Programs of Record, was built as an end-to-end solution with its own sensors, processors, and data stores, implemented and operated to achieve a specific intelligence objective. They were never meant to interoperate, share data, or even expose data beyond a narrow mission-focused enclave. The advent of network technologies and protocols, which have effectively eliminated the physical barriers between systems, has done little to bridge the chasm between these data silos. Although we can now transfer data over the wire, disparate and utterly incompatible data-models characterized by straightforward and subtle differences in vocabulary, structure, semantics, and constraints continue to stymie data search, exploration, enrichment, and exploitation efforts. The fundamental problems of data integration remain to be solved.

Current practice in data integration, whereby all data-models would be merged or harmonized, either physically or virtually [Batini 1986, Parent 1998, Halevy 2005, Bernstein 2007] fails to accommodate the demands of our fluid and rapidly growing Intelligence Enterprise. The physical mapping of disparate models into a single canonical data-model [Omelayenko 2001] is simply untenable as the scale and complexity of their subjects quickly overwhelms our tools and methods. Federation approaches share this defect and introduce new ones [Izydor 2007, Yero 2008]. In practice, these approaches provide only the illusion of data integration as they mainly integrate data-models, not the data itself, and in so doing confine all data to a model that is incapable of adapting itself or its contents as our knowledge about the domain evolves.

In all but the most constrained situations, what begins as a perfectly neat solution for a handful of systems quickly becomes intractable with scale, exposing not only the limitations of traditional approaches, but also of our grasp at the foundations of knowledge representation itself. This phenomenon is but one early symptom of our evolution toward Ultra-Large Scale (ULS) systems [Northrop 2006] and as such, invites a completely different approach – one that remains viable in a freely evolving, interdependent collective of data sources / types / modalities / models, analytics, tools, interfaces, mission applications, perspectives, and users, very little of which will ever be under our control. Our objective is to define such a solution.

Conceptual Approach

Our approach to integrating intelligence data in a ULS systems environment is data-centric (as opposed to data-model centric) and proceeds in stages. The first addresses the unified storage of the entire spectrum of intelligence artifacts regardless of modality or representation. The second stage builds upon the foundation provided by the first to address the unified storage of structured data to enable semantic data integration. A third stage (beyond the scope of this paper) addresses unified storage of knowledge models. In all stages we embrace the diversity of domain-specific data-models employed throughout the Intelligence Community by taking a data-model agnostic approach wherein the persistence model makes the least possible commitment to any particular data-model. In the case of “raw” artifacts, this means storing each according to its native representation without the application of structural or semantic transformations. In the case of structured artifacts, it means: a) perceiving data as first-class citizens by de-coupling them from the data-model and b) using an abstract model for the unified representation and persistence of the artifact and integration semantics. A key aspect of our approach is that the character and meaning of the source data and data-model is preserved and made accessible by the data store. The result is a layered Data Architecture and Semantic Integration Framework that can accommodate any kind of data and semantics without placing restrictions on vocabulary, structure, semantics, or constraints, in a way that addresses the needs of the Intelligence Community today while providing a seamless transition path toward a future of ULS systems imbued with semantic technologies.

Scope

The types of intelligence collected by sensors and systems today span the electro-magnetic spectrum to include all manner of signals, audio, video, and images, in addition to so-called human intelligence (e.g. text artifacts such as reports, messages, web pages). Our approach to data integration supports all of these simultaneously regardless of their underlying source data-model, or lack thereof. It does not however, prescribe a solution for data-model harmonization. In particular, our approach imposes no relationship between the data-models to which the artifacts adhere. It does however, allow such relationships, created by external processes of any sort, to be effectively represented and persisted together.

Whereas the business of intelligence is to develop and communicate understanding (which entails the collection, exploitation, and provisioning of intelligence), intelligence business processing includes any automated activity that moves intelligence artifacts with respect to the cognitive hierarchy (see Fig.1a). This includes data collection, semantic enhancement, and fusion from data to information to knowledge, and communication / collaboration to create understanding.

Our Data Architecture and Semantic Integration Framework mirrors both the structure of the cognitive hierarchy and the operations of intelligence business processing. Built atop a collection of indigenous artifacts (see Fig. 1b), Layer 1 of the Framework supports an aspect of collection and rudimentary exploitation of artifact semantics. Layer 2 supports the processing by which data extracted from artifacts is enhanced with semantics to produce information, and the processing by which information is enhanced with richer associations to produce knowledge. Layer 3 supports the management and integration of knowledge models employed by Layers 1 and 2. Finally, Layer 4 supports human computer interfaces through which the analyst “sees” all of this intelligence. This paper focuses on Layers 1 and 2, which together support the provisioning of integrated intelligence at the level of data, information, and knowledge. Layers 3 and 4 will be the subject of a subsequent paper.

Layer 1

The broad and ever-changing spectrum of intelligence artifacts existing within the Intelligence Enterprise today reflects a nearly equally broad and ever-changing spectrum of intelligence collectors, producers, and consumers. The types of artifacts they generate vary tremendously in their modality (e.g. text, images, audio, video, signals), structure (e.g. relational, object, key-value) and representation (e.g. free text, XML, SQL, vector, raster). As this diversity is beyond our control, we term all such artifacts as “indigenous” and the diversity of the external data stores and systems in which they reside as the “wild.”

Layer 1 of our data integration framework addresses the integration of the entire spectrum of artifacts existing in the wild by simply collecting them together in a unified data store. The decision to bring an artifact into the Framework is recorded in Layer 1 by persisting a reference to the artifact along with a minimal set of essential meta-data whose main purpose is: a) to support analysis of artifact content; and b) to provide access to the artifact from the higher layers of the framework (see Fig. 2). The original artifact may also be physically captured in an indigenous data store that sits below Layer 1, however this is not required. In addition to the artifact reference and metadata, the Layer 1 schema also supports associations between artifacts. Any kind of relationship between artifacts can be represented since the set of predicates used to express them is not pre-defined. As described subsequently, predicates are persisted in Layer 3 of the Framework.

Layer 2

Structured Data

Every analyst engaged in intelligence processing either creates or uses structured data. Just as we do not control the sources or format of indigenous artifacts, we also do not control the various methods by which such artifacts might be structured or the data-models employed therein. Thus as the objective of Layer 1 is to represent the diversity of indigenous artifacts regardless of type or format, the objective of Layer 2 is to accommodate the diversity of all structured data regardless of vocabulary, organization, representation, or semantics.

Structured data necessarily adheres to some sort of model, which in general specifies vocabulary, organization, semantics, and constraints. Though not all data-models specify all of these, at minimum, every structured artifact entails a vocabulary reflecting a set of entity types (e.g. person, message) and an organization reflecting their relationships (e.g. message to person). These basic elements are illustrated in the simplified example of Fig. 3. Part (a) of the figure shows a short unstructured text message, and part (b) shows a data-model according to which a message might be structured. Part (c) then shows the original message structured according to the data-model and part (d) shows how that structured message might typically be persisted in a database.

Notice how the database schema is tightly coupled to the data-model that was used to structure the data, and how the raw message is bound to the data-model by the database. In effect, the data-model is imposed on the database, and the data itself is frozen into it such that no additional attributes or relationships are possible (without modifying the database schema). This is a severe shortcoming considering the tremendous variety of ways in which a given artifact might be structured or enhanced with additional attributes and associations. Even for the simple case shown in the figure, we can easily imagine data-models that use different entities (e.g. ‘Individual’ instead of ‘Person’), different relationships (e.g. ‘Sender’ instead of ‘From’), and different organizations (e.g. by including ‘MessageDate’), not to mention the wealth of other information external to the message itself (e.g. about ‘Tanya’) that might be brought to bear.

In a ULS systems environment, it is simply unreasonable to presume that the data-models or the various processes, either automated or manual, that structure data can be controlled or constrained. It is also unreasonable to presume that it is possible to anticipate the totality of their breadth or their application. To the contrary, the urgency and diversity driving our Intelligence Enterprise essentially guarantees that as many different methods for extracting entities, relationships, and events will be brought to bear as our imaginations and increasingly powerful technologies can support. Thus, although we might like to enhance Layer 1 of our Data Architecture and Semantic Integration Framework by exposing all possible extracted elements along with their properties and attributes in order to support efficient information retrieval and broad application, introducing an ever expanding array of fields and tables into a database is as impractical as attempting to accommodate every kind of data and purpose within a single canonical data-model.

The challenge therefore, is to build the next layer of the Framework to accommodate structured data in a way that exposes that structure for use, without imposing the structure on the data store itself. In other words, we must determine a method for storing and managing any kind of structured data, reflecting any data-model, so that it can be shared, efficiently exploited, and extended in unforeseen ways without requiring model-specific storage implementations. In other words, we seek a universal, domain-neutral storage model for structured data.

Data-Model Abstraction

The key to devising a domain-neutral storage model for structured data is to decouple what varies, namely vocabularies and, more generally the data-models, from that which remains constant, namely the source artifact, and ideally the storage structure. To achieve this, we consider structure, vocabulary, semantics, and constraints from a higher level of abstraction from which we then distill a minimal set of elements sufficient to capture any data-model. These are defined as follows:

Sign: A sign, g, is a representation of a chunk of data, either physically located within a tangible artifact, or contained within an analyst’s mind. Examples of the former include a string of text in a document; an object within an image; a segment of audio in an audio stream; a spike in a signal. As illustrated in Fig. 4, regardless of the type of medium, a sign for tangible data is always associated with a physical extent within the artifact and has a quantifiable span, which we call a mention. In contrast, signs that reside in an analyst’s mind become tangible only when she writes down her thoughts. We explicitly include such intangible signs here to support the analyst’s ability to assert information directly into the data store without having to first represent it in a physical artifact. The set of all signs, G = {gi}, spans across all data sources. In the set, each element is unique: i,j (i
≠ j)
à (gi ≠ gj)
. G is the construct by which data are represented. From the text data shown in Fig. 4, signs G’ = {‘Suzi’, ‘Tanya’, ‘July 4, 2007′, ‘Bring lunch’, ‘Message1′} contribute to G (i.e. G’ Í
G), though many more signs may be identified even from this simple example.

Concept:
A concept,
c, is a representation of an abstract idea, defined explicitly or implicitly by a source data-model.  For example, the nodes of an ontology, the tag set in an XML Schema Document (XSD), and the attribute / table names in a relational database all represent concepts. In the set of all concepts C = {ci}, each element is unique: i,j (i
≠ j)
à (ci ≠ cj)
. From the text data shown in Fig. 4, concepts C’ = {‘Message’, ‘Person’, ‘Body_text’} contribute to the full set of concepts C (i.e. C’ Í
C).

Predicate: A predicate,
p, is a representation of an abstract idea used to express a relationship between “things.” Predicates are used in the formation of statements (described below) and may be defined either explicitly or implicitly by a source data-model. For example, the arcs of an ontology, and the attributes of an XML or database schema represent predicates. In the set of all predicates P = {pi}, each element is unique: i,j (i
≠ j)
à (pi ≠ pj)
. The text example of Fig. 4 contributes predicates P’ = {To’, ‘From’,
‘Body’} to the set of all predicates P (i.e. P’ Í
P). The only predicate that is “built into” (i.e. defined by) our storage model is the ‘IsInstanceOf’ predicate, which is used to disambiguate signs to form terms as described below. Concepts and predicates are the constructs by which we link to data-models and, thereby, explicitly expose data-semantics.

Term: A term,
tij,
is an ordered pair <gi,cj> where gi G and cj C. Each term represents a disambiguated sign. The process of disambiguation associates a sign with a concept using the ‘IsInstanceOf’
predicate (though not every sign from G is necessarily disambiguated, and not every concept from C is necessarily used for disambiguation). In the set of all terms T = {tij}, each element is unique: i,j,k,l
(i ≠ k or
j ≠ l)
à (tij ≠ tkl). The text example of Fig. 4 contributes terms T’ = {t1, t2, t3, t4} where t1 = <’Suzi’, person>, t2 = <’Tanya’, person>, t3 = <’Bring lunch’, Body_text>, t4 = <Message1, message> to the complete set of terms T (i.e. T’ Í
T).

Statement: A statement, s, encodes a binary relationship between a subject and an object mediated by a predicate. A statement is represented by an ordered triple sijh = <subjecti, predicatej, objecth>. Among the set of all statements, each element is unique: i,j,h,l,m,n
(i ≠ l or j ≠ m or h ≠ n)
à (sijh ≠ slmn). In our model, subject and object may be either a term or statement. The simplest kind of statement is one in which subject and object are terms s0ijh = <ti, pj, th>. Statements in which the object is itself another statement represent reifications: s1klm = <tk, pl, sm>. Finally, a statement in which both subject and object are other statements represents a relationship between statements: s2xyz = <sx, py, sz>. The set of all statements S = {s0ijh} U {s1klm} U {s2xyz}. The text example of Fig. 4 shows three statements: S’ = {<t4, to, t1>, <t4, from, t2>, <t4, body, t3>} all with the same subject, which is the term corresponding to the message itself. These statements contribute to the set of all statements, i.e. S’ Í
S.

Note that the above definitions are formulated to be clear and unambiguous with respect to our particular approach and may not match those found in other literature. Throughout the paper, we will denote instances of signs, concepts, predicates, terms, and statements using Arial font within single quotes (e.g. ‘person’).

DDF

Abstracted from the milieu of all possible data-models, these elementary constructs (concept, predicate, sign, term, and statement) provide the fixed-points of a data reference model that will ultimately form the basis of a practical Data Architecture and Semantic Integration Framework which we call the Data Description Framework (DDF). Despite its simplicity, the DDF is an amazingly rich model that can be viewed from at least two different perspectives. From one perspective, the DDF encompasses a synergistic combination of two higher order models lying along different dimensions of abstraction – one that is outward-looking (“extrospective”), one inward-looking (introspective).

The “extrospective” portion of the model is a meta-model formed by (a) C and P, which look outward to domain knowledge (represented in data / knowledge models), and (b) G, which looks outward toward the data. Signs bring data into the DDF as first class entities which may then participate in various, unlimited conceptualizing relationships created by any sort of automated or manual process at any time. Signs provide a fundamental level of data integration (that traditional approaches lack) resulting from having eliminated data-model barriers. Concepts and predicates are to domain knowledge what signs are to data. They are the mechanism by which such knowledge (typically encoded in domain-specific data / knowledge models) is linked into the DDF and exposed by our Data Integration Framework for use and re-use.

The introspective portion of the model is a semantic model formed by T and S which abstract data-model internals to expose structure in a uniform way. Terms bind signs to concepts, exposing the meaning of the data unambiguously with respect to the original source data-model. Statements represent semantic relationships about, within, and between disambiguated data elements.

Together the introspective and “extrospective” models that comprise the DDF enable both horizontal and vertical data integration. The “extrospective” abstraction bridges data and domain knowledge (vertical integration). The instrospective abstraction bridges data structured by various disparate processes (horizontal integration) and binds the two outward looking faces of the “extrospective” model to provide a comprehensive data integration model.

From the second perspective, the DDF may be regarded as a synergistic combination of two interaction patterns – one that decouples, one that binds. DDF achieves decoupling in two ways. First, as a higher order data-model abstraction, DDF effectively decouples data from data-models. Thus, the DDF can encapsulate any sort of data regardless of the source data-model. Second, as a higher order data-structure, DDF effectively decouples structured data from data storage structures. Thus, the DDF can accommodate any data regardless of the source storage structure. As a result, the DDF provides a practical foundation for implementing a stable database that can accommodate any sort of structured data.

The ways in which DDF implements binding are illustrated in Fig. 5. Specifically, sign g binds with concept c to form term t,
and predicate p
binds with term t
to form statement s. The diagram also indicates that a predicate may bind a term and a statement to form reification or a predicate may bind a statement with another statement to form a statement relationship. These bindings allow data to be integrated within and across data-models and continuously enriched into knowledge.

Together these interaction patterns make the DDF a powerful yet practical platform for data integration. Decoupling gives DDF the character of a universal data store and successive bindings progressively move intelligence artifacts (or their constituent elements) upward through the cognitive hierarchy. The result is a universal data integration and semantic enrichment platform that supports data structured by any means, unrestricted associations within and between them, and increasingly rich semantics.

Representational Power

Although the expressiveness of the DDF is sufficient to capture the data and data-semantics of any structured data source, we illustrate this for the relational model since it is the most commonly used. Similar arguments can be made for other model types, such as hierarchical, object-oriented, and graph.

In accordance with common relational formalism [Date 2004], a relation R is defined by the set of attributes A = {Ai} (1 ≤ i ≤ n). The subset of attributes that comprise the primary key are denoted as K={Kl} (1 ≤ l ≤ k), K Í
A. The set of all data values in R is D = {dij},
where dij is a value on the intersection of attribute Ai and row Wj (1≤ j ≤ m). We can integrate data and its original semantics from R into a DDF data space consisting of G0, C0, P0, T0, and S0 according to the following procedure:

  • All attributes of R are added to the set of concepts:

    C = C0 U A

  • Non-key attributes are added to the set of predicates:

    P = P0 U (A – K)

  • D’ = {d’i} is the set of unique values of D: i,j (i
    ≠ j) di ≠ dj
    . The values in D’ that are not already present in G0 are added to the set of signs:

    G = G0 U (D’– G0)

  • We build the set of terms TR = {tij} where tij=<dij, Ai> and 1 ≤ i ≤ n, 1≤ j ≤ m. T’R is the subset of unique terms of TR. Terms of T’R are added to T0.

    T = T0 U T’R

  • We build the set of statements SR = {sij} where sij = < <dkj, K>, Ai, <dij, Ai> >, dkj represents the combination of values of the key attributes for the row Wj, Ai
    Í
    A-K, and k+1 ≤ i ≤ n, 1≤ j ≤ m. Statements of SR are added to S0:

    S = S0 U SR

Representation of R in DDF is lossless (no loss or distortion of data and semantics, even though semantics of R is not explicitly represented in DDF) because we can restore R from DDF:

  1. R is contained in statements S, therefore, using processing metadata (described in the following section and shown in Fig. 6), extract from S the statements that originated from R:

    SR = {sij} where sij = < <dkj, K>, C <dij, Ai> >

  2. From SR restore the structure and rows of R as follows:

K

Ak+1

Ak+2

An

dk1 dk+1,1 dk+2,1 . . . dn1
dk2 dk+1,2 dk+2,2 . . . dn2
. . . . . . . . . . . . . . .
dkm dk+1,m dk+2,m . . . dnm

The process that was used to build combinations of values of the key attributes can be reversed to get to the relation in its original form:

Ak

. . .

Ak

Ak+1

Ak+2

An

d11 . . . dk1 dk+1,1 dk+2,1 . . . dn1
d12 . . . dk2 dk+1,2 dk+2,2 . . . dn2
. . . . . . . . . . . . . . . . . . . . .
d1m . . . dkm dk+1,m dk+2,m . . . dnm

Therefore, by the integration procedure described above, the data and data-semantics from R are faithfully represented with the DDF. The structure of R itself and its identity integrity are explicitly captured in Layer 3.

This procedure further reveals two powerful and distinguishing features of the DDF:

  • The DDF can accommodate data and data-semantics from structured sources without loss or distortion.
  • Structured artifacts can be integrated within the DDF in a mechanical fashion without requiring prior knowledge, and or analysis of, their domain-specific data-models.

Integration Power

The DDF exposes four levels of data and semantics (signs, terms, statements, and concepts and predicates), which support four levels of largely independent integration actions or patterns: establishing signs, disambiguation, association, and data-model enhancement:

  • Signs are established from mentions asserted by users or data ingest process. Many mentions may relate to the same sign, and it is this “re-use” of signs that provides a primal level data integration.
  • Data-model enhancement occurs when processes operating on Layer 3 extend, enhance, or harmonize data-models associated with the incorporated data sources. Since these operations do not affect the original data-models, but serve to establish new overarching data-models, multiple perspectives can co-exist (e.g. both the original and enhanced).
  • Disambiguation actions create associations between signs and concepts, either when data is ingested into the DDF, or by a subsequent semantic process. The same sign may be disambiguated in any number of different ways and signs and concepts may be associated regardless of their originating data source. In other words, the sign and the concept to which it is bound may originate from disparate sources.
  • Finally, association actions create binary statements between terms and or statements using a predicate. These may originate from semantic relationships expressed within a data source or may be created by a subsequent semantic process. Any term / statement may be associated to any other regardless of their origin.

By virtue of these integration actions, the DDF is able to support an essentially endless process of semantic enrichment and each of these actions may be conducted on any integration level without necessarily affecting the other elements.

Evolutionary Power

The independence of the integration actions imbues the DDF with extraordinary evolutionary power. New sources, modifications of the integrated data sources, and changes of integration models can be accommodated without requiring the data space to be rebuilt. In particular, the growth of a source system, both in terms of data and semantics, can be accommodated by the addition of new DDF elements (signs, terms, statements, concepts, and predicates) associated with that growth. Source system modifications can be handled in several ways depending on how the integrated store is used. Because the dependencies between all of the elements in the Framework are known, we can always define how to proliferate / manage changes and the approaches employed may vary from a source to a source. Finally, new integration models can be introduced by the integration actions described previously and these will simply co-exist with all other models. If an integration model needs to be modified, we can introduce it as a new model or mark the changed elements appropriately and proliferate those changes.

DDF in Relationship to RDF

The reader familiar with the Resource Description Framework (RDF/RDFS) may wonder what is different here. Indeed, RDF and DDF share DNA, so to speak, since both employ a similar level of abstraction and make semantics explicit. Unlike RDF however, DDF also prescribes the exposure of data as signs which can freely participate in the disambiguations and associations necessary for data integration. In other words, data represent themselves directly as signs and participate in the DDF as first-class citizens. Because of the way they are designed, DDF signs provide a primal level of data integration. In contrast, data in RDF are represented either as literals or by URIs. A datum represented as an RDF literal cannot be explicitly disambiguated or associated. Furthermore, because the URI is a first-class citizen, not the datum, there is no mechanism in RDF to prevent a single datum from being represented by multiple URIs and or literals.

This is not a criticism of RDF as these differences reflect the fact that DDF is an abstract model aimed at data integration, whereas, RDF is a meta-model. Thus, employing RDF for data integration necessitates building a model in RDF (i.e. a particular meta-model instance) along with rules prescribing the manner of data exposure. In contrast, DDF is a model that makes explicit commitments to support data integration. Because this model represents an abstraction over domain data-models, the DDF can represent data structured by any data-model, and be represented in any meta-model (including RDF). Fig. 6 illustrates the place of DDF in relation to the models and meta-models.

Implementation

A universal storage model based on DDF can be implemented in a variety of ways (e.g. objects, relations, triples). We have explored and implemented two approaches, the first using relational technology (Oracle, mySQL) and the second using cloud technology (Hadoop / HBase) [Hadoop], [HBase], [Chang 2006]. In both cases we employ the Dimensional Data Modeling (DDM) approach [Kimball 2002] because it nicely captures the business processes associated with moving intelligence artifacts upward through the cognitive hierarchy while accommodating the metadata that intelligence processing requires. In particular, we maintain not only the contextual metadata about the indigenous artifact itself (e.g. the who what when and where of its creation and transmission), but also process metadata regarding the processing by which signs, terms and statements are created. The former are captured in the Layer 1 storage structure as described previously while the latter are accommodated in Layer 2.

As reflected in Fig. 7, the essential intelligence business processes that the DDF captures are semantic disambiguation and association formation. Thus, the DDF storage model consists of two main fact-tables, SemanticFact
and
AssociationFact. The SemanticFact table records metrics relating to the formation and disambiguation of signs, and references dimension tables that record signs, concepts, and process metadata. The signs themselves are represented using two tables, Sign and Mention. The value of a mention is identified by the region of the artifact in which it is localized. The boundary of such a region is recorded in the Mention table. The value of a sign may represent any number of source mentions that are exactly the same or are considered to be the same from the perspective of the process which extracts / identifies them. The Concept dimension records elements from the domain knowledge which includes the source artifacts’ data-models. Each record in the SemanticFact table binds a sign to a concept using isInstanceOf’
semantics.

The AssociationFact table records metrics relating to the formation of associations and references dimension tables that record statements, predicates, and process metadata. Recall that statements come in three types – an assocation between terms (i.e. statement), an association between a term and another statement (i.e. reification), and an association between two statements (statement relation). These are accommodated by the three subclasses of the Statement dimension which are Statement0, Statement1, Statement2 respectively. The Predicate dimension records predicates from the domain knowledge.

The ProcessMetadata package shown in Fig. 7, represents a collection of dimensional tables used to record operational and contextual metadata about the various external processes that create SemanticFact and AssociationFact records. The particular elements and formulation of this metadata would be designed to support the information assurance needs of the Intelligence Community. Typically these would include Date, Time, Creator, and SecurityClassification dimensions.

The DDF does not prescribe or constrain the processing by which the DDF storage model would be populated, and the nature of such processing depends both on the modality and structure (or lack thereof) of the indigenous artifacts. Nevertheless, to illustrate how DDF works, and provide more insight into the relationship between external processes and our Data Integration Framework, the interested reader may find a brief discussion of the processing by which Layers 1 and 2 would be populated in the Appendix.

Relation to Other Approaches

Data quality professionals widely recognize the importance of data integration and the need for efficient data integration approaches to redress a panoply of data quality problems [Lee 2006].

A large body of work exists on data integration approaches [Batini 1986], [Parent 1998], [Halevy 2005],[Bernstein 2007], many of which have contributed to successful Enterprise Information Integration solutions. However, because they all are based on some kind of data-model harmonization (i.e. mapping), they fail to provide practical solution for ULS intelligence data integration. In particular, data-model integration does not address data integration, which intelligence data processing requires. Physical data integration, typical of data warehouse applications, also requires heavy up-front data-model analysis and harmonization as well. This activity is not only resource intensive, it often results in the loss and or distortion of data and its semantics which, in the context of intelligence, may reduce the richness and power of the data. DDF addresses the needs of the Intelligence Community by supporting ad-hoc, lossless data integration without imposing a heavy pre-processing burden.

The Dataspaces approach introduced by Halevy et al. [Franklin 2005], [Halevy 2005], [Halevy 2006-1], [Halevy 2006-2] is similar in philosophy to the DDF in that it supports the co-existence of disparate data sources regardless of their type.

Dataspaces are not a data integration approach; rather, they are more of a data co-existence
approach. The goal of Dataspaces support is to provide base functionality over all data sources, regardless of how integrated they are” [Franklin 2005].

With this approach, a mediated (general, global) schema serves as an integration model. Individual sources participate in the integration by exposing Local-As-View (LaV) schemas that comply with the mediated schema. In practice, a LaV is implemented as a view on the source, and the mediated schema provides an interface to Dataspaces participants. To support the storage of new associations between data (a shortcoming of virtual integration) it may be necessary to introduce a Local Store for these associations. DDF and Dataspaces represent two different approaches to data management and have different niches: Dataspaces focuses on eliminating the barriers to data access and provides some limited data integration capability, whereas DDF focuses on comprehensive data integration and supports deep semantic enrichment. The DDF can leverage the Dataspaces as a mechanism of data access, and participate within Dataspaces as a semantically rich Integrated Local Store.

A number of commercial products, including the ones that are being used in military applications, e.g. Palantir [Palantir], are based on the object meta-model where data entities are represented by objects with dynamic properties. Although they claim to be able to accommodate data from any structured or unstructured source, in practice they impose a particular, albeit modifiable, data-model on the structured data. This requires heavy pre-processing of the source, as is typical for such solutions, and results in loss and or distortion of source data and data semantics. In addition, customers are dependent on the solution provider as they cannot perform modifications of the integration model.

The Information Model Interoperability Reference Model [Melnik 2000], [Omelayenko 2001], proposed for presenting information on the web, consists of three layers – syntax, object, and semantic. The syntax layer represents serialized data content, similar to our indigenous text artifacts. The semantic layer provides semantics through data-models and languages, and the object layer provides a bridge between the two. Unfortunately however, the IMI does not provide a practical model for the implementation of those layers and their interfaces.

The Data Reference Model (DRM) of the Federal Enterprise Architecture (FEA) aims to provide standards for the description, categorization, and sharing of data [DRF 2005]. Unlike DDF however, it does not resolve the issues of data integration and unfortunately exhibits the typical shortcomings of most physical and virtual data integration approaches.

Finally, the Common Warehouse Model (CWM) [CWM 2001] offers a standardized approach (and tools that support it) for representing and mediating the automated interchange of metadata in warehouse applications that involve multiple data sources and data processing applications. Being focused on metadata integration, as opposed to data integration, the CWM mainly addresses issues relating to Layer 3 of our Data Integration Framework.

Current & Future Work

Today there is a deployed system called the Joint Intelligence Operational Capability in Iraq (JIOC-I) that essentially implements Layer 1 of our Data Integration Framework, though only for text artifacts. Unfortunately, the JIOC-I by itself falls short of a complete integration solution because it does not address structured data in a way that exposes that structure to support further analytical processing and visualization. In other words, it lacks Layer 2. Consequently, there has been much criticism of the JIOC-I, along with various suggestions for “fixing” it (e.g. by extending the schema to accommodate structured data). In contrast, we recognize the JIOC-I as a foundational element (that got it mostly right) and a first step toward a ULS intelligence system that integrates data while embracing data diversity. Indeed, the JIOC-I was the inspiration that led us to develop the layers above, and the DDF in particular.

While several relational database implementations of Layers 1 and 2 of our Data Architecture and Semantic Integration Framework are being developed and tested in various US Army CERDEC I2WD projects, our most ambitious implementation is being made for the Army using cloud computing technology. Our current effort on a 52-node cloud implements all three Layers of the data architecture using the Hadoop Distributed File System (HDFS) and HBase to achieve an Ultra-Large Scale, unified “dataspace” that supports not only a diversity of data, but also a diversity of processing (Hadoop Map / Reduce [Dean 2004] analytics). Although we reserve the performance metrics and details of the DDF implementation within the cloud computing key-value meta-model for a separate paper, it is interesting to note that cloud appears to accommodate the DDF in a particularly efficient manner.

Other key aspects of our Data Integration Framework are described elsewhere. [Yoakum 2008 IQIS] highlights the low barrier to entry for data integration by describing the process for lossless mechanical data ingestion which requires no costly pre-processing or data-model harmonization. Data surfing, drilling, and discovery on the DDF unified data space are described in [Yoakum 2008 DAMA]. [Yoakum 2008 SIMA] addresses the utility of DDF in Situation Management – another activity that requires rapid, ad-hoc data integration. Finally, [Yoakum 2009] describes a formation of a contiguous and therefore navigatable integrated data space that enables “vertical” integration from artifacts to structured data and to knowledge models, and “horizontal” integration of data from various sources.

As they are matured, we anticipate that Layers 3 and 4 of the Framework will provide fertile ground for entirely new work in knowledge interaction and perception. Layer 3 serves as a universal substrate on which to explore, discover, and encode relationships between knowledge models that go well beyond harmonization and integration to include, for example, dissonant perspectives which can not and should not be “harmonized.” Layer 4 provides the lenses through which the human user looks into this cauldron of knowledge, information, and data to explore and make sense of the object of his interest (e.g. a domain, a situation, an entity) according to a chosen perspective. Having all four layers present will close the loop between data and knowledge in both directions so that they may co-evolve to yield more complete and accurate understanding. Atop the immense foundation of integrated data provided by Layers 1 and 2, Layers 3 and 4 will fuel the engines of ULS systems research for a very long way into the future.

Conclusion

The Intelligence Enterprise is inexorably evolving into an Ultra-Large Scale systems world that can not, and will not, be constrained in its processes or products. The data integration problem is but one early symptom of this burgeoning reality. Although this knowledge does not provide a recipe for good solutions, it makes it rather easy to spot bad ones. Unfortunately, current data integration approaches generally represent the latter.

In this paper, we have presented the first two layers of a multi-layer Data Integration and Semantic Enrichment Framework that enables deep semantic data integration in a ULS systems environment. The model on which it is founded, the DDF, supports both horizontal and vertical data integration (i.e. across disparate data-models and from data to knowledge) by embracing the diversity of data / knowledge models and processes by which data is structured. More importantly, the model admits a practical implementation (i.e. “hard running code”) that accommodates artifacts of any modality (e.g. text, audio, images, video, signals) in a single unified data store that enables true data integration and the continuous enrichment of data into knowledge.

References

[Batini 1986] Batini, C. et al. A comparative analysis of methodologies for database schema integration, ACM Computing Surveys, (18) 4, 1986.

[Bernstein 2007] Bernstein P., Ho, H. Model Management and Schema Mappings: Theory and Practice, Proceedings of VLDB Conference, 2007.

[Chang 2006] Chang, F., et. al. Bigtable: A Distributed Storage System for Structured Data. 2006. http://labs.google.com/papers/bigtable.html

[CWM 2001] Object Management Group “Common Warehouse Model (CWM) Specification“, OMG, 2001. http://www.omg.org/docs/ad/01-02-01.pdf

[Date 2004] Date, C. An Introduction to Database Systems, 8th edition, Addison Wesley, 2004.

[Dean 2004] Dean, J. and Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. 2004. http://labs.google.com/papers/mapreduce.html

[DRF 2005] Federal Enterprise Architecture Program “The Data Reference Model“, 2005. http://www.whitehouse.gov/omb/egov/documents/DRM_2_0_Final.pdf

[Franklin 2005] Franklin, M., Halevy, A., and Maier, D. From Databases to Dataspaces: A New Abstraction for Information Management. ACM SIGMOD Record, 2005.

[Hadoop] Apache Hadoop. http://hadoop.apache.org/

[Halevy 2005] Halevy, A. et al. Enterprise information integration: successes, challenges and controversies, Proceedings of 24th International Conference on Management of Data, Baltimore, 2005.

[Halevy 2006 - 1] Halevy, A. Franklin, M., and Maier, D. Principles of Dataspace Systems. Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, 2006.

[Halevy 2006 - 2] Halevy, A., Rajaraman, A., and Ordille, J.
Data Integration: The Teenage Years. Proceedings of VLDB Conference, 2006.

[HBase] HBase Tutiorial. http://arunma.com/2008/10/26/hbase-tutorial/

[Izydor 2007] Izydor, C. and McCollum, P. BI, Process and Integration Trends. DM Review Magazine, August 2007. http://www.dmreview.com/issues/20070801/1089409-1.html?portal=data_integration

[Kimball 2002] Kimball, R. and Ross, M. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, Wiley, 2002.

[Lee 2006] Lee, Y., Pipino, L., Funk, J., Wang, R. Journey to Data Quality, The MIT Press, Cambridge, MA, 2006

[Melnik 2000] Melnik, S. and Decker, S. A layered approach to Information Modeling and Interoperability on the Web. Proc. ECDL’00 Workshop on the Semantic Web, Lisbon, Portugal, Sept 2000. http://infolab.stanford.edu/~melnik/pub/sw00/.

[Northrop 2006] Northrop, L., et al., Ultra-Large-Scale Systems The Software Challenge of the Future, Pittsburgh: Carnegie Mellon University, 2007. http://www.sei.cmu.edu/publications/books/engineering/uls.html

[Omelayenko 2001] Omelayenko, B. and Fensel, D. An Analysis of B2B Catalogue Integration Problems. Proceedings of the International Conference on Enterprise Information Systems (ICEIS-2001), July 7-10, 2001, p. 945-952.

[Palantir] Palantir Technologies. http://www.palantirtech.com/

[Parent 1998] Parent, C. and Spaccapietra, S. Issues and approaches of database integration, Communications of the ACM, 41(5), 1998.

[RDF 2004] RDF Core Working Group “Resource Description Framework (RDF)”, W3C, 2004. http://www.w3.org/RDF/.

[Steinberg 1998] Steinberg, N., Bowman, C. L. and White F. E. Revision to the JDL Data Fusion Model, Joint NATO/IRIS Conference, Quebec City, October 1998.

[Yero 2008] Yero, J. Logical vs. Physical Data Integration: A Practical Decision Guide, The DAMA International Symposium & Wilshire Meta-Data Conference. San-Diego, CA, 2008.

[Yoakum 2008 IQIS] Yoakum-Stover, S. and Malyuta, T. Unified Architecture for Integrating Intelligence Data, Proceedings of MIT Information Quality Industry Symposium, MIT, Cambridge, MA, 2008.  http://blog.systover.net/

[Yoakum 2008 DAMA] Yoakum-Stover, S. and Malyuta, T. Unified Integration Architecture for Intelligence Data, Proceedings of DAMA International Europe Conference, London, UK, 2008.  http://blog.systover.net/

[Yoakum 2008 SIMA] Yoakum-Stover, S. and Malyuta, T. Unified Data Integration for Situation Management, Proceedings of the 4th IEEE Workshop on Situation Management (SIMA 2008) at MILCOM 2008, San Diego CA, 2008.  http://blog.systover.net/

[Yoakum 2009] Yoakum-Stover, S., Malyuta, T., and Antunes, N. A Data Integration Framework with Full Spectrum Fusion Capabilities.
MSS Information Fusion Symposium, Las Vegas, NV, Aug 3-7, 2009. http://blog.systover.net/

Appendix – Processing

Ingestion

Consider first, processes that load indigenous artifacts into Layer 1 either physically or virtually so that they may be unambiguously referenced within Layer 2. Typically these are called ingestion processes. Such processes insert either the entire indigenous artifact, or a reference to its location within the authoritative data source, into Layer 1. In addition, both artifact and process metadata are recorded in the appropriate metadata tables. The former essentially provides a card catalogue for the artifact and the latter provides information assurance.

Unstructured Information

Processes that structure unstructured artifacts generate SemanticFact and AssociationFact records in Layer 2. Each such process necessarily entails a particular data-model. This data-model is persisted in Layer 3. Concepts and predicates from the data-model (or references to them) are also persisted in the Concept
and Predicate dimension tables of Layer 2 along with sufficient metadata to identify and retrieve the data-model source artifact (i.e. schema, ontology, etc..).

Unstructured information processing typically identifies all instances of the concepts within its data-model or type system. For example, a given text extractor may identify all ocurrences of ‘IBM’ and associate them with the concept ‘Company.’ Each such instance is represented as a DDF mention. The position of each mention within the source artifact is recorded in the Mention table (e.g. using beginChar, endChar) and a single record is added to the Sign table using, for example, the actual contents of the span (‘IBM’) as the sign value. Each disambiguation ocurrence (i.e. the association made by the text extractor between a mention and a concept) is recorded in the SemanticFact table along with appropriate process metadata, and a term consisting of <sign, concept> is created in the Term table (if such term does not already exist).

Further semantic processing may identify relationships between elements within the artifact. The elements themselves would have already been recorded as SemanticFacts. For each such relationship, an AssociationFact is recorded along with appropriate process metadata, and a Statement table entry is created.

Unstructured information processing of other than text artifacts is similar, the main differences being that entries in the Mention table will have a different spanCoordinateType, and the method for assigning a sign value will be different. For example, consider object recognition software that extracts faces from within an image of a crowd. For each extracted face, the corresponding rectangular area of the image could be recorded in the Mention table with the help of pixelUpperLeft and pixelLowerRight, and a sign (e.g. ‘Suzi’s faceImage’) would be assigned to all extracted mentions.

Extract-Transform-Load

Consider next, Extract-Transform-Load (ETL) processes that pull data from other structured data sources, typically databases, into Layer 2. The initial phase of the ETL loads the source data-model (e.g. database data dictionary) into Layer 3, and concepts and predicates (or their references) into in the Concept and Predicate
dimension tables of Layer 2. Sufficient metadata necessary to identify and retrieve the data-model source artifact (i.e. schema), are also stored. Subsequent ETL processing, which entails a mapping to the DDF structure, inserts signs, terms, and statements into the SemanticFact and AssociationFact tables along with appropriate process metadata.

Because the ETL process needs only to capture the explicit semantics of the source meta-model (e.g. relational, hierarchical, graph…), one ETL can be developed for a whole class of data stores. For example a discussion of ETL for relational stores may be found in [Yoakum 2008 IQIS].

Interactive

Finally, consider an interactive user interface that enables an analyst to assert semantic and association facts directly into the DDF. The analyst will have the option to use existing concepts, predicates, terms, and statements or to create new ones. In the case of the latter, recorded and asserted mentions will reference the source analyst. Metadata recorded for manual processes with also reference the source analyst