Virtualization services for Data Grids

Data Grids provide a set of virtualization services to enable management and integration of data that are distributed across multiple sites and storage systems. Some of the key services are naming, location transparency, federation, and information integration.



Digital entities are bit streams can only be interpreted through an infrastructure. It includes sensor data, output from simulations, and even output from word processing programs.


Digital entities inherently are composed of data, information (metadata tags), and knowledge in the form of logical relationships between metadata tags or structural relationships defined by the data model. Treating the processes used to generate digital entities as first-class objects gives rise to the notion of ‘virtual’ digital entities, or virtual data.

The virtual data issue arises in the context of long-term data persistence as well, when a digital entity may need to be accessed possibly years after its creation. Either the application that was used to create the design drawings is preserved, in a process called emulation, or the information and the knowledge content is preserved in a process called migration.

Long-term persistence

Given that the infrastructure continues to evolve, one approach to digital entity management is to try to keep the interfaces between the infrastructure components invariant. Emulation specifies a mapping from the original interface (e.g. operating system calls) to the new interface. Thus, emulation is a mapping between interface standards. Migration specifies a mapping from the original encoding format of a data model to a new encoding format.


Data corresponds to the bits (zeroes and ones) that comprise a digital entity. Information corresponds to any semantic tag associated with the bits. The tags assign semantic meaning to the bits and provide context Knowledge corresponds to any relationship that is defined between information attributes or that is inherent within the data model.

A unifying abstraction

Data are managed as files in a storage repository, information is managed as metadata in a database, and knowledge is managed as relationships in a knowledge repository.

Files are manipulated in file systems, information in digital libraries, and knowledge in inference engines.

A Data Grid defines the interoperability mechanisms for interacting with multiple versions of each type of repository. These levels of interoperability can be captured in a single diagram that addresses the ingestion, management, and access of data, information, and knowledge.

Digital Entity Management System

Above Figure shows an extension of the two-dimensional diagram with a third dimension to indicate the requirement for information integration across various boundaries, for example, disciplines and/or organizations. In two dimensions, the diagram is a 3 × 3 data management matrix that characterizes the data handling systems in the lowest row, the information handling systems in the middle row, and the knowledge handling systems in the top row. The ingestion mechanisms used to import digital entities into management systems are characterized in the left column. Management systems for repositories are characterized in the middle column and access systems are characterized in the right column. The rows of the data management matrix describe different naming conventions.

Finally, at the knowledge level, concept spaces are used to span multiple Data Grids Virtualization and levels of data abstraction, since any component of the hardware and software infrastructure may change over time, emulation needs to be able to deal with changes not only in the operating system calls but also in the storage and the display systems. An application can be wrapped to map the original operating system calls used by the application to a new set of operating system calls.

A Data Grid specifies virtualization services and a set of abstractions for interoperating with multiple types of storage systems.

It is possible to define an abstraction for storage that encompasses file systems, databases, archives, Web sites, and essentially all types of storage systems. The storage system abstraction for a Data Grid uses a logical namespace to reference digital entities that may be located on storage systems at different sites. The logical namespace provides a global identifier and maintains the mapping to the physical file names. Each of the data, the information, and the knowledge abstractions in a Data Grid introduces a new namespace for characterizing digital entities.

Data Grid infrastructure

The Data Grid community is developing a consensus on the fundamental capabilities that should be provided by Data Grids. In addition, the Persistent Archive Research Group of the Global Grid Forum is developing a consensus on the additional capabilities that are needed in Data Grids to support the implementation of a persistent archive. Distributed data management has been largely solved by the Data Grid community. A storage abstraction is used to implement a common set of operations across archives. Hierarchical Resource Managers, file systems, and databases. The SRB organizes the attributes for each digital entity within a logical namespace that is implemented in an object-relational database. The logical namespace supports organization of digital entities as collections/subcollections and supports soft links between subcollections. The types of digital entities that can be registered include files, URLs, SQL command strings, and databases.

Data Grid projects

The Data Grid community has been developing software infrastructure to support distributed data collections for many scientific disciplines, including high-energy physics, chemistry, biology, Earth systems science, and astronomy. These systems are in production use, managing data collections that contain millions of digital entities and aggregate terabytes in size. The common Data Grid capabilities that are emerging across all of the Data Grids include implementation of a logical namespace that supports the construction of a uniform naming convention across multiple storage systems. The logical namespace is managed independently of the physical file names used at a particular site, and a mapping is maintained between the logical file name and the physical file name. Each Data Grid has added attributes to the namespace to support location transparency (access without knowing the physical location of the file), file manipulation, and file organization. Most of the Grids provide support for organizing the data files in a hierarchical directory structure within the logical namespace and support for ownership of the files by a community or collection ID.


 As described earlier, Data Grid technologies are becoming available for creating and managing data collections, information repositories, and knowledge bases. However, a new requirement emerges for a virtualization service that integrates information from different collections, repositories, and knowledge bases. Such a virtualization service provides a  (virtual) single site view over a set of distributed, heterogeneous information sources. The motivation arises from the fact that human endeavors – whether in science, commerce, or security – are inherently becoming multidisciplinary in nature. ‘Next-generation’ applications in all of these areas require access to information from a variety of distributed information sources. The virtualization service may support simple federation, that is, federation of a set of distributed files and associated metadata or it may provide complex semantic integration for data from multiple domains. The need for such integration was realized several years ago in some domains, for example, commercial and government applications, owing to the existence of multiple legacy, ‘stovepiped’ systems in these domains. Another approach to information integration is the application integration approach. While the objective is similar to database integration, that is, to integrate data and information from disparate, distributed sources, the techniques are based on a programming language approach rather than a database approach. Logic-based, semantic data integration may be viewed as a ‘horizontal integration’ approach, since the integration is across databases that are all ‘about’ the same high-level concept, but employ disparate terminology, format, and type of content, for example, geologic maps, geochemistry data, gravity maps, and geochronology tables for plutons. An additional level of complexity is introduced in scenarios that require ‘vertical integration’ of data. Such scenarios require domain models that describe causal relationships, statistical linkages, and the various interactions that occur across these levels, to enable integration of information. Indeed, it would be impossible to integrate the information in the absence of such models. We refer to this approach as model-based integration, which requires statistical and probabilistic techniques and the ability to handle probabilistic relationships among data from different sources rather than just logical relationships.

Data warehousing


As mentioned earlier, data warehousing has been in existence for almost 10 years. Since all the data are brought into one physical location, it becomes possible to exploit this feature by developing efficient storage and indexing schemes for the centralized data. To create a warehouse, data is physically exported from source data systems. Thus, a copy of the necessary data is sent for incorporation into the warehouse. Typically, a defined subset, or ‘view’, of the source data is exported.

Database and application integration


One of the drawbacks of warehouses is that the data can be out of date with respect to the actual data at the sources. Moreover, the warehouse model requires a higher level of commitment from each participating source, since each source is required to take on the additional task of periodically exporting its data to the warehouse. Database integration and mediation techniques focus on integration of data from multiple sources without assuming that the data is exported from the sources into a single location. Thus, the key difference here is that sources export views rather than data. However, many of the conceptual issues in integrating data may be quite similar to the data warehouse case. Application integration techniques are also designed to provide an integrated view of disparate data sources and applications, to the end user. They differ in approach from database integration techniques in that they employ a programming language and associated ‘data model’ for integration.

Semantic data integration


Semantic data integration is necessary in cases in which information is integrated across sources that have differing terminologies or ontologies. Information integration in this case is based on developing ‘conceptual models’ for each source and linking these models to a global knowledge representation structure, which represents the encyclopedic knowledge in the domain.

Model-based integration


There is an increasing need in some scientific disciplines to integrate information ‘across scale’. In bioinformatics applications, scientists are interested in integrating information from the molecular level to the sequence, genome, protein, cell, tissue, and even organ level. Supporting analysis pipelines requires a virtualization service that can support robust workflows in the Data Grid. Efforts are under way to standardize Web workflow specifications. Increasingly, the processes in the workflow/pipeline are Web services, so managing the execution of a pipeline is essentially the same as the problem of managing, or ‘orchestrating’, the execution of a set of Web services.


The Semantic Grid: a future e-Science infrastructure

 At present, the key communication technologies are predominantly e-mail and the Web. Together these have shown a glimpse of what is possible; however, to more fully support the e-Scientist, the next generation of technology will need to be much richer, more flexible and much easier to use. Against this background, this chapter focuses on the requirements, the design and implementation issues, and the research challenges associated with developing a computing infrastructure to support

future e-Science. The computing infrastructure for e-Science is commonly referred to as the Grid and this is, therefore, the term that is used here.The Semantic Grid is characterised as an open system in which users, software components and computational resources (all owned by different stakeholders) come and go on a continual basis. There should be a high degree of automation that supports flexible collaborations and computation on a global scale. Moreover, this environment should be personalised to the individual participants and should offer seamless interactions with both software components and other relevant users.

Given the above view of the scope of e-Science, it has become popular to characterise the computing infrastructure as consisting of three conceptual layers:

• Data/computation

• Information

• Knowledge



 This section expands upon the view of the Semantic Grid as a service-oriented architecture in which entities provide services to one another under various forms of contract.The e-Scientist’s environment is composed of data/computation services, information services, and knowledge services.

Justification of a service-oriented view

 A key question in designing and building Grid applications is what is the most appropriate conceptual model for the system? Without a conceptual underpinning, Grid endeavours will simply be a series of handcrafted and ad hoc implementations that represent point solutions. Services can be related to the domain of the Grid, the infrastructure of the computing facility, or the users of the Grid – that is, at the data/computation layer, at the information layer, or at the knowledge layer. Services do not exist in a vacuum; rather they exist in a particular institutional context. Thus, all services have an owner (or set of owners). The owner is the body (individual or institution) that is responsible for offering the service for consumption by others. The owner sets the terms and conditions under which the service can be accessed. The key components of a service-oriented architecture are service owners that offer services to service consumers under particular contracts. Each owner-consumer interaction takes place in a given marketplace whose rules are set by the market owner. The market owner may be one of the entities in the marketplace or it may be a neutral third party. This dynamic service composition activity is akin to creating a new virtual organisation.

The service creation process covers three broad types of activity. Firstly, specifying how the service is to be realized by the service owner using an appropriate service description language. Secondly, specifying the metainformation associated with the service. Thirdly, making the service available in the appropriate marketplace.

The service procurement phase is situated in a particular marketplace and involves a service owner and a service consumer establishing a contract for the enactment of the service according to a particular set of terms and conditions.

It can be seen that a service-oriented architecture is well suited to Grid applications:

• Able to store and process huge volumes of content in a timely fashion.

• Allow different stakeholders to retain ownership of their own content and processing capabilities, but to allow others access under the appropriate terms and conditions.

• Allow users to discover, transparently access and process relevant content wherever it may be located in the Grid.

• Allow users to form, maintain, and disband communities of practice with restricted membership criteria and rules of operation.

• Allow content to be combined from multiple sources in unpredictable ways according to the users’ needs.

• Support evolutionary growth as new content and processing techniques become available.

Key technical challenges

 Service owners and consumers as autonomous agents-An agent is an encapsulated computer system that is situated in some environment and that is capable of flexible, autonomous action in that environment in order to meet its design objectives.Each service owner will have one or more agents acting on its behalf. These agents will manage access to the services for which they are responsible and will ensure that the agreed contracts are fulfilled.

Interacting agents-Grid applications involve multiple stakeholders interacting with one another in order to procure and deliver services. Once semantic interoperation has been achieved, the agents can engage in various forms of interaction. These interactions can vary from simple information interchanges,to requests for particular actions to be performed and on to cooperation, coordination and negotiation in order to arrange interdependent activities.

The nature of the interactions between the agents can be broadly divided into two main camps. Firstly, those that are associated with making service contracts. This will typically be achieved through some form of automated negotiation since the agents are autonomous. When designing these negotiations, three main issues need to be considered:

• The Negotiation Protocol

• The negotiation object

• The agent’s decision-making models

The second main type of interaction is when a number of agents decide to come together to form a new virtual organisation. There are a number of techniques and algorithms that can be employed to address the coalition formation process.

Marketplace structures-It should be possible to establish marketplaces by any agent(s) in the system (including a service owner, a service consumer or a neutral third party).


 The aim of the knowledge layer is to act as an infrastructure to support the management and application of scientific knowledge to achieve particular types of goal and objective. In order to achieve this, it builds upon the services offered by the data/computation and Information.

The knowledge life cycle

The knowledge life cycle can be regarded as a set of challenges as well as a sequence of stages. Each stage has variously been seen as a bottleneck. The effort of acquiring knowledge was one bottleneck recognised early. But so too are modelling, retrieval, reuse, publication and maintenance. In this section, we examine the nature of the challenges at each stage in the knowledge life cycle and review the various methods and techniques at our disposal. Knowledge acquisition sets the challenge of getting hold of the information that is around, and turning it into knowledge by making it usable. Knowledge modelling bridges the gap between the acquisition of knowledge and its use. Knowledge models must be able both to act as straightforward placeholders for the acquired knowledge, and to represent the knowledge so that it can be used for problem solving. One of the most serious impediments to the cost-effective use of knowledge is that too often knowledge components have to be constructed afresh. There is little knowledge reuse. This arises partly because knowledge tends to require different representations depending on the problem solving that it is intended to do. We need to understand how to find patterns in knowledge, to allow for its storage so that it can be reused when circumstances permit. Having acquired knowledge, modelled and stored it, the issue then arises as to how to get that knowledge to the people who subsequently need it. The challenge of knowledge publishing or disseminating can be described as getting the right knowledge, in the right form, to the right person or system, at the right time. Finally, having acquired and modelled the knowledge, and having managed to retrieve

and disseminate it appropriately, the last challenge is to keep the knowledge content current knowledge maintenance. This may involve the regular updating of content as knowledge changes.

Ontologies and the knowledge layer

 The concept of an ontology is necessary to capture the expressive power that is needed for modelling and reasoning with knowledge. Generally speaking, an ontology determines the extension of terms and the relationships between them. It is important to recognize that enrichment or metatagging can be applied at any conceptual level in the three-tier Grid This yields the idea of metadata, metainformation, and metaknowledge.

Domain ontologies: Conceptualizations of the important objects, properties, and relations between those objects

Task ontologies: Conceptualizations of tasks and processes, their interrelationships and properties

Quality ontologies: Conceptualizations of the attributes that knowledge assets possess and their interrelationships

Value ontologies: Conceptualizations of those attributes that are relevant to establishing the value of content

Personalization ontologies: Conceptualizations of features that are important to establishing a user model or perspective

Argumentation ontologies: A wide range of annotations can relate to the reasons why content was acquired, why it was modelled in the way it was, and who supports or dissents from it. This is particularly powerful when extended to the concept of associating discussion threads with content

Given the developments outlined in this section, a general process that might drive the emergence of the knowledge Grid would comprise the following:

• The development, construction, and maintenance of application (specific and more general areas of science and engineering) and community (sets of collaborating scientists)based ontologies.

• The large-scale annotation and enrichment of scientific data, information and knowledge in terms of these ontologies.

• The exploitation of this enriched content by knowledge technologies.

Knowledge layer aspects of the scenario

 The knowledge layer is described in terms of the agent-based service-oriented analysis. Important components of this conceptualization were the software proxies for human agents such as the scientist agent and the technician agent. These software agents will interact with their human counterparts to elicit preferences, priorities, and objectives. One of the most pervasive knowledge services in our scenario is the partial or fully automated annotation of scientific data. Before it can be used as knowledge, we need to equip the data with meaning. These acquisition and annotation services along with many others will be underpinned by ontology services that maintain agreed vocabularies and conceptualizations of the scientific domain. Personalization services will also be invoked by a number of agents in the scenario. Personal annotations might reflect genuine differences of terminology or perspective – particular signal types often have local vocabulary to describe them. At this point, agents are invoked whose job it is to locate other systems or agents that might have an interest in the results. Raw results are unlikely to be especially interesting so that the generation of natural language summaries of results will be important for many of the agents in our scenario. Ultimately it will be up to application designers to determine if the knowledge services described in this scenario are invoked separately or else as part of the inherent competencies of the agents described earlier. Whatever the design decisions, it is clear that knowledge services will play a fundamental role in realizing the potential of the Semantic Grid for the e-Scientist.


Peer to Peer Grids

P2P technology is exemplified by Napster and Gnutella, which  can enable ad hoc communities of low-end clients to advertise and access the files on the communal computers.  A P2P example can be, to browse and access files on a peer, or advertise one’s interest in a particular file and it requires services to set up and join peer groups. The grid could support job submittal and status services and access to sophisticated data management.

Grids have structures robust security services while P2P networks can exhibit more intuitive trust mechanism reminiscent of the ‘real world’.

We explore the P2P concept of a grid with a set of services that include the services of Grids and P2P networks.

An example of Grid with P2P, consider the way one uses the Internet to access information – either news items or multimedia entertainment.  Perhaps the  large  sites  such  as  Yahoo,  CNN  and  future  digital  movie distribution centers have Gridlike organization. There are well-defined central repositories and high-performance delivery mechanisms involving caching to support access. Security is likely to be strict for premium channels. This structured information is augmented by the P2P mechanisms popularized by Napster with communities sharing MP3 and other treasures in a less organized and controlled fashion.

Key Technology concepts for P2P Grids

The figure below shows a traditional Grid with Web middle ware mediating between clients and backend resources but arranged in democratically as in a P2P environment.

A peer-to-peer grid

Figure : A peer-to-peer grid

Distributed object technology is implemented with objects defined in an XML-based IDL (Interface Definition Language) called WSDL (Web Services Definition Language). This allows ‘traditional approaches’ such as CORBA or Java to be used ‘under-the-hood’ with an XML wrapper providing a uniform interface. Another key concept – that of the resource – comes from the Web consortium W3C.

Everything – whether an external or an internal entity – is a resource labeled by a Universal Resource Identifier (URI) and URL locates the things. The resources are directly exposed to the users and to other services are built as distributed objects that serves as services so that capabilities and properties can be accessed by a message-based protocol.

Typically services can be migrated between computers. The use of XML and WSDL standards allows interoperability.

There are several technology and research and development areas on which infrastructure builds:

  • Web services: LDAP, XML database/files.
  • The messaging sub system between Web services and external resources addressing , performance, fault tolerance. Both P2P and grid need messaging technique.
  • Toolkits to enable applications to be packaged as Web services and libraries.
  • Metadata needed to describe all stages of scientific endeavour.
  • Services like visualization, collaboration.
  • Semantic Grid
  • Portal technology defining user-facing ports on web services.

Peer to peer Grid event service:

The communication subsystem, which provides the messaging between the resources and the Web services. The key to the future e-science infrastructure will be the messaging subsystem and the network communications, performance and reliability.

There are some messaging services SOAP, JXTA peer-to-peer protocols and commercial Java Message Service.

Collaboration in P2P Grids

Both Grids and P2P networks are associated with collaborative environments. P2P networks started with ad hoc communities such as those sharing MP3 files; Grids support virtual enterprises or organizations – these are unstructured or structured societies, respectively. At a high level, collaboration involves sharing and in our context this is sharing of Web services, objects or resources.

We can expect that Web service interfaces to ‘everything’ will be available and will take this point of view later where Word, a Web Page, a computer visualization or the audio–video (at say 30 frames per second) from some videoconferencing system will all be viewed as objects or resources with a known Web service interface.

Asynchronous collaboration has no special time constraint and typically each community member can access the resource in their own fashion; objects are often shared in a coarse grain fashion with a shared URL pointing to a large amount of information.

Synchronous collaboration at a high level is no different from the asynchronous case except that the sharing of information is done in real time. The ‘real-time’ constraint implies delays of around 10 to 1000 ms per participant or rather ‘jitter in transit delays’ of a ‘few’ milliseconds.

An example of the collaboration is the shared display. In shared display model one shares the bitmap display and state is maintained  between the clients by transmitting the changes in the display.

User interfaces and universal access:

User interface issues in the context of universal accessibility, that is, ensuring that any Web service can be accessed by any user irrespective of their physical capability or their network/client connection. Universal access requires that the user interface be defined intelligently by an interaction between the user ‘profile’ (specifying user and client capabilities and preferences) and the semantics of the Web service.  Only the service itself can in general specify what is essential about its user-facing view.

there are three key user-facing sets of ports:

  • The main user-facing specification output ports that in  general do  not  deliver the information defining the display but rather a menu that defines many possible views. A selector  combines a user profile from the client (specified on a special profile port) with this menu to produce the ‘specification of actual user output’ that is used by a portal, which aggregates many user interface components (from different Web services) into a single view. The result of the transformer may just be a handle that points to a user-facing customized output port.
  • Customized user-facing port, it seems appropriate to consider interactions with user profiles and filters as outside the original Web service as they can be defined as interacting with the message using a general logic valid for many originating Web services.
  • User-facing input/output port, which is the control channel.



  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: