1steve/cse298300/fall05/xmlfin.doc · web viewthus, l may be used in any implementation of a...

70
CSE 333 Final Project Report 12/16/2005 Final Report for CSE 333 – Distributed Component Systems Instructor Prof. Steven Demurjian XML and Component-Based Systems: Four Application Areas Bryan Bentz Jason Hayden

Upload: trinhminh

Post on 28-Apr-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

Final Report

for

CSE 333 – Distributed Component Systems

InstructorProf. Steven Demurjian

Department of Computer Science and EngineeringUniversity of Connecticut

October 2005

XML and Component-Based Systems: Four Application Areas

Bryan Bentz Jason Hayden

Upsorn PraphamontripongPaul Vandal

Page 2: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

1

Page 3: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

TABLE OF CONTENTS

LIST OF FIGURES …..…………………………………………………………………….iii

CHAPTER

1 Introduction...................................................................................................................................................11.1 Background............................................................................................................................................11.2 Scope......................................................................................................................................................11.3 The Common Thread.............................................................................................................................2

2 Semantic Web..............................................................................................................................................22.1 Background............................................................................................................................................22.2 Semantic Networks................................................................................................................................32.3 RDF: the Resource Description Framework..........................................................................................4

2.3.1 Practical RDF..............................................................................................................................52.3.2 The Dublin Core.........................................................................................................................5

2.4 Building a Semantic Web Server...........................................................................................................52.5 Conclusions............................................................................................................................................8

3 Software Component Retrieval...............................................................................................................103.1 Background..........................................................................................................................................103.2 Overview of Architecture....................................................................................................................123.3 XML-based Software Component Specification.................................................................................123.4 Match Definitions................................................................................................................................15

3.4.1 Exact Match..............................................................................................................................153.4.2 Generalization Match................................................................................................................153.4.3 Specialization Match.................................................................................................................153.4.4 Partial Match.............................................................................................................................153.4.5 Reference Match.......................................................................................................................15

3.5 Similarity Measurement.......................................................................................................................163.5.1 Component Similarity...............................................................................................................163.5.2 Method Similarity.....................................................................................................................163.5.3 Pre-condition Similarity............................................................................................................163.5.4 Post-condition Similarity..........................................................................................................163.5.5 Input Similarity.........................................................................................................................173.5.6 Output Similarity......................................................................................................................17

3.6 Illustration............................................................................................................................................173.7 Adoption of RDF.................................................................................................................................193.8 Future Work.........................................................................................................................................20

4 OpenDocument Exchange........................................................................................................................204.1 Introduction..........................................................................................................................................204.2 Format Specifications..........................................................................................................................21

4.2.1 Meta.xml...................................................................................................................................214.2.2 Settings.xml..............................................................................................................................224.2.3 Styles.xml..................................................................................................................................234.2.4 Content.xml...............................................................................................................................234.2.5 Pictures folders..........................................................................................................................244.2.6 Thumbnail.png..........................................................................................................................244.2.7 Configurations2.........................................................................................................................244.2.8 Manifest.xml.............................................................................................................................24

4.3 Testing Portability...............................................................................................................................25

i

Page 4: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

4.3.1 Simple Text Document.............................................................................................................254.3.2 Advanced Text Document........................................................................................................274.3.3 Simple Spreadsheet Document.................................................................................................294.3.4 Advanced Spreadsheet Document............................................................................................30

4.4 Conclusions..........................................................................................................................................32

5 Distributed Warehouse Systems..............................................................................................................325.1 Data Warehouse Systems....................................................................................................................32

5.1.1 Overview...................................................................................................................................335.1.2 Metadata Interoperability in the Data Warehouse Environment..............................................335.1.3 An Intelligent Search by applying the Semantic Web..............................................................36

5.2 Persons of Interest Tracking................................................................................................................395.3 Conclusion...........................................................................................................................................40

6 References...................................................................................................................................................41

ii

Page 5: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

LIST OF FIGURES

FIGURE PAGE

1 Semantic Networks.......................................................................................................................................32 RDF Representation......................................................................................................................................43 RDF Gateway Tool.......................................................................................................................................64 RDFQL.........................................................................................................................................................74b Ontology cardinality.....................................................................................................................................95 Overview architecture of the software component retrieval system ...... Error: Reference source not found 06 XML-based software component specification with individual conditions ...... Error: Reference source not found27 XML-based software component specification with Boolean conditions Error: Reference source not found28 XML-based software component specification with nested Boolean conditions .... Error: Reference source not found29 XML-based software component specification wieh nested conditions Error: Reference source not found 210 XML-based software component specification with an “isa” relation . . Error: Reference source not found 211 Example query and result from the retrieval process ............................. Error: Reference source not found 612 Example meta.xml file................................................................................................................................2013 Example settings.xml file............................................................................................................................2014 Example styles.xml file...............................................................................................................................2115 Example content.xml file............................................................................................................................2216 Text document 1 opened in OpenOffice.....................................................................................................2317 Text document 1 opened in StarOffice.......................................................................................................2418 Text document 1 opened in KOffice...........................................................................................................2419 Text document 2 opened in OpenOffice.....................................................................................................2520 Text document 2 opened in StarOffice.......................................................................................................2621 Text document 2 opened in KOffice...........................................................................................................2622 Simple spreadsheet in OpenOffice..............................................................................................................2723 Simple spreadsheet in StarOffice................................................................................................................2724 Simple spreadsheet in KOffice...................................................................................................................2825 Advanced spreadsheet in OpenOffice.........................................................................................................2826 Advanced spreadsheet in StarOffice...........................................................................................................2927 Advanced spreadsheet in KOffice..............................................................................................................2928 Metadata bridge broblem............................................................................................................................3329 A central proprietary metadata repository..................................................................................................3330 Common Warehouse Metamodel construction...........................................................................................3431 A central CWM metadata repository..........................................................................................................3432 Full data searching in a data warehouse.....................................................................................................3633 Proprietary metadata searching...................................................................................................................3634 Searching an RDF repository mapped from a CWM repository................................................................3735 Persons of interest search............................................................................................................................38

iii

Page 6: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

1 Introduction

1.1 Background

The emergence of the Internet and distributed computing has brought new challenges to the software developing community. Today, it is possible for a team of developers to work together on a project, while located in various parts of the world. In addition to being scattered, they may also use different types of computers, operating systems, or software. This has created a desperate need for program and component interoperability.

This interoperability needs to be supported on many levels. This includes creating a standard model that defines how to generate data, how to organize it, and also a standard way to interchange this data between components. Such a format would enable programs to save files using a strict structure, allowing for simple interchange and remote interpretation of the file among a range of machines and programs.

A new standardized format called XML has won over the software community and has gained the recognition of the World Wide Consortium. Unlike HTML, XML does not concern itself with display and format of the text; instead it describes the text and how it is structured within the XML file. With this simple standard, many different types of programs are able to send data between programs and machines in an organized and structured fashion. XMI, or the XML Metadata interchange, uses an XML document for the interchange of metadata. The content of the file is the meta-data itself and the tags are meta-meta data, which are defined by the meta-meta-model MOF, or Meta Object Facility.

1.2 ScopeWe have investigated four areas that rely on XML, and which explore a common problem with XML-represented data: that there must also be agreement on the semantic content and markup mechanism before data may be exchanged (what we will refer to as the “Meta Problem”). These areas are:

Semantic Web: a next-generation World Wide Web, allowing data and services to be located and used by programs as well as people;

Software reuse, facilitating the locating, retrieval, and integration of software components based on requirements;

OpenDocument exchange, allowing the creation and exchange of office documents across multiple document editors;

Business data warehouse systems, for a more efficient interchange and interoperability of data across distributed data warehouses.

1

Page 7: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

These distinct areas all use XML (and often XMI) to provide interoperability between disparate components, which may evolve independently at different points in time. In each of these areas the challenge of agreeing on XML structure arises, specifically in determining the detailed structure and tags that are to be used in the XML, or in providing some way to negotiate these as needed. In the case of the Semantic Web, there are techniques for exchanging information about the XML representations used; in the Open Office area, an agreed-upon standard is used (OpenDocument). Business database systems may use either technique.

1.3 The Common Thread

The one application of XML that addresses the Meta Problem head-on is the Semantic Web. This is intended to be the next generation of the World Wide Web, containing information marked up in XML so that meaning may be used as a search criterion. As with the World Wide Web, it is intended that this will be done in a decentralized manner by many authors, so no centrally-imposed rigid set of XML guidelines may be relied on for consistency; rather, this is a runtime problem for those searching for information.

Semantic web work has led to an approach to this problem which we will use as the underlying thread unifying our work. In short, this approach involves representing semantic information in a Resource Description Framework (RDF), which is rich enough to allow runtime resolution of the Meta Problem.

2 Semantic WebBy Bryan Bentz

2.1 Background

The Semantic Web is an extension of the current web, in which information is given well-defined meaning, allowing richer interactions with both humans and machines [Bryan-1]. The purpose is twofold: to be a web of distributed knowledge bases, accessible by agents (as well as humans); and to be a repository of web services, allowing agents to locate, select, employ, compose, reuse, and monitor services automatically [Bryan-2].

Distributed knowledge bases use ontologies to define their structures, allowing interoperability of web resources containing related content. Ontologies provide the framework upon which the XML for a given knowledge base is constructed. The subject of ontologies spans ontology representation languages, ontology development, ontology learning approaches, and ontology library systems.

Web services use one of several techniques to advertise their capabilities and allow invocation; software agents must be able to find, communicate with, monitor, control, and handle the output of such services; indeed services may be composed into new web services [Bryan-3]. To do this there are several evolving standards, such as UDDI (Universal

2

Page 8: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

Description, Discovery, and Integration); WSDL (Web Service Definition Language) proposed by IBM and Microsoft; DAML (Darpa Agent Markup Language), ebXML (Electronic Business using XML) proposed by the UN. At this point in time, composition of services is largely an ongoing research topic, though some sorts of composition (analogous to UNIX pipes, that is chaining together existing services) is practical today.

All of these applications of the Semantic Web require solving the Meta Problem; some do so by formalizing a set of XML tags (such as ebXML), or by providing some means of dynamically negotiating a common XML tag vocabulary. Specifically, data of a given type may be marked up by different people in different ways, e.g.

<Telephone>(860) 536-1477</Telephone><PhoneNumber>8605361477</PhoneNumber>

How do agents, or users performing a semantic search, recognize these as containing the same information?

One answer is to use a richer model of the tags, one that will be adequate to allow an automated determination that the above two lines represent the same information. A bedrock foundation of the Semantic Web is RDF [Bryan-4], which encodes a type of knowledge representation known as a semantic network.

2.2 Semantic Networks

In concept a semantic network is quite simple: it is a directed graph, in which the nodes represent semantic tokens and the edges represent relationships. It was initially developed by Richard Richens of the Cambridge Language Research Unit in 1956 [Bryan-5] as a way of representing the underlying meaning of natural language; for example, the phrase “my kettle and cup of coffee” might be represented as this semantic network:

Figure 1: A Semantic Network

In this network, the labeled nodes represent ‘concepts’ in a sense – and in a complete representation, would consist of semantic networks themselves that fully represent those concepts (that is, consider this to be a network fragment, though in practice it may be all that needs to be represented for a particular application). The links represent relationships, typically subclassing (“ako” = “a kind of”), instances (“inst.”), or other relationships, either ad hoc or defined elsewhere in the network (“contains” in the above example).

3

Page 9: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

Note how useful this representation would be, for example, in performing machine translation of (say) English to French. The English would be used to construct the network from the phrase – and at that point, the network fully represents the phrase, and it has no English grammatical structure imbedded in it. From this representation, a description in French may be built, using the appropriate French grammatical structures to describe this network. Semantic networks are indeed very powerful representational tools and may be used in a number of contexts.

In the Semantic Web use of semantic networks, this sort of translation is what is used to reconcile one set of XML tags with another – rather than translating from English to French, the idea is to translate from my set of tags to your set of tags. In practice we may have to map my semantic network representation to yours as well, but this is now a well-defined operation of identifying matching nodes and typology in our semantic networks.

2.3 RDF: the Resource Description Framework

RDF originated with the work of R.V. Guha at Apple on what was known as the Meta Content Framework. An RDF representation consists of a set of triples, each one of the form:

Subject Predicate Object

This might seem like a poor way to begin building a sophisticated knowledge representation, but in reality it is an encoding of a semantic network, representing one link at a time:

Figure 2: RDF Representation

Each RDF triple contains 2 objects and a relationship between them; these two objects are the subject and object of the RDF triple, and the relationship is the predicate.

Provided that each object is represented with a unique name, RDF triples may unambiguously encode an arbitrary semantic network. Furthermore, if the names are globally unique (across all systems), different systems may combine their local semantic networks with that of other systems. This is a powerful idea, indeed a surprising one: the Semantic Web at its core involves the construction of one large semantic network, distributed across many, many machines.

4

Page 10: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

2.3.1 Practical RDF

In practice, Universal Resource Identifiers (URI’s) are used to name nodes; these are unique, and furthermore may point to URL’s that contain further information about the object being represented. Ontologies, which are published on the web, consist of semantic network fragments, generally covering some particular domain. Ontology tools let designers work on the semantic network representation from a global perspective. The resulting RDF descriptions may then be exchanged (in an XML format for RDF), merged, or analyzed by semantic web components.

2.3.2 The Dublin Core

One well-known and well-used ontology is known as the Dublin Core [Bryan-6], and defines terms that are used to indicate information about publications; items represented include [Bryan-7]:

Title FormatCreator IdentifierSubject SourceDescription LanguagePublisher RelationContributor CoverageDate RightsType

Which is a fairly simple set, though the relationships between these elements, and the nature of the data which may exist for each attribute, means that this is actually a relatively large ontology. Because so much of what is currently on the web, and of what is likely to be on the web, may be considered to be a ‘document’, the Dublin Core is widely used.

2.4 Building a Semantic Web Server

Much of the literature about the Semantic Web is hypothetical, or theoretical; to develop an understanding of the pragmatics involved, we implemented an experimental semantic web server, using the RDF Gateway tool from Intellidimension. As a domain area, we chose to represent software components and the relationships between them; for this domain we built an RDF description covering those relationships.

It should be noted that usually the inference power that draws upon an RDF representation is used to draw conclusions about types, for instance about the equivalence of XML tags: this is meta-information about data to be marked up in XML. We felt this would be confusing in

5

Page 11: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

the context of an example, as the RDF would then encode meta-information about types, which is meta-meta-information about data to be marked up with tags. While this was just as feasible as what we did, it would not be as clear – it seemed to have too many levels of indirection to be intuitive to the reader. By representing software components (rather than tags used to mark up software component descriptions) we have an example that is more concrete without any loss of generality.

We did spend some time trying to locate published ontologies about software, but found nothing of note. We felt that this was surprising, as it might be a very useful representation to have, as it would allow Semantic Web techniques to be used to identify and search for software components and tools.

The RDF Gateway tool we chose is quite a useful and powerful package, able to interact with existing databases, external data sources, and COM objects.

Figure 3: RDF Gateway Tool

We interacted with RDF Gateway using a browser, and used RDFQL, the RDF Query Language, to establish as well as query the semantic model, which is maintained within an RDF Gateway database.

We represented several components: FFT code in C++ FIR filter code in C Hidden Markov model code in C Wavelet transforms in Java

For each component we had the source code (often in multiple files), documentation, and auxiliary files (e.g. Makefiles). We used this to write RDF describing each component; for example, for the HMM, we had:

hmm.c The basic code filehmm.h The header filehmmrand.c Platform-independent random nums.hmmutils.c File I/O, matrix codehmmtut.ps Postscript documentation

6

Page 12: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

We used the RDFQL query language to instantiate these dependencies one triple at a time, that is walking though each such list building the network representation: a fragment of the semantic network we were representing looks like this:

Figure 4: Our Example Semantic Network

Some of these links represent ‘Requires’, that is for example hmm.c requires Hmmrand.c. Some represent the implementation language, and some represent the documentation. The white boxes represent the abstract algorithm being implemented.

RDF for these links looked like this:“source/hmm.c” requires “source/hmmutils.c”“source/hmmtut.ps” documents “source/hmm.c”“source/hmmutils.c” requires “sysutils.h”

We could then declare an inference rule, e.g.:INFER {?A 'requires' ?C}

FROM {?A 'requires' ?B} AND {?B 'requires' ?C};

In which the “?” denotes a variable that will be filled by matching against an RDF triple. This rule says that you may infer that A requires C from having A requires B and B requires C; the inference engine that applies this to RDF does so recursively, so if either B or C requires other components, they too will be inferred to be required by A.

We were successful at requesting the set of dependencies that we’d set up in the semantic network, both for ‘requires’ and ‘documents’; these were illustrative link types, and could trivially be extended to represent the full language environment (compiler, compiler version, etc.) and machine appropriate to each component, or to condition the returned list of source files based upon a given machine type and environment. The inference engine may be looked upon as a basic expert system, which may take input from a semantic web request and operate on that input and the local (and potentially more global) semantic network to compute and return an answer.

7

Page 13: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

We built an interface to the inference engine using the RDF Gateway’s ability to generate HTML as output; it can of course equally output XML for use by other tools.

This experiment let get in to the inner workings of a Semantic Web server, and see concretely how RDF could be used to represent useful information. The power of this representation is sufficient to allow the kind of inferences that are necessary in mapping one set of XML tags into another – where one application might wish to represent software dependences via a directed graph, another might wish to use a list of required files for each component – and we demonstrated can convert from one to the other.

We felt that our choice of RDF as the unifying representational approach running through all of our areas was justified.

2.5 ConclusionsOne nagging doubt arose as I began to understand the nature of the Semantic Web. As I remarked in our midterm presentation, I feel I have seen this problem before.

In the 1980’s, Artificial Intelligence was all the rage, and it seemed like answers were just around the corner – after all, the tools (such as semantic networks) were there, it seemed all that needed to be done was to assemble them appropriately for given domains. It didn’t turn out that way – working through the details uncovered problems of progressively greater depth. This didn’t mean that AI was a bad idea or wouldn’t eventually work in the way envisioned – but that it would take a lot more time to get there.

For example, consider the Cyc project, which aimed at representing broad common-sense knowledge; it was begun in 1984 by Doug Lenat, who was recently quoted as saying that it’s been going for 20 years, and probably will need another 20 years to complete. (Cyc does have useful applications today, and indeed can import and export Semantic Web ontologies – so this depth of effort may pay off to some degree in the context of the Semantic Web.)

It seems that Semantic Web implementers are in the grip of the same sort of enthusiasms that animated the AI community in the 1980’s: the basic technologies seem to be available, and the payoff of being able to search on and work with conceptual entities in the World Wide Web seems to offer vast new possibilities. What is ironic is that the very same problems that faced AI researchers, particularly in knowledge representation, have to be solved to make it work. Often this is quite striking, as current Semantic Web contributors seem unaware of nearly identical research performed earlier in AI.

For example, consider the underlying semantic network representation; this has been used and studied in the AI community for many years. There are known problems with semantic networks that Semantic Web work has yet to address, or even acknowledge. It is worth being specific about a few of these problems.

One is that even what people think of as a standard, well-defined link type may be ambiguous; consider the “is-a” link. One network might say that

8

Page 14: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

“Lassie” is-a “Dog”, while in another part of the network we might have “Dog” is-a “Mammal”.

The first example indicates that a particular instance is a member of a class; the second is a subclass-to-class relationship. This sort of confused use of a link type can give inference engines difficulty. (The problem may be fixed; one researcher, Patrick Winston of MIT, used a new “a-kind-of” link to represent class relationships - unfortunately this is not at all standard.)

Another problem is that the links (similar to UML links) may require cardinality on both sides. For instance, if the semantic network is being to used to represent a compound object, it is important to explicitly represent the count of particular kinds of subobjects. One might imagine other properties it might be desirable to attribute to links as well – cost, perhaps.

Related to this limited definition of links is the problem of quantification over sets; while it is possible to do with semantic networks, it is usually messy. It may be simple to represent “The man ate the apple.” In a semantic network, but it takes a bit of doing to represent in a useful way “Every man has eaten an apple.” Other set operations (e.g. disjunction) are also hard; these may be done, but usually one has to introduce an explicit representation of sets and set operations for each such set one wants to work with. In short: it’s a mess.

In summary, there are no formal semantics for semantic networks, no accepted definitions of what nodes and arcs represent, no accepted way to represent ensembles of objects or cardinality of relationships. The AI community addressed these problems by coming up with Frames. The Semantic Web community hasn’t yet addressed these issues.

In conclusion, I think (as with AI) the Semantic Web efforts will in the near term succeed in some well-defined areas, probably in the context of particularly valuable, well-constrained interactions – these might involve automatic or assisted purchasing in a particular industry, for instance. Software component/service discovery and use within a constrained context may well be another. However, I don’t believe that the glorious possibilities that seem to be just around the corner will come to fruition any time soon; it will likely take decades of experience to uncover the thorny knowledge representation problems inherent in adequately representing the range of information required for something as all-encompassing as the Web.

Figure 4b: UML Cardinality Example

9

Page 15: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

3 Software Component Retrieval By Upsorn Praphamontripong

3.1 Background

Fundamentally, a major promise of component-based software engineering is to reduce application development time and costs by reusing existing and reusable software components. To achieve high efficiency in software reuse, existing reusable software components, which best satisfy users’ needs must be correctly located and retrieved. As a software component may compose of one or more components with explicit dependencies and each component may consist of one or more methods, let a software component be an object C = { mk, …, mk} consisting of a set of methods mj, a library (or repository) L = {C1, …, Cn} be a set of existing reusable component Ci, and a query Q describes a desirable component C. Simply speaking, the software component retrieval problem is to find a match <C, Ci> that satisfy a certain similarity measures.

Many techniques have been used in software component retrieval systems. Traditionally, keyword-based querying and Boolean operations play important roles in software component retrieval process. Keyword-based querying mainly focuses on extracting information or documents related to a keyword, and comparing a keyword-based query against software component descriptions represented by a structured list of pre-defined descriptive keywords. Synonyms, different words with the same meaning; and homonyms, words with different meanings in different contexts, of keywords can be handled by adopting a dictionary or thesaurus such as WordNet, which is a thesaurus involving simple word-word substitution and a more sophisticated generalization-specialization hierarchy [Upsorn-2]. A candidate component will be selected in which the keywords that form its representation match all of the keywords that form the query. Since keyword-based querying concept is simple, it is generally used in both information retrieval and software component retrieval systems.

To enhance the ability of the retrieval process, Boolean operators, AND, OR, NOT, or alternatively NEAR or WITHIN, are applied to keyword-based querying, the so-called Boolean querying. In Boolean querying, Boolean operators are used to combine several queries together to express requirements more specifically. To perform retrieval, terms are specified in a given query and are used as index terms or keywords stored in an inverted list. These keywords are located in the inverted list which presents the set of associated base documents. Base documents are searched in order, i.e., from left to right and the most specific operators first. Those documents which are true for the query are considered candidates and are retrieved.

Moreover, one of the most commonly used techniques in retrieval is behavioral-based methods. The behavioral-based software component retrieval emphasizes on executable software components and therefore it is sometimes called execution-based retrievals [Upsorn-3]. All components in the repository with behavior features common with the component’s behavior described in a query are identified. Examples of input and output data

10

Page 16: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

are given by the users and components of which outputs match those specified for the given input data will be retrieved [Upsorn-5]. Other interesting techniques, signature matching and formal specification matching mainly focus on the level of individual functional units within a component [Upsorn-4]. Jeng and Cheng [Upsorn-6] introduced a two-step retrieval process based on specification matching. Zaremski and Wing [Upsorn-7] applied specification matching to the retrieval system. Queries and existing components are represented with (pre-condition, post-condition) pairs.

Apparently, these approaches primarily rely on formal specifications, mechanisms used for expressing a software component based on mathematically based languages [Upsorn-1], such as Z, Larch/ML, and Vienna Development Method (VDM). Although a concise and precise specification of software components is desirable, these formal languages require knowledge and familiarity in describing software components. Furthermore, in practice, the software reuse and retrieval environment involves a broader class of documents or artifacts than source code solely. These documents include design and analysis documents in unified modeling language (UML) notations and in textual format; source code and textual of component interface definitions. Formal specifications may not always be capable to support of all these artifacts. For these reasons, it would be more convenient for developers or users to express components’ behavior using natural languages such as English. Then again, there are some difficulties due to developers may describe the same component behaviors in different ways or using different choices of terms. Thus, the retrieval process may be impeded. Accordingly, a systematic mechanism guiding in component expression is needed.

Furthermore, with the advancement of the Internet technologies, retrieving software components over the web is more convenient as developers can access and locate available reusable software components at anytime, anywhere. Hence, there has been a tremendous amount of activity taking place in the development of reusable software components and the development of mediators used in locating and retrieving these components. To incorporate the exchange of components descriptions over the Internet, metadata such as XML should be used. Currently, XML-based specifications have been defined and constructed in several software component retrieval projects such as AUTOSOFT, MOOSE [Upsorn-8,9,10]. Therefore, to better understand how XML can be used in software component retrieval, a small software component retrieval system is developed in this project. Domain-specific XML-based software component specifications are constructed. Existing components will be retrieved regarding the match type and its similarity measurement. Nonetheless, based on the implementation, there exist some difficulties. Though a semi-structure like XML is used, choice of terms used in tags and contents rely on English. Matching different word forms or different spellings may cause an incorrect result set. Mismatch of the same word form with possible different meanings also causes an unexpected result. Additionally, developers practically prefer to integrate their specifications, which may be written in different standards or formats. Therefore, RDF will be adopted to enhance matching the semantic contents as well as to integrate multiple specifications to construct a large, complex software application.

In the next section, an overview of the retrieval system will be discussed. Then, XML-based software component specification will be presented, followed by the match determination

11

Page 17: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

and the similarity measurement. Finally, the adoption of RDF and future work will be addressed.

3.2 Overview of Architecture

The basic architecture of the software component retrieval system implemented in this work consists of two main parts, the specification analyzer and the match maker, as illustrated in Figure 5. The specification analyzer performs extracting and realizing the XML-like software component specification. That is, the spelling of the units, the grammatical structure, and semantic content of the specifications will be analyzed, verified, and extracted. Then, the match maker performs matching corresponding to the match type specified in the later section. Meanwhile, similarity values are evaluated. Finally, the system returns a list of potential existing components along with their similarity measurements.

Figure 5: An overview architecture of the software component retrieval system.

3.3 XML-based Software Component Specification

XML-based software component specification is a software component specification based on XML technology. By emphasizing “what” a component should do rather than “how” it should do, needed information for distinguishing a component is its signature and its behavior. Since a component may naturally have one or more methods, we apply signature and behavior concepts at the method level. Therefore, for each method, a structure of an XML-based software component specification is divided into two main parts, (1) input and output type information and (2) behavior. The input and output type information is used to ensure compatible signature between components. The behavior of a component is expressed in pre-condition and post-condition predicates, where the former is a predicate describing the starting state and the later determining the final state of each method belonging to a component.

To demonstrate our idea according to the characteristics of a component addressed previously, four cases of software component specifications are constructed as follows: a component whose method has simple individual conditions without a Boolean join (shown in Figure 6); a component whose method has simple conditions with a Boolean join (shown in

Retrievedcomponents

Extracted librarycomponent specification

MatchMaker

LibraryComponent

Specification

QueryComponent

SpecificationExtracted query

component specification

Specificationanalyzer

12

Page 18: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

Figure 7); a component whose method has multi-level conditions (shown in Figures 8 and 9); and a component that inherits characteristics from another component, so-called an “is-a” component (shown in Figure 10).

A component element indicates an information block for a particular component. Each component is identified with a content of a cname element. Since a component may have one or more methods, a method element is used to specify an information block for a particular method. For each method, a paramNo element is used to determine a number of parameters where parameters are considered as inputs of a method. A list of param elements contains input information. Each param element optionally contains parameter name and parameter size, designated in pname and psize elements. Only information about a parameter’s type, specified in a ptype element, is required for each method and is used for further type comparison. To express the behavior of a component, we consider the behavior of each method belonging to a component. While a method may or may not have pre-conditions, it must contain at least one post-condition to ensure what happens after the method completes. Pre-condition is a predicate specifying the starting state that must hold before a method performs. Each pre-condition is expressed in a block of a precond element. For a method with simple conditions, a precond element is composed of three basic elements, which are a left element, an operator element, and a right element, specifying a left operand, an operator, and a right operand of a condition, respectively. Post-condition, in contrast, is a predicate describing the final states that must hold after a method performs. Similar to a pre-condition, a post-condition, conveyed with a block of a postcond element, comprises of a left element, an operator element, and a right element, straightforwardly determining a left operand, an operator, and a right operand of a condition, as well. Notice that a left element may be considered as an optional element for a condition such as, for example, EXIST X, where EXIST is an operator and X is a right operand. Though the left-operator-right predicate is a basic condition for both pre-condition and post-condition, it can be extended to construct a complex one; for example using Boolean operators (and or or) to join a successor predicate with a predecessor predicate at the same level; or forming nested conditions. Moreover, by nature of a component, each method generally produces some kinds of results, called output. While the input information is needed to verify the type of input to be entered for a method, the output information is also required to ensure that compatible results are generated from a method. The output information, on which we are focusing, is a type of return value itself and is specified in a return element.

13

Page 19: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

Figure 6: An XML-based software component specification with a set of individual conditions

Figure 7: An XML-based software component specification with Boolean conditions

Figure 8: An XML-based software component specification with nested Boolean conditions

Figure 9: An XML-based software component specification with nested conditions

Figure 10: An XML-based software component specification with an “isa” relation

<component> <cname> component_name </cname> <method> <mname> method_name1 </mname> <paramNo> no_of_parameter </paramNo> <param>

<pname> parameter_name1 </pname><ptype> parameter_type1 </ptype><psize> parameter_size1 </psize>

</param>…

<precond><left> left_operand1 </left><operator> operator1 </operator><right> right_operand1 </right>

</precond> <precond>

… </precond>

… <postcond>

<left> left_operand1 </left><operator> operator1 </operator><right> right_operand1 </right>

</postcond> <postcond>

… </postcond>

… <return> return_type </return> </method> …</component>

<postcond> <subcond> <left> left_operand1 </left> <operator> operator1 </operator> <right> right_operand1 </right> </subcond> <and> <subcond> <left> left_operand2 </left> <operator> operator2 </operator> <right> right_operand2 </right> </subcond> </and> …</postcond>

<postcond> <subcond> <left> left_operand1 </left> <operator> operator1 </operator> <right> right_operand1 </right> <and> <subcond> <left> left_operand2 </left> <operator> operator2 </operator> <right> right_operand2 </right> <and> <subcond> <left> left_operand3 </left> <operator> operator3 </operator> <right> right_operand3 </right> </subcond>

… </and> </subcond> </and> </subcond></postcond>

<postcond> <left> <subcond>

<left> left_operand1 </left> <operator> operator1 </operator> <right> right_operand1 </right> </subcond> </left> <operator> operator2 </operator> <right> right_operand2 </right></postcond>

<component> <cname> component_name1 </cname> <method>

… </method></component><component> <cname> component_name2 </cname> <isa> component_name1 </isa> <method>

… </method></component>

14

Page 20: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

3.4 Match Definitions

Let Q and L denote a query and an existing component stored in a library. Subscript in, out, pre, and post are used to indicate the input type information, output type information, pre-condition, and post-condition of a component, respectively. In this project, five forms of matches are considered.

3.4.1 Exact MatchMatchExact(Q, L) = (Qpre Lpre) (Qpost Lpost) (Qin Lin) (Qout Lout)

A component described by a query Q matches an existing component in a library L exactly if the component behaviors described by L provides everything needed by Q. Thus, L may be used in any implementation of a component described by Q without additional modification.

3.4.2 Generalization MatchMatchGen(Q, L) = (Lpre Qpre Lpost Qpost)

An existing component in a library is said to be a generalization version of a component described by Q if all component behaviors described in L are included in Q where some behaviors described by Q may not included in L. Thus, L may be used in any implementation of a component described by Q with some modification.

3.4.3 Specialization MatchMatchSpec(Q, L) = (Qpre Lpre Qpost Lpost)

An existing component in a library is said to be a specialization version of a component described by Q if all component behaviors described in Q are included in L where some behaviors described by L may not included in Q. Thus, L may be used in any implementation of a component described by Q with some modification.

3.4.4 Partial MatchMatchPart(Q, L) = (MatchGen(Q, L) MatchSpec(Q, L))

(L’in Lin (L’in = Qin) L’out Lout (L’out = Qout)) An existing component in a library is said to partially match a component described by Q if it is either a more general version or a more specific version of Q and some input or output type information of L are similar to those of Q. Thus, L may be used in any implementation of a component described by Q with some modification.

3.4.5 Reference MatchMatchRef(Q, L) = (MatchExact(Q, l) MatchPartial(Q, l))

(MatchExact(q, L) MatchPartial(q, L))Let q and l be specifications describing a desired component and an existing component in the library, and q and l inherit from Q and L, respectively. Q and L are said to referentially

15

Page 21: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

match if a referred component l of L matches the query Q exactly or partially, or a referred component q of Q matches the library component L exactly or partially. Thus, L may be used in any implementation of a component described by Q with some modification.

3.5 Similarity Measurement

Not only are match forms concerned, the closeness between components is also important. In this project, we compute similarity between the existing components and the query at several levels, aiming to provide users with detailed information to help them in determining which component to be reused. Each measurement is associated with a weight between 0 and 1 with all weights adding up to 1.0. Moreover, weights may be assigned corresponding to the importance that the users view.

3.5.1 Component Similarity

SimComp (Q, L) = ( SimMethod (MethodQi, MethodLi)) / N

Component similarity is a sum of the similarity measure of its N methods.

3.5.2 Method SimilaritySimMethod (MethodQ, MethodL) = WPrecond SimPrecond (MethodQ, MethodL) +

WPostcond SimPostcond (MethodQ, MethodL) + WInput SimInput (MethodQ, MethodL) +WOutput SimOutput (MethodQ, MethodL))

Method similarity is the weighted sum of the similarity of its pre-conditions, post-conditions, input parameters, and return types (output).

3.5.3 Pre-condition Similarity

SimPrecond (MethodQ, MethodL) = (Wi SimPrecond(condQi, condLi))

Pre-condition similarity is a weighted sum of the similarity of the pre-conditions of the pair of methods. Suppose a method of a component described by Q has NQ pre-conditions and a method or an existing component L has NL pre-conditions, N is obtained from min(NQ, NL).

3.5.4 Post-condition Similarity

SimPostcond (MethodQ, MethodL) = (Wi SimPostcond(condQi, condLj))

Post-condition similarity is a weighted sum of the similarity of the post-conditions of the pair of methods. Suppose a method of a component described by Q has NQ post-conditions and a method or an existing component L has NL post-conditions, N is obtained from min(NQ, NL).

16

Page 22: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

3.5.5 Input SimilaritySimInput (MethodQ, MethodL) = M / N

Input similarity is computed from the fraction of the number input that matches and the total number of inputs f the methods. Suppose a method of a component described by Q has NQ inputs and a method of an existing component L has NL inputs, N is obtained from max(NQ, NL).

3.5.6 Output SimilaritySimOutput (MethodQ, MethodL) = M / N

Output similarity is computed from the fraction of the number output that matches and the total number of outputs f the methods. Suppose a method of a component described by Q has NQ outputs and a method of an existing component L has NL outputs, N is obtained from max(NQ, NL).

3.6 Illustration

Figure 11 displays an example query and the result from the retrieval process. Let a component described by a query q2 be an example to determine the match and to evaluate the similarity values. Suppose weights given to input, output, pre-condition, and post-condition of a query q2 are 10, 10, 30, and 50, respectively; and, assume that each pre-condition and each post-condition are of equal importance. Considering an existing component math1, which contains a method named sqr, the input and output similarities are equal to 1 as all inputs and all outputs of q2 match those of math1. Furthermore, as all pre-conditions described in q2 match those of math1, the pre-condition similarity is 1. The post-condition similarity is 50(0) + 50(1) = 50 as one out of two post-condition of q2 matches those of math1. Thus, the method similarity is 50(50%) + 30(100%) + 10(100%) + 10(100%) = 75%. Accordingly, since a component math1 consists of one method (sqr), the component similarity of a query q2 and an existing component math1 is 75%.

17

Page 23: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

Figure 11: An example query and result from the retrieval process

Although using XML-based specifications seems powerful in software component retrieval, there are some issues that need to be concerned. Due to the fact that, in practice, there are several component developers or vendors involved in the development of software applications, they may use the same or different standards to describe software components. Also, choices of terms may be different. Consequently, it may not be always possible to

<query> <component> <cname> q2 </cname> <method> <mname> m2 </mname> <paramNo> 2 </paramNo> <param> <pname> k </pname> <ptype> real </ptype> </param> <param> <pname> j </pname> <ptype> real </ptype> </param> <precond> <left> j </left> <operator> greater_or_equal </operator> <right> 0 </right> </precond> <postcond> <left> k </left> <operator> greater_or_equal </operator> <right> 0 </right> </postcond> <postcond> <left> <subcond> <left> k </left> <operator> power </operator> <right> 2 </right> </subcond> </left> <operator> equal </operator> <right> j </right> </postcond> <return> k </return> </method> </component></query>

Display Similarity TableFormat: component name, method name, Sim-Post, Sim-Pre, Sim-Param, Sim-Return, method-sim, match_type, ref_matchmath1, sqr, 50.0, 100.0, 100.0, 100.0, 75.0, generalmath2, sqr2, 100.0, 100.0, 100.0, 100.0, 100.0, exactmath3, sqr, 0.0, 100.0, 100.0, 100.0, 50.0, partial

Component name: q2

Parameter ----> Method: m2 k Method: m2 j

Pre condition ----> Method: m2 weight =100.00% j greater_or_equal 0

Post condition ----> Method: m2 weight =50.00% k greater_or_equal 0 Method: m2 weight =50.00% (k power 2) equal j

Return value ----> Method: m2 k

18

Page 24: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

convert or transform among specifications. This leads to three major concerns: i) the use of different tags referring to the same element; ii) the use of different terms referring to the same content; and iii) the integration of specification. Accordingly, a mechanism to deal with these concerns is needed.

3.7 Adoption of RDF

To overcome the difficulties that we have experienced while using XML-based specifications (as addressed in the previous section), RDF, a major component of the semantic web, is adopted. Basically, RDF relies on triples consisting of subject, predicate, and object. Thus, it naturally allows mapping between objects via relations.

Considering the case when several developers involve in the software development and they have an agreement on describing software component using XML, it is possible that they may use different tags to refer to the same element. For example, two developers may want to specify the type of some parameters, where one uses “<input-type>” tag and another uses “<datatype>” tag. Obviously, the inconsistency of tags used may impede the retrieval and the result may be misleading. Therefore, as these tags are expected to refer to the same element, they should be able to substitute each other. For this reason, the possible way to deal with this situation is to use RDF to map between these tags with a relation like “is-substitute-for.” As a result, transformation among specification is not needed prior to retrieval.

Additionally, there is a possibility that multiple developers may use different choices of terms in expressing the same content. For instance, two developers who deal with an array’s operations may use terms “add” and “insert” in order to add/insert an element into an array. Presumably, the developers expect to use terms “add” and “insert” interchangeably. Hence, RDF may be used to map between these terms with a relation “is-a-synonym-of” or “is-synonym-for.” However, there should not be restriction on selecting a relation as long as it best expresses the dependency between objects.

As there may be multiple developers or vendors involving in the software development and being responsible for particular pieces of software, some developers may specify their components using the same standards while some may use something differently. In fact, though a software component may be deployed independently, it is frequently subjected to be composition to form a large, complex system. Likewise, these developers would finally want to combine or integrate their parts to form a larger structure of the software application. Due to the fact that specifications may be written in different formats or standards, it may not always be feasible to convert or transform among standards. Moreover, it is very time-consuming. To facilitate the integration of specifications, RDF is an alternative way to map between component specifications. Relations that best express the dependencies between components shall be used. For example, a component may be mapped to another component, from which it inherits properties, using an “inherit” relation. Another example, suppose there are three components A, B, and C; and A inherits from B, B inherits from C, thus A indirectly inherits from C. In this manner, since RDF can reason when A B and B C and hence A C, RDF may be applied to overcome this situation.

19

Page 25: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

3.8 Future Work

Based on our experiment, although XML seems powerful in software component retrieval, there are some difficulties of using XML in describing software component specifications as developers may use different tags and terms in referring to the same element or the same content. Besides, since developers may be specialized in different standards, specifications may be written in different standards or formats. Combining specifications written in different standards is difficult. Therefore, our future work will be

Generate more test cases to better understand how RDF can overcome these difficulties.

As we have seen in the examples, the tags and terms used are normally domain-specific. Mechanisms to deal with domain-specific to make it more flexible while meaningful will be explored.

Furthermore, although in the users’ perspective, as they use RDF, they do not have to worry about transforming among standards. Aiming to support multiple standards of specifications, some kinds of interpretation should be implemented on the retrieval tool.

Lastly, since many developers or vendors may involve in the development of software, authorization and privacy issues may be concerned. Role-based may be applied to limit developers from accessing particular pieces of component specifications.

4 OpenDocument ExchangeBy Jason Hayden

4.1 Introduction

The ability to create and exchange office documents across multiple document editors is crucial to facilitating the exchange of data. This is accomplished by storing information within the XML based file format OpenDocument. The OpenDocument format is an open source document file format for saving and exchanging documents across multiple editors [Jason-1]. The OpenDocument format was developed by the OASIS consortium which is comprised of multiple large corporations including Adobe, Corel, IBM, and Sun Microsystems [Jason-2]. The format was ratified on May 1, 2005 and was originally based off a file format created by OpenOffice.org but has been greatly modified and enhanced since its inception, resulting in improved capabilities. The specifications for this format are licensable under a royalty-free license without any restrictions on how it can be implemented.

Currently a wide variety of document editors have the ability to edit and create OpenDocument format files. This list includes: Abiword, IBM Workplace, KOffice, OpenOffice, StarOffice, and TextMaker [Jason-4]. One of the main attractions of the OpenDocument format is the ability to be able to exchange office documents across a variety

20

Page 26: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

of document editors. In this study, the validity of this statement will be investigated, by creating multiple OpenDocument files and opening these files in different applications. The document will be examined to ensure that the information entered, formatting of the text, and images created are the same across all the document editors. One item that is going to be studied in great detail is the portability of complex formulas in a spreadsheet document. The OpenDocument specifications define that formulas are stored in the attribute table:formula but does not define the syntax of formulas. For example, the function SUM( ) might accept different inputs then the SUM( ) function from a different application. Problems that occur with incompatible inputs into functions will be documented and explored.

4.2 Format Specifications

The OpenDocument format can describe text documents, spreadsheets, presentations, drawings, images, charts, mathematical formulas, databases, and document templates. The lists of document types and extensions are listed in Table 1.

File Type ExtensionsText .odtText Template .ottSpreadsheet .odsSpreadsheet Template .otsPresentation .odpPresentation Template .otpDrawing .odgDrawing Template .otgChart .odcFormula .odfDatabase .odbImage .odi

Table 1: File Types of OpenDocument [Jason-1]

An OpenDocument file is a compressed file consisting of multiple files and directories. These files and directories include the files: meta.xml, settings.xml, styles.xml, content.xml, manifest.xml, and thumbnails.png, in addition to the folders Pictures and Configurations2. The only files that are required to create a valid OpenDocument file are content.xml and manifest.xml.

4.2.1 Meta.xml

The meta.xml is a XML file that contains information describing the document. Some of the information that is recorded is title of the document, creation date, creator of the document, and number of times edited. The structure of the file has a root element of <office:document-meta> and contains one child element <office:meta>. Within the <office:meta> element are other elements that actually contain information explaining the particular OpenDocument file. Example of this file is shown in Figure 12.

21

Page 27: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

<office:document-meta><office:meta> <meta:generator>OpenOffice.org/2.0$Win32 OpenOffice.org_project/680m3$Build-8968</meta:generator> <meta:initial-creator>Jason Hayden</meta:initial-creator> <meta:creation-date>2005-12-02T11:06:07</meta:creation-date> <dc:creator>Jason Hayden</dc:creator> <dc:date>2005-12-02T11:07:09</dc:date> <dc:language>en-US</dc:language> <meta:editing-cycles>2</meta:editing-cycles> <meta:editing-duration>PT1M6S</meta:editing-duration> <meta:user-defined meta:name="Info 1" /> <meta:user-defined meta:name="Info 2" /> <meta:user-defined meta:name="Info 3" /> <meta:user-defined meta:name="Info 4" /> <meta:document-statistic meta:table-count="0" meta:image-count="0" meta:object-count="0" meta:page-count="1" meta:paragraph-count="1" meta:word-count="3" meta:character-count="18" /> </office:meta> </office:document-meta>

Figure 12: Example meta.xml file

4.2.2 Settings.xml

Settings.xml contains information that explains options about the application that created the file. Information stored in this file is independent of the document itself but describes property settings of the application itself. An example file is shown in Figure 13. The root element of the file is <office:document-settings> which contains the element <office:settings>. The <office:settings> element contains many <config:config-item-set> elements which has attribute of config:name. The config:name attribute describes which section of settings are being described. Within each <config:config-item-set>, there are <config:config-item> elements that have the config:name and config:type attributes. The config:name describes the desired application option and config:type describes the variable type stored in the content of the element. The content of the <config:config-item> element stores the value of the config:name attribute which is used by the document editor for settings.

<office:document-settings> <office:settings> <config:config-item-set config:name="ooo:view-settings"> <config:config-item config:name="ViewAreaTop" config:type="int">0</config:config-item> <config:config-item config:name="ViewAreaLeft" config:type="int">0</config:config-item> … <config:config-item config:name="CurrentDatabaseCommand" config:type="string" /> <config:config-item config:name="PrintDrawings" config:type="boolean">true</config:config-item> <config:config-item config:name="PrintBlackFonts" config:type="boolean">false</config:config-item> </config:config-item-set> </office:settings> </office:document-settings>

Figure 13: Example settings.xml

22

Page 28: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

4.2.3 Styles.xml The information about the format and layout of text that are used within a document are stored within the styles.xml file. A style can be applied on a paragraph, page, character, frame or list. The styles.xml file starts with the root element of <office:document-style> and within it contains two different elements, <office:styles> and <office:font-face-decls>. The <office:font-face-decls> element stores the fonts that are used in the document. The <office:styles> element contains information about the styles used within the document. It contains multiple elements that define the default style for text and custom styles created by the end user. An example styles.xml file is listed in Figure 14.

<office:document-styles> <office:font-face-decls> <style:font-face style:name="Tahoma1" svg:font-family="Tahoma" /> <style:font-face style:name="Tahoma" svg:font-family="Tahoma" style:font-pitch="variable" /> </office:font-face-decls> <office:styles> <style:default-style style:family="paragraph"> <style:paragraph-properties fo:hyphenation-ladder-count="no-limit" style:text-autospace="ideograph-alpha" style:punctuation-wrap="hanging" style:line-break="strict" style:tab-stop-distance="0.4925in" style:writing-mode="page" /> <style:text-properties style:use-window-font-color="true" style:font-name="Thorndale" fo:font-size="12pt" fo:language="en" fo:country="US" style:font-name-asian="Andale Sans UI" style:font-size-asian="12pt" style:language-asian="none" style:country-asian="none" style:font-name-complex="Tahoma" style:font-size-complex="12pt" style:language-complex="none" style:country-complex="none" fo:hyphenate="false" fo:hyphenation-remain-char-count="2" fo:hyphenation-push-char-count="2" /> <style:style style:name="Index" style:family="paragraph" style:parent-style-name="Standard" style:class="index"> <style:paragraph-properties text:number-lines="false" text:line-number="0" /> <style:text-properties style:font-name-complex="Tahoma1" /> </style:style> </office:styles> </office:document-styles>

Figure 14: Example styles.xml

4.2.4 Content.xml

The content.xml file is one of the two required files and contains the content of the office document being created. This file differs greatly dependent upon which type of office document is created, but regardless of the type of file, the root elements of the file are the same. The root element of this file is <office:document-content> which contains attributes that name the namespace of the file and name which application is being used to save the file. The root element contains: <office:scripts>, <office:automatic-styles>, and <office:body>. The <office:scripts> element is unused at this time and is scheduled to be used for storing macros in future revisions of OpenDocument. Additional styles that are defined after initially creating the file is stored within the <office:automatic-styles> element. The same attributes and elements describing styles within the styles.xml file are also used in this section. The <office:body> element stores the text entered by the user. The first element listed describes which document is described: <office:text>, <office:drawing>,

23

Page 29: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

<office:presentation>, <office:spreadsheet>, <office:chart>, or office:image>. Within that element, text information is stored within elements that describe the formatting of the text. Figure 15 shows the contents of the content.xml file of a simple text document.

<office:document-content> <office:scripts/> <office:styles> <office:automatic-styles> <!-- style information --> </office:automatic-styles> </office:styles> <office:body> <office:text> <b>Simple Text Document </b> </office:text> </office:body></office:document-content>

Figure 15: Content.xml file

4.2.5 Pictures folders

The pictures folder contains all the images that are used within the document. Most of the images are stored in their original format except bmp which are converted to png.

4.2.6 Thumbnail.png

The thumbnail.png file is a thumbnail image file of the formatting of the whole document. This image is the same view that would be seen if a print preview option was selected.

4.2.7 Configurations2

This folder is currently unused and is planned for storage of configurations in future releases of OpenDocument.

4.2.8 Manifest.xml

The manifest.xml file is a required file and is used to explain which files are located in the in the OpenDocument file. This file is used by the office document editor to locate and load files.

24

Page 30: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

4.3 Testing Portability

One of the largest advantages of OpenDocument is the ability of an office document to be created and edited into a variety of document editors. The validity of that statement is tested to see if a document is capable of being opened and edited in different editors. To examine this capability, four different documents of increasing complexity will be created in OpenOffice 2.0 and then opened in StarOffice 8.0 and KOffice 1.4.2. Since text documents and calculation spreadsheets received the most attention in the business sector, we will focus our testing on those types of documents. In addition, we chose to use the most popular OpenDocument editors since they are more widely developed.

4.3.1 Simple Text Document

The first document that is tested is a simple text document that contains the phrase “Open Document test” centered all on one line with different formatting on each word in the phrase. The word “Open” is underlined with the New Times Roman font at size 20. The word “Document” is bold with the Arial font at size 44. Finally, the word “test” is italics with a size 10 Tahoma font. This document as seen in OpenOffice is represented in Figure 16.

Figure 16: Text document 1 opened in OpenOffice

When the same file was opened in StarOffice, the document opened exactly the same way as it did in OpenOffice. All the formatting of the text was correct and there were no problems editing the file. This is shown in Figure 17.

25

Page 31: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

Figure 17: Text document 1 opened in StarOffice

The formatting and styling of the document was correct when it was opened (Figure 18). There were no problems in editing and saving the document.

Figure 18: Text document 1 opened in KOffice

26

Page 32: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

4.3.2 Advanced Text Document

The next document is a very advanced text document that contains many different options that are normally available to only an advanced text editor. The first thing the document utilizes is the ability of predefined formatting by using heading and body text tags. Within the body text of the headings, there are multiple fonts and font sizes being used. Included in the document are a six cell table that contains a dark grey shading in the top row, some text in a variety of justifications, and a small gif image in one of the cells. This document can be seen as opened in OpenOffice in Figure 19.

Figure 19: Text document 2 opened in OpenOffice

When the document is opened in StarOffice (Figure 20) it looks exactly the same as it does in OpenOffice. There are no differences in the formatting and the file is able to be edited to allow additional modification.

27

Page 33: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

Figure 20: Text document 2 in StarOffice

There were no issues when the document was opened in KOffice, as seen in Figure 21. The table was formatted correctly and the justifications of the text were correct.

Figure 21: Text document 2 in KOffice

28

Page 34: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

4.3.3 Simple Spreadsheet Document

The third document tested is a simple spreadsheet document that performs four simple calculations (Figure 22). In row 1 the contents of column A and B were added together and stored in column C. Row 2 has the contents of column A were subtracted from column B and stored in column C. The content of column C in row 3 has the result of the multiplication of column A and B. In row 4, column C has the result of the division of column A by column B.

Figure 22: Simple spreadsheet in OpenOffice

There were no problems when opening the document in StarOffice (Figure 23). The calculations and formatting were correct.

Figure 23: Simple spreadsheet in StarOffice

29

Page 35: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

The spreadsheet opened without a problem in KOffice as seen in Figure 24. The mathematical equations were correct and the formatting was the same.

Figure 24: Simple spreadsheet in KOffice

4.3.4 Advanced Spreadsheet Document

The last document is an advanced spreadsheet that combines multiple functions in the function “=(SUM(A1;A3;AVERAGE(A1:A5)+LOG(A1+10;10)-SQRT(A4*SIN(0.75)+MOD(A3;A4))) + (POWER(3;3)/LN(A4)))”. The equation and answer can be seen in Figure 25.

Figure 25: Advanced spreadsheet in OpenOffice

30

Page 36: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

The spreadsheet opened without a problem in StarOffice and the calculation of the equation is correct. This can be seen in Figure 26.

Figure 26: Advanced spreadsheet in StarOffice

When the document was opened in KOffice, the equation did not port over. As seen in Figure 27, the value in cell B1 is ####. Problems occurred because OpenDocument does not define a standard for functions, only formatting. Currently each application defines their own standard for function formatting. New research into extending OpenDocument called OpenFormula has been started so that a standard may be developed to exchange formulas within OpenDocument files [Jason-3].

Figure 27: Advanced spreadsheet in KOffice

31

Page 37: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

4.4 Conclusions

OpenDocument has distinct advantages for use in storing office documents. The use of XML to store information about the document allows information to be easily parsed by multiple editors and allows for the possibility of data recovery in instances of corruption. Using a binary format instead of XML to store the information will require a larger overhead to access the data and no chance of data recovery in instances of corruption. The portability of OpenDocument permits documents to be opened in a variety of editors without requiring a specific application and allowing for the editor of choice to be selected by the end user. Although the portability of OpenDocument is not yet universal, there is a lot of work going on to achieve this goal. Lastly, since OpenDocument is an open source format, it allows for constant development over a wide range of people who are not restrained by the ideals of one company. These three advantages combine to create a strong future for OpenDocument in which it can be utilized for sharing office documents.

5 Distributed Warehouse SystemsBy Paul Vandal

The scope of my project has evolved throughout the semester. Initially, I proposed to research data warehouses and its use of XMI/XML as a solution for interoperability. Through this research I discovered the Common Warehouse Metamodel [Paul-1][Paul-2][Paul-3], or CWM, that is used in data warehouse to provide a standard format for representing metadata in heterogeneous systems. While CWM utilizes XMI for a standard interchange format, it was only a small part of the big picture, and if each team member researched XMI in their various subsections, there was a good chance we would all be doing the same research. At that point, I began looking at the various dimensions of CWM, including the technologies it uses and how it leverages the problem of proprietary meta-data within heterogeneous data warehouse systems.

By mid semester, we each had interesting research fields with a lot of compiled information, but no common thread. At that point, we started to look at the semantic web and try to use it to glue all of our individual sections together to have a more unified them. As introduced above, we found RDF as this common thread. My research shifted from the details of the CWM itself, to how it could interface with a technology such as RDF. Through the following sections, I provide an overview of my research of CWM and then move on to my work with RDF.

5.1 Data Warehouse Systems

32

Page 38: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

5.1.1 OverviewOn a daily basis, businesses are driven in different directions by decisions made by the executives that control them. These decisions need to be created on concrete and correct data that represents the trends of the business and their clients. Companies must be dynamic and adapt to these trends in order to survive the tough competition they are up against each day. Without taking the proper steps to be competitive in the market, these businesses will be unsuccessful and short-lived while their competitors prosper from the use of new technology for more insightful business decisions.

Raw data held in a database is almost worthless to the decision makers in a business. Usually these databases consist of data such as customer information and inventories. It is the relationships that are created between them that hold some value. Looking at a customer name or the cost of a product does not allow for proper decision making that can change the future of a company. Instead, it would be useful for these decision makers to know how many products have been sold in the past year or even the past month. Take for instance a product in a store hasn’t sold in the past 2 weeks. With this information, the decision makers of the company can properly analyze the problem. The price may be too high and needs to be lowered, or maybe it is time to remove the item from the store because recent trends show that customers have moved on to a different type of product. A data warehouse holds these types of trends and relationships that exist between the data over time. As shown in [Paul-3], numerous tools are integrated into this system to squeeze the best juice out of the data at hand.

The research presented in this section has been divided into two main subsections. First, the problem of heterogeneous metadata in the metadata repository will be considered. This problem will be explored along with different types of metadata architectures in a data warehouse system. The problems that arise for these systems will be considered, and solutions to this interoperability problem will be presented. In the second half, a proposal to add a Resource Description Framework repository [Bryan-4][Paul-5] to the data warehouse system will be introduced. An application will then be presented which integrates the RDF repository into a data warehouse system that uses a central CWM repository.

5.1.2 Metadata Interoperability in the Data Warehouse Environment

There are two different architectures that are applied within a data warehouse. First, there is the architecture of the data warehouse system itself. These are usually in the form of a pipe and filter architecture in which tool inputs data into the data warehouse, and then other tools are used to output the data in the form of reports. While this data is traversing the data warehouse, there is another layer in which the metadata that describes the data is traversing as well. As explored in [Paul-4], this layer is known as the metadata architecture, and can have similar architectures to the data warehouse architecture. Some of these architectures include centralized repositories while others are point to point.

There are many tools that interface with the data warehouse. These tools have various uses, including transforming the data to the specifications of the user, and other tools for creating

33

Page 39: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

reports for the user. This data process chain is quite interesting, but is out of the scope of this paper. Instead, the interfacing and interoperability of these tools has been explored. In a data warehouse system, the tools that operate on the data are usually not all from the same vendor, and worst of all, they are not designed to interact with tools from other vendors. While it is possible for a business to purchase a data warehouse system that has been created by one vendor, it is usually more beneficial for a business to choose to buy heterogeneous parts. This allows them to buy specific tools for the warehouse that best fit the needs of their application. It is safe to assume here that vendors design tools for different situations, for example, one tool may have been created to handle high bursty loads, while another was designed for a low load system. The low load tool may have been simpler to design and therefore creating a cheaper alternative for a small business with smaller needs to purchase.

An interoperability issue arises with the introduction of heterogeneous tools into this data warehouse system. Each tool has its own way of storing metadata of the data it is performing some operation on. These tools usually have a set of Application Programming Interfaces that are public, but complicated metadata bridges are necessary between tools that interact. When adding a new tool into a heterogeneous data warehouse system, a business will be required to hire a programmer to interface the newly purchased tool into the data warehouse system. These complex bridges that need to be created may be pricey to create, and the decision makers of a company may shy away from upgrading to a newer system, reducing the sales of the company that designed the tool.

From a different point of view, the designers of a tool may need to waste money on creating bridges to other existing data warehouse tools. If they already have a solution to the complex bridge that needs to be created, companies may be more apt to buy their product. But, this has a profound affect on the design companies as revenue is wasted in order to research and create these complex bridges. A team of programmers may need to be hired just to keep the pool of bridges current with the new tools that are created or old tools that have been upgraded by other vendors.

Leveraging the heterogeneality problem will also have an industry wide impact. Instead of dedicating programmers to creating these bridges to remove the bottleneck, new products can be created and the data warehouse system can evolve. More money can be directed at creating new cutting edge ideas and concepts for more efficient data warehouse tools. This is an advantage for both the design companies and the companies that use the data warehouses.

Figure 28: PV1 Figure 29: PV2

34

Page 40: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

The first step in leveraging the metadata integration problem of heterogeneous tools is to create a central metadata repository. In Figure 28 (PV1), there are four tools and six bidirectional metadata bridges. Each metadata bridge has two complex metadata bridges that need to be implemented, bringing the total to twelve bridges to be implemented for all four tools, assuming that all the tools interact with each other. Introducing a metadata repository shown in Figure 29 (PV2), reduces the amount of bidirectional bridges to four, with the sum of the amount of bridges at eight. Here the metadata repository has its own metadata language, but the advantage is that the tools need to interface with that common metadata language rather than every other tool in the system.

Although it can be used in many different metadata architectures, the Common Warehouse Metamodel builds on this central metadata repository to bring the interoperability of this metadata repository architecture even further. It introduces a standard metamodel for creating metadata, designed specifically for the data warehouse domain. In order to design this common metamodel, the creators of this technology researched the metadata from leading warehouse tools and used the most redundant metadata. Using the most common metadata, they created this common warehouse metamodel, as shown in Figure 30 (PV3). Also in this figure, the common metamodel is only a small part of a large metamodel. A package in CWM allows for extensibility, allowing a developer to use the common metamodel and extend it for the specific application being developed.

Figure 30: PV3 Figure 31: PV4

Extensibility is one of the many advantages that make it easier to use CWM for creating applications. Another advantage is the use of XML to interchange the metadata between tools. This also raises the level of interoperability a great deal, since XML is another widely used standard and is easily encapsulated and transmitted through HTML. The HTML port on a computer is the most widely used and is almost never fire walled, allowing seamless interchange of CWM metadata through XML embedded in HTML.

The main reason and biggest advantage for using this metamodel is the ability it provides for leveraging of the metadata bridge problem. In Figure 31 (PV4), each warehouse tool is built using the CWM specification. While the designers of the tools may still store proprietary metadata internally, although an inefficient practice, they can develop their tools using CWM compliant bidirectional bridges to deal with external metadata. This creates a plug and play environment, where new tools can be introduced into the system without much effort.

35

Page 41: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

With the metadata integration problem leveraged, other areas of the data warehouse can be explored to create efficient data warehouses. An interesting and important area of research is the metadata management strategies that can be implemented on these data warehouses. Different strategies will have different affects on the system, and it would be interesting to see which strategies should be implemented for different scenarios. Also, the affect of extensibility of CWM on interoperability can also be explored. It would be interesting to see how two different metamodels that are extensions of the base CWM metamodel interact and how interoperability holds up.

5.1.3 An Intelligent Search by applying the Semantic Web

Several questions arose during the mid term presentation as to what the major theme of our project was. We answered them by shifting the main scope of a project to a topic that could glue them all together. As the main overview of our shift to accommodate this change is covered in the main report, the details of the change in my research area will be explored here. It was suggested to find someway to glue all of our projects together in the form of some IT system that would need to leverage all of our research areas. The group chose the technology Resource Description Framework that is used in the semantic web as this glue, and somehow these data warehouses needed to be interfaced with it. After some initially investigation, it seemed quite possible to glue CWM to RDF through a mapping in which RDF would create objects out of the CWM metadata and store it with rich relationships between the data. While no mapping of this type exists presently, the semantic web has been gaining momentum in the computing world and this doesn’t seem too much of a long shot. This paper could possible form the basis of that effort. After the mapping from CWM to RDF has been proposed, a person of interest search using the RDF repository [Paul-5] will be proposed as well.

In CWM, metadata is created without any types of special relationships between the data. Say for example, customer A purchases product X, and two days later, customer B also buys product X. In CWM, there is a tuple that represents the customer linked to a tuple that represents the product. Both customers will be linked to the instance of product X that they have bought.

Mapping the data from a CWM repository to an RDF repository, the data is created into objects based using an ontology. Then, relationships and inferences are created between them. Say for instance, the product the two customers bought is an apple, this object is a member of an ontology hierarchy. It is a fruit, fruit is produce, and produce is a grocery store item. This data already exists in the CWM repository in some form, but it is not as organized as it is in the RDF repository. While you can search both repositories, the abilities of the two searches are quite different.

An RDF metadata repository allows for a powerful search of the data stored inside of it. Take for instance the two customers that purchased apples at a grocery store. The two customers bought an apple, but customer A bought a granny smith, and customer B bought a red delicious. In this grocery store application, a customer hierarchy will also exist in the

36

Page 42: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

ontology. Customer A may be a 30 year old man while customer B is a 40 year old woman. It would be possible to track the women and men that bought different types of apples and depending on how deep the ontology has been created, the different age groups of men, women, or both that bought certain types of apples. It is also possible to raise the types of apples to just apples in general, or even just fruit. This can be important information to a grocery store for many different reasons, including who to target for different types of products.

Depending on how intricate the ontology is designed, a research group that is interested in tracking a certain ingredient could use this as a research vessel. An ingredient they have been researching is known to cause acne breakouts in teenagers. The research group can use the data to explore the foods with the ingredients that 20 year old females are most likely to buy using the data recorded over a period of time. Then, linking the customers that have experienced this breakout, the group may be able to deduce a common ingredient that may be the suspect to these breakouts.

In order to illustrate the powerful searches that the RDF repository allows a user to implement, the different types of searches that are possible will be presented. One of these includes the use of a CWM metadata repository. From there the RDF repository will be introduced as a search vessel and then compared and contrasted to the other types of searches. The types of searches that will be explored are: Full Data Search, Metadata search, Global metadata search, and RDF Repository search.

A full data search is shown in Figure 32 (PV5). Here, the search tool is attached directly to the data warehouse. Using the grocery store application above, a 20 year old woman is interested in buying a product without the use of a specific preservative that is used in most of the specific product that she wants to buy. In order to do this, she will have to look at each individual product in the store and determine which product doesn’t use the preservative she is allergic to. This is parallel to a full data search, in which a search tool will have to search every piece of document and data contained in the data warehouse for some user defined keyword. These types of searches are quite inefficient.

Figure 32: PV5 Figure 33: PV6

37

Page 43: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

A way to leverage the full data search is a metadata search. The grocery store the young woman is in has just created an online book of all its products. The metadata is written in this book, the type of product, where it can be found, the manufacturer, and the ingredients that are contained inside. Here, she can flip through the book and find a product faster than looking through all of the products on the shelves. But, the definitions of the products were not labeled in any specific format. This slows her down as she needs to determine what the labels on the metadata actually represent. This of course is parallel to that in Figure 33 (PV6), in which each warehouse tool stores the metadata in its own proprietary way, and each tool needs a bridge to external metadata in order to understand the metadata. In this figure, the search tool relies on each warehouse tool do conduct the search of each metadata repository and then propagate the results upward.

In order to make this woman’s life easier, the grocery decides to adopt a standard for representing the products. In order to implement this, the store samples the different ways in which the products have been represented and uses the most redundant ones and creates a base to represent products and an organized and standard fashion. Of course this is the function that the Common Warehouse Metamodel provides data warehouse tools. Although the interoperability issue has been leveraged, searches on this metadata are still quite unintelligent, enabling users to search for keywords or instances of metadata in the data warehouse.

Figure 34: PV7

Taking these searches further, the CWM repository is mapped to an RDF repository as shown in Figure 34 (PV7). The woman now has a wealth of knowledge at her fingertips. She can browse the ingredient that she is allergic to. Here she can see all the products that use the ingredient so she will know which products to stay away from. But, that is only the beginning. From there, she can search other ingredients that may cause the same allergic reaction. Assuming no confidentiality of health records, she can search other people who have had the same reaction and the ingredients that they had allergic reactions from. From these results, she can narrow down the search field to her age group, gender, and even nationality.

Assuming that there is some way to map the CWM repository to the RDF repository, a topic that can be further researched, it is important here to understand the differences of the two data sets. In a regular CWM metadata repository, there are no special relationships between

38

Page 44: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

metadata. There are tuples that represent objects, take for example a person. A person instance will have a name, social security number, phone number, address etc. Then, if the person buys an apple, the person instance will be linked to the instance of the apple. In an RDF repository, the relationships have some similarities, but there is more information available. The apple is in an ontology hierarchy, specifically, it is a fruit. Then there are different types of apples. The person who bought the apple is linked to the object itself.

5.2 Persons of Interest Tracking

The previous grocery store example that was used to illustrate the different types of searches, but there are also many other interesting applications that can be built on this CWM – RDF framework. Take for instance Figure 35 (PV8) where a system is shown in which both the FBI and the CIA have a data warehouse that is CWM compliant. These systems were created in order to store data about persons of interest. Both the CIA and FBI have tracking devices that record the electronic actions of these people. In addition, there are spies that insert all other actions of these people that are not automatically recorded. As this data continuously files into the data warehouse, the RDF repository is constantly updated with the new objects and relationships. That is, new persons of interest are added to the repositories along with their actions and the objects they have interacted with as they happen.

Figure 35: PV8

There are numerous types of persons of interest that are tracked by this system, including organized crime members, serial killers, and suspected terrorists. The actions and objects these persons interact with are endless, including simple things such as depositing money at a bank. Every detail about the actions of the suspect is recorded, such as the currency amount, bank name and location, and the day and time that the suspect deposited money at this bank.

The investigation team that has implemented this repository has been tipped off that one of the suspects in the repository is a suspected terrorist and may also be involved with some business with a mafia organization. With just this word of mouth, it is impossible for this team to file for a warrant or arrest the suspect to get him off the street. In order to form some concrete case against the suspect, they turn to the RDF repository. Querying the RDF repository, they search the activities this man has been involved with, suspicious activities, or even other persons of interest they have been interacting with or even been sharing similar actions with. If he has been to numerous banks, you can browse the various banks that he has been to. While browsing one bank, it is possible that another suspected terrorist has used the same bank around the same the terrorist X has. This may not be totally incriminating, but

39

Page 45: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

if these exchanges of money were of any significance, a flag may be raised and closer attention may be paid to these suspects.

This system proves to be effective in catching a person of interest before they have the ability to cause harm to others. The live updates can take this even further. A tool can be implemented to interface with the RDF repository and analyze any suspicious things that the new data inserted creates. The best way to illustrate this would be if there was a king pin drug dealer in the system. Recorded in the system a week ago, the drug dealer bought some household cleaner at a grocery store. Then five minutes ago, this person bought another household cleaner, that combined with the last one bought, can create some form of drug concoction. While the actions alone are perfectly normal, this combination of events has raised a flag in this new data analyzer, and it will notify the investigation team.

5.3 Conclusion

With the help of new technologies, businesses are capable of more intuitive business decisions. But, even with these new technologies, certain bottlenecks need to be leveraged in order for these businesses to move faster. The main bottleneck explored in this paper was metadata interoperability in data warehouses, and the solution of the Common Warehouse Metamodel was introduced. With this problem of metadata interoperability leveraged, money that was once wasted on complex metadata bridges can now be saved or used to create better warehouse tools and systems. More research areas that can be explored as extensions to this paper include different metadata management strategies for the CWM metamodel and also the affect of extensibility on interoperability.

Once this global metadata standard for data warehouses has been implemented, new evolutions can occur to the data warehouse system. The Resource Description Framework was proposed as one of these evolutions. Mapping a CWM repository to an RDF repository creates a richer data set that will allow for more intelligent applications to the existing data sets. Without the standard metadata format for data warehouses, this mapping may not be possible without complex bridges which may seize the process from manifesting. An application to this RDF repository was introduced which used the mapping of CWM to RDF and created a search tool for persons of interest tracking. This creates another line of research, to see how the mapping of this nature would be designed and implemented in the data warehouse system.

40

Page 46: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

6 References

[Bryan-1] “The Semantic Web: Opportunities and Challenges for Next-Generation Web Applications”, Shiyong Lu, Ming Dong, Farshad Fotouhi, Information Research, Vol. 7 No. 4, July 2002, at http://informationr.net/ir/7-4/paper134.html

[Bryan 2] “DAML-S: Semantic Markup for Web Services”, Anupriya Ankolekar, Mark Burstein, et al, 2001, in “The First Semantic Web Working Symposium”, pp 411-430, Heidelberg: Springer-Verlag Heidelberg.

[Bryan 3] “Integrating Applications on the Semantic Web”, James Hendler, Journal of the Institute of Electrical Engineers of Japan, Vol 122(10), October 2002, pp 676-680, at http://www.w3.org/2002/07/swint

[Bryan-4] “Resource Description Framework”, W3 Consortium, at http://www.w3.org/RDF/

[Bryan-5] “Machine Translation: Past, Present, Future” Hutchins, W. J., 1986, Wiley & Sons.

[Bryan-6] The Dublin Core Metadata Initiative is documented fully at http://dublincore.org/

[Bryan-7] Dubline http://dublincore.org/documents/dcmi-terms/

[Jason-1] OpenDocument Format for Office Applications, http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office

[Jason-2] OASIS OpenDocument Essentials, http://books.evc-cit.info/odbook/book.html

[Jason-3] OpenForumula, http://www.dwheeler.com/openformula/

[Jason-4] OpenDocument Fellowship, http://opendocumentfellowship.org/Applications/HomePage

[Paul-1] Comon Warehouse Metamodel, http://www.omg.org/technology/cwm/

[Paul-2] D.T. Chang, “Common Warehouse Meta-model (CWM), UML and XML,” Meta- Data Conference, March 19-23, 2000, http://www.cwmforum.org/cwm.pdf.

[Paul-3] Poole, J., Chang, D., Tolbert, D., & Mellor, D.(2002), Common Warehouse Metamodel, An Introduction to the Standard for Data Warehouse Integration. New York, NY: John Wiley & Sons

41

Page 47: 1steve/Cse298300/Fall05/XMLFIN.doc · Web viewThus, L may be used in any implementation of a component described by Q without additional modification. Generalization Match MatchGen(Q,

CSE 333 Final Project Report 12/16/2005

[Paul-4] H. Do, E. Rahm, “On Metadata Interoperability in DataWarehouses,” March 2000.

[Paul-5] T. Priebe, “INWISS – Integrative Enterprise Knowledge Portal,” August 26, 2004

[Upsorn-1] V.S. Alagar and K. Periyasamy, Specification of Software Systems. Graduate Texts in Computer Science, NY: Springer-Verlag, 1998.

[Upsorn-2] C. Fellbaum. WordNet: An Electronic Lexical Database, Cambridge, Mass: MIT Press, 1998.

[Upsorn-3] R. Hall, “Generalized Behavior-Based Retrieval,” Proceedings of 15th International Conference on Software Engineering (ICSE’93), Baltimore, MD, page 371-380, ACM Press, 1993.

[Upsorn-4] D. Hemer and P. Lindsay, “Specification-Based Retrieval Strategies for Module Reuse,” Proceedings of Australian Software Engineering Conference (ASWEC’2001), page 235-243, IEEE Computer Society Press, August 2001.

[Upsorn-5] D.M. Hoffman and D.M. Weiss, Software Fundamentals (Collected papers by David L. Parnas), NG: Addison-Wesley, 2001.

[Upsorn-6] J.J. Jeng and B.H.C. Cheng, “Specification Matching for Software Reuse: A Foundation,” Proceedings of the ACM Symposium on Software Reusability (SSR’95), Seattle, WA, page 97-105, April 1995.

[Upsorn-7] A.M. Zaremski and J. Wing, “Specification Matching of Software Components,” ACM Transactions on Software Engineering and Methodology, Vol. 6, No. 4, page 333-369, October 1997.

[Upsorn-8] F. Gibb, C. McCartan, R. O’Donnel, N. Sweeney, and R. Leon, “The Integration of Information Retrieval Techniques within a Software Reuse Environment,” Journal of Information Science, 26(4), page 211-226, 2000.

[Upsorn-9] R. Pinheiro, M.N. Costa, R.M.M. Braga, M. Mattoso, and C.M.L. Werner, “Software Component Reuse Through Web Search and Retrieval,” Proceedings of the International Workshop on Information Integration on the Web Technologies and Applications, “Rio de Janeiro, Brazil, page 12-18, April 2001.

[Upsorn-10] A. Schlapbach. “Generic XMI Support for the MOOSE Reengineering Environment,” Software Composition Group, Institut für Informatik angewandte Mathematik, Universität Bern, June 17, 2001.

42