Modeling and language support for the management of pattern-bases

Please download to get full document.

View again

of 30
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Similar Documents
Information Report



Views: 57 | Pages: 30

Extension: PDF | Download: 0

In our days knowledge extraction methods are able to produce artifacts (also called patterns) that concisely represent data. Patterns are usually quite heterogeneous and require ad-hoc processing techniques. So far, little emphasis has been posed on
  Modeling and language support for the managementof pattern-bases  q Manolis Terrovitis  a,* , Panos Vassiliadis  b , Spiros Skiadopoulos  c , Elisa Bertino  d ,Barbara Catania  e , Anna Maddalena  e , Stefano Rizzi  f  a School of Electrical and Computer Engineering, National Technical University of Athens, Greece b Department of Computer Science, University of Ioannina, Greece c Department of Computer Science, University of Peloponnese, Greece d Department of Computer Science, Purdue University, USA e Department of Computer and Information Science, University of Genoa, Italy f  Department of Electronics, Computer Science and Systems, University of Bologna, Italy Received 4 March 2006; received in revised form 24 July 2006; accepted 5 October 2006Available online 9 November 2006 Abstract Information overloading is today a serious concern that may hinder the potential of modern web-based informationsystems. A promising approach to deal with this problem is represented by knowledge extraction methods able to produceartifacts (also called patterns) that concisely represent data. Patterns are usually quite heterogeneous and voluminous. Sofar, little emphasis has been posed on developing an overall integrated environment for uniformly representing and que-rying different types of patterns. In this paper we consider the larger problem of modeling, storing, and querying patterns,in a database-like setting and use a Pattern-Base Management System (PBMS) for this purpose. Specifically, (a) we for-mally define the logical foundations for the global setting of pattern management through a model that covers data, pat-terns, and their intermediate mappings; (b) we present a formalism for pattern specification along with safety restrictions;and (c) we introduce predicates for comparing patterns and query operators.   2006 Elsevier B.V. All rights reserved. Keywords:  Pattern bases; Pattern-Base Management Systems; Data models; Knowledge warehousing 1. Introduction Nowadays, we are experiencing a phenomenon of information overload, which escalates beyond any of ourtraditional beliefs. As a recent survey states [1], the world produces between 1 and 2 exabytes of unique infor- 0169-023X/$ - see front matter    2006 Elsevier B.V. All rights reserved.doi:10.1016/j.datak.2006.10.002 q This work was partially funded by the European Commission and the Hellenic Ministry of Education through EPEAEK II and thePANDA IST Thematic Network. * Corresponding author. Tel.: +30 210 7721402. E-mail address: (M. Terrovitis).Data & Knowledge Engineering 62 (2007) 368–397  mation per year, 90% of which is digital and with a 50% annual growth rate. Clearly, this sheer volume of collected data in digital form calls for novel information extraction, management, and querying techniques,thus posing the need for novel Database Management Systems (DBMSs). Still, even novel DBMS architec-tures are insufficient to cover the gap between the exponential growth of data and the slow growth of ourunderstanding [2], due to our methodological bottlenecks and simple human limitations. To compensatefor these shortcomings, we reduce the available data to  knowledge artifacts  (e.g., clusters, association rules)through data processing methods (pattern recognition, data mining, knowledge extraction) that reduce theirnumber and size (so that they are manageable by humans), while preserving as much as possible from theirhidden/interesting/available information. Again, the volume, diversity, and complexity of these knowledgeartifacts make their management by a DBMS-like environment imperative. In the remainder of this document,we will refer to all these knowledge artifacts as  patterns .So far, patterns have not been adequately treated as  persistent objects  that can be stored, retrieved, and que-ried. Thus, the challenge of integration between patterns and data seems to be achievable by designing funda-mental approaches for providing database support to pattern management. In particular, since patternsrepresent relevant knowledge, often very large in size, it is important that such knowledge is handled as  first-class  citizen. This means that  patterns should be modeled, stored, processed, and queried in a similar wayto data in traditional DBMSs . Motivating example . To show the benefits of treating patterns as first-class citizens, we briefly discuss a sce-nario involving a manager interacting with a set of stored patterns that he can query through the facilities Data Person Result 1 1. Find alldata thatcorrespond tothis pattern2. For acertainrecord, findall relevantpatterns Patterns of Day 1Pattern 1Result 2Interesting data 3. check whatpatternscorrespond to anew subset andvice-versa4. How different arethe new and the oldset of patterns?  Patterns of Day 2 5. Is thisdifferencesignificant? Pattern 2 6. What areits mostsimilarpatterns?   LegendOperation parameterPattern/data derivationPattern/data subset Fig. 1. An example of pattern generation scenario. M. Terrovitis et al. / Data & Knowledge Engineering 62 (2007) 368–397   369  offered by a  Pattern-Base Management System  (PBMS) [3] (Fig. 1). We will assume the existence of data for the customers and the transactions of two different supermarket branches organized in an object-relationaldatabase like the following: CustBranch1(id:int,name:varchar,age:int,income:int,sex:int)CustBranch2(id:int,name:varchar,age:int,income:int,sex:int)TransBranch1(id:long int,customerID:int,totalSum:euro,items: { int } )TransBranch2(id:long int,customerID:int,totalSum:euro,items:{int}) The relations  CustBranch1 ,  CustBranch2  hold data for customers from two different branches of thesupermarket and the relations  TransBranch1 ,  TransBranch2  hold data for the transactions performed atthe respective branches. The scenario takes place during two different days. Day 1 : A data mining algorithm is executed over the underlying data and the results are stored in the pat-tern base. For example, the data mining algorithm involves the identification of frequent itemsets. Possiblepatterns stored in the pattern base involve sets of items that are frequently bought together; e.g.,{ bread,butter }. The user is notified on this event and starts interacting with the stored data and patterns.During the early stages of this interaction the user is navigating back and forth from the pattern to data space.[ Which are the data represented by a pattern ?] The user selects an interesting pattern and decides to find outwhat are the underlying data that relate to it.[ Which are the patterns representing a data item ?] Out of a vast number of data that the user receives as ananswer, he picks an interesting record and decides to see which other patterns relate to it (so that he canrelate the srcinating pattern with these ones, if this is possible).[ Does a pattern represent adequately a data set ?] This process of cross-over operations goes on for a while, asthe user gets accustomed to the data. Then, he decides to proceed to some more elaborate analysis, consist-ing of ‘‘what-if’’ operations. Therefore, he arbitrarily chooses another dataset (on the basis of someassumption) and checks which of the available patterns can adequately represent it. Then, on the basisof the results, he selects a certain pattern and tries to determine which sets of data correspond to this pat-tern. This interaction with the stored data and patterns goes on for some time.[ Could we derive new patterns from knowledge in the pattern base ?] The user may exploit patterns stored inthe pattern base in order to derive or even to conjecture the existence of new patterns representing existingdata without applying any time consuming mining technique but using available query operators. Day 2 : Datainthesourcedatabaseschange,therefore,theoriginaldataminingalgorithmsareexecutedonaregularbasis(e.g.,weekly)inordertocapturedifferenttrendsonthedata.Onceasecondversionofthepatternsiscomputed(forthenewdatathistime),theanalystisinterestedindeterminingthechangesthathavetakenplace.[ How different are two pattern classes ?] First, the user wants to find out which are the new patterns andwhich patterns are missing from the new pattern set. This involves a simple qualitative criterion: presenceor absence of a pattern.[ Is the difference of two pattern classes significant ?] Then, the user wants some quantitative picture of thepattern set: which patterns have significant differences in their measures compared to their previous version?[ How similar are two patterns ?] Finally, the user wants to see trends: how could he predict that something isabout to happen? So, he picks a new pattern and tries to find its most similar pattern in the previous patternset, through a similarity operator. Why is the problem hard to solve with the current state-of-practice ? Assuming that someone would like totreat patterns as first-class citizens (i.e., store and query them), a straightforward approach would be toemploy standard object-relational or XML database technology to store patterns and exploit its query facil-ities to navigate between the two spaces and pose queries. Although it is feasible to come up with an imple-mentation of this kind, several problems occur. A preliminary object-relational implementation was pursuedin the PANDA project [28] and we briefly list its findings here. First, we need to address the problem of havinga generic structure able to accommodate all kinds of patterns; unfortunately, standard relational, and even 370  M. Terrovitis et al. / Data & Knowledge Engineering 62 (2007) 368–397   object-relational technology, results in a schema which is too complicated. As typically happens with rela-tional technology, such a complex structure has a direct overhead in the formulation of user queries, whichbecome too lengthy and complicated for the user to construct. An XML database could probably handlethe heterogeneity problem gracefully; nevertheless, the main strength of XML databases lies in the efficientanswering of path queries. Unfortunately, the particularities of the (pointer-like) navigation between the dataand the pattern space as well as the novel query operations concerning collections of patterns are not compat-ible with this nature of XML path queries.Based on all the above, the situation calls for usage of customized techniques. In particular, we need both alogical model for the concise definition of a pattern base and an appropriate query language for the efficientformulation of queries. As we shall see, in this paper we treat the problem from a blank sheet of paper anddiscuss the management of patterns as a sui-generis problem. It is also interesting to point out that havingan appropriate logical model, allows its underlying support by an object-relational or an XML storage enginein a way that is transparent to the user. Why is the problem novel  ? Both the standardization and the research community have made efforts close tothe main problem that we investigate; nevertheless, to the best of our knowledge there is no approach thatsuccessfully covers all the parameters of the problem in a coherent setting.So far,  standardization efforts  [4–6] have focused towards providing a common format for describing pat-terns with the aim of interchangeability. The storage and eventual querying of such patterns has not beentaken into consideration in the design of these approaches. A much deeper approach, similar to our own,has been performed in the field of   inductive databases : a particular characteristic of inductive databases is thatthe querying process treats data and patterns equally. Nevertheless, the inductive database community avoidsthe provision of a generic model, capable of handling any kind of patterns and focuses, instead, on specificcategories of them. Therefore, our approaches are complementary since the inductive database community fol-lows a bottom-up approach, with individual cases of patterns considered separately, whereas we follow a top-down approach with a generic model that aims to integrate any kind of patterns. Finally, it is interesting topoint out the relationship to the  data warehousing   approach, where data are aggregated and stored in appli-cation-oriented data marts. We follow a similar, still broader, approach, in the sense that although data ware-housing aggregation is one possible way to provide semantically rich information to the user in a concise way,it is not the only one. As already mentioned, we try to pursue this goal in an extensible fashion, for a signif-icantly broader range of patterns. Problem and main idea. The problem that we address is focused towards providing a generic and extensiblemodel for patterns, along with the necessary languages for their definition and manipulation . The issues that arisein this context are specifically: (a) the accommodation of any kind of patterns through a generic model, (b) theimposing of a logical structure to the vast space of patterns and the organization of patterns in semanticallyrich, easy to query collections, (c) the ability to navigate between the data and the pattern space and the abilityto interrelate the corresponding objects of the two worlds, and, (d) the ability to perform novel types queriesboth over data, and, most importantly, over collections of patterns.The importance of a logical model is paramount in this effort. As usual, a logical model provides the inter-mediate mathematical formulation that gracefully bridges the subjective, conceptual user view to the actualphysical implementation. Moreover, apart from the formal view on the internal structure of data that the log-ical model imposes, it also forms the basis upon which a query language can be defined. Contributions.  The main idea of this paper is the organization of patterns and data in a  Pattern Warehouse environmentthat(a)isbasedontheformalfoundationsofawellspecifiedlogicalmodeland(b)providesthenec-essary mechanisms for the querying of patterns and data. Specifically, we make the following contributions: •  First, we formally define the foundations for the global setting of pattern management through a logicalmodel that covers data, patterns and intermediate mappings. The model covers typing issues, both for dataandpatterns,withthelaterprovisionaimingtowardsbeing abletodefinereusablepatterntypesthatarecus-tomizedpercase.Moreover,themodelallowsthedatabasedesignertoorganizesemanticallysimilarpatternsin the appropriate collections that can be subsequently queried, just like relations in traditional databases. Adistinguishing provision of our model is the ability to relate patterns and relevant data either explicitly,through intermediate, pointer-based mappings, or approximately, through declarative expressions. M. Terrovitis et al. / Data & Knowledge Engineering 62 (2007) 368–397   371  •  Second, we discuss language issues in the context of pattern definition and management. In particular,we present a declarative formalism for defining the approximate relation between the patterns and thedata space. The approximate formulae that perform this approximate mapping are complementary tothe precise mappings (e.g., for the case that the latter are absent). Safety issues are also discussed in thiscontext. •  Third, we provide a characterization of pattern relationships (disjointness, equivalence, subsumption, sim-ilarity and equality) and we discuss their completeness and minimality. This fundamental set of relations islater exploited in the formation of the appropriate predicates on patterns, which are part of the query lan-guage that we introduce. •  Fourth, we introduce query operators for patterns and discuss their usage and semantics. Our query oper-ators involve (a) the selection of patterns satisfying some criteria, (b) the reconstruction or combination of existing patterns to form new ones and (c) the navigation from the data to the pattern space and vice versa. •  Finally, we describe PSYCHO ( Pattern base management SYstem arCHitecture prOtotype ) [7], a PBMSprototype implementation, developed in order to assess the usability of the proposed pattern model andlanguages.A short version of this paper appears in [8]. In this paper, we extend [8] as follows: (a) we provide a fundamental classification of both precise pattern relationships, upon which we define appropriate predi-cates; (b) a proof of completeness for this classification; (c) a similar classification, along with the appro-priate predicates and completeness proofs for approximate pattern relationships, (d) a novel family of similarity operators for patterns, (e) a thorough discussion and exemplification of the components of thePattern Warehouse throughout all the paper, with particular emphasis on the formal definition and discus-sion of the operators of the query language, and finally, (f) a brief presentation of a prototypeimplementation.The rest of this paper is organized as follows. In Section 2, we introduce the notion of Pattern-Base Man-agement system. Section 3 presents a general model for the coexistence of data and patterns. In Section 4, we present the formalism that describes the closed form of the relation between data and patterns. In Section 5,we explain how patterns can be queried and introduce predicates for comparing patterns and query operatorsand in Section 6, we discuss the related work. Finally, in Section 7 we provide a brief description of a proto- type implementation, while Section 8 offers conclusions and topics of future work. 2. Patterns and Pattern-Base management systems By observing the motivating example of Section 1, we can see that the overall structure of the environmentand the operations performed by the user abide by some fundamental principles: •  There exist a  data space  and a  pattern space . •  Patterns serve as artifacts, which describe (a subset of) data with similar properties and/or behavior, pro-viding a compact and rich in semantics representation of data. Frequent itemsets in the case of our moti-vating example are compact (in terms of size) and rich in terms of information conveyed to the user. •  Relationships  exist among the members of the data space and the members of the pattern space. In general,these relationships can be of cardinality  many-to-many , i.e., a pattern can correspond to more than one dataitem and vice versa. We say that the data that are represented by a pattern form the  image of the pattern  inthe data space. •  The user navigates from the pattern to the data space , back and forth. Patterns are related to their dataimages and data are mapped to patterns corresponding to them. •  To avoid performing costly operations over large data spaces, the user exploits their concise pattern repre-sentations, by  applying query operators over collections of patterns, as well as individual patterns . Therefore,patterns are selected by value, compared to each other, matched for similarity or equivalence, just like datarecords are. At the same time, whole classes of patterns are compared to provide differences and similaritiesamong their members, again, exactly as collections of data records (e.g., relational tables, object-orientedclasses, etc.) would be handled. 372  M. Terrovitis et al. / Data & Knowledge Engineering 62 (2007) 368–397 
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!