Data mining, also known as knowledge discovery in databases, consists in extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information from large databases. The main aim of data mining is to give organizations tools to sift through large databases to find the trends, patterns, and correlations that can guide strategic decision making. An important data mining problem is discovery of sequential patterns in large databases. Sequential pattern mining is the mining of frequently occurring patterns related to time or other sequences. For example, typical patterns that may be discovered are as follows: "A customer who bought a TV three months ago is likely to order a new VCR within one month" or "typical user's visit to a web server is the following page A, then page C, then page D". Formally, the sequential pattern discovery problem may be formulated in the following way: given a database D of sequences, find the maximal sequences among all sequences that have a certain user-specified minimum support. However, new applications emerging from the wide spread use of WWW and data mining create processing needs for advanced query processing on sequences. For example, we would like to search web access logs stored in relational database for user's sequences that support a given sequential pattern: "identify all users which access a web server in a similar way to a sequence q". For large data volumes this type of retrieval may be extremely time consuming and is not well supported by traditional indexing techniques.
The talk will present the concept of a new index structure to optimize pattern search queries on a database of sequences. First, the set retrieval problem will be stated and the Group Hash Bitmap Index will be presented. Then, the talk will present a new index structure to optimize sequence search queries (called pattern queries). The talk will focus on its physical structure, maintenance and performance issues.
Speaker: Prof. Tadeusz Morzy
TU Poznan / Poland
When: Friday, 14th of June 2002, 15:30 (s.t)
Where: HS10, University Klagenfurt