mining interesting sequential patterns for intelligent systems

15
Mining Interesting Sequential Patterns for Intelligent Systems Show-Jane Yen* Department of Computer Science and Information Engineering, Ming Chuan University, Taipei, Taiwan Mining sequential patterns means to discover sequential purchasing behaviors of most custom- ers from a large number of customer transactions. Past transaction data can be analyzed to dis- cover customer purchasing behaviors such that the quality of business decisions can be improved. However, the size of the transaction database can be very large. It is very time consuming to find all the sequential patterns from a large database, and users may be only interested in some sequen- tial patterns. Moreover, the criteria of the discovered sequential patterns for user requirements may not be the same. Many uninteresting sequential patterns for user requirements can be gen- erated when traditional mining methods are applied. Hence, a data mining language needs to be provided such that users can query only knowledge of interest to them from a large database of customer transactions. In this article, a data mining language is presented. From the data mining language, users can specify the items of interest and the criteria of the sequential patterns to be discovered. Also, an efficient data mining technique is proposed to extract the sequential pat- terns according to the users’ requests. © 2005 Wiley Periodicals, Inc. 1. INTRODUCTION Because the capacity of storage is getting larger, large amounts of data can be stored in a database. Potential useful information may be embedded in the large databases. Hence, how to discover the useful information that exists in such data- bases is becoming a popular field in computer science. The purpose of data mining 1–8 is to discover the useful information from large databases, such that the quality of decision making can be improved. A transaction database consists of a set of transactions. A transaction typi- cally consists of the transaction identifier, the customer identifier (the buyer), the transaction date (or transaction time), and the items purchased in this trans- action. Mining sequential patterns 2,9–13 is to find the sequential purchasing behav- ior of most customers from a large transaction database. For example, there is a sequential pattern $ Basic Computer Concepts %$ Programming Language %$ System *e-mail: [email protected]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 20, 73–87 (2005) © 2005 Wiley Periodicals, Inc. Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/ int.20054

Upload: show-jane-yen

Post on 11-Jun-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining interesting sequential patterns for intelligent systems

Mining Interesting Sequential Patternsfor Intelligent SystemsShow-Jane Yen*Department of Computer Science and Information Engineering,Ming Chuan University, Taipei, Taiwan

Mining sequential patterns means to discover sequential purchasing behaviors of most custom-ers from a large number of customer transactions. Past transaction data can be analyzed to dis-cover customer purchasing behaviors such that the quality of business decisions can be improved.However, the size of the transaction database can be very large. It is very time consuming to findall the sequential patterns from a large database, and users may be only interested in some sequen-tial patterns. Moreover, the criteria of the discovered sequential patterns for user requirementsmay not be the same. Many uninteresting sequential patterns for user requirements can be gen-erated when traditional mining methods are applied. Hence, a data mining language needs to beprovided such that users can query only knowledge of interest to them from a large database ofcustomer transactions. In this article, a data mining language is presented. From the data mininglanguage, users can specify the items of interest and the criteria of the sequential patterns to bediscovered. Also, an efficient data mining technique is proposed to extract the sequential pat-terns according to the users’ requests. © 2005 Wiley Periodicals, Inc.

1. INTRODUCTION

Because the capacity of storage is getting larger, large amounts of data can bestored in a database. Potential useful information may be embedded in the largedatabases. Hence, how to discover the useful information that exists in such data-bases is becoming a popular field in computer science. The purpose of datamining1–8 is to discover the useful information from large databases, such that thequality of decision making can be improved.

A transaction database consists of a set of transactions. A transaction typi-cally consists of the transaction identifier, the customer identifier (the buyer),the transaction date (or transaction time), and the items purchased in this trans-action. Mining sequential patterns2,9–13 is to find the sequential purchasing behav-ior of most customers from a large transaction database. For example, there is asequential pattern �$Basic Computer Concepts%$Programming Language%$System

*e-mail: [email protected].

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 20, 73–87 (2005)© 2005 Wiley Periodicals, Inc. Published online in Wiley InterScience(www.interscience.wiley.com). • DOI 10.1002/int.20054

Page 2: Mining interesting sequential patterns for intelligent systems

Programming%�. Seventy percent is discovered from the transaction database in abook store, which means that 70% of the customers read “Programming Lan-guage” after reading the books about “Basic Computer Concepts,” and then theyread “System Programming” after reading “Programming Language.” The man-ager can use this information to recommend that new customers read “Program-ming Language” and “System Programming” when they read “Basic ComputerConcepts.”

The definitions used in discussing mining sequential patterns are as follows:An itemset is a nonempty set of items, and a sequence is an ordered list of theitemsets. A sequence s is denoted as ^s1, s2, . . . , sn&, where si is an itemset. A sequence^a1, a2, . . . , an & is contained in another sequence ^b1, b2, . . . , bm & if there exist inte-gers i1 � i2 � {{{ � in , 1 � ik � m, such that a1 � bi1, . . . , an � bin , then^a1, a2, . . . , an & is a subsequence of sequence ^b1, b2, . . . , bm &.

A customer sequence is the list of all the transactions of a customer, which isordered by increasing transaction time. A customer sequence c supports a sequences if s is contained in c. The support for a sequence s (or an itemset i ! is defined asthe ratio of the number of customer sequences that supports s (or i ! to the totalnumber of customer sequences. If the support for a sequence s (or an itemset i !satisfies the user-specified minimum support threshold, then s (or i ! is called fre-quent sequence (or frequent itemset). The length of an itemset X is the number ofitems in the itemset X, and the length of a sequence s is the number of itemsets inthe sequence. An itemset of length k is called a k-itemset, and a frequent itemset oflength k a frequent k-itemset. Also, a sequence of length k is called a k-sequence,and a frequent sequence of length k a frequent k-sequence.

A sequential pattern is a frequent sequence that is not contained in any otherfrequent sequence, that is, a maximum frequent sequence. Typically, the work ofmining sequential patterns can be separated into two parts: The first part is to findall frequent itemsets (i.e., the frequent 1-sequences).1,4,12,14–16 The second part isto discover all the frequent sequences, and then find all the sequential patternsfrom the frequent sequences. In general, before generating the frequent sequences(or frequent itemsets), we need to generate the candidate sequences (or candidateitemsets), and scan the database to count the support for each candidate sequence(or candidate itemset) to decide if it is a frequent sequence (or frequent itemset).A candidate sequence (or candidate itemset) of length k is called a candidatek-sequence (or candidate k-itemset).

To find sequential patterns, all frequent sequences need to be generated fromthe database. However, the size of the database can be very large. It is very timeconsuming to find all sequential patterns from the large database, and users maybe interested only in the sequential patterns among certain items. Moreover, thecriteria (such as minimum support) to discover sequential patterns for the usersmay not be the same. Many sequential patterns uninteresting to the users can begenerated when traditional methods of mining sequential patterns are applied.Hence, a data mining language is needed such that users can query knowledgefrom a large database of customer transactions.

For designing a data mining language, two important issues need to beconsidered: the easy-to-use user interface and the efficient data mining language

74 YEN

Page 3: Mining interesting sequential patterns for intelligent systems

processing. This article is concerned with the two issues. We present a data mininglanguage, from which users only need to specify the criteria and the items of inter-est for discovering the sequential patterns.

2. RELATED WORK

For the problem of mining frequent itemsets, the algorithm Apriori1 needs toscan the database in multiple passes. For the kth pass, it counts supports for thecandidate k-itemsets and generates the frequent k-itemsets. Different from the pre-vious algorithm, algorithm DLG (Direct Large itemset Generation)16 scans thedatabase once to record the related information and construct an association graph.After constructing the graph, DLG generates all the frequent itemsets by travelingthe association graph. Hence, DLG is more efficient than the previous two algo-rithms. However, DLG needs to take a lot of memory space to record the relatedinformation. The algorithm DIC (Dynamic Itemset Counting)4 can reduce the num-ber of database scans. For each database scan, DIC counts the supports of thecandidates whose lengths can be different. However, DIC has to take a lot of timeto count the supports of the candidate itemsets for each pass.

Agrawal and Srikant2,9 proposed an approach to find sequential patterns. Forthe first part, they applied the algorithm Apriori1 to find all the frequent itemsets.For the second part, they presented an algorithm Aprioriall to discover all thesequential patterns. This algorithm needs to make multiple passes over the data-base. For the kth pass, many candidate k-sequences need to be counted to generateall the frequent k-sequences. The algorithm DSG (Direct Sequential pattern Gen-eration)16 needs to scan the database only once to generate all the frequentsequences. For this database scan, DSG records the related information, constructsan association graph, and then all the frequent sequences can be generated by trav-eling the constructed association graph. However, when the database is gettinglarger, the related information may not fit in the main memory.

Han and Pei proposed the FP growth algorithm14,17 and the PrefixSpan algo-rithm10 to generate frequent itemsets and frequent sequences, respectively. Thesealgorithms adopt a divide-and-conquer method to project and partition databasesbased on the currently discovered frequent patterns and grow such patterns to lon-ger ones in the projected databases. These approaches are more efficient than theabove approaches, because no candidate needs to be generated. However, the pat-terns discovered by these approaches may not be interesting for the users.

Meo, Psaila, and Ceri18 proposed a SQL-like operator for extracting associa-tion rules. The SQL-like operator is capable of expressing the problem of miningassociation rules. However, the expressive power of the SQL-like operator is stilllimited. For example, users may want to query the associations among certainitems and all the other items. The SQL-like operator cannot express this kind ofquery. Furthermore, the SQL-like query language is inconvenient for naive users,although it is suitable for SQL programmers and experts, and the SQL-like oper-ator performs set-oriented operations (i.e., join operations), which are very ineffi-cient operations. Yen and Chen19 and Yen and Lee20 proposed a data mininglanguage for mining interesting association rules. They presented a user-friendly

MINING INTERESTING SEQUENTIAL PATTERNS 75

Page 4: Mining interesting sequential patterns for intelligent systems

mining language and users can specify the interested items and the criteria of therules to be discovered. They also proposed an efficient data mining technique toextract the association rules according to the users’ requests.

3. DATA MINING LANGUAGE AND DATABASE TRANSFORMATION

In this section, we propose a data mining query language and transform theoriginal transaction data into another type to improve the efficiency of queryprocessing.

3.1. Data Mining Language

In this section, we present a data mining language. Users can query sequen-tial patterns by specifying the related parameters in the data mining language. Thedata mining language is defined as follows:

Mining �Sequential Patterns�From �CSD�With ^$D1%, $D2 %, . . . , $Dm %&Support �s%�

(1) In the Mining clause, �Sequential Patterns� is specified because the discovered knowl-edge is sequential patterns.

(2) In the From clause, �CSD� is used to specify the database name to which users querythe sequential patterns.

(3) In the With clause, ^$D1%, $D2 %, . . . , $Dm %& are user-specified items that are ordered byincreasing purchasing time. Also, the notation “*” can be in the itemsets Di , whichdenotes any itemsets and $Di % can be the notation “*”, which represents any sequence.

(4) The Support clause is followed by the user-specified minimum support s%.

3.2. Database Transformation

To find the interesting sequential patterns efficiently, we need to transformthe original transaction data into another type. Each item in each customer sequenceis transformed into a bit string. The length of a bit string is the number of thetransactions in the customer sequence. If the i th transaction of the customersequence contains an item, then the ith bit in the bit string for this item is set to 1.Otherwise, the ith bit is set to 0.

For example, in Table I, the customer sequence in CID 1 contains items A, C,and E. Because item A is contained in the second and the third transactions in thiscustomer sequence, the second and the third bits in the bit string for item A in thiscustomer sequence are 1s and the other bits in the bit string are 0s. The bit stringfor item A in CID 1 is 011. Hence, we can transform the customer sequence data-base (Table I) into the bit-string database (Table II).

76 YEN

Page 5: Mining interesting sequential patterns for intelligent systems

3.3. Sequential Bit-String Operation

Suppose a customer sequence contains the two sequences S1 and S2. We presentan operation called sequential bit-string operation to check whether the sequenceS1 S2 is also contained in this customer sequence. The process of the sequentialbit-string operation is described as follows: Let the bit string for sequence S1 incustomer sequence c be B1, and for sequence S2 be B2. Bit string B1 is scannedfrom left to right until a bit value 1 is visited. We set this bit and all bits on theleft-hand side of this bit to 0 and set all bits on the right-hand side of this bit to 1,and assign the resultant bit string to a template Tb . Then, the bit string for sequenceS1 S2 in c can be obtained by performing logical AND operation on bit strings Tb

and B2. If the number of 1s in the bit string for sequence S1 S2 is not zero, thenS1 S2 is contained in customer sequence c. Otherwise, the customer sequence cdoes not contain S1 S2.

For example, consider Table I. We want to check whether sequence $A%$C% iscontained in customer sequence CID 1. From Table II, we can see that items A andC are contained in customer sequence CID 1, and the bit string of items A and C inCID 1 is BA � 011 and BC � 111, respectively. We scan the bit string BA from leftto right and generate the template bit string Tb � 001. By performing logical ANDoperation on Tb and BC , we can obtain that the bit string for sequence $A%$C% incustomer sequence CID 1 is 001, in which the number of 1s is not zero. Hence, thesequence $A%$C% is contained in the customer sequence CID 1, and the resultantbit string 001 is the bit string for the sequence $A%$C% in the customer sequenceCID 1.

Table I. Customer sequence database (CSD).

CID Customer sequence

1 $C%$AC%$ACE%2 $AE%$A%$ACE%$CE%3 $C%$E%$E%$CE%4 $BD%$AE%$BC%$AE%$ABE%$F%5 $D%$DEF%$CEF%$AD%$BD%$DF%

Table II. Bit-string database.

CID Items Bit string for each item

1 A, C, E 011,111,0012 A, C, E 1110,0011,10113 C, E 1001,01114 A, B, C, D, E, F 010110,101010,001000,100000,010110,0000015 A, B, C, D, E, F 000100,000010,001000,110111,011000,011001

MINING INTERESTING SEQUENTIAL PATTERNS 77

Page 6: Mining interesting sequential patterns for intelligent systems

4. MINING INTERESTING SEQUENTIAL PATTERNS

In this section, we describe how to process a user’s query and find the inter-esting sequential patterns. For a user’s query, if there is no notation “*” specifiedin the With clause, then this query is to check whether the sequence followed bythe With clause is a frequent sequence. We call this type of user’s queries theType I query. If the user would like to extract the sequential patterns that containother sequences except the sequences specified in the With clause, then the nota-tion “*”s have to be specified in the With clause. We call this type of user’s que-ries the Type II query. In the following, we discuss the query processing for thetwo types of user queries.

4.1. Query Processing for Type I Query

Suppose the specified sequence S � $D1%$D2 % . . . $Dm % in the With clause,where Di is an itemset. The method to check whether sequence S is a frequentsequence is described as follows:

Step 1. Scan the bit-string database and find the number of customer sequenceswhich contain the sequence S.

For each record in the bit-string database, if the customer sequence containsall items in sequence S, then scan sequence S from left to right. For each itemset Di

~1 � i � m! in sequence S, perform the logical AND operation on the bit stringsfor all items in Di , and the resultant bit string is the bit string for itemset Di . If thebit string for itemset Di is not zero, then perform the sequential bit-string opera-tion on the bit strings for Di and Di�1. The resultant bit string is the bit-string forsequence $Di %$Di�1% . Then, perform the sequential bit-string operation on the bitstrings for sequence $Di %$Di�1% and itemset Di�2, and so on. While performingthose operations, if the resultant bit string is zero, then we do not need to continuethe process, because we can be sure that the customer sequence does not containsequence S. If the final resultant bit string is not zero, then we increase the numberof customer sequences that contain the sequence S.

Step 2. Determine if the sequence S is a frequent sequence.The support for the sequence S can be obtained by dividing the number of

total customer sequences from the number of customer sequences that contain thesequence S. If the support of the sequence S is no less than the minimum supportthreshold, then sequence S is a frequent sequence.

Example 1. After purchasing item A, we would like to check if 30% of custom-ers will purchase items C and E together. This query can be written as Query 1.

Query 1:

Mining �Sequential Patterns�From �CSD�With ^$A%, $C, E %&Support �30%�

78 YEN

Page 7: Mining interesting sequential patterns for intelligent systems

First we transform Table I into Table II. In Step 1, for each record in Table II,we check if the record contains the three items A, C, and E. If the three items arecontained in this record, then the sequential bit-string operation is performed. Forexample, items A, C, and E are contained in the first record. Hence, we performAND operation on the bit strings of items C and E, and the resultant bit string isthe bit string of the itemset $C, E%. Finally, we perform the sequential bit-stringoperation on the bit string for item A and itemset $C, E%. The bit string for thesequence $A%$C, E% is 001, which is not zero. The number of the customersequences, which support the sequence $A%$C, E%, is increased. After scanningTable II, we can obtain that there are two records that contain sequence $A%$C, E%.The support for this sequence is greater than the minimum support 30%. Hence,the specified sequence in Query 1 is a frequent sequence, that is, more than 30% ofcustomers purchase both items C and E together, after purchasing item A.

4.2. Query Processing for a Type II Query

For a Type II query, there is the notation “*” specified in the With clause. Forexample, in Query 2, the user would like to find all the sequential patterns thatcontain the sequence $E%$A%$B% from the customer sequence database (Table I)and the minimum support threshold is set to 40%.

Query 2:

Mining �Sequential Patterns�From �CSD�With ^*, $E %,*, $A%,*, $B%,*&Support �40%�

Suppose the user specifies a sequence that contains m itemsets D1, D2, . . . andDm in the With clause and S � $D1%$D2 % . . . $Dm % . We divide the algorithm forthis type of query into two steps: The first step is to find ~m �1)-frequent sequencesthat contain sequence S, and the second step is to find all the q-frequent sequences~q � m � 2) that contain sequence S. In the following, we describe the two steps.

Step 1. Find all the frequent ~m � 1)-sequences.Step 1.1. Scan the bit-string database; if all items in S are contained in a

record, then output the items in this record and the bit string for each item into1-itemset database. If S is a frequent sequence, then find all 1-frequent itemsets.The frequent itemsets are found in each iteration. For the kth iteration ~k � 1), thecandidate ~k � 1)-itemsets are generated, and scan the (k � 1)-itemset database tofind ~k � 1)-frequent itemsets.

The method to generate the candidate ~k � 1)-itemsets is described as fol-lows1: For every two k-frequent itemsets A � $a1, . . . , ak�1, r% and B � $a1, . . . ,ak�1, t % , the candidate ~k � 1)-itemset $a1, . . . , ak�1, r, t % can be generated. Foreach record in the k-itemset database, we use the k-frequent itemsets in this recordand apply the above method to generate candidate ~k � 1)-itemsets. Suppose thetwo frequent k-itemsets X and Y in a record generate candidate ~k � 1)-itemset Z.

MINING INTERESTING SEQUENTIAL PATTERNS 79

Page 8: Mining interesting sequential patterns for intelligent systems

We perform AND operation on the two bit strings for the two frequent k-itemsetsX and Y, and the resultant bit string is the bit string for the candidate ~k �1)-itemsetZ. If this bit string is not zero, then output the candidate ~k � 1)-itemset Z and itsbit string into the ~k � 1)-itemset database. We also output the frequent k-itemsetand its bit string in each record into the frequent itemset database.

For example, in Table II, the records that contain the sequence $E%$A%$B% inthe With clause in Query 2 are CID 4 and CID 5, Hence, the 1-itemset databasecan be generated, which is shown in Table III. Then, the 1-itemset database isscanned to generate frequent 2-itemsets and the 2-itemset database. The 2-itemsetdatabase is shown in Table IV. Finally, we can generate the frequent itemsets $A%,$B%, $C%, $D%, $E%, $F% and $B, D%, and the frequent itemset database, which isshown in Table V.

Step 1.2. Each frequent itemset (i.e., frequent 1-sequence) is given a uniquenumber. Replace the frequent itemsets in the frequent itemset database with theirnumbers to form a 1-sequence database.

For example, the numbers for the frequent itemsets $A%, $B%, $C%, $D%, $E%,$F%, and $B, D% are 1, 2, 3, 4, 5, 6, and 7, respectively, and the 1-sequence databaseis shown in Table VI, which is generated from Table V.

Step 1.3. Generate candidate 2-sequences, and scan the 1-sequence databaseto generate the 2-sequence database and find all the frequent 2-sequences.

The candidate 2-itemsets are generated as follows: For each frequent1-sequence f except D1, the itemset D1 is combined with the frequent 1-sequenceto generate a candidate 2-sequence. If a notation “*” appears before the itemset D1

in the With clause, then the candidate 2-sequence $ f %$D1% is generated. If thenotation “*” appears after the itemset D1, then the candidate 2-sequence $D1%$ f %is generated. If the reverse order of a candidate 2-sequence is contained in thespecified sequence S, then this candidate 2-sequence can be pruned.

For each record in the 1-sequence database, we use the frequent 1-sequencesin the record and apply the above method to generate candidate 2-sequences. Sup-pose that the two frequent 1-sequences X and Y in a record generate candidate2-sequence Z. We perform the sequential bit-string operation on the two bit strings

Table III. 1-itemset database.

CID Items Bit string for each item

4 A, B, C, D, E, F 010110,101010,001000,100000,010110,0000015 A, B, C, D, E, F 000100,000010,001000,110111,010000,010001

Table IV. 2-itemset database.

CID 2-itemsets Bit string for each 2-itemset

4 $AB%, $AE%, $BC%, $BD%, $BE% 000010,010110,001000100000,0000105 $AD%, $BD%, $DE%, $DF%, $EF% 000100,000010,010000,010001,010000

80 YEN

Page 9: Mining interesting sequential patterns for intelligent systems

for the two frequent 1-sequences X and Y, and the resultant bit string is the bitstring for the candidate 2-sequence Z. If this bit string is not zero, then output thecandidate 2-sequence Z and its bit string into the 2-sequence database. After scan-ning the 1-sequence database, the 2-sequence database can be generated and thecandidate 2-sequences can be counted. If the support for a candidate 2-sequence isno less than the minimum support threshold, then the candidate 2-sequence is afrequent 2-sequence.

For example, in Query 2, the first itemset specified in the With clause is $E%whose number is 5, and there are notation “*”s that appear before and after theitemset $E%. Hence, the generated candidate 2-sequences are $1%$5%, $5%$1%, $2%$5%,$5%$2%, $3%$5%, $5%$3%, $4%$5%, $5%$4%, $6%$5%, $5%$6%, $7%$5%, and $5%$7%. From thesecandidate 2-sequences, $1%$5% and $2%$5% can be pruned, because the reverse orderof the two sequences are contained in the specified sequence $5%$1%$2%. After scan-ning the 1-sequence database (Table VI), the generated 2-sequence database isshown in Table VII, and the frequent 2-sequences are $5%$1%, $5%$2%, $4%$5%, $5%$3%,and $5%$6%.

Step 1.4. Generate candidate 3-sequences, and scan the 2-sequence databaseto generate the 3-sequence database and find all the frequent 2-sequences.

The method to generate candidate 3-sequences is as follows: For every twofrequent 2-sequences S1 � $D1%$r% which is a subsequence of S and S2 � $D1%$t %(or S1 � $D1%$r% and S2 � $t %$D1%), we can generate the candidate 3-sequences$D1%$r%$t % and $D1%$t %$r% ~or $t %$D1%$r%!.

For the above example, the generated candidate 3-sequences are $5%$1%$2%,$4%$5%$1%, $4%$5%$2%, $5%$3%$1%, $5%$1%$3%, $5%$3%$2%, $5%$2%$3%, $5%$6%$1%, $5%$1%$6%,$5%$6%$2%, and $5%$2%$6%. After scanning each record in the 2-sequence database(Table VII), the candidate 3-sequences in each record can be generated, and thecandidate 3-sequences can be counted. Finally, the generated frequent 3-sequencesare $5%$1%$2%, $4%$5%$1%, $4%$5%$2%, $5%$3%$1%, $5%$3%$2%, $5%$1%$6%, and $5%$2%$6%.

Table V. Frequent itemset database.

CID Frequent itemsetsBit string

for each frequent itemset

4 $A%, $B%, $C%, $D%$E%, $F%, $BD%

010110,101010,001000,100000,010110,000001,100000

5 $A%, $B%, $C%, $D%$E%, $F%, $BD%

000100,000010,001000,110111,010000,010001,000010

Table VI. 1-sequence database.

CID 1-sequence Bit string for each 1-sequence

4 1, 2, 3, 4, 5, 6, 7 010110, 101010, 001000, 100000, 010110, 000001, 1000005 1, 2, 3, 4, 5, 6, 7 000100, 000010, 001000, 110111, 010000, 010001, 000010

MINING INTERESTING SEQUENTIAL PATTERNS 81

Page 10: Mining interesting sequential patterns for intelligent systems

Step 1.5. Frequent ~h � 1)-sequences (3 � h � m! are generated in eachiteration. For the ~h � 2)th iteration, we use frequent h-sequences to generatecandidate ~h � 1)-sequences, and scan the h-sequence database to generate the~h � 1)-sequence database and find all the frequent ~h � 1)-sequences.

We use the following method to generate candidate ~h � 1)-sequences: Forany two frequent h-sequence S1 � $s1%$s2 % . . . $sh�1%$r% and S2 � $s1%$s2 % . . .$sh�1%$t %, in which $s1%$s2 % . . . $sh�1% is a subsequence of S or $r% and $t % arecontained in S, the candidate ~h � 1)-sequences $s1%$s2 % . . . $sh�1%$r%$ t % and$s1%$s2 % . . . $sh�1%$t %$r% can be generated. If a generated candidate $h �1%-sequencecontains more than one itemset that is not contained in S, then the candidate$h � 1%-sequence can be pruned.

For each record in the h-sequence database, we use the frequent h-sequencesin this record and the above method to generate candidate ~h � 1)-sequences, andperform the sequential bit-string operation on the two bit strings for the two fre-quent h-sequences, which generates the candidate ~h �1)-sequence. The resultantbit string is the bit string for the candidate ~h � 1)-sequence. If the resultant bitstring is not zero, then output the candidate ~h �1)-sequence and its bit string intothe ~h � 1)-sequence database and count the support for the candidate ~h � 1)-sequence. After scanning the h-sequence database, the ~h � 1)-sequence databasecan be generated and the supports for the candidate ~h � 1)-sequences can becomputed. If the support for a candidate ~h � 1)-sequence is no less than the min-imum support, then the candidate ~h �1)-sequence is a frequent ~h �1)-sequence.If there are frequent ~m � 1)-sequences generated, then step 2 needs to be per-formed. Otherwise, Step 3 is performed directly.

For the above example, according to Step 1.5, the generated candidate4-sequences are $4%$5%$1%$2%, $5%$3%$1%$2%, $5%$1%$6%$2%, and $5%$1%$2%$6%. Afterscanning the 3-sequence database, the generated frequent 4-sequences are$4%$5%$1%$2%, $5%$3%$1%$2%, and $5%$1%$2%$6%. Because there are frequent 4-sequencesgenerated, we need to perform the next step.

Step 2. The frequent ~m � n � 1)-sequences ~n � 1) that contain the speci-fied sequence S are generated in each iteration. For the nth iteration, we use thefrequent ~m � n!-sequences to generate candidate ~m � n � 1)-sequences andscan the ~m � n!-sequence database and the 1-sequence database to generate the~m � n � 1)-sequence database in which the candidate ~m � n � 1)-sequencesare contained in each record but the bit strings are not, and find the frequent ~m �n � 1)-sequences.

The method to generate candidate ~m � n � 1)-sequences is as follows: Forevery two frequent ~m � n!-sequences S1 � $s1%$s2 % . . . $si %$r%$si�1% . . . $sm�n�1%

Table VII. 2-sequence database.

CID 2-sequence Bit string for each 2-sequence

4 $3%$5%, $4%$5%, $7%$5%, $5%$1%,$5%$2%, $5%$3%, $5%$6%

000110, 010110, 010110, 000110, 001010, 001000, 000001

5 $4%$5%, $5%$1%, $5%$2%, $5%$3%,$5%$4%, $5%$6%, $5%$7%

010000, 000100, 000010, 001000, 000111, 000001, 000010

82 YEN

Page 11: Mining interesting sequential patterns for intelligent systems

and S2 � $s1% $s2 % . . . $sj %$t %$sj�1% . . . $sm�n�1% ~i � j !, in which $r% is not con-tained in S2 and $t % is not contained in S1, a candidate ~m � n � 1!-sequence$s1%$s2 % . . . $r% . . . $t % . . . $sm�n�1% can be generated. For each record in the ~m � n!-sequence database, we also use every two frequent ~m � n!-sequences in thisrecord and apply the above method to generate a candidate ~m � n �1!-sequence,and perform the sequential bit-string operations on the bit strings for the itemsetsin the candidate ~m � n � 1!-sequence by scanning the 1-sequence database. Ifthe resultant bit string is not zero, then output the candidate ~m � n �1!-sequenceinto the ~m � n � 1!-sequence database and count the support for the candidate~m � n � 1!-sequence. After scanning the ~m � n!-sequence database, the ~m �n �1!-sequence database can be generated and the frequent ~m � n �1!-sequencescan be found.

For the above example, according to Step 2, the generated candidate5-sequences are $4%$5%$3%$1%$2%, $4%$5%$1%$2%$6%, and $5%$3%$1%$2%$6%. For eachrecord in the 4-sequence database, we use the frequent 4-sequences in this recordto generate candidate 5-sequences for this record, and perform the sequential bit-string operations on the bit strings for the itemsets in the candidate 5-sequence togenerate the bit string for the candidate 5-sequence. If the resultant bit string is notzero, then count the support for the candidate 5-sequence. After scanning the4-sequence database, the generated frequent 5-sequences are $4%$5%$3%$1%$2%,$4%$5%$1%$2%$6%, and $5%$3%$1%$2%$6%. These frequent 5-sequences can furthergenerate candidate 6-sequence $4%$5%$3%$1%$2%$6%. After scanning the 5-sequencedatabase, the generated frequent 6-sequence is also $4%$5%$3%$1%$2%$6%, and thereis no candidate 7-sequence generated. Hence, the algorithm for mining frequentsequences terminates.

Step 3. For each frequent sequence, the code for each itemset in the frequentsequence is replaced with the itemset itself. If a frequent sequence is not containedin another frequent sequence, then this frequent sequence is a sequential pattern.

For the above example, the frequent sequences that satisfy the user require-ment in Query 2 are $E%$A%$B%, $D%$E%$A%$B%, $E%$C%$A%$B%, $E%$A%$B%$F%,$D%$E%$C%$A%$B%, $D%$E%$A%$B%$F%, $E%$C%$A%$B%$F%, and $D%$E%$C%$A%$B%$F%,and the sequential pattern is $D%$E%$C%$A%$B%$F%.

4.3. The Special Cases for the Type II Query

In this section, we discuss two special cases for the Type II query.

Case 1. The notation “*” is just before or after the specified sequence in theWith clause.

Query 3:

Mining �Sequential Patterns�From �CSD�With �{A,C}{D,E},*�Support �40%�

MINING INTERESTING SEQUENTIAL PATTERNS 83

Page 12: Mining interesting sequential patterns for intelligent systems

In Query 3, the user would like to find all the sequential patterns in which thebeginning sequence is $A,C%$D, E%, and the minimum support is 40%.

Case 2. There is no specified sequence in the With clause.

Query 4:

Mining �Sequential Patterns�From �CSD�With �*�Support �50%�

In Query 4, the user would like to find all the sequential patterns whose sup-port is no less than 50%.

For the above two special cases, the methods of the candidate generation aredifferent from that of general case, such as Query 2. In the two special cases,because there is no notation “*” between the itemsets in the specified sequence inthe With clause, we can simply use apriori-gen9 to generate candidate sequences.Besides, the difference between two special cases such as Query 3 and Query 4 isthat the candidate sequences generated in Case 1 have to contain the specifiedsequence, but for Case 2, there is no specified sequence considered. The queryprocessing for the two cases is simpler than the general case.

5. EXPERIMENTAL RESULT

In this section, we evaluate the performance of our algorithm. The syntheticdatabase of sales transactions is generated to evaluate the performance of our algo-rithm. The method to generate synthetic transactions is similar to the one used inRef. 9. The parameters used in our experiments are shown in Table VIII.

First, we generate four synthetic transaction databases C5-T10-I10-R10, C10-T10-I10-R10, C20-T10-I10-R10, and C10-T10-I10-R20. The parameter settingsare shown in Table IX. We also generated three general cases of Type II queries.Figure 1 shows the relative execution time for PrefixSpan10 and our algorithm fora generated Type II query, using four synthetic datasets shown in Table IX. Fig-ure 2 shows the relative execution time for the generated three queries in the data-base C20-T10-I10-R10 for different minimum support thresholds.

Table VIII. The parameters.

C Number of customers(in 000s)D Number of transactions(in 000s)6T 6 Average number of the transactions for each customer6I 6 Average size of the transactionsR The percentage of the repeated items between every two neighbor transactions

84 YEN

Page 13: Mining interesting sequential patterns for intelligent systems

The experimental results show that our algorithm outperforms the Prefix-Span algorithm, and the performance gap increases as the minimum supportthreshold decreases because when the minimum support decreases, the number ofthe frequent sequences increases, the number of the projected databases increases,and the size of each projected database also increases, such that the performanceis degraded for the PrefixSpan algorithm. Besides, the PrefixSpan algorithm needsto take extra time to pick the frequent itemsets from the large number of frequentitemsets to match the user queries.

However, for our algorithm, we only focus on the items specified in userqueries, that is, no redundant frequent sequence can be generated. Hence, our algo-rithm can significantly outperform the PrefixSpan algorithm.

6. CONCLUSION

In this article, we introduce a data mining language. From the data mininglanguage, users can specify the items or the sequences and the minimum supportthreshold of the sequential patterns to be discovered in which they are interested.

We propose an efficient data mining technique to process the user require-ment. Our algorithms can reduce the number of the combinations of itemsets orsequences in each customer sequence for counting the supports of the candidatesequences and reduce the number of the candidate sequences according to theuser’s requests. However, to improve the efficiency, we need to generate another

Table IX. The parameter settings for the four synthetic databases.

Database 6C 6 6D 6 6T 6 6I 6 6R 6

C5-T10-I10-R10 5 50 10 10 10C10-T10-I10-R10 10 100 10 10 10C20-T10-I10-R10 20 200 10 10 10C10-T10-I10-R20 10 100 10 10 20

Figure 1. Relative execution time.

MINING INTERESTING SEQUENTIAL PATTERNS 85

Page 14: Mining interesting sequential patterns for intelligent systems

database, which costs extra memory space. However, it is more important to reducethe response time for a data mining query system. In the future, we shall considerapplying our approach to discover interesting frequent traversal patterns21,22 inthe World Wide Web environment.

References

1. Agrawal R, Srikant R. Fast algorithm for mining association rules. In: Bocca JB, Jarke M,Zaniolo C, editors. Proc Int Conf on Very Large Data Bases, Santiago de Chile. San Fran-cisco, CA: Morgan Kaufmann; 1994. pp 487– 499.

2. Agrawal R, Srikant R. Mining sequential patterns: Generalizations and performanceimprovements. In: Apers P, Bouzeghoub M, Gardarin G, editors. Int Conf on ExtendingDatabase Technology (EDBT), Avignon, France. Heidelberg, Germany: Springer-Verlag;1996. pp 3–17.

3. Bayardo RJ Jr, Agrawal R. Mining the most interesting rules. In: Proc 5th ACM SIGKDDInt Conf on Knowledge Discovery and Data Mining. New York: ACM Press; 1999.pp 145–154.

4. Brin S, Motwani R, Ullman JD, Tsur S. Dynamic itemset counting and implication rulesfor market basket data. In: Proc ACM SIGMOD Int Conf on Management of Data. NewYork: ACM Press; 1997. pp 255–264.

5. Liu B, Hsu W, Ma Y. Mining association rules with multiple minimum supports. In ACMConf on Knowledge Discovery in Data (KDD). New York: ACM Press; 1999. pp 337–341.

6. Liu J, Pan Y, Wang K, Han J. Mining frequent item sets by opportunistic projection. In:ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining; 2002. pp 229–238.

7. Yen SJ. Mining generalized multiple-level association rules. Lecture Notes in ArtificialIntelligence: Principles of Data Mining and Knowledge Discovery, Vol 1910; 2000.pp 679– 684.

8. Zaki MJ. SPADE: An efficient algorithm for mining frequent sequences. Mach Learn J2001;42:31– 60.

9. Agrawal R, Srikant R. Mining sequential patterns. In: Yu PS, Chen ALP, editors. Proc IntConf on Data Engineering. IEEE Computer Society; 1995. pp 3–14.

10. Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu MC. PrefixSpan: Miningsequential patterns efficiently by prefix-projected pattern growth. In: ICDE’01, Heidel-berg, Germany. IEEE Computer Society; 2001. pp 215–224.

Figure 2. Relative execution time for database C20-T10-I10-R10.

86 YEN

Page 15: Mining interesting sequential patterns for intelligent systems

11. Yen SJ, Cho CW. An efficient approach to discovering sequential patterns from large data-bases. Lecture Notes in Artificial Intelligence: Principles of Data Mining and KnowledgeDiscovery, Vol 1910; 2000. pp 685– 690.

12. Yen SJ, Chen ALP. A graph-based approach for discovering various types of associationrules. IEEE Trans Knowl Data Eng 2001;13:839-845.

13. Yen SJ, Cho CW. An efficient approach for updating sequential patterns using databasesegmentation. Int J Fuzzy Syst 2001;3(2):422– 431.

14. Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: Chen W,Naughton JF, Bernstein PA, editors. SIGMOD’00, Dallas, TX. ACM Press; 2000. pp 1–12.

15. Han J, Pei J, Lu H, Nishio S, Tang S, Yang D. H-Mine: Hyper-structure mining of frequentpatterns in large databases. In: Cercone N, Lin TY, Wu X, editors. Proc. 2001 IEEE IntConf on Data Mining (ICDM’01), San Jose, CA, November 29–December 2, 2001. IEEEComputer Society; 2001.

16. Yen SJ, Chen ALP. An efficient approach to discovering knowledge from large databases.In: Proc Int Conf on Parallel and Distributed Information Systems. IEEE Computer Soci-ety; 1996. pp 8–18.

17. Han J, Pei J. Mining Frequent patterns by pattern-growth: Methodology and implications.In: ACM SIGKDD, December 2000. New York: ACM Press; 2000. pp 30–36.

18. Meo R, Psaila G, Ceri S. A new SQL-like operator for mining association rules. In: Vija-yaraman TM, Buchmann AP, Mohan C, editors. Proc Int Conf on Very Large Data Bases.San Francisco, CA: Morgna Kaufmann; 1996. pp 122–133.

19. Yen SJ, Chen ALP. An efficient data mining technique for discovering interesting associ-ation rules. In: Hameurlain A, Tjoa AM, editors. Proc 8th Int Conf and Workshop on Data-base and Expert Systems Applications. Heidelberg, Germany: Springer-Verlag; 1997.pp 664– 669.

20. Yen SJ, Lee YS. Mining interesting association rules: A data mining language. Lect NotesArtif Intell 2002;2336:172–176.

21. Yen SJ. An efficient approach for analyzing user behaviors in a web-training environment.Int J Distance Educ Technol 2003;1:55–71.

22. Yen SJ, Lee YS. An efficient data mining algorithm for discovering web access patterns.Lect Notes Comput Sci Web Technol Appl 2003;2642:187-192.

MINING INTERESTING SEQUENTIAL PATTERNS 87