summarizing large query logs in ettu - arxiv · summarizing large query logs in ettu gokhan kul,...

12
Summarizing Large Query Logs in Ettu * Gokhan Kul, Duc Luong, Ting Xie, Patrick Coonan, Varun Chandola, Oliver Kennedy, Shambhu Upadhyaya University at Buffalo, SUNY {gokhanku, ducthanh, tingxie, pcoonan, chandola, okennedy, shambhu}@buffalo.edu ABSTRACT Database access logs are large, unwieldy, and hard for hu- mans to inspect and summarize. In spite of this, they remain the canonical go-to resource for tasks ranging from perfor- mance tuning to security auditing. In this paper, we address the challenge of compactly encoding large sequences of SQL queries for presentation to a human user. Our approach is based on the Weisfeiler-Lehman (WL) approximate graph isomorphism algorithm, which identifies salient features of a graph — or in our case of an abstract syntax tree. Our gen- eralization of WL allows us to define a distance metric for SQL queries, which in turn permits automated clustering of queries. We also present two techniques for visualizing query clusters, and an algorithm that allows these visualizations to be constructed at interactive speeds. Finally, we evalu- ate our algorithms in the context of a motivating example: insider threat detection at a large US bank. We show ex- perimentally on real world query logs that (a) our distance metric captures a meaningful notion of similarity, and (b) the log summarization process is scalable and performant. Keywords Query Logs; Summarization; Database Security; Clustering 1. INTRODUCTION Database access logs are used in a wide variety of settings, from performance monitoring, to benchmark development, and even to database auditing and compliance. Examin- ing a history of the queries serviced by a database can help database administrators with tuning, or help security an- alysts to assess the possibility and/or extent of a security breach. However, logs from enterprise database systems are far too large to examine manually. As one example, our group was able to obtain a log of all query activity at a ma- jor US bank for over a period of 19 hours. The log includes * The first three authors contributed equally and should be considered a joint first author nearly 17 million SQL queries and over 60 million stored pro- cedure execution events. Even excluding stored procedures, it is unrealistic to expect any human to manually inspect all 17 million queries. Let us consider an analyst (let’s call her Jane) faced with the task of analyzing such a query log. She might first at- tempt to identify some aggregate properties about the log. For example, she might count how many times each table is accessed or the frequency with which different classes of join predicates occur. Unfortunately, such fine-grained proper- ties do not always provide a clear picture of how the data is being used, combined, and/or manipulated. To get a com- plete picture of the intent of users and applications interact- ing with the database, Jane must look at entire queries. So, she might turn to more coarse-grained properties, like the top-k most frequent queries. Here too she would run into a problem. Consider the following two queries: SELECT * FROM R WHERE R.A = 1 SELECT * FROM R WHERE R.A = 2 They differ only in the value of a single constant: 1 or 2. When counting queries, should she count these as the same query or two different queries? If she chooses to count them together, are there perhaps other “similar” queries that she should also count together, like for example: SELECT * FROM R WHERE R.B = 1 Of course the answer depends on the content of the log, the database schema, and numerous other details that may not be available to Jane immediately when she sits down to analyze a log. As a result, this type of log analysis can quickly become a tedious, time-consuming process. In this paper, we introduce a framework that automati- cally creates compact, easy-to-consume summaries of large query logs. The first part of this framework is an algorithm for extracting features from the abstract syntax trees (ASTs) of SQL queries. The resulting feature vectors help to define a similarity metric for queries, which in turn allows us to cluster queries into groups based on their structure. To further aid users in their understanding of query logs, we propose two techniques for summarizing and visualizing queries in a cluster: (1) a text explanation that overviews common features of queries in the cluster, (2) a graph vi- sualization that presents these features in the context of an actual query. 1.1 Motivating Application: Ettu It is increasingly important for organizations to be able to detect and respond to cyber attacks. An especially difficult arXiv:1608.01013v1 [cs.DB] 2 Aug 2016

Upload: others

Post on 27-Sep-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Summarizing Large Query Logs in Ettu - arXiv · Summarizing Large Query Logs in Ettu Gokhan Kul, Duc Luong, Ting Xie, Patrick Coonan, Varun Chandola, Oliver Kennedy, Shambhu Upadhyaya

Summarizing Large Query Logs in Ettu∗

Gokhan Kul, Duc Luong, Ting Xie, Patrick Coonan,Varun Chandola, Oliver Kennedy, Shambhu Upadhyaya

University at Buffalo, SUNY{gokhanku, ducthanh, tingxie, pcoonan, chandola, okennedy, shambhu}@buffalo.edu

ABSTRACTDatabase access logs are large, unwieldy, and hard for hu-mans to inspect and summarize. In spite of this, they remainthe canonical go-to resource for tasks ranging from perfor-mance tuning to security auditing. In this paper, we addressthe challenge of compactly encoding large sequences of SQLqueries for presentation to a human user. Our approach isbased on the Weisfeiler-Lehman (WL) approximate graphisomorphism algorithm, which identifies salient features of agraph — or in our case of an abstract syntax tree. Our gen-eralization of WL allows us to define a distance metric forSQL queries, which in turn permits automated clustering ofqueries. We also present two techniques for visualizing queryclusters, and an algorithm that allows these visualizationsto be constructed at interactive speeds. Finally, we evalu-ate our algorithms in the context of a motivating example:insider threat detection at a large US bank. We show ex-perimentally on real world query logs that (a) our distancemetric captures a meaningful notion of similarity, and (b)the log summarization process is scalable and performant.

KeywordsQuery Logs; Summarization; Database Security; Clustering

1. INTRODUCTIONDatabase access logs are used in a wide variety of settings,

from performance monitoring, to benchmark development,and even to database auditing and compliance. Examin-ing a history of the queries serviced by a database can helpdatabase administrators with tuning, or help security an-alysts to assess the possibility and/or extent of a securitybreach. However, logs from enterprise database systems arefar too large to examine manually. As one example, ourgroup was able to obtain a log of all query activity at a ma-jor US bank for over a period of 19 hours. The log includes

∗The first three authors contributed equally and should beconsidered a joint first author

nearly 17 million SQL queries and over 60 million stored pro-cedure execution events. Even excluding stored procedures,it is unrealistic to expect any human to manually inspect all17 million queries.

Let us consider an analyst (let’s call her Jane) faced withthe task of analyzing such a query log. She might first at-tempt to identify some aggregate properties about the log.For example, she might count how many times each table isaccessed or the frequency with which different classes of joinpredicates occur. Unfortunately, such fine-grained proper-ties do not always provide a clear picture of how the data isbeing used, combined, and/or manipulated. To get a com-plete picture of the intent of users and applications interact-ing with the database, Jane must look at entire queries. So,she might turn to more coarse-grained properties, like thetop-k most frequent queries. Here too she would run into aproblem. Consider the following two queries:

SELECT * FROM R WHERE R.A = 1

SELECT * FROM R WHERE R.A = 2

They differ only in the value of a single constant: 1 or 2.When counting queries, should she count these as the samequery or two different queries? If she chooses to count themtogether, are there perhaps other “similar” queries that sheshould also count together, like for example:

SELECT * FROM R WHERE R.B = 1

Of course the answer depends on the content of the log,the database schema, and numerous other details that maynot be available to Jane immediately when she sits downto analyze a log. As a result, this type of log analysis canquickly become a tedious, time-consuming process.

In this paper, we introduce a framework that automati-cally creates compact, easy-to-consume summaries of largequery logs. The first part of this framework is an algorithmfor extracting features from the abstract syntax trees (ASTs)of SQL queries. The resulting feature vectors help to definea similarity metric for queries, which in turn allows us tocluster queries into groups based on their structure.

To further aid users in their understanding of query logs,we propose two techniques for summarizing and visualizingqueries in a cluster: (1) a text explanation that overviewscommon features of queries in the cluster, (2) a graph vi-sualization that presents these features in the context of anactual query.

1.1 Motivating Application: EttuIt is increasingly important for organizations to be able to

detect and respond to cyber attacks. An especially difficult

arX

iv:1

608.

0101

3v1

[cs

.DB

] 2

Aug

201

6

Page 2: Summarizing Large Query Logs in Ettu - arXiv · Summarizing Large Query Logs in Ettu Gokhan Kul, Duc Luong, Ting Xie, Patrick Coonan, Varun Chandola, Oliver Kennedy, Shambhu Upadhyaya

class of cyber attack to detect is the so called insider at-tacks that occur when employees misuse legitimate access toa resource like a database. The difficulty arises because ap-parently anomalous behavior from a legitimate actor mightstill have legitimate intent. For example, a bank teller inBuffalo who withdraws a large sum for a client from Cali-fornia may be acting legitimately (e.g., if the client has justmoved and is purchasing a house), or may be committingfraud. The “U.S. State of Cybercrime Survey” [1] statesthat 37% of organizations have experienced an insider in-cident, while a 2015 study [2] identified insider attacks ashaving the longest average response time of any attack typesurveyed: 54.5 days.

The challenge of addressing of insider attacks lies in thedifficulty of precisely specifying access policies for shared re-sources such as databases. Coarse, permissive access policiesprovide opportunities for exploitation. Conversely, restric-tive fine-grained policies are expensive to create and limita legitimate actor’s ability to adapt to new or unexpectedtasks. In practice, enterprise database system administra-tors regularly eschew fine-grained database-level access con-trol. Instead, large companies commonly rely on reactivestrategies that monitor external factors like network activitypatterns and shared file transfers. In a corporate environ-ment, monitoring user actions requires less preparation andgives users a greater degree of flexibility. However, exter-nal factors do not always provide a strong attestation of thelegitimacy of a database user’s actions.

The Ettu1 system, currently under development at theUniversity at Buffalo [3], seeks to help analysts to monitorquery access patterns for signs of insider attack. Databaselogging and monitoring is expensive, so Ettu needs to be ableto identify normal, baseline database behaviors that can beeasily flagged as “safe” and ignored from normal logging andpost-mortem attack analysis. In this paper, we focus on oneconcrete part of the overall Ettu system, responsible for sum-marizing and visualizing query logs. In the complete system,this component serves to help analysts generate patterns ofsafe queries, and to quickly analyze large multi-day querylogs to identify potential attack activity.

1.2 ContributionsIn addition to enabling insider threat analysis [3], the abil-

ity to quickly visualize patterns in query logs has potentialapplications for new benchmark development [4], databasetuning [5], and privacy analysis [6]. This paper representsa first step towards effective summarization for query logs.Concretely, in this paper we: (1) identify and motivate thechallenge of log summarization, (2) adapt the Weisfeiler-Lehman approximate graph isomorphism algorithm [7] tothe task of feature extraction from SQL queries, (3) proposeEuclidean distance over the resulting feature vectors as asimilarity metric for SQL that can be used create clustersof structurally similar queries, (4) introduce techniques forvisualizing and summarizing the queries in each cluster, and(5) experimentally demonstrate that our similarity metricmirrors human intuition and that our clustering and visual-ization processes are scalable.

This paper is organized as follows. We provide back-

1Ettu is derived from the last words of the Roman emperorJulius Caesar, “Et tu, Brute?” in Latin, meaning “You, too,Brutus?” in English to emphasize that this system is meantto detect the unexpected betrayals of trusted people

ground information that creates a basis for our research inSection 2 and introduce our core contribution, a techniquefor query similarity evaluation, in Section 3. We next ex-plain our proposed clustering and summarization techniquesin Sections 4 and 5, respectively. In Section 6, we evaluatethe accuracy and performance of our proposed techniques.We compare our approach to related work in Section 7. Fi-nally, we conclude by identifying the steps needed to deployquery summarization into practice in Section 8.

2. BACKGROUNDAs a declarative language, the abstract syntax tree (AST)

of a SQL statement acts as a proxy for the intent of thequery author. We thus argue that structural similarity is ameaningful metric for query similarity. Naively, we wouldgroup a query Q with other queries that have nearly (orcompletely) the same AST as Q. This structural definitionof intent has seen substantial use already, particularly in thetranslation of natural language queries into SQL [8].

SELECT

COLS FROM WHERE

COL_ID A !=

A.a A.b A.a 5

Figure 1: An abstract syntax tree of query SELECT A.a, A.b

FROM A WHERE A.a != 5.

For the remainder of this paper, we will use Q to de-note both the query itself as well as its tree encoding. Anideal distance metric would measure the level of similarityor overlap between these tree encodings and their substruc-tures. Naively, for two SQL queries Q1 and Q2, one satis-factory metric might be to count the number of connectedsubgraphs of Q1 that are isomorphic to a subgraph of Q2.Subgraph isomorphism is NP-complete, but a computation-ally tractable simplification of this metric can be found inthe Weisfeiler-Lehman (WL) Algorithm [7] illustrated as Al-gorithm 1. Instead of comparing all possible subgraphs ofQ1 against all possible subgraphs of Q2, the WL algorithmrestricts itself to specific types of subgraphs.

Given a query Q, let N ∈ Q denote a node in Q. Nis initially labeled with the SQL grammar symbol that Nrepresents. The i-descendent tree of N : desc(N, i) is thesub-tree rooted at N , including all descendants of N in Qup to and including a depth of i.

Example 1. Given the tree in Figure 1, desc(COLS, 2) isthe tree containing the nodes COLS, COL_ID, A.a, and A.b.

The WL algorithm identifies a query Q by all possible i-descendent trees that can be generated from Q:

id(Q) = { desc(N, i) | N ∈ Q ∧ i ∈ [0, depth(Q)] }

Here depth(Q) is the maximum distance from the root of Qto a leaf. To make this comparison efficient, subtrees aredeterministically assigned a unique integer identifier, andthe query is described by the bag of Q’s i-descendent tree

Page 3: Summarizing Large Query Logs in Ettu - arXiv · Summarizing Large Query Logs in Ettu Gokhan Kul, Duc Luong, Ting Xie, Patrick Coonan, Varun Chandola, Oliver Kennedy, Shambhu Upadhyaya

Algorithm 1 Weisfeiler–Lehman Algorithm

1: procedure Weisfeiler–Lehman2: for each tree t ∈ T do3: H ← depth(Q)4: for h← 1;h ≤ H;h + + do5: for each node N ∈ Q do6: s← CreateSet(labelsof{N}

⋃∑N.children)

7: if hashtable.get(s) 6= null then8: label(N)← hashtable.get(s)9: else

10: hashtable.put(s, label(N))11: end if12: if IsLeaf(

∑N.children) then

13: N.IsLeaf = true14: end if15: end for16: end for17: end for18: end procedure

identifiers. Thus two query trees with an isomorphic subtreewill both include the same identifier in their description.The bag of identifiers is encoded as a (sparse) feature vectorand allows Euclidean distance to measure the similarity (orrather dis-similarity) of two queries.

0

1 2 3

4

8 7

5 6

8 9

10

1 2 3

4

8 7

5 6

8 9

10

11 12 13

4

8 7

5 6

8 9

10

11 12 13

14

8 7

5 15

8 9

16

11 12 13

14

8 7

5 15

8 9

16

17 12 18

14

8 7

5 15

8 9

19

17 12 18

14

8 7

5 15

8 9

19

17 12 18

14

8 7

5 15

8 9

Initial ASTIteration: 1

Level: 0Iteration: 1

Level: 1

Iteration: 1Level: 2

Iteration: 2Level: 0

Iteration: 2Level: 1

Iteration: 3Level: 0 Encoded AST

10 implies 0,1,2,3 11 implies 1,412 implies 2,513 implies 3,6

14 implies 4,7,815 implies 6,8,9

16 implies 10,11,12,13 17 implies 11,1418 implies 13,15

19 implies 12,16,17,18

Figure 2: Weisfeiler-Lehman algorithm applied on ASTgiven in Figure 1.

Figure 2 shows how the WL algorithm is applied on theAST of the query given in Figure 1. First, every distinctnode of the AST gets labeled with a unique integer. Ifthe same node appears more than once, each instance islabeled with the same integer. As the algorithm progresses,the dotted box emphasizes the region being examined, whilethe text below represents new labels being synthesized fromexisting labels. Grey nodes have been fully labeled, whilewhite nodes are still being processed.

3. SYSTEM OUTLINEThe log summarization mechanism in Ettu operates in

three stages: (1) A vectorization phase where the ag-gregated query logs are processed into feature vectors, (2)An offline clustering phase where the feature vectors aregrouped together according to their structural similarity,and (3) A summarization phase where the clusters aresummarized to present a human user with a concise overviewof the cluster’s distinguishing features. These stages are il-lustrated in Figure 3.

Query Vectors

Query ClustersQuery ClustersQuery ClustersQuery Clusters

(1) Vectorization

(2) Clustering

(3) Summarization

Query Logs

Query ClustersQuery

ClustersQuery ClustersSummary

of Clusters

Figure 3: The typical query auditing workflow in Ettu.

The initial input to Ettu is a log of query activity pro-cessed by the target database. The log may be annotatedwith supplemental metadata like usernames and timestamps.The goal of the methodology described here is to produce aconcise, but precise sketch of the most common query struc-tures in the log. This summary contains information aboutgroups of similar queries aiming to explain what a query ineach group looks like and what kind of features are expectedto be encountered in a query.

Scenario 1. Jane feeds the logs collected to Ettu. Thesystem processes the file and groups similar queries togetherand passes them to the next stage of the operation.

3.1 Generalization of Weisfeiler-LehmanThe WL algorithm assumes zero knowledge about struc-

tural features of the trees it compares, limiting itself to i-descendent subtrees. Conversely, the grammar of SQL hasa very well-defined structure. In Ettu, we exploit this struc-ture to eliminate redundancy and create features that morereliably encode the query’s semantics. We specifically iden-tify and address three features of SQL that limit the effec-tiveness of WL.

First, the number of features created by the WL algorithmis large. Although clustering naturally prunes out features

Page 4: Summarizing Large Query Logs in Ettu - arXiv · Summarizing Large Query Logs in Ettu Gokhan Kul, Duc Luong, Ting Xie, Patrick Coonan, Varun Chandola, Oliver Kennedy, Shambhu Upadhyaya

without discriminative power, we can use SQL’s semanticsto identify structures that are unlikely to be useful. For ex-ample, the subtree desc(SELECT, 1) in Figure 1 is common tovirtually all queries. Such features can be pruned preemp-tively.

Second, constants in the query require special treatmentin feature encoding. The output of this operation is a queryskeleton.

Example 2. Consider the query SELECT * FROM R WHERE

R.a = 1. Subtree R.a = 1 can be abstracted as R.a = ? be-cause there is much higher probability that this abstractionwill be shared thus it correctly captures commonality amongqueries while feature R.a = 1 can coexist offering distin-guishing power ( ? denotes the placeholder constant)

Finally, SQL makes frequent use of commutative and as-sociative operators. The semantics of such operators mayoverlap, even if their i-descendent subtrees do not.

Example 3. Consider the Boolean expressions A AND B

and A AND B AND C. The former AST is a AND node with2 children while the latter has 3. Although the two ASTshave 3 0-descendent subtrees in common, they share no 1-descendent subtrees; WL ignores the similarity between thetwo conjunctive expressions.

To address these challenges we first generalize WL algorithmto allow the creation of dynamic, language-specific features,and then use this facility to present a suite of features thatpermit the generalized W-L algorithm to reliably detect fea-tures of SQL queries.

3.2 Dynamic FeaturesFeature generation process can be explained in three steps:

(1) Based on the structure of the query AST and exist-ing set of features, use some rules to identify collectionsof existing features as candidates for new feature creation;(2) Use a set of constraints to filter out undesirable fea-tures; (3) Offer a function digest(·) to synthesize new fea-ture labels from these collections of existing features. To bemore precise, suppose we are given a query Q, each nodeN ∈ Q has a set of features FN , initially defined as thesingleton set containing its SQL grammar atom. A rule isapplied to construct FN from the node itself, features gener-ated by other rules, and the feature sets of its descendants:{ FC | C is a descendent of N }. A constraint acts as a filter,removing features from consideration. Rules and constraintsare applied in a pre-defined priority order, bottom-up to thenodes of Q to compute feature sets for each of Q’s nodes. Fi-nally, the feature vector of Q is defined by the bag

⊎N∈Q FN

of all features from all nodes of Q. Note that each node hasat most one instance of a feature (FN is a set), while anentire AST Q might have multiple instances of the feature(the feature vector of Q is a bag).

Let us consider the steps of feature generation in reverseorder. First, to define features by structural recursion, weneed a way to synthesize new feature identifiers from collec-tions of existing identifiers. We define a function digest(·)that deterministically constructs feature identifiers from ex-isting collections (whether bags, sets, or lists) of identi-fiers. The output of digest(·) is deterministic and unique:for any collection A, digest(A) = digest(A) and for anytwo collections A 6≡ B, digest(A) 6= digest(B). Unique-ness is achieved by maintaining a persistent table of all

Predicate DescriptionNode(N) Node N in query AST.Atom(N, f) Feature f encodes the grammar atom of

node N .

Child(Nc, Np, n) Node Nc is nth Child of Np.Own(N, f) N owns f or equivalently Feature f is in

set FN of Node N .

Table 1: Dynamic Feature Selection Predicates.

identifiers assigned, while determinism for bags and setsis achieved by first sorting features in the collection. Wewill denote list, bag, and set digests as digest`([a, b, a, c]),digestb({|a, a, b, c|}), and digests({a, b, c}), respectively.

Example 4. List-digests respect order, and as a resultdigest`([a, b, a, c]) 6= digest`([a, a, b, c]). Bag-digests re-spect multiplicity but not order, so digestb({|a, b, a, c|}) =digestb({|a, a, b, c|}), but also that digestb({|a, a, b, c|}) 6=digestb({|a, b, c|}). Finally, set digests differentiate neither,so digests({a, a, b, c}) = digests({a, b, c})

Second, we wish to ignore features that are semanticallyuninformative, like the 1-dependent subtree desc(SELECT, 1)in Figure 1. Given a list of these features and/or patternsdescribing these features, we can mark unwanted features byin the table of identifier assignments generated by digest(·).Unwanted feature identifiers are simply not added to FN asit is constructed.

Finally, we consider the construction of new features bystructural recursion. We have found that all of the cus-tomized features we are interested in creating can be ex-pressed through first–order logic, or more precisely thoughdatalog [9]. In datalog, logic rules are expressed in the form

Head(Y ) : −Body(XY ) ≡ ∀y∃x (Body(xy)→ Head(y))

Here, Body(XY ) is a conjunction of first–order predicatesover act as conditions, true for some set of valuations tothe variables XY . The rule indicates that the conjunctionof predicates Head(Y ) should be true for any valuation forwhich there exists some X for which Body(XY ) is true.

For rule generation, we employ two types of predicates:

• Structural predicates, which are defined according theAST’s graph structure. We use two structural pred-icates: Node(N) is true for all nodes N ∈ G andChild(Nc, Np, n) is true if Nc is the n-th child of Np.

• Feature predicates,which are defined by the rule sys-tem. The feature predicate Atom(N, f) is true if f isthe feature encoding the grammar atom for node Nin the AST. Rules define new features by populatingpredicate Own(N, f) with Atom(N, f) as initial ele-ments N would own.

For easy explanation, we present Table 1 with built-in pred-icates that we will use in our feature generating rules. Wewill first show that Weisfeiler-Lehman algorithm can be eas-ily re-formalized under datalog using recursion in 3.2.1. Fi-nally, in 3.2.2-3.2.3, the solutions for challenges brought byWL algorithm are re-formalized in datalog.

3.2.1 Weisfeiler-Lehman RevisitedWe first show that the base Weisfeiler-Lehman algorithm

can be easily re-formalized under datalog with recursion andaggregation. In practice, the model of datalog that we use

Page 5: Summarizing Large Query Logs in Ettu - arXiv · Summarizing Large Query Logs in Ettu Gokhan Kul, Duc Luong, Ting Xie, Patrick Coonan, Varun Chandola, Oliver Kennedy, Shambhu Upadhyaya

is loosely based on that of LogiQL [10] which supports ag-gregate and grouping operations.

1. IterF (N, fN , 0) : − Atom(N, fN )2. with f = digest`([fN , digestb({|∀fC |})])

IterF (N, f, n) : −IterF (C, fC , n− 1), Child(C,N, ),Atom(N, fN ), (n < Height(N))

3. Own(N, f) : − IterF (N, f, )

Predicate IterF (N, f, n) is an intermediate store for fea-tures f created for node N at iteration n. denotes a wild-card that accepts anything and we denote aggregate group-ing by a ∀ symbol. Line 1 serves as initial condition. Startingfrom line 2, the datalog expression forms a group of all childlabels fC assigned during the prior iteration and computesa bag-digest over the result (digestb({|∀fC |})).

3.2.2 SkeletonsTo give special treatment to binary expressions which we

only care about its left hand side, Example 5 shows a rulethat works on Equality operator.

Example 5. Name : Equality Skeletonwith f = digest`([fp, fl, ?])Own(f,Np) : −

Atom(Np, fp), Np = Equals,Child(Nr, Np, 2), Atom(Nr, fr), Const(fr),Child(Nl, Np, 1), Atom(Nl, fl), ¬Const(fl)

? is a placeholder indicating a constant value. Const isa unitary predicate indicates whether a string represents aconstant value. This datalog expression seeks any node Np

in the AST whose grammar atom is Equals and whose rightchild’s atom is a constant. Right child is replaced by ? anddigest`([fp, fl, ?]) generates the resulting feature that en-codes the 1-descendent subtree AST rooted at Np.

3.2.3 Conjunctive Normal FormIn response to the challenge brought by commutative and

associative operators, one solution would be pre-processingthe query AST and converting the composite boolean for-mula into Conjunctive Normal Form (CNF). Since booleanformula in the query reveal the intent by which the query is-suer defines and limits the desired data records, algebraicallyequivalent boolean formula reflect the same intent. ThusCNF normalization which re-formalizes an equivalence classof boolean formula into a uniform structure helps to gener-ate features more informative in the sense of intent. CNFis preferred over Disjunctive Normal Form(DNF) as CNF isstructurally more sensitive to logic OR expressions which isone of the major tools of retrieving additional data records.For example, the structure of OR-piggybacked formula ((AAND B) OR C) in its CNF changes to ((A OR C) AND (BOR C)) comparing with AND-piggybacked formula (A ANDB AND C). In this section, we show that pre-processing thesubtree of some boolean formula into its CNF can be trans-lated into rules expressed in datalog.

First, we introduce the basic predicates and notationsused in this section. CNF is a conjunction of clauses inwhich each clause is a disjunction of literals. A literal is anatomic boolean formula or its negation. Accordingly, we de-fine unitary predicate L(N) to indicate that node N is theroot of the subtree in the AST that represents a literal. Anyliteral in a composite formula can be located by seeking aboolean operator node with zero logic connectives within its

operand’s subtree. Hence the predicate L(N) can be realizedby creating rules that enumerate all root nodes of literals ina boolean formula. As the AST subtree of a literal can befully represented by its root, we give it the special notationNL that distinguishes it from common node notation.

Logic OR operator is commutative, hence a DisjunctiveClause denoted by DC in CNF is a set of literals. The pred-icate that records the member literals of a DC is denoted asMember(NL, DC). For simplicity, we use notation DC andNL representing both the object and the unique ID of theobject. Identification of DC is generated by digesting allmember IDs in its set, denoted as digests({

⋃NL∈DC NL}).

Since logic AND operator is also commutative, the CNFof any formula rooted at node N can be viewed as a setof DCs in which each member is attached to N . We usea new predicate Attach(DC,N) indicating that DC is at-tached to N . This predicate enables Ettu to attach thegenerated CNF structure to the root of the original booleanformula. If negated formula are considered, the rule for CNFnormalization should be paired with the rule for DNF nor-malization. Turning a negated formula into CNF is the sameas turning it into its DNF then its CNF is equivalent to aconjunction of negated conjunctive clauses of its DNF. Forsimplicity we assume there is no negated formula and allnodes in the example 6 is a descendent of the root of sometarget boolean formula.

Example 6. Name: Conjunctive Normal Form(1) Base case: with DC =digests({NL})Member(NL, DC), Attach(DC,NL) : − L(NL)(2) For AND node:Attach(DC,N) : −

Atom(N, f), f = AND,Attach(DC,Nc), Child(Nc, N, )

(3) For OR node:with DC =digests({

⋃Member(NL,DCm)∨Member(NL,DCn) NL})

Member(Np, DC),Member(Nq, DC),Attach(DC,N) : −

Member(Np, DCm),Attach(DCm, Ni),Child(Ni, N, i),Member(Nq, DCn),Attach(DCn, Nj),Child(Nj , N, j),

Atom(N, f), f = OR, (i 6= j)

Since generating CNF for the sub-formula rooted at anynode only requires information from its children, the CNFnormalization process is bottom-up. For the base case, ex-pression (1) in the rule creates a singleton bag containing achosen literal itself as a new disjunctive clause and attachit to the root node of the literal. As the process is bottom-up, nodes assume that each of their children have alreadycompleted CNF and have the resulting disjunctive clausesattached to it. For root nodes of literals, since they do nothave sub-formula, they only need to provide their CNF up-wards while root nodes of logic connectives AND and ORneed only combine children CNFs with respect to their ownlogic grammar. Expression (2) locates a logic AND nodeand attaches all disjunctive clauses from its children to it-self. However, in expression (3) for logic OR node, we needto apply distributive law as a disjunction of CNFs need tobe re-formalized as a single CNF. We union any pair of lit-eral sets (DCm, DCn) attached to different children Ni,Nj

with i 6= j of logic OR node N and the resulting disjunctiveclause DC is attached to N .

Page 6: Summarizing Large Query Logs in Ettu - arXiv · Summarizing Large Query Logs in Ettu Gokhan Kul, Duc Luong, Ting Xie, Patrick Coonan, Varun Chandola, Oliver Kennedy, Shambhu Upadhyaya

4. CLUSTERINGThe next step is to cluster query skeletons by their fea-

ture vectors. The result of this process is a set of clusters inwhich query skeletons with similar structures are groupedtogether. We considered two possible clustering approachesfor use in Ettu: k-means and hierarchical clustering [11]. K-means outputs a set of query skeletons for k clusters. Onthe other hand, hierarchical clustering outputs a dendro-gram – a tree structure which shows how each query can begrouped together. For our specific use case where we aim tosummarize groups of similar queries, the number of groupsor clusters is not known in advance. For this reason, weeventually elected to use hierarchical clustering because itgives us the flexibility to choose the number of clusters aftercomputing the dendrogram. In addition, a dendrogram is aconvenient way to visualize the relationship between queriesand how each query is grouped in the clustering process.

Hierarchical clustering recursively selects the closest pairof data points and merges them into a single virtual datapoint. This process requires an agglomerative rule, a for-mula for aggregating feature vectors together into the vec-tor of each virtual data point. There are several often usedpossibilities, including complete–linkage, single–linkage andaverage–linkage. In our experiments, the choice of agglom-erative rules does not significantly change the clustering re-sult; all commonly used agglomerative rules produce equallyreliable outputs. Consequently, we arbitrarily chose to usecomplete–linkage, in which the farthest distance from a datapoint in one cluster to a data point in another cluster is con-sidered as the distance between two clusters.

Another aspect that we should consider when doing hi-erarchical clustering is is the distance metric to compare apair of data points. In our experiments, we use Euclideandistance as the distance metric, though other metrics suchas the Manhattan distance gave similar clustering outputs.

5. SUMMARIZATIONTo help human inspectors reliably visualize and interpret

clusters, we need a step to present sets of similar queriesas compact summaries. An ideal presentation is in tree-like form: presenting a coarse view of the cluster at first,while allowing the inspector to easily unfold portions of thetree to obtain more detail. As a basis for this visualization,we observe that each feature is constructed, essentially as adigest of grammar atoms for a syntactically correct partialAST, and can be displayed as a subtree.

Each cluster can be characterized by a set of features com-mon to most, if not all of the queries grouped into it. Anideal form of summarization then, would first present a set ofcharacteristic features for the entire cluster, and then allowusers to iteratively subdivide the cluster into smaller andsmaller sub-clusters, each with a progressively larger set ofcharacteristic features. We call the resulting hierarchical vi-sualization a recursive feature characterization(RFC).

Generating a RFC requires us to overcome three chal-lenges: First, allowing a user to navigate through the RFCat an interactive speed requires scalable memory consump-tion and the ability to quickly identify clusters with largenumbers of common features. Second, the feature vectorsused for clustering are usually redundant; The same nodemay appear multiple times across features representing dis-tinct subtrees of the AST. We need a way to quickly and

efficiently identify and remove redundant features. Over-coming the two above challenges is sufficient to produce alist of features that can be presented to a user. The first andmost basic visualization is: a text-based summary of eachcluster. However, seeing features in isolation without thecontext of a surrounding query can make it difficult to un-derstand the implications of each feature. Consequently, ourthird challenge is to create a visualization that can presentfeature bags in context.

To address the first challenge, we adopt Frequent Pat-tern Trees (FP Trees) [12]. Normally, FP Trees is used tomine frequent patterns of item-sets, for example sets of itemsfrequently bought together. Individual items in each item-set are sorted into a sequence and an FP Tree is built byprefix-aligning these sorted sequences. Any path of nodesstarting from the root of the tree to an internal node in FPTree represents a shared prefix of item sequences with eachnode counting the number of prefixes that pass it. Eachfeature f is constructed as the digest of a set of nodes ofthe query AST, the lineage of the feature. The lineage canbe reassembled into a partial AST, denoted by AST f . Themultiplicity of f in the feature bag

⊎N∈Q FN generated for

query Q is denoted by multQ(f) . For each query we cre-ate an item-set by constructing 2-tuples from features in

the query{〈 f,multQ(f) 〉 | f ∈

⊎N∈Q FN

}. The result-

ing sets of items as 2-tuples are used to build an FP-Treefor the cluster.

FP Trees offer high compression rates; the ratio of thesize of an FP Tree to the total number of items appearingin the input item-sets is low. To increase the compressionrate, the item-sets can be sorted in the descending order ofthe number of occurrences of each feature tuple. Doing soincreases the chances that item-set prefixes can be shared,farther increasing the compression rate. This assumptionhas been validated experimentally [12] but can still fail, forexample when an item occurs frequently, but as an isolatedsingleton. As an optimization to counteract this case, weconsidered a metric related to frequency that we call totalpopularity. The popularity of a feature in the feature vectorof some query is the number of distinct features that coex-ist with it. Total popularity is the sum of popularities forthe feature across all queries in the cluster. We comparethe compression rate for FP Tree generated using both fre-quency and total popularity as sort orders in Figure 4. Thegraph shows how compression rate varies with the numberof queries incorporated into the tree. As the number ofqueries increases, the compression rate for both approachesincreases super-linearly, suggesting that FP Trees are a goodway to visualize large query clusters, while the use of totalpopularity as a sort order provides a small, but measurableimprovement in compression rate.

The FP Tree structure provides a tree-like sub-clusteringof queries that naturally tracks features common to queriesin the sub-clusters. Users can selectively traverse it by fold-ing or unfolding subtrees to settle on an appropriate level ofdetail, without being overwhelmed. Recall that each pathfrom the root to a current visited node in the FP tree cor-responds to a prefix of item sequences shared by real lifequeries. As the user traverses the tree, Ettu tries to reducethe length of the prefix by assembling its underlying fea-ture ASTs and ideally creates a single assembly AST withupdated multiplicity as the summary. In practice, the re-

Page 7: Summarizing Large Query Logs in Ettu - arXiv · Summarizing Large Query Logs in Ettu Gokhan Kul, Duc Luong, Ting Xie, Patrick Coonan, Varun Chandola, Oliver Kennedy, Shambhu Upadhyaya

0 2 4 6 8 10 12Num of Queries #105

12

14

16

18

20

22

24

26

28

Com

pres

sion

Rat

e

Total PopularityFrequency

Figure 4: Comparison of compression rate of Total Popular-ity and Frequency in FP tree.

sulting summary is a set of assembly ASTs and we wouldlike to improve readability by reducing the set size. Furtherexpanding and visualizing an FP Tree is as intuitive as sim-ply merging more feature AST components to the existingassembly AST.

In general, reducing the set size contains two steps: (1)Shrink and (2) Merge. The shrink step removes redundantitems from the set as a preparation for merge step. Recallthat an item I is of the form 〈 f,multQ(f) 〉. Given a specificfeature fc ∈

⊎N∈Q FN , the set Sc is defined as{

〈 fp,multQ(fp) 〉 | AST fc ⊂ AST fp , fp ∈⊎

N∈Q

FN

}

covers all items in the bag whose AST fp fully contains AST fc .For any item Ip = 〈 fp,multQ(fp) 〉 ∈ Sc, suppose fc oc-curs mc times in the AST of feature p, by viewing itemIp, the inspector has already been informed of feature fc(multQ(fp) ∗ mc) number of times. If we sum this for allIp ∈ Sc and compare the sum with the multiplicity of fea-ture fc, namely if multQ(fc) <

∑Ip∈Sc

multQ(fp)∗mc, item

Ic can be safely removed as redundant. Example 7 shows asimple case of shrink.

Example 7. Consider a feature f that encodes the ex-pression A = B with multiplicity 2, 〈 f, 2 〉. Consider an-other 2-tuple 〈 f ′, 1 〉 in the same feature bag which encodesA = with multiplicity 1. It is possible that the multiplicityof A = is less than A = B as it is not necessary that eachsubgraph of A = B is included in the bag. By viewing A = Btwo times, the inspector is already informed of the existenceof A = structure two times which renders 〈 f ′, 1 〉 redundant.

Merge step further reduces the set size by a sequence ofpairwise merge operation on items. Merge of two items isnot merely the merge of their feature ASTs. The order ofmerge operations and the difference in feature multiplicitybetween two items should be carefully taken care of. In fact,the correctness of merge step needs to be verified by countingthe number of original queries that share the resulting set ofASTs. This count should match with the count of queries,recorded in the FP Tree, that pass the chosen prefix.

In practice, shrink step is much faster and reliable thanmerge step: to guarantee the correctness of merge step, ver-ification back on the original queries is required while shrinkonly involves local comparison on subtrees. In some cases

that shrink step already produces minimal set size, mergestep can be skipped.

Thus we recommend a property called Seal for featurebags. Part of its goal is skipping merge step.

Definition 1. Property: SealGeneric Seal: At least the complete subtree rooted at each

node in the AST should be encoded as a feature;Tailored Seal: For any pair of distinct features in the fea-

ture bag of Q, if they assemble to ASTasmbly which is alsocontained in query AST of Q, then structure ASTasmbly

should be included in the feature bag for Q.

Generic Seal guarantees that features generated fully recon-structs the original query AST. With Tailored Seal, mergestep can be skipped and features that encode assembly treesdescribed in Tailored Seal are kept for summarization. Thuswe are capable of providing the inspector a quick and a con-cise summary. It comes with the cost that depending on thenumber of additional features created, Seal would affect thescalability of Ettu with respect to memory consumption ofFP Tree and speed of clustering. It may also affect cluster-ing result as it changes the feature vector. Measuring andunderstanding the overall effect of applying property Sealare left for future work.

Figure 5 shows an example AST resulting from a rule thatonly encodes the parental relationship between any node inquery AST and the whole subtree rooted at one of its childnode. The AST is generated from a query and summariza-tion information of the query set that it belongs to. The10 features shown with colored nodes appears at least oncein every query in the query set. The legend shows thathow many times the feature appears in all queries as “Totaloccurrence”, and in the example query itself as “Query oc-currence.” What makes this query different from the otherones is the uncolored nodes. They either don’t appear inthe other queries, or appear infrequently. This visualizationmakes a human inspector’s work easier to spot importantfeatures and understand what the queries in the groups do,and identify why a query is different from the other queries.

6. EXPERIMENTSIn this section, we run experiments on our proposed frame-

work to verify its performance and feasibility. We are specif-ically interested in two properties of Ettu’s log classifier.First is whether or not our system can capture the notionof query intents through clustering queries by their similar-ity levels. To answer this question, we ran an experimentto measure the level of correspondence between clusteringresults from Ettu and one from manual human inspection.As we will show, there is a high correlation between Ettuand manual clustering. Second, especially given that clus-tering in general has O(N2) scaling in the data size, we wantto show that Ettu’s clustering performs well enough to bepractical for use in corporate settings. As shown in the ex-periments, Ettu can cope very well with large data sizes evenon commodity desktop computers.

The data we use to run our experiments is based on SQLquery logs that capture all query activity on the major-ity of databases at a major US bank over a period of ap-proximately 19 hours. Logs are anonymized by replacingall constants with hash values generated by SHA-256, andmanually vetted for safety. Table 2 and Table 3 show sum-maries of our dataset. Of the nearly 78 million database

Page 8: Summarizing Large Query Logs in Ettu - arXiv · Summarizing Large Query Logs in Ettu Gokhan Kul, Duc Luong, Ting Xie, Patrick Coonan, Varun Chandola, Oliver Kennedy, Shambhu Upadhyaya

Feature 1

Feature 2

Feature 3

Feature 4

Feature 5

Feature 6

Feature 7

Feature 8

Feature 9

Feature 10

SELECT COLSFROMSELECT

COLS

FROM

WHERE

TOP

reportevent.*

eventuser.uid

displaytext.text

disptextref.text

step.stepname

reportevent

persons

displaytextdisplaytext

step

300

AND

OR

In

EqualsTo

OR

In

EqualsTo

OR

EqualsTo

EqualsTo

OR

EqualsTo

EqualsTo

EqualsTo

IS NOT NULL

EqualsTo

EqualsToEqualsTo

EqualsTo

EqualsTo

EqualsTo

reportevent.itemid?|?

reportevent.itemid

?

reportevent.itemid

?|?reporteven

t.scope

?

reportevent.scope

?

reportevent.itemid

?

reportevent.scope

?

reportevent.scope

? reportevent.class

?

reportevent.seqtext

reportevent.type

? eventuser.userid

reportevent.newby

displaytext.seq

reportevent.seqtext

displaytext.seq

reportevent.seqtextref

step.caseitem

reportevent.itemid

step.stepid

reportevent.stepid

rewrapped.*

Total occurrence: 17Query occurence: 1

Total occurence: 18Query occurence: 1

Total occurence: 56Query occurence: 2

Total occurence: 56Query occurence: 2

Total occurence: 64Query occurence: 2

Total occurence: 64Query occurence: 2

Total occurence: 140Query occurence: 1

Total occurence: 17Query occurence: 1

Total occurence: 18Query occurence: 1

Total occurence: 18Query occurence: 1

Legend

Figure 5: An example AST of a SQL query where the common features that appears in all queries in the query group arecolored and the uncolored nodes are specific only to that query.

operations captured, 61 million are not directly queries, butrather invocations of stored procedures. The aim of thispaper is to study query clustering specifically, so we ignorethese queries. Of the the 17 million SQL statements in thetrace, we were able to reliably parse 2.8 million. Of these,we base our analysis on the 1.35 million syntactically validSELECT queries.

Our first observation is that of these 1.35 million queries,there are only 1614 different query skeletons. The number ofdistinct structures that Ettu needs to cluster is quite small.Furthermore, this number grows sub-linearly over time. Asshown in Table 4, nearly half of all the skeletons arise in thefirst 1.5 hours of the trace.

Not a query 61,041,543Un-parsable queries 14,138,723Parsable queries 2,818,719Total 77,998,985

Table 2: Breakdown of bank data set.

In the set of parsable queries in Table 2, the distributionof different types of queries is shown in Table 3.

6.1 FeasibilityOur first goal is to understand whether clustering queries

based on structural similarity, and in particular our use ofthe generalized W-L algorithm, produces meaningful results:Is log summarization even feasible in the first place? Inorder to demonstrate the feasibility of log summarization,

SELECT 1,349,861UNION 1,306INSERT 1,173,140UPDATE 288,098DELETE 6,314Total 2,818,719

Table 3: Distribution of query types in bank dataset.

we compare Ettu’s query clustering with an“intuitive”querysimilarity metric applied.

To make a meaningful manual clustering tractable, weselected a sample of 140 query skeletons (about 10%) andgrouped them manually by attempting to infer the query’s“intent” and creating clusters of queries with similar intents.

Example 8. An example of two structurally similar queriesin the set of 140 query skeletons that are grouped together inmanual clustering:

SELECT historytran .*FROM historytran LEFT JOIN feestate AS feestateON feestate.seqhistorytran = historytran.seq

WHERE (historytran.caseid = ’’) ANDisnull(feestate.rechargestate , ’’) IN (’’, ’’)

ORDER BY historytran.txdate DESC ,historytran.txtime DESC

SELECT historytran .*FROM historytran LEFT JOIN feestate AS feestateON feestate.seqhistorytran = historytran.seq

Page 9: Summarizing Large Query Logs in Ettu - arXiv · Summarizing Large Query Logs in Ettu Gokhan Kul, Duc Luong, Ting Xie, Patrick Coonan, Varun Chandola, Oliver Kennedy, Shambhu Upadhyaya

WHERE (historytran.caseid = ’’) ANDfeestate.reversestate = ’’ ANDisnull(feestate.rechargestate , ’’) NOT IN (’’, ’’)

ORDER BY historytran.txdate DESC ,historytran.txtime DESC

When observing these queries, one can notice that they areexactly the same, except that second query contains WHERE

expression with an additional AND phrase and 3rd term isNOT IN instead of IN . From these two queries, a human in-spector judged that they share similar intent and group theminto one cluster with the reason that they select the same setof columns, show the results in same order, get data fromsame tables, and have similar WHERE condition.

The chance of false positives in a sample is governed bythe birthday paradox; The chance of picking similar, but un-related skeletons for our sample grows with the square of thesample size. To make our sampled results meaningful andincrease the chance of false positives, we generated the 140skeleton sample biased towards skeletons that share commonterms and features (e.g., queries over the same relations).

We used Ettu to cluster the 140 query skeleton sample us-ing hierarchical clustering and manually grouped the sameset of queries. By hand, we identified 23 specific clustersof similar queries. To roughly visualize the correspondencesbetween manual and automated clustering, Figure 6 showsa tanglegram [13] view of the clusters. Ettu’s clusteringselections are shown as a dendrogram on the left of the tan-glegram. The x-coordinate of each branch signifies the dis-tance between the clusters being merged. The further to theright the branch appears, the more similar the clusters beingmerged. Each leaf is one of the 140 query skeletons. Themanual clustering appears on the right of the tanglegram;Group assignments are fixed, and hence there are only twolevels of the “tree”. Lines in the middle link skeletons inEttu’s clustering with the manual clustering.

Ettu

272839403843414217181920212223242526333432373536302931504951525354555658596061636465664514152378613416110111213444546471261304857621251289131135129127768910987118122728211295786912391677768707173747980818386971001061168410793941021051081249092889975111851039612198104101110113114115119117120132133136137138140139

manual grouping

123456789

101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899

100101102103104105106107108109110111112113114115116117118119120121122123124125126128129130127131132133136137138139140134135

Figure 6: Correspondences between hierarchical clusteringobtained from Ettu and from manual grouping.

One measure of degree of correspondences between two

trees is number of crossings of lines connecting two labels.The lower the number is, the higher correspondence betweentwo trees is. However, aligning two trees so that the numberof crossings is minimum is an NP-hard problem [14]. Fig-ure 6 was produced using the Dendextend R package [13],which minimizes the number of crossings but does not guar-antee an optimal alignment. However, even with this figure,we qualitatively observe a large number of parallel lines (anindicator of good alignment between clustering results).

To measure the correspondence more quantitatively, weuse a metric called entanglement [13], also implemented byDendextend. Entanglement score ranges from 0 (no entan-glement) and 1 (full entanglement). To compute an entan-glement score, we first construct a vector for each cluster-ing from the integer positions of each query in the dendro-gram. The entanglement score is computed as the L-normdistance between these two vectors. For the tanglegram inFigure 6, the L-2 entanglement score is 0.17, which indicatesa high degree of correlation between the two clusterings.This demonstrates that Ettu’s distance metrics, based onstructural similarity, do in fact correspond to more intuitivenotions of query similarity.

Surprisingly, a significant number of the alignment errorsdetected above are attributable (in retrospect) to human er-ror rather than to Ettu. After performing this experiment,we manually inspected the misalignments in Figure 6 to bet-ter understand how Ettu was breaking. In doing so, wefound that many of the misalignments were actually mis-takes in the manual grouping rather than the automatedclustering algorithm. The reason for this error was that ourmanual clustering decisions were heavily biased by similari-ties between the first few words of each query: SELECT andthe project targets. Conversely, Ettu’s automated clusteringis able to identify similarity in substructures as well as atthe root of the query.

6.2 ScalabilityOur second goal is to demonstrate the scalability of our

proposed framework. There are two components to thisframework: The initial clustering process which can be per-formed offline, and the interactive FP-tree explorer. Con-cretely, we aim to show three properties of this framework.First, the time taken by the offline component should bereasonable compared to the duration of the log. The systemis not useful if it takes the better part of a day to cluster aday’s worth of queries. Second, the added compute require-ments of additional queries should likewise be reasonable.Finally, the FP-tree explorer should be able to re-render afeature set at interactive speed.

The bank query log starts at about 11 AM. To test scala-bility, we vary the number of queries processed by truncatingthe log after varying lengths; After 1.5 hours, the log con-tains 250 thousand SELECT queries with 756 skeletons, justover 500 thousand queries after 3.3 hours, and so forth. Fullstatistics for each truncated log are shown in Table 4.

Experiments were performed on a commodity laptop com-puter with a 2.2 GHz Intel Core i7 processor and 16 GBRAM memory running OS X 10.11.3. We used the R imple-mentation of hierarchical clustering. The remaining compo-nents were implemented in Java using the single-threadedJDK 1.8.0. Reported running times are the average of 10trials for each phase.

The clustering process is performed in three distinct phases:

Page 10: Summarizing Large Query Logs in Ettu - arXiv · Summarizing Large Query Logs in Ettu Gokhan Kul, Duc Luong, Ting Xie, Patrick Coonan, Varun Chandola, Oliver Kennedy, Shambhu Upadhyaya

End Time Total Time SELECTs Skeletons

12:34 PM 1.5 hours 250,740 75614:24 PM 3.3 hours 512,981 94616:06 PM 5 hours 789,050 1,05718:27 PM 7.3 hours 972,511 1,10221:23 PM 10.3 hours 1,079,427 1,13000:43 AM 13.6 hours 1,176,582 1,25604:04 AM 17 hours 1,270,984 1,57606:28 AM 19.4 hours 1,349,861 1,614

Table 4: Summary of query log sizes by absolute time.

(1) A preprocessing phase where the full query log isrewritten into a bag of query skeletons, (2) A relabelingphase where the generalized W-L algorithm is applied toeach query skeleton to extract feature vectors, (3) A clus-tering phase that actually groups queries by similarity, andfinally (4) A FP tree construction phase in which eachcluster of query skeletons is summarized using FP tree. Fig-ure 7 shows the running time for phases 1 and 3. Phase 2and phase 4 take almost no time by comparison and is shownindependently in Figure 8.

The running time of phase 1 grows sub-linearly with thelog size. We attribute this to the dominant costs being re-source allocation associated with defining and indexing newquery skeletons, and potentially JIT compilation overheads.The running time of phase 2 has (relatively) high upfrontcosts, predominantly digest(·) needing to index newly al-located feature identifiers, but quickly levels off into a lin-ear scaling with the number of queries. The running timeof phase 3 grows super-linearly with the number of queryskeletons. This is typical for clustering processes in gen-eral, which need to compute and evaluate a full (O(N2))pairwise distance matrix. Fortunately, the number of skele-tons grows sub-linearly with the number of queries, and theoverall scaling by time is roughly linear; We expect that thisgrowth curve would eventually become sub-linear for longertraces. Phase 4 runs quite fast because of relatively smallnumber of query skeletons.

start time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of log

0

5

10

12:00 15:00 18:00 21:00 00:00 03:00 06:00End time of log

runn

ing

time

(min

utes

)

PhaseClusteringPreprocessing

Figure 7: Running time of preprocessing phase and cluster-ing phase with respect to different log size.

With running time breaks down into steps as shown inFigure 7 and Figure 8, we can observe that the bottleneckof our system is in hierarchical clustering in which runningtime grows quadratically fast as the log size grows. How-ever, our experiments, even with single-threaded on com-modity hardware, were able to cluster the trace of parseablequeries in under 15 minutes. We can reasonably expect the

● ● ●●

● ●

●●

● ●●

start time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of logstart time of log

0.2

0.3

0.4

0.5

0.6

12:00 15:00 18:00 21:00 00:00 03:00 06:00End time of log

runn

ing

time

(sec

onds

)

PhaseFP tree constructionRelabeling

Figure 8: Running time of relabeling phase and FP treeconstruction phase with respect to different log size.

clustering process to scale to even larger workloads, espe-cially considering the embarrassingly parallelizable natureof clustering.

7. RELATED WORKThe basic idea behind Ettu is to profile normal user be-

havior, detect suspicious behavior using this information,and distinguish malicious behavior from benign intents [15].Indeed, this idea is not new; there are many anomaly detec-tion systems focusing on suspicious behavior of users. Spe-cific examples include file access [16] and transfers [17], on-line and social behavior [18], activities on a website [19],command-line statements [20] and SQL queries issued to adatabase [21].

SQL queries as a resource: As the basic unit of in-teraction between a database and its users, the sequence ofSQL queries that a user issues effectively models the user’sbehavior.

One approach relies on the syntax of queries [22] and itpermits fast query validation where they focus on detectionof potential intrusions on database systems. In their paper,they introduce a mechanism which analyzes audit logs ofdatabases with both defined user roles and undefined userroles. This system uses multiple techniques to attempt to de-tect the threats depending on the role distinction. They useNaive Bayes Classifier and clustering techniques, k-centersand k-means, in their experiments to build user profiles. Thetechniques were able to produce low false positives in ex-perimental testing, but the false negative rates were highfor both techniques. Their work shows that building userprofiles from database logs has potential for detecting in-trusions, especially in a system with defined roles. Ettu,however, takes a different approach in creating the featurevectors exploiting subtree similarities instead of individualfeature similarities, in order to address the problem when thesyntax is much different but the main intent of the query issimilar.

Another method is to use a data–centric approach, whichperforms better in detecting anomalies [21]. In the paper,a set of techniques are used to detect anomalies on the re-sults of SQL queries rather than the queries themselves. Indoing this, the authors are able to improve both where thesyntax–based methods miss a potentially harmful query andwhen there is a false positive. These errors can occur in the

Page 11: Summarizing Large Query Logs in Ettu - arXiv · Summarizing Large Query Logs in Ettu Gokhan Kul, Duc Luong, Ting Xie, Patrick Coonan, Varun Chandola, Oliver Kennedy, Shambhu Upadhyaya

syntax–based system when two queries with seemingly sim-ilar structures have vastly different results, as well as whentwo queries appear unrelated but actually return the exactsame set of data. The limitation of this approach is that thedataset should remain static for effective results but still thepotential in the data–centric model is quite promising. How-ever, to evaluate the efficacy of such a system in a realisticway, access to data in a corporate database is required.

This brings us back to the problem that, many organiza-tions do not allow third party softwares to access their sensi-tive data, let alone letting them to be used for research pur-poses [15]. Hence, auditing SQL query logs of the databasesand having a sense of what the query intends to do correctlygains a lot of importance. Our eventual goal is extend oursyntactic visualization strategy to support data–aware visu-alization and query auditing, in this work we focus on thecore components of clustering queries and generating clustervisualizations.

Query visualization: Generating query cluster visual-izations is a complex task considering that even visualizingjust one query accurately to help users understand the intentbehind the query quickly is still a research challenge [23].

QueryViz [24] addresses query interpretation which is theproblem of understanding the goal of the query by visualiz-ing it. Query interpretation is usually as hard as writing anew query as the complexity of the query increases [23]. Ittakes SQL query and the schema of the database; and parsesthe query, builds an AST and creates a graph for users toview it. The aim is to present queries as simple as possiblefor the users to understand them, as it is easier to under-stand the relationships and references in a query when it isgraphically visualized.

Logos [25], on the other hand, is a system that has theability to translate SQL queries into natural language equiv-alents. This technology works by creating graph represen-tations of the query structure. These representations thenuse predefined relationships within the database schema toallow for construction of natural language expressions. De-spite the high overhead of maintaining the relationships forthe database, this technology shows promise in revealing theintent of user queries.

We take a similar approach as QueryViz [24], but we in-clude the common features of all queries in the query repre-sentation, so that the users can see what features are uniqueto the query and what features are shared by all the queriesin the query groups.

Distinguishing queries: There are different approachesto understand the intents behind SQL queries and what aquery returns. Analyzing query intents can help in differentresearch topics like creating indices, benchmark design, andmasquerade detection in databases.

QueryScope [26] starts with the hypothesis that visual-ization of queries would help identifying similar queries anddistinguishing different ones from each other, hence it wouldhelp finding better tuning opportunities. The system pro-vides a user interface to visualize SQL queries. This inter-face presents these queries as graphs in order to make thestructure more understandable. To construct these graphs,the structure of the query is broken down into XML formatprior to visualization. The proposed use of this interfaceis for tuning of database systems, although the breakdownof these queries proposes the potential for detecting simi-larity between different queries. Our approach is similar to

the individual queries but we focus on describing query sets.While they directly focus on similarities of queries by findingthe critical elements of the query using the table relation-ships and table cardinalities, we make use of the subtreesimilarities.

XData [27] is a system which uses a technique that al-lows SQL queries to be tested for mutations to the intendedquery. In doing this, queries that have slight differences andthat look to return the same result as a correct query will bemarked as incorrect if they produce different results. Thistechnology is intended for use in grading student work, al-though other uses are possible. In order to determine thecorrectness of a query, the correct query must have con-straints created to represent possible mutations. The au-thors explain each aspect of a SQL query for which a muta-tion can occur, and for each they use tuples to store the pos-sible mutations. By doing this, all possibly mutation com-binations can be presented for comparison against question-able queries. In testing against student work, this techniquewas able to correctly identify over 95 percent of incorrectmutants presented.

A method for maintaining privacy in databases via audit-ing is presented in [28] which discusses an auditing frame-work that allows for analysts to ensure compliance of certainprivacy guarantees. Audit expressions are used in their pa-per to establish this notion of privacy in the database alongwith the concept of an indispensable tuple. This means thatthe removal of this tuple will affect the result of the queryin question. By using this concept, the authors detail amethodology in which audit logs can be constructed thatwill reveal any privacy violations.

If the data for the target database is available, our pro-posed framework can be augmented by using methods pro-posed in above discussed data–centric approaches [27,28].

8. CONCLUSION AND FUTURE WORKThe focus of this paper is to summarize large query logs

by clustering them according to their similarity with thehypothesis that the similarity in query structure correspondsto the intent of the query.We utilize a revised Weisfeiler–Lehman algorithm to create features out of query logs, formfeature vectors out of them, and cluster the similar queriestogether by their feature vectors with hierarchical clusteringmethod. Finally, we exploit FP Trees to create summaries ofthe clustered query sets. In our experiments, we show that(1) the structural similarity of queries corresponds to theintent of the query by comparing the groupings performedby a human expert and our system, and (2) even with acommodity laptop and single-threaded implementation, theprocess could be performed in under 15 minutes for a set of1.35 million queries.

We plan to extend our work in several directions: First,we will explore new feature weighting strategies and newlabeling rules in order to capture the intent behind loggedqueries better. Second, we will examine user interfaces thatbetter present clusters of queries — Different feature sortingstrategies in an FP Tree in order to help the user distinguishimportant and irrelevant features, for example.

Scenario 2. Jane inspects the query group summariespresented by Ettu and labels these groups as safe, unsafe,and unknown. Ettu repeats the auditing process to elaboratethe groups on unknown clusters, so that they can be labeled

Page 12: Summarizing Large Query Logs in Ettu - arXiv · Summarizing Large Query Logs in Ettu Gokhan Kul, Duc Luong, Ting Xie, Patrick Coonan, Varun Chandola, Oliver Kennedy, Shambhu Upadhyaya

as safe and unsafe, too.

Third, further exploration on various kinds of statisticscaptured in FP Tree will help us in determining the qualityof the cluster, weighting of features, and identifying suspi-cious paths. Lastly, we will investigate the effect of tempo-rality on query clustering.

9. ACKNOWLEDGMENTSThis material is based in part upon work supported by

the National Science Foundation under award number CNS- 1409551. Usual disclaimers apply. We also would liketo thank Manish Gupta and Rick Jesse for their efforts inpreparation of the dataset we used in our experiments.

10. REFERENCES[1] CERT Insider Threat Center. 2014 U.S. State of

Cybercrime Survey. July 2014.

[2] Ponemon Institute. 2015 Cost of Cyber Crime Study:Global. October 2015.

[3] Gokhan Kul, Duc Luong, Ting Xie, Patrick Coonan,Varun Chandola, Oliver Kennedy, and ShambhuUpadhyaya. Ettu: Analyzing query intents incorporate databases. In ERMIS, 2016.

[4] Oliver Kennedy, Jerry Ajay, Geoffrey Challen, andLukasz Ziarek. Pocket Data: The need forTPC-MOBILE. In TPC-TC, 2015.

[5] Nicolas Bruno and Surajit Chaudhuri. Automaticphysical database tuning: A relaxation-basedapproach. In ACM SIGMOD, 2005.

[6] Cynthia Dwork. ICALP 2006, Proceedings, Part II,chapter Differential Privacy. Springer, 2006.

[7] Nino Shervashidze, Pascal Schweitzer, Erik Jan VanLeeuwen, Kurt Mehlhorn, and Karsten M. Borgwardt.Weisfeiler-lehman graph kernels. JMLR, 2011.

[8] Fei Li and H. V. Jagadish. Constructing an interactivenatural language interface for relational databases.pVLDB, 2014.

[9] Shan Shan Huang, Todd Jeffrey Green, andBoon Thau Loo. Datalog and emerging applications:an interactive tutorial. In ACM SIGMOD, 2011.

[10] Molham Aref, Balder ten Cate, Todd J. Green, BennyKimelfeld, Dan Olteanu, Emir Pasalic, Todd L.Veldhuizen, and Geoffrey Washburn. Design andimplementation of the logicblox system. In ACMSIGMOD, 2015.

[11] Rui Xu, Donald Wunsch, et al. Survey of clusteringalgorithms. Neural Networks, IEEE Transactions on,16(3):645–678, 2005.

[12] Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao.Mining frequent patterns without candidategeneration: A frequent-pattern tree approach. DMKD,2004.

[13] Tal Galili. dendextend: an R package for visualizing,adjusting and comparing trees of hierarchicalclustering. Bioinformatics, 2015.

[14] Kevin Buchin, Maike Buchin, Jaroslaw Byrka, MartinNollenburg, Yoshio Okamoto, Rodrigo I Silveira, andAlexander Wolff. Drawing (complete) binarytanglegrams. In Graph Drawing. Springer, 2008.

[15] Gokhan Kul and Shambhu Upadhyaya. Towards acyber ontology for insider threats in the financialsector. JOWUA, 2015.

[16] Xiaobin Wang, Yongjun Wang, Qiang Liu, YonglinSun, and Peidai Xie. UCAWSN & PDCAT 2015,chapter Insider Detection by Analyzing ProcessBehaviors of File Access. Springer Singapore, 2016.

[17] Andrew Stephen McGough, Budi Arief, Carl Gamble,David Wall, John Brennan, John Fitzgerald, Aad vanMoorsel, Sujeewa Alwis, Georgios Theodoropoulos,and Ed Ruck-Keene. Ben-ware: Identifying anomaloushuman behaviour in heterogeneous systems usingbeneficial intelligent software. JOWUA, 2015.

[18] Gaurang Gavai, Kumar Sricharan, Dave Gunning,John Hanley, Mudita Singhal, and Rob Rolleston.Supervised and unsupervised methods to detectinsider threat from enterprise social and onlineactivity data. JOWUA, 2015.

[19] Eren Manavoglu, Dmitry Pavlov, and C Lee Giles.Probabilistic user behavior models. In ICDM, 2003.

[20] Roy Maxion and Tahlia N Townsend. Masqueradedetection using truncated command lines. In DSN,2002.

[21] Sunu Mathew, Michalis Petropoulos, Hung Q. Ngo,and Shambhu Upadhyaya. A data-centric approach toinsider attack detection in database systems. InRAID, 2010.

[22] Ashish Kamra, Evimaria Terzi, and Elisa Bertino.Detecting anomalous access patterns in relationaldatabases. VLDBJ, 2007.

[23] Wolfgang Gatterbauer. Databases will visualizequeries too. pVLDB, 2011.

[24] Jonathan Danaparamita and Wolfgang Gatterbauer.Queryviz: Helping users understand sql queries andtheir patterns. In EDBT/ICDT, 2011.

[25] Andreas Kokkalis, Panagiotis Vagenas, AlexandrosZervakis, Alkis Simitsis, Georgia Koutrika, and YannisIoannidis. Logos: A system for translating queries intonarratives. In ACM SIGMOD, 2012.

[26] Ling Hu, Kenneth A Ross, Yuan-Chi Chang,Christian A Lang, and Donghui Zhang. Queryscope:visualizing queries for repeatable database tuning.pVLDB, 2008.

[27] Bikash Chandra, Bhupesh Chawda, Biplab Kar, K. V.Reddy, Shetal Shah, and S. Sudarshan. Datageneration for testing and grading sql queries. TheVLDB Journal, 24, 2015.

[28] Rakesh Agrawal, Roberto Bayardo, ChristosFaloutsos, Jerry Kiernan, Ralf Rantzau, andRamakrishnan Srikant. Auditing compliance with ahippocratic database. In VLDB, 2004.