1 data mining chapter 9 moving on: applications and beyond kirk scott

1

Data MiningChapter 9

Moving on: Applications and Beyond

Kirk Scott

2

• So-called machine learning is a broad topic with many ramifications

• Data mining is just an applied subset of this overall field

• The book says the algorithms aren’t “abstruse or complicated” but they’re also not “completely obvious and trivial”

3

• The book identifies the challenge of the future as lying in the realm of applications

• In this sense, data mining has something in common with database management systems

• For some people the interesting part is figuring out how to apply the techniques to a given problem

4

• The book notes that the source of these applications are people working in the problem domains

• People specializing in data mining will continue to develop new algorithms

• But this doesn’t happen in a vacuum• Much of the real, interesting work will

come out of applications

5

9.1 Applying Data Mining

• The book lists the “Top 10” data mining algorithms

• These are given in Table 9.1, shown on the following overhead

• Recall that number 1, C4.5, was for decision tree induction

• Notice also that the majority of these algorithms are for classification

7

Progress in Data Mining

• There is a pitfall of applying algorithms to data sets, comparing results, and drawing broad conclusions about what is best in certain problem domains

• Reasoning from the specific to the general, without further information, is not necessarily correct

8

• Even statistically significant differences in outcomes may not be important in practice

• Quite often, simple methods get reasonably good results

• Complicated methods have their own shortcomings, including computational cost

9

• Something to always keep in mind:• There may just be a lot of noise in the data• Or, there may just be a lot of statistical

variation• There ARE limits on the ability to draw

inferences from data

10

• Also, training sets, by definition, are historical

• They can’t perfectly reflect new data in a changing world

11

• Another point to consider:• Recall that some classification schemes

give probabilities that an instance falls into a class

• However, in reality, classification categories might not be mutually exclusive

• There may be data points in the training set which are partially one and partially another

12

• However, the training set is considered to have instances that are rigidly classified as one or the other

• The training set doesn’t reflect probability• Thus, the training set which you are

basing your inferences on already contains inaccuracies

• You might think of this as conceptual noise resulting from forcing an instance into one class

13

• The book admits that tweaking (picking parameters) can affect performance

• As a result, small empirical differences in data mining results do not necessarily reflect actual differences in the quality of the algorithms

• In application, one might have been more successfully tweaked than another

14

• Another interesting point on the Occam’s Razor/Epicurus divide:

• Complicated methods may be harder to criticize than simple ones

• That fact alone doesn’t make them better• The authors still favor “All else being

equal, simpler is better”.

15

9.2 Learning from Massive Datasets

• Basic constraints: Computational space and time

• Data stream methods are unaffected by this (more later)

• In other types of algorithms implementation techniques like hashing, caching, indexing and other data structures may be critical to practicality

16

• Massive data sets typically imply large numbers of instances

• Any algorithm with time complexity > linear will eventually be swamped by massive data

• Depending on the algorithm, too many attributes may also render it impractical because the computational complexity is in the dimension of the problem space

17

• General ways to adapt to large data sets:• Train on a sample or subset of the data

only• Do a parallel implementation of the

algorithm in a multi-processor environment• Invent new algorithms…

18

• Training on samples or subsets can give you as good a result as training on the whole set

• The law of diminishing returns says that after a certain point, more instances don’t give significant increases in accuracy

19

Training on Samples

• There are two ways of looking at this:• If a problem is simple, a small data set

may encapsulate all there is to know about it

• If the problem is complex but the data mining algorithm is simple, the algorithm may max out on its predictive power no matter how many training instances there are

20

Parallelization

• Algorithms like nearest neighbor, tree formation, etc. can be parallelized

• Not only do you have to figure out how to parallelize

• Parallelization is no defense against combinatorial explosion

• If the complexity is exponential but the growth in the number of processors is linear, you eventually lose

21

New Algorithms

• This is where research comes in• In general, the sky is the limit• In some situations (tree building, for

example) there is a provable floor on complexity

• Even here, new methods may be simpler and still have the characteristic of approximating the solutions given by deterministic methods

22

• A side note on this:• Virtually everything we’ve looked at has

been a heuristic anyway—including greedy tree formation

• Exhaustive search would give genuinely optimal results

• Everything else is essentially an approximation approach

23

• Another aspect to improving algorithm performance:

• A lot of the data in a set, both instances and attributes, may be redundant

• Simply finding ways to throw out useless data may improve performance

• This idea will recur in the next section

24

9.3 Data Stream Learning

• For data streams, the overriding assumption for algorithms is that each instance will be examined at most one time

• The model of the data is updated incrementally based on the incoming instance

• Then the instance is discarded

25

• An example of an application area is sensor readings

• As long as the sensor is active, the readings just keep on coming

• It seems that things like Web transactions might be another example

26

• Both time and space are issues• Discarding instances saves space• The examination of each instance also has

to be fixed in time, with an average rate of examination no less than the arrival rate of instances

• This constraint rules out major changes or reorganization of the model

27

• Modifications to the model resulting from an instance either have to be counted in the examination time

• Or they have to occur infrequently enough that they are averaged out over the examination time of multiple instances

28

• You may come up with ways of throwing out unneeded data

• But the goal is not to throw out data simply because you couldn’t handle it fast enough

• (Although stay tuned for a later comment on throwing out data)

29

Algorithms That Are Directly Suited to Data Streams

• 1R• Naïve Bayes• Perceptrons• Multi-layer neural networks• Rules with exceptions (although you can’t

simply accumulate unlimited exceptions)

30

• The book notes that other kinds of algorithms can be adapted to data streams

• It spends some time explaining how this might be done with trees

• The details are obscure and not important at this late stage in the semester

31

• The key insight about throwing out instances is this:

• Will you lose important data if you throw out instances?

• In an unending data stream, if information or a pattern is significant, it will recur

• So in the long run, throwing out an instance doesn’t hurt the model that results from the algorithm

32

9.4 Incorporating Domain Knowledge

• The overall topic here is metadata—data about the data

• How to put this to use is an open area of research

• There can be various kinds of relationships between attributes

• They include semantic, causal, and functional

33

• Semantic relationships can be summarized in this way:

• If one attribute is included in a rule, another should also be included

• Informally, in the problem domain, this means that these two attributes aren’t (fully) meaningful without the other

• Somehow this could be included as a condition in a data mining scheme

34

• The idea of causality also comes from the problem domain

• The point is that if causality exists, the data mining scheme should be able to detect it

• This causality may go through multiple attributes

• ABC…

35

• Functional dependency in data mining refers to the same concept as in db design

• The point with respect to metadata is that if the functional dependency is already know, it’s not productive to have data mining “discover” it

36

• On the one hand, there may be ways of applying data mining to normalization

• On the other hand, if that’s not your purpose, functional dependencies that are mined will tend to have high confidence and most likely high support

• These associations will end up outweighing other, new associations that an algorithm might mine

37

• How should metadata be represented?• A straightforward approach is to list what

you already know about the data set using rules

• Logical deduction schemes can produce other rules resulting from the ones you already know

38

• Data mining scheme works with instances to produce other new rules that you didn’t know before

• The bodies of rules, merged together, give the sum total of knowledge gleaned about the problem

39

9.5 Text Mining

• Compare text mining to standard data mining

• Data sets roughly parallel database tables• There are identifiable instances and well-

defined attributes• Text is emphatically not structured in this

way

40

• An interesting comparison:• Data mining is supposed to find

information about data where that information was not known

• By definition, text is different• In text, the information is out in the open,

in the form language• It is simply not in a form suitable for easy

computerized analysis

41

• Data mining can be said to have as its goal the acquisition of “actionable” information

• Based on a training set you can classify or cluster future instances, for example

• In a derivative way, you can make decisions that earn make money, etc.

• Another goal of data mining is to develop a data model

• Again, this is out in the open with text

42

• There are several applications of text mining:

• Text summarization, document classification, and clustering

• Language identification and authorship ascription

• Assigning key descriptive phrases to documents

• Metadata, entity, and information extraction

43

• Document classification can be done based on the (count of) occurrences of words in the document

• There is a feature extraction aspect to this problem

• Frequent words don’t help classify• Infrequent words don’t help classify• There is still an overwhelming number of

words in the middle that have value

44

• More complex methods step up from counting words alone

• Context, word order, grammatical constructs, etc. all affect meaning

• At the very least phrases might be mined instead of words

• Natural language processing, syntax, and semantics may come into play

45

• Document classification may be done with predefined classes

• Document clustering doesn’t have predefined classes

• Mining techniques can be used to identify the language of a document

• n-grams, n-letter sequences correlate highly with different languages

• n = 3 is usually sufficient for this

46

• Authorship ascription is done by counting common (stylistic) words, not the content words which define classifications

• A more complex approach would again do more than just count words

47

• Assignment of key phrases to a document corresponds to the problem of assigning subject headings in the library catalog

• You start with established sets of phrases with defined meanings

• The goal is to assign one or more of these phrases to the document

48

• Metadata extraction is a related idea with further ramifications

• Is it possible to find specific information like author and title, automatically?

• Can you extract useful identifying key words and phrases?

• Note that the ability to do this may result in “actionable” information

49

• The next step is entity extraction• Not only do you want to extract more

obvious things like the author and title:• You want to identify any things that are

mentioned in the document

50

• How do you identify entities?• You can look them up in reference

resources like dictionaries, lists of names, etc.

• You may rely on simply things like capitalization or titles of address, etc.

• You may search for regular expressions or use simple grammars for expressions

51

• Information extraction refers to a situation similar to reading information from a form

• Certain documents may be limited in their scope and expected to contain a given set of data items

• If the informational items can be extracted, it may then be possible to infer rules or relationships involving them

52

• Text mining is complex• There are many potential applications• Ultimately, the algorithms fall into the

realm of natural language processing• The current state of the art of

computerized text processing is at the level where it is hard to match human abilities and understanding

53

9.6 Web Mining

• This is like a specialized area of text mining

• The basic content of pages is predominantly text

• There are two key differences:• Pages have internal markup, which

defines structure• Pages have links in and out, which help

classify by content and value

54

• In fact, large amounts of Web content are tabular, and presumably suited to standard data mining

• However, the tabular nature is only reflected in the html presentation

• A data mining technique called wrapper induction can be used to try and infer the tabular relationships based on formatting

55

• Wrapper induction can be extended:• To accommodate changes in formatting• To recognized when formats that differ

superficially really indicate the same structure

56

• Page rank is an important concept in Web mining

• It is only tangentially related to standard data mining

• Page rank is a measure of how “good” a match a page is to a certain search request based on links in and out of pages

57

• These are the general principles for page ranking:

• Many links to page x suggest that x is good

• A link from y to x is a good indicator if there are few links out of y

• A link from y to x is a good indicator if y itself is rated highly

58

• Having the ranks of pages depend on the ranks of other pages is somewhat circular

• However, iterative algorithms can run through networks of links converging on solutions

• The stopping condition is when the change of ranking values between iterations falls below a certain threshold

59

• Some relevant factoids:• The authors guess that for ranks

normalized between 0 and 1, the threshold for the stopping conditions is in the range 10-9-10-12

• An early experiment when the Web was simpler converged after ~50 iterations

60

• Rumor has it that Google’s page ranking program(s) cycle through the entire Web in a matter of days

• This process is repeated every few weeks

61

• There is a practical problem with this approach

• There may be Web pages with no in links or no out links

• These are known as page rank sinks• If the algorithm is caught in a sink it won’t

terminate or converge

62

• The algorithms are given probabilistic parameters

• In searching the web there is a small probability of “teleporting”, jumping, to another random page

• The parameter is tunable• Its size affects the speed and accuracy of

convergence

63

9.7 Adversarial Situations

• Spam filtering is a classic example of this• You can write algorithms to identify spam• The point is that spam senders will track

the algorithms and try to outwit them• This is a see-saw problem

64

• Similar kinds of problems:• Trying to trick page ranking algorithms• Trying to confound computer security

programs that detect intrusions and breaches based on transaction patterns

• There are many applications of data mining which, while not necessarily adversarial exactly, are “security” rather than money based

65

• Data mining may be used to detect fraud, money laundering, other criminal activity

• It is used to flag airline passengers for additional screening

• The government uses it to detect terrorists or terror-related activity

• This can be based on any obtainable record, whether computer, cell phone or other, which can be mined

66

• There are serious ethical, legal, privacy, constitutional, and civil liberties questions about these practices

• Data mining can also be applied in a collaborative/adversarial way

67

• The authors cite robo-soccer• Independent agents “learn” to act

independently and in concert with other agents to win a game or achieve some goal against other collections of agents

• The book also cites a project that was designed to determine the author of a document where the styled had presumably been intentionally altered

68

9.8 Ubiquitous Data Mining

• Ubiquitous data mining stems from the concept of ubiquitous computing

• More factoids:• The Web contains 10-20 billion documents• This is ~50 Terabytes• Suppose you combined the content of the

Web with the contents or actions of every device with an electronic brain, cell phone, iPOD, etc., and data mined across it all

69

• In a more limited sphere the authors cite a specific example:

• There are programs which follow a user’s actions with the goal of “learning” what tasks a user performs when using an operating system, in essence learning how to customize the O/S’s actions to the user’s desires

70

• The authors state that data mining will be ubiquitous in the future

• Doubtless this is true• The questions remaining:• Who wins, who makes money:• Will this make us more masters or (willing)

slaves of technology?• Do you want to play the game?• Will you have a choice?

71

The End

1 data mining chapter 9 moving on: applications and beyond kirk scott

Documents

data mining algorithms

data mining results

data sets

new data

ramifications data mining

data mining chapter

data points

kirk scott slide