the analysis of patterns nello cristianini. the value of patterns patterns are everywhere, and...

46
The Analysis of Patterns Nello Cristianini

Upload: reginald-malcolm-small

Post on 27-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

The Analysis of PatternsNello Cristianini

Page 2: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

The Value of Patterns

• Patterns are everywhere, and people have always been fascinated by them.

• Detecting patterns confers an advantage to an organism

Temperature and Rainfall in Lake Shasta over 5 years

Page 3: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Patterns Help Us in Many Ways…

e.g., compress, predict, remove errors

Page 4: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Benefits of Detecting Patterns

Page 5: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Patterns and Intelligence

• We care so much about pattern finding skills, that we even use them (partly) to quantify intelligence…

Page 6: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

The Instinct for Patterns• We see patterns everywhere• Even where there are no patterns:

Page 7: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Patterns and Randomness

• We are poorly equipped to deal with randomness:

• 3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679...

• In first million digits we see Erice’s ZIP code 11 times…

• Does it mean anything?

Page 8: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Patterns and Randomness• ABRAHAM LINCOLN WAS ELECTED TO

CONGRESS IN 1846.• JOHN  F.  KENNEDY WAS ELECTED TO

CONGRESS IN 1946.•

• ABRAHAM LINCOLN WAS ELECTED PRESIDENT IN 1860.

• JOHN F. KENNEDY WAS ELECTED PRESIDENT IN 1960.

• THE NAME LINCOLN AND KENNEDY EACH CONTAIN SEVEN LETTERS.

• BOTH WIVES LOST CHILDREN WHILE LIVING IN THE WHITE HOUSE.

• BOTH PRESIDENTS WERE SHOT ON FRIDAY.

• BOTH WERE SHOT IN THE HEAD.

• BOTH SUCCESSORS WERE NAMED JOHNSON.

• ANDREW JOHNSON, WHO SUCCEEDED LINCOLN, WAS BORN IN 1808.

• LYNDON JOHNSON, WHO SUCCEEDED KENNEDY, WAS BORN IN 1908.

• JOHN WILKES BOOTH, REPORTEDLY ASSASSINATED LINCOLN.

• LEE HARVEY OSWALD, REPORTEDLY ASSASSINATED KENNEDY.

• BOTH ASSASSINS WERE KNOWN BY THREE NAMES.

• BOTH NAMES CONTAINED FIFTEEN LETTERS.

• BOOTH AND OSWALD WERE ASSASSINATED BEFORE THEIR TRIALS.

Coincidences ?

Page 9: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Visualizing Patterns

• We are naturally equipped to detect CERTAIN types of patterns, not others…

5.9400 8.6100 12.2800 11.6100 20.2800 23.8300 25.8300 27.390024.1700 19.0600 11.1700 8.8900 8.3300 7.8900 12.1100 15.390019.1100 24.6100 28.1100 25.7800 23.1100 16.8400 13.1700 8.44005.8900 10.8300 12.1100 15.7800 18.8300 26.5600 27.5600 25.000023.4400 15.5600 10.7200 7.1700 7.8300 11.1700 9.7800 14.940020.5000 23.3300 27.8300 29.2200 25.1100 20.6700 12.8900 11.89009.1700 9.8300 14.2800 18.5000 19.0000 26.3900 29.6100 26.720022.6700 20.3900 13.8900 8.8900

Page 10: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Finding Patterns

• We are naturally interested in finding relations in data.

• We are naturally ill-equipped in dealing with randomness.

• We have developed sophisticated technology to do this for us.

• In last decade we have made one more step…• As a society we rely on pattern discovery

technology, in many ways

Page 11: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Computational Pattern Finding

• We want to find relations

• They need to be reliable

• They need to be explored efficiently

• We want to do it automatically

• On MASSIVE amounts of data

Page 12: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

The Analysis of Patterns

• Data driven approach to• Science• Business• Technology

• Modern society relies on our capability to automatically detect reliable patterns in vast sets of data

Page 13: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

The Analysis of Patterns

• Science:– The Genome Project– Surveys of the Universe

• Business:– Amazon automatically exploiting trends and relations

in transactions database– Fraud Detection in Credit Card Companies

• Technology:– Voice recognition– Handwriting recognition

Page 14: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

A Scientific Gold Rush• 1 GATCACAGGT CTATCACCCT ATTAACCACT CACGGGAGCT CTCCATGCAT TTGGTATTTT• 61 CGTCTGGGGG GTGTGCACGC GATAGCATTG CGAGACGCTG GAGCCGGAGC ACCCTATGTC• 121 GCAGTATCTG TCTTTGATTC CTGCCTCATT CTATTATTTA TCGCACCTAC GTTCAATATT• 181 ACAGGCGAAC ATACCTACTA AAGTGTGTTA ATTAATTAAT GCTTGTAGGA CATAATAATA• 241 ACAATTGAAT GTCTGCACAG CCGCTTTCCA CACAGACATC ATAACAAAAA ATTTCCACCA• 301 AACCCCCCCC TCCCCCCGCT TCTGGCCACA GCACTTAAAC ACATCTCTGC CAAACCCCAA• 361 AAACAAAGAA CCCTAACACC AGCCTAACCA GATTTCAAAT TTTATCTTTA GGCGGTATGC• 421 ACTTTTAACA GTCACCCCCC AACTAACACA TTATTTTCCC CTCCCACTCC CATACTACTA• 481 ATCTCATCAA TACAACCCCC GCCCATCCTA CCCAGCACAC ACACACCGCT GCTAACCCCA• 541 TACCCCGAAC CAACCAAACC CCAAAGACAC CCCCCACAGT TTATGTAGCT TACCTCCTCA• 601 AAGCAATACA CTGAAAATGT TTAGACGGGC TCACATCACC CCATAAACAA ATAGGTTTGG• 661 TCCTAGCCTT TCTATTAGCT CTTAGTAAGA TTACACATGC AAGCATCCCC GTTCCAGTGA• 721 GTTCACCCTC TAAATCACCA CGATCAAAAG GGACAAGCAT CAAGCACGCA GCAATGCAGC• 781 TCAAAACGCT TAGCCTAGCC ACACCCCCAC GGGAAACAGC AGTGATTAAC CTTTAGCAAT• 841 AAACGAAAGT TTAACTAAGC TATACTAACC CCAGGGTTGG TCAATTTCGT GCCAGCCACC• 901 GCGGTCACAC GATTAACCCA AGTCAATAGA AGCCGGCGTA AAGAGTGTTT TAGATCACCC• 961 CCTCCCCAAT AAAGCTAAAA CTCACCTGAG TTGTAAAAAA CTCCAGTTGA CACAAAATAG• 1021 ACTACGAAAG TGGCTTTAAC ATATCTGAAC ACACAATAGC TAAGACCCAA ACTGGGATTA• 1081 GATACCCCAC TATGCTTAGC CCTAAACCTC AACAGTTAAA TCAACAAAAC TGCTCGCCAG• 1141 AACACTACGA GCCACAGCTT AAAACTCAAA GGACCTGGCG GTGCTTCATA TCCCTCTAGA

Page 15: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns
Page 16: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Yeast protein interaction map (Barabasi)

Page 17: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Another Gold Rush

• The world wide web contains billion of pages, with text, images, data…

• Semantic web, XML-based, provides high quality annotated information…

• Soon all books ever written will be in digital form

• Are we ready?

Page 18: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

2001

2004

2005

Page 19: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

The Analysis of Patterns

• Traditionally, the role of analyzing data belongs to Statistics.

• Or: does it ?

• Data analysis performed by physicists, biologists, engineers… each with their own set of tools.

• Even the task of making or validating these tools is not just part of statistics.

Page 20: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

The Analysis of Patterns

• Signal processing

• Data mining

• Information retrieval

• Pattern recognition (*)

• Pattern matching

• Machine Learning

• …

Page 21: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

The Analysis of Patterns

• Pattern Recognition– Syntactical / Structural– Statistical– Visual

• Pattern Discovery vs Pattern Matching– In sequences– In graphs– In images

Page 22: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

The Analysis of Patterns

• Grammatical Inference

• Mining for Association Rules

• Patterns in Vector Data (classical multivariate statistics; neural networks; machine learning; etc)

• Etc, Etc

Page 23: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

The Analysis of Patterns

• Many communities working almost independently

• Occasionally re-discovering the same things

• A small and fairly stable set of ideas– Efficient search for patterns in data– Statistical validation issues– Pattern visualization

• Often same tools and concepts

Page 24: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Searching for Patterns

• The search problem can be framed within Operational Research / Optimization. – (e.g., Integer Programming, Convex Programming,

etc…)

• Many key ideas from exact optimization have revolutionized this field in recent years

• Where exact solution are theoretically impractical (and only then!) we can use approximations, then heuristic approaches.

• Again: same heuristics appear in many fields

Page 25: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Statistics

• How do we know that a relation found in a finite set of data is reliable, or significant, or even interesting?

• Many issues of hypothesis testing

• Classical statistics vs statistical learning theory

Page 26: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

What Are Patterns?

• This is a rather difficult question to answer. I hope we will have an answer by the end of this meeting.

• I encourage all speakers and participants to suggest some definitions.

Page 27: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Gregory Chaitin: "Patterns, Randomness and Information"

• Information, Complexity, Patterns, Randomness and Compression.

• What are regularities in data? How can they be defined? And quantified?

• Predictability and Compressibility are connected.• Randomness can be defined in algorithmic

ways.

Page 28: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

• Chaitin will explain what it means that a sequence “has no pattern”, and some far reaching consequences

• ideas can be traced back through Hermann Weyl to Leibniz in 1686,

• connect with Godel & Turing

• the question of how math compares & contrasts with physics and with biology

Gregory Chaitin

Page 29: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Patterns in Sets of Points (Vectors)

• Probably the most developed part of pattern analysis

• Includes much multivariate stats, much statistical pattern recognition (e.g., Duda and Hart) and machine learning

Page 30: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Tijl De Bie: Patterns in Sets of Points

• Patterns in sets of points: an overview– the role of optimization– examples of patterns– Dimensionality reduction, classification, clustering– Emphasize linear patterns (connect to later kernels talk)

• Patterns in sets of points: the myriad virtues of eigenproblems – the eigenvalue problem. – principal component analysis, canonical correlation analysis, Fisher's

discriminant, partial least squares, and spectral clustering.

– More from thiss area will be covered in Kernel Methods’ talk

Page 31: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Patterns in Sequences

• After vectors, probably the most important type of data:

• DNA• Text (web)

• How to find patterns within and among sequences?

• What data structures? What statistical models?

Page 32: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Suffix Tree and Hidden Markov techniques for pattern analysis

• Efficient Pattern Discovery in sequences requires appropriate data structures– Suffix tree construction. – linear time array constructions – using suffix trees for finding motifs with gaps – finding cis-regulatory motifs by comparative

genomics – Hidden Markov techniques for haplotyping

Esko Ukkonen

Page 33: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Dan Gusfield Trees, Arrays, Networks and Optimization

for Finding Patterns in Biological Sequences

a) The use of suffix trees and integer programming for finding optimal virus signatures.

b) A current treatment of suffix-arrays and their uses.

c) Algorithms for finding signatures (patterns) of historical recombination and gene-conversion in SNP (binary) sequences.

Page 34: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Raffaele GiancarloPatterns and Compression

• Patterns are not just necessary for prediction, they are also needed for data compression.

• Many relations between PA and Data Compression.

• Raffaele Giancarlo (University of Palermo) - On Indexing and Compression: Two Sides of the Same Coin

Page 35: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Conceptual Foundations

• Alberto Apostolico (University of Padova and Georgia Tech) - "Algorithmic and Combinatorial Foundations of Pattern Discovery"

• Will discuss various aspects of the interplay between algorithmics and statistics, as well as the notion itself of pattern.

Page 36: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Kernel Methods

• An idea: if we are so good at finding (linear) patterns in sets of points…

• Why not transforming all other problems into a points problem?

• Good idea…

• Kernels Methods (from machine learning) can do this “automatically” …

Page 37: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Bernhard Schoelkopf Kernel Methods

• Kernel methods: combine ideas from statistics and optimization

• State of the art machine learning systems• Operate on general types of data• Work by embedding data into a euclidean

space• The structure of the space determined by

choosing a special “kernel” function…• KMs connect various aspects of PA…

Page 38: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Patterns in Sets

• The most classic textbook example of data mining: You shop at the supermarket, and the market-basket

contents are recorded by the computer system at check-out…

Discover when some items are associated, when it is possible to predict your next purchase, etc…

• (this is what Amazon does automatically…)

Page 39: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Heikki Mannila: Finding frequent patterns

• Part I: Finding frequent patterns from data • Discovery of frequent patterns = finding positive conjunctions that are true

for a given fraction of the observations• this basic idea can be instantiated in many ways:

– finding frequent sets from 0/1 data (association mining) – finding frequent episodes in sequences – finding frequent subgraphs in graphs etc.

• efficient algorithms exist -- the levelwise approach• theoretical analysis of the algorithms is not trivial (leads to connections to

hypergraph transversals etc.)

• Part II: how can the patterns be used?• sometimes interesting in themselves • can be used to approximate the joint distribution • maximum entropy approaches • combining information from several patterns - ordering patterns

Page 40: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

When can we trust the patterns we found?

• Statistical issues:– Patterns can be the result of chance– Multiple testing increases this risk– Small samples, interest in weak patterns, etc… are

other factors

• Statistical learning theory and Classical statistics have developed tools to deal with this

• These criteria can also guide the search algorithms towards more reliable patterns…

Page 41: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

John Shawe-Taylor Statistical Aspects of Pattern Analysis

• We want significant / reliable patterns• Reliable: give us predictive power• Significant: cannot be explained by chance

• Factors affecting pattern reliability:– Pattern magnitude (how strong is the relation)– Sample size (how large is the support from data?)– Multiple testing (how many other patterns have been tested at

the same time)

• This translates into classical machine learning and statistics themes

Page 42: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Nicolo' Cesa-Bianchi On-line linear learning algorithms

• Machine learning has various ways to model the pattern discovery process.

• An approach completely different from classical statistics: on-line learning.– Prediction with expert advice. – Learning with linear experts. – The Perceptron algorithm and its extensions.– On-line learning with kernels. – Mistake bounds. – From mistake bounds to risk bounds

Page 43: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Grammatical Inference

• Very classical theme in pattern recognition, based on Chomsky’s theory of formal languages and grammars:

• Given a finite sample from a language, infer the grammar that generates it (with various constraints).

• A child’s game…

Page 44: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Colin de la Higuera : "Grammatical Inference, a Tutorial"

• The lectures will introduce the key ideas of grammatical inference and concentrate specially on the algorithmic aspects.

• Some algorithms that will be described are: – The "State merging" family : Gold, Rpni, Edsm... – The "Window" languages : Local and k-testable– Learning with queries.

• This class of approaches often goes under the name “Syntactical Pattern Recognition”

Page 45: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns

Patterns in Graphs

• Edwin Hancock (University of York, UK) - ``Pattern Analysis with Graphs and Trees'‘– Spectral representations of graphs, – Pattern spaces from graph spectra, – Spectral approaches to matching, – Heat kernel methods – Probabilistic and spectral methods for graph matching

and clustering. – Applications in computer vision.

Page 46: The Analysis of Patterns Nello Cristianini. The Value of Patterns Patterns are everywhere, and people have always been fascinated by them. Detecting patterns