graphsandpaths: findingconnectionsand … › ~ptw › inaugural.pdflinked data linked railway data...
TRANSCRIPT
Graphs and paths: Finding connections andpatterns in data
Peter Wood
Department of Computer Science and Information Systems
Birkbeck, University of London
Graphs and paths: Finding connections and patterns in data 1
Contents
Introduction
The complexity of querying graphs (1987–1995)
Optimising queries on trees (1999–2007)
Adding flexibility to graph queries (2006–
Ongoing and future work
Graphs and paths: Finding connections and patterns in data 2
Contents
Introduction
The complexity of querying graphs (1987–1995)
Optimising queries on trees (1999–2007)
Adding flexibility to graph queries (2006–
Ongoing and future work
Graphs and paths: Finding connections and patterns in data 3
Introduction
What is a graph and what are they used for?
What is a query language?
What are some of the problems we study?
Graphs and paths: Finding connections and patterns in data 4
What is a graph?
A B C
D
A graph is an abstract representation of some real-world data,
where the data consists of connections between entities:
the entities are denoted by nodes
the connections are denoted by edges
Graphs have been studied since the 18th century
Graphs and paths: Finding connections and patterns in data 5
What is a graph?
A B C
D
A graph is an abstract representation of some real-world data,
where the data consists of connections between entities:
the entities are denoted by nodes
the connections are denoted by edges
Graphs have been studied since the 18th century
Graphs and paths: Finding connections and patterns in data 6
Bridges of Königsberg
The city of Königsberg straddled a river containing 2 islands.
The islands and mainland were connected by 7 bridges.
Was there a walk that crossed each bridge once and only once?
In 1736 Leonhard Euler proved that such a walk did not exist.Graphs and paths: Finding connections and patterns in data 7
Bridges of Königsberg as a graph
A B C
D
Graphs and paths: Finding connections and patterns in data 8
Airline flights as a graph
A graph of airports and flight distances (in hundreds of miles):
SEA
YVR
LHR
CDGIAH MAD
LIM SCL
250
47
8
1
19
31 59
15
72
Graphs and paths: Finding connections and patterns in data 10
Airlines connecting airports
A graph of airports and airlines flying between them:
SEA
YVR
LHR
CDGIAH MAD
LIM SCL
BABA
BA
BA
AA
AA
AA LA
LA
LA
AFAA
AC
IB
AC
LA IB AF
Graphs and paths: Finding connections and patterns in data 11
Large graphs
the previous examples were of “toy” graphs
real-life graphs can be very large
examples include
Facebook: 1.65 billion users (nodes) and 280 billion connections (edges)
Linked Open Data cloud: 295 datasets, with 31 billion facts
Graphs and paths: Finding connections and patterns in data 12
Linked data graph
Linked Datasets as of August 2014
Uniprot
AlexandriaDigital Library
Gazetteer
lobidOrganizations
chem2bio2rdf
MultimediaLab University
Ghent
Open DataEcuador
GeoEcuador
Serendipity
UTPLLOD
GovAgriBusDenmark
DBpedialive
URIBurner
Linguistics
Social Networking
Life Sciences
Cross-Domain
Government
User-Generated Content
Publications
Geographic
Media
Identifiers
EionetRDF
lobidResources
WiktionaryDBpedia
Viaf
Umthes
RKBExplorer
Courseware
Opencyc
Olia
Gem.Thesaurus
AudiovisueleArchieven
DiseasomeFU-Berlin
Eurovocin
SKOS
DNBGND
Cornetto
Bio2RDFPubmed
Bio2RDFNDC
Bio2RDFMesh
IDS
OntosNewsPortal
AEMET
ineverycrea
LinkedUser
Feedback
MuseosEspaniaGNOSS
Europeana
NomenclatorAsturias
Red UnoInternacional
GNOSS
GeoWordnet
Bio2RDFHGNC
CticPublic
Dataset
Bio2RDFHomologene
Bio2RDFAffymetrix
MuninnWorld War I
CKAN
GovernmentWeb Integration
forLinkedData
Universidadde CuencaLinkeddata
Freebase
Linklion
Ariadne
OrganicEdunet
GeneExpressionAtlas RDF
ChemblRDF
BiosamplesRDF
IdentifiersOrg
BiomodelsRDF
ReactomeRDF
Disgenet
SemanticQuran
IATI asLinked Data
DutchShips and
Sailors
Verrijktkoninkrijk
IServe
Arago-dbpedia
LinkedTCGA
ABS270a.info
RDFLicense
EnvironmentalApplications
ReferenceThesaurus
Thist
JudaicaLink
BPR
OCD
ShoahVictimsNames
Reload
Data forTourists in
Castilla y Leon
2001SpanishCensusto RDF
RKBExplorer
Webscience
RKBExplorerEprintsHarvest
NVS
EU AgenciesBodies
EPO
LinkedNUTS
RKBExplorer
Epsrc
OpenMobile
Network
RKBExplorerLisbon
RKBExplorer
Italy
CE4R
EnvironmentAgency
Bathing WaterQuality
RKBExplorerKaunas
OpenData
Thesaurus
RKBExplorerWordnet
RKBExplorer
ECS
AustrianSki
Racers
Social-semweb
Thesaurus
DataOpenAc Uk
RKBExplorer
IEEE
RKBExplorer
LAAS
RKBExplorer
Wiki
RKBExplorer
JISC
RKBExplorerEprints
RKBExplorer
Pisa
RKBExplorer
Darmstadt
RKBExplorerunlocode
RKBExplorer
Newcastle
RKBExplorer
OS
RKBExplorer
Curriculum
RKBExplorer
Resex
RKBExplorer
Roma
RKBExplorerEurecom
RKBExplorer
IBM
RKBExplorer
NSF
RKBExplorer
kisti
RKBExplorer
DBLP
RKBExplorer
ACM
RKBExplorerCiteseer
RKBExplorer
Southampton
RKBExplorerDeepblue
RKBExplorerDeploy
RKBExplorer
Risks
RKBExplorer
ERA
RKBExplorer
OAI
RKBExplorer
FT
RKBExplorer
Ulm
RKBExplorer
Irit
RKBExplorerRAE2001
RKBExplorer
Dotac
RKBExplorerBudapest
SwedishOpen Cultural
Heritage
Radatana
CourtsThesaurus
GermanLabor LawThesaurus
GovUKTransport
Data
GovUKEducation
Data
EnaktingMortality
EnaktingEnergy
EnaktingCrime
EnaktingPopulation
EnaktingCO2Emission
EnaktingNHS
RKBExplorer
Crime
RKBExplorercordis
Govtrack
GeologicalSurvey of
AustriaThesaurus
GeoLinkedData
GesisThesoz
Bio2RDFPharmgkb
Bio2RDFSabiorkBio2RDF
Ncbigene
Bio2RDFIrefindex
Bio2RDFIproclass
Bio2RDFGOA
Bio2RDFDrugbank
Bio2RDFCTD
Bio2RDFBiomodels
Bio2RDFDBSNP
Bio2RDFClinicaltrials
Bio2RDFLSR
Bio2RDFOrphanet
Bio2RDFWormbase
BIS270a.info
DM2E
DBpediaPT
DBpediaES
DBpediaCS
DBnary
AlpinoRDF
YAGO
PdevLemon
Lemonuby
Isocat
Ietflang
Core
KUPKB
GettyAAT
SemanticWeb
Journal
OpenlinkSWDataspaces
MyOpenlinkDataspaces
Jugem
Typepad
AspireHarperAdams
NBNResolving
Worldcat
Bio2RDF
Bio2RDFECO
Taxon-conceptAssets
Indymedia
GovUKSocietal
WellbeingDeprivation imd
EmploymentRank La 2010
GNULicenses
GreekWordnet
DBpedia
CIPFA
Yso.fiAllars
Glottolog
StatusNetBonifaz
StatusNetshnoulle
Revyu
StatusNetKathryl
ChargingStations
AspireUCL
Tekord
Didactalia
ArtenueVosmedios
GNOSS
LinkedCrunchbase
ESDStandards
VIVOUniversityof Florida
Bio2RDFSGD
Resources
ProductOntology
DatosBne.es
StatusNetMrblog
Bio2RDFDataset
EUNIS
GovUKHousingMarket
LCSH
GovUKTransparencyImpact ind.Households
In temp.Accom.
UniprotKB
StatusNetTimttmy
SemanticWeb
Grundlagen
GovUKInput ind.
Local AuthorityFunding FromGovernment
Grant
StatusNetFcestrada
JITA
StatusNetSomsants
StatusNetIlikefreedom
DrugbankFU-Berlin
Semanlink
StatusNetDtdns
StatusNetStatus.net
DCSSheffield
AtheliaRFID
StatusNetTekk
ListaEncabezaMientosMateria
StatusNetFragdev
Morelab
DBTuneJohn PeelSessions
RDFizelast.fm
OpenData
Euskadi
GovUKTransparency
Input ind.Local auth.Funding f.
Gvmnt. Grant
MSC
Lexinfo
StatusNetEquestriarp
Asn.us
GovUKSocietal
WellbeingDeprivation ImdHealth Rank la
2010
StatusNetMacno
OceandrillingBorehole
AspireQmul
GovUKImpact
IndicatorsPlanning
ApplicationsGranted
Loius
Datahub.io
StatusNetMaymay
Prospectsand
TrendsGNOSS
GovUKTransparency
Impact IndicatorsEnergy Efficiency
new Builds
DBpediaEU
Bio2RDFTaxon
StatusNetTschlotfeldt
JamendoDBTune
AspireNTU
GovUKSocietal
WellbeingDeprivation Imd
Health Score2010
LoticoGNOSS
UniprotMetadata
LinkedEurostat
AspireSussex
Lexvo
LinkedGeoData
StatusNetSpip
SORS
GovUKHomeless-
nessAccept. per
1000
TWCIEEEvis
AspireBrunel
PlanetDataProject
Wiki
StatusNetFreelish
Statisticsdata.gov.uk
StatusNetMulestable
Enipedia
UKLegislation
API
LinkedMDB
StatusNetQth
SiderFU-Berlin
DBpediaDE
GovUKHouseholds
Social lettingsGeneral Needs
Lettings PrpNumber
Bedrooms
AgrovocSkos
MyExperiment
ProyectoApadrina
GovUKImd CrimeRank 2010
SISVU
GovUKSocietal
WellbeingDeprivation ImdHousing Rank la
2010
StatusNetUni
Siegen
OpendataScotland Simd
EducationRank
StatusNetKaimi
GovUKHouseholds
Accommodatedper 1000
StatusNetPlanetlibre
DBpediaEL
SztakiLOD
DBpediaLite
DrugInteractionKnowledge
BaseStatusNet
Qdnx
AmsterdamMuseum
AS EDN LOD
RDFOhloh
DBTuneartistslast.fm
AspireUclan
HellenicFire Brigade
Bibsonomy
NottinghamTrent
ResourceLists
OpendataScotland SimdIncome Rank
RandomnessGuide
London
OpendataScotland
Simd HealthRank
SouthamptonECS Eprints
FRB270a.info
StatusNetSebseb01
StatusNetBka
ESDToolkit
HellenicPolice
StatusNetCed117
OpenEnergy
Info Wiki
StatusNetLydiastench
OpenDataRISP
Taxon-concept
Occurences
Bio2RDFSGD
UIS270a.info
NYTimesLinked Open
Data
AspireKeele
GovUKHouseholdsProjectionsPopulation
W3C
OpendataScotland
Simd HousingRank
ZDB
StatusNet1w6
StatusNetAlexandre
Franke
DeweyDecimal
Classification
StatusNetStatus
StatusNetdoomicile
CurrencyDesignators
StatusNetHiico
LinkedEdgar
GovUKHouseholds
2008
DOI
StatusNetPandaid
BrazilianPoliticians
NHSJargon
Theses.fr
LinkedLifeData
Semantic WebDogFood
UMBEL
OpenlyLocal
StatusNetSsweeny
LinkedFood
InteractiveMaps
GNOSS
OECD270a.info
Sudoc.fr
GreenCompetitive-
nessGNOSS
StatusNetIntegralblue
WOLD
LinkedStockIndex
Apache
KDATA
LinkedOpenPiracy
GovUKSocietal
WellbeingDeprv. ImdEmpl. Rank
La 2010
BBCMusic
StatusNetQuitter
StatusNetScoffoni
OpenElection
DataProject
Referencedata.gov.uk
StatusNetJonkman
ProjectGutenbergFU-BerlinDBTropes
StatusNetSpraci
Libris
ECB270a.info
StatusNetThelovebug
Icane
GreekAdministrative
Geography
Bio2RDFOMIM
StatusNetOrangeseeds
NationalDiet Library
WEB NDLAuthorities
UniprotTaxonomy
DBpediaNL
L3SDBLP
FAOGeopolitical
Ontology
GovUKImpact
IndicatorsHousing Starts
DeutscheBiographie
StatusNetldnfai
StatusNetKeuser
StatusNetRusswurm
GovUK SocietalWellbeing
Deprivation ImdCrime Rank 2010
GovUKImd Income
Rank La2010
StatusNetDatenfahrt
StatusNetImirhil
Southamptonac.uk
LOD2Project
Wiki
DBpediaKO
DailymedFU-Berlin
WALS
DBpediaIT
StatusNetRecit
Livejournal
StatusNetExdc
Elviajero
Aves3D
OpenCalais
ZaragozaTurruta
AspireManchester
Wordnet(VU)
GovUKTransparency
Impact IndicatorsNeighbourhood
Plans
StatusNetDavid
Haberthuer
B3Kat
PubBielefeld
Prefix.cc
NALT
Vulnera-pedia
GovUKImpact
IndicatorsAffordable
Housing Starts
GovUKWellbeing lsoa
HappyYesterday
Mean
FlickrWrappr
Yso.fiYSA
OpenLibrary
AspirePlymouth
StatusNetJohndrink
Water
StatusNetGomertronic
Tags2conDelicious
StatusNettl1n
StatusNetProgval
Testee
WorldFactbookFU-Berlin
DBpediaJA
StatusNetCooleysekula
ProductDB
IMF270a.info
StatusNetPostblue
StatusNetSkilledtests
NextwebGNOSS
EurostatFU-Berlin
GovUKHouseholds
Social LettingsGeneral Needs
Lettings PrpHousehold
Composition
StatusNetFcac
DWSGroup
OpendataScotland
GraphSimd Rank
DNB
CleanEnergyData
Reegle
OpendataScotland SimdEmployment
Rank
ChroniclingAmerica
GovUKSocietal
WellbeingDeprivation
Imd Rank 2010
StatusNetBelfalas
AspireMMU
StatusNetLegadolibre
BlukBNB
StatusNetLebsanft
GADMGeovocab
GovUKImd Score
2010
SemanticXBRL
UKPostcodes
GeoNames
EEARodAspire
Roehampton
BFS270a.info
CameraDeputatiLinkedData
Bio2RDFGeneID
GovUKTransparency
Impact IndicatorsPlanning
ApplicationsGranted
StatusNetSweetie
Belle
O'Reilly
GNI
CityLichfield
GovUKImd
Rank 2010
BibleOntology
Idref.fr
StatusNetAtari
Frosch
Dev8d
NobelPrizes
StatusNetSoucy
ArchiveshubLinkedData
LinkedRailway
DataProject
FAO270a.info
GovUKWellbeing
WorthwhileMean
Bibbase
Semantic-web.org
BritishMuseum
Collection
GovUKDev LocalAuthorityServices
CodeHaus
Lingvoj
OrdnanceSurveyLinkedData
Wordpress
EurostatRDF
StatusNetKenzoid
GEMET
GovUKSocietal
WellbeingDeprv. imdScore '10
MisMuseosGNOSS
GovUKHouseholdsProjections
totalHouseolds
StatusNet20100
EEA
CiardRing
OpendataScotland Graph
EducationPupils by
School andDatazone
VIVOIndiana
University
Pokepedia
Transparency270a.info
StatusNetGlou
GovUKHomelessness
HouseholdsAccommodated
TemporaryHousing Types
STWThesaurus
forEconomics
DebianPackageTrackingSystem
DBTuneMagnatune
NUTSGeo-vocab
GovUKSocietal
WellbeingDeprivation ImdIncome Rank La
2010
BBCWildlifeFinder
StatusNetMystatus
MiguiadEviajesGNOSS
AcornSat
DataBnf.fr
GovUKimd env.
rank 2010
StatusNetOpensimchat
OpenFoodFacts
GovUKSocietal
WellbeingDeprivation Imd
Education Rank La2010
LODACBDLS
FOAF-Profiles
StatusNetSamnoble
GovUKTransparency
Impact IndicatorsAffordable
Housing Starts
StatusNetCoreyavisEnel
Shops
DBpediaFR
StatusNetRainbowdash
StatusNetMamalibre
PrincetonLibrary
Findingaids
WWWFoundation
Bio2RDFOMIM
Resources
OpendataScotland Simd
GeographicAccess Rank
Gutenberg
StatusNetOtbm
ODCLSOA
StatusNetOurcoffs
Colinda
WebNmasunoTraveler
StatusNetHackerposse
LOV
GarnicaPlywood
GovUKwellb. happy
yesterdaystd. dev.
StatusNetLudost
BBCProgram-
mes
GovUKSocietal
WellbeingDeprivation Imd
EnvironmentRank 2010
Bio2RDFTaxonomy
Worldbank270a.info
OSM
DBTuneMusic-brainz
LinkedMarkMail
StatusNetDeuxpi
GovUKTransparency
ImpactIndicators
Housing Starts
BizkaiSense
GovUKimpact
indicators energyefficiency new
builds
StatusNetMorphtown
GovUKTransparency
Input indicatorsLocal authorities
Working w. tr.Families
ISO 639Oasis
AspirePortsmouth
ZaragozaDatos
AbiertosOpendataScotland
SimdCrime Rank
Berlios
StatusNetpiana
GovUKNet Add.Dwellings
Bootsnall
StatusNetchromic
Geospecies
linkedct
Wordnet(W3C)
StatusNetthornton2
StatusNetmkuttner
StatusNetlinuxwrangling
EurostatLinkedData
GovUKsocietal
wellbeingdeprv. imd
rank '07
GovUKsocietal
wellbeingdeprv. imdrank la '10
LinkedOpen Data
ofEcology
StatusNetchickenkiller
StatusNetgegeweb
DeustoTech
StatusNetschiessle
GovUKtransparency
impactindicatorstr. families
Taxonconcept
GovUKservice
expenditure
GovUKsocietal
wellbeingdeprivation imd
employmentscore 2010
Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak.
http://lod-cloud.net/
Graphs and paths: Finding connections and patterns in data 13
Linked Open Data
Linked Open Data stores facts such as
“Hilary Mantel won the Man Booker Prize"
as triples of the form subject — predicate (or property) — object
e.g. ( Hilary Mantel , hasWonPrize, Man Booker ) or
Hilary Mantel Man BookerhasWonPrize
This language is called RDF (Resource Description Framework)
Graphs and paths: Finding connections and patterns in data 14
What are graphs used for?
social networks
knowledge representation (e.g. semantic web)
transportation and other networks (e.g WWW)
geographical information
(hyper)document structure
semantic associations in criminal investigations
bibliographic citation analysis
pathways in biological processes
computer program analysis
workflow systems
data provenance
. . .Graphs and paths: Finding connections and patterns in data 15
Query languages
A query language is a declarative language in which to express an
information request from a data set.
“declarative” means that we shouldn’t need to specify precisely how
the information is to be retrieved
that is left up to the (database) system to work out
e.g. What is the length of the shortest route (path) from LHR to LIM
Graphs and paths: Finding connections and patterns in data 16
Query languages
A query language is a declarative language in which to express an
information request from a data set.
“declarative” means that we shouldn’t need to specify precisely how
the information is to be retrieved
that is left up to the (database) system to work out
e.g. What is the length of the shortest route (path) from LHR to LIM
Graphs and paths: Finding connections and patterns in data 17
Airline flights as a graph
A graph of airports and flight distances (in hundreds of miles):
SEA
YVR
LHR
CDGIAH MAD
LIM SCL
250
47
8
1
19
31 59
15
72
Graphs and paths: Finding connections and patterns in data 18
Airline flights as a graph
A graph of airports and flight distances (in hundreds of miles):
SEA
YVR
LHR
CDGIAH MAD
LIM SCL
250
47
8
1
19
31 59
15
72
Graphs and paths: Finding connections and patterns in data 19
Airline flights as a graph
A graph of airports and flight distances (in hundreds of miles):
SEA
YVR
LHR
CDGIAH MAD
LIM SCL
250
47
8
1
19
31 59
15
72
Graphs and paths: Finding connections and patterns in data 20
Airline flights as a graph
A graph of airports and flight distances (in hundreds of miles):
SEA
YVR
LHR
CDGIAH MAD
LIM SCL
250
47
8
1
19
31 59
15
72
Graphs and paths: Finding connections and patterns in data 21
Airline flights as a graph
A graph of airports and flight distances (in hundreds of miles):
SEA
YVR
LHR
CDGIAH MAD
LIM SCL
250
47
8
1
19
31 59
15
72
Graphs and paths: Finding connections and patterns in data 22
Shortest path algorithmFunction Dijkstra(Graph, source)dist[LHR] ← 0
create node set Q
foreach node v in Graph do
if v 6= source then dist[v] ← INFINITY
Q.add_with_priority(v, dist[v])
while Q is not empty do
u ← Q.extract_min()
foreach neighbour v of u do
alt ← dist[u] + length(u, v)
if alt < dist[v] then
dist[v] ← alt
Q.decrease_priority(v, alt)
return dist[]Graphs and paths: Finding connections and patterns in data 23
Shortest path as a query
LHR LIM( )+(D)
shortestPath(min(sum(D))
variable D collects up the distance values
sum adds up the distances along a path
min asks for the minimum summed distance
Graphs and paths: Finding connections and patterns in data 24
Problems to study
3 main topics of investigation:
what are appropriate query language constructs?
how efficiently can queries be evaluated?
how can query evaluation be optimised?
Graphs and paths: Finding connections and patterns in data 25
Measures of efficiency
we can run queries to see how fast they execute
we can study their computational complexity
polynomial-time algorithms
NP-complete (or worse) problems — intractable
we can try to find polynomial-time subclasses (occurring in practice)
we can try to find approximation algorithms or heuristics
we can claim that the problem size is small enough
Graphs and paths: Finding connections and patterns in data 26
Measures of efficiency
we can run queries to see how fast they execute
we can study their computational complexity
polynomial-time algorithms
NP-complete (or worse) problems — intractable
we can try to find polynomial-time subclasses (occurring in practice)
we can try to find approximation algorithms or heuristics
we can claim that the problem size is small enough
Graphs and paths: Finding connections and patterns in data 27
Measures of efficiency
we can run queries to see how fast they execute
we can study their computational complexity
polynomial-time algorithms
NP-complete (or worse) problems — intractable
we can try to find polynomial-time subclasses (occurring in practice)
we can try to find approximation algorithms or heuristics
we can claim that the problem size is small enough
Graphs and paths: Finding connections and patterns in data 28
Contents
Introduction
The complexity of querying graphs (1987–1995)
Optimising queries on trees (1999–2007)
Adding flexibility to graph queries (2006–
Ongoing and future work
Graphs and paths: Finding connections and patterns in data 29
Many years ago . . .
my PhD was about “Queries on Graphs”
one contribution was the use of regular expressions to specify the
paths of interest
also contributions in terms of complexity of query evaluation
25 years later
regular expressions were added to SPARQL 1.1, the standard query
language for RDF
Graphs and paths: Finding connections and patterns in data 30
Many years ago . . .
my PhD was about “Queries on Graphs”
one contribution was the use of regular expressions to specify the
paths of interest
also contributions in terms of complexity of query evaluation
25 years later
regular expressions were added to SPARQL 1.1, the standard query
language for RDF
Graphs and paths: Finding connections and patterns in data 31
Regular expressions
Used to express patterns of interest, using 3 (or more) operators:
“choice” (|): a parent is a father or mother
(father | mother)
“sequence” (·): a grandparent is a parent of a parent
(parent · parent)
“repetition” (∗): an ancestor is a parent of a parent of a . . .
(parent)∗
These can be combined: so (mother | father)∗ specifies any sequence of
mother and/or father relationships.
Graphs and paths: Finding connections and patterns in data 32
Regular expressions
Used to express patterns of interest, using 3 (or more) operators:
“choice” (|): a parent is a father or mother
(father | mother)
“sequence” (·): a grandparent is a parent of a parent
(parent · parent)
“repetition” (∗): an ancestor is a parent of a parent of a . . .
(parent)∗
These can be combined: so (mother | father)∗ specifies any sequence of
mother and/or father relationships.
Graphs and paths: Finding connections and patterns in data 33
Regular expressions
Used to express patterns of interest, using 3 (or more) operators:
“choice” (|): a parent is a father or mother
(father | mother)
“sequence” (·): a grandparent is a parent of a parent
(parent · parent)
“repetition” (∗): an ancestor is a parent of a parent of a . . .
(parent)∗
These can be combined: so (mother | father)∗ specifies any sequence of
mother and/or father relationships.
Graphs and paths: Finding connections and patterns in data 34
Regular expressions
Used to express patterns of interest, using 3 (or more) operators:
“choice” (|): a parent is a father or mother
(father | mother)
“sequence” (·): a grandparent is a parent of a parent
(parent · parent)
“repetition” (∗): an ancestor is a parent of a parent of a . . .
(parent)∗
These can be combined: so (mother | father)∗ specifies any sequence of
mother and/or father relationships.
Graphs and paths: Finding connections and patterns in data 35
Regular expressions
Used to express patterns of interest, using 3 (or more) operators:
“choice” (|): a parent is a father or mother
(father | mother)
“sequence” (·): a grandparent is a parent of a parent
(parent · parent)
“repetition” (∗): an ancestor is a parent of a parent of a . . .
(parent)∗
These can be combined: so (mother | father)∗ specifies any sequence of
mother and/or father relationships.
Graphs and paths: Finding connections and patterns in data 36
Example of a query
given a graph
nodes representing people
edges labelled with m (for motherOf) or f (for fatherOf)
following query finds parents followed by pairs of people who have a
common ancestor
x z y x y
x y x y
p* p* a
m|f p
x, y and z are variables, for matching nodes in the graphGraphs and paths: Finding connections and patterns in data 37
Example of a query
given a graph
nodes representing people
edges labelled with m (for motherOf) or f (for fatherOf)
following query finds parents followed by pairs of people who have a
common ancestor
x z y x y
x y x y
p* p* a
m|f p
x, y and z are variables, for matching nodes in the graphGraphs and paths: Finding connections and patterns in data 38
Regular simple path queries
a path p is simple if no node is repeated on p
Regular Simple Path Problem
Given a graph, a pair of nodes x and y and a regular expression r , is
there a simple path from x to y satisfying r?
Regular Simple Path Problem is NP-complete, even for fixed
expressions
the complexity is in the size of the graph
Graphs and paths: Finding connections and patterns in data 39
Regular simple path queries
a path p is simple if no node is repeated on p
Regular Simple Path Problem
Given a graph, a pair of nodes x and y and a regular expression r , is
there a simple path from x to y satisfying r?
Regular Simple Path Problem is NP-complete, even for fixed
expressions
the complexity is in the size of the graph
Graphs and paths: Finding connections and patterns in data 40
Regular simple path queries
a path p is simple if no node is repeated on p
Regular Simple Path Problem
Given a graph, a pair of nodes x and y and a regular expression r , is
there a simple path from x to y satisfying r?
Regular Simple Path Problem is NP-complete, even for fixed
expressions
the complexity is in the size of the graph
Graphs and paths: Finding connections and patterns in data 41
Regular simple path problem
solvable in polynomial time when restricted to regular expressions
closed under abbreviations (or deletions)
if we take a sequence satisfying regular expression r and delete some
labels from the sequence, we get another sequence satisfying r
many commonly used regular expressions, such as p* and (m|f)* are
closed under abbreviations
Graphs and paths: Finding connections and patterns in data 42
Contents
Introduction
The complexity of querying graphs (1987–1995)
Optimising queries on trees (1999–2007)
Adding flexibility to graph queries (2006–
Ongoing and future work
Graphs and paths: Finding connections and patterns in data 43
Trees
trees are commonly used for representing data
one language for trees is called XML (eXtensible Modelling Language)
XML is used for data exchange between systems/applications
e.g., HESA (Higher Education Statistics Agency) requires student
data to be submitted in XML
Graphs and paths: Finding connections and patterns in data 44
Calendar data in XML
<diary>
<event>
<date>3/12/88</date> <description>anniversary</description>
<repeat><frequency>yearly</frequency></repeat>
</event>
<event>
<date>6/6/16</date> <description>inaugural</description>
<time>17:00</time> <duration>1:00</duration>
</event>
<event>
<date>5/1/2016</date> <description>lecture</description>
<time>18:00</time> <duration>3:00</duration>
<repeat><until>19/3/2016</until><frequency>weekly</frequency></repeat>
</event>
</diary>Graphs and paths: Finding connections and patterns in data 45
Calendar data as a tree
diary
event event event
date desc repeat date desc time duration date desc time duration repeat
3/12/88 anniversary frequency 6/6/16 inaugural 17:00 1:00 5/1/16 lecture 18:00 3:00 until frequency
yearly 19/3/16 weekly
Graphs and paths: Finding connections and patterns in data 46
A query language for XML
XPath (XML Path language) is a query language for XML
it is part of many other languages for processing XML
it makes sense to study optimisation of XPath queries
particularly in the presence of schema information
a schema is a set of constraints that restricts the permissable
relationships between data items
one way to define schemas for XML is to use
Document Type Definitions (DTDs)
Graphs and paths: Finding connections and patterns in data 47
A query language for XML
XPath (XML Path language) is a query language for XML
it is part of many other languages for processing XML
it makes sense to study optimisation of XPath queries
particularly in the presence of schema information
a schema is a set of constraints that restricts the permissable
relationships between data items
one way to define schemas for XML is to use
Document Type Definitions (DTDs)
Graphs and paths: Finding connections and patterns in data 48
DTD example
Consider an XML DTD for our diary application:
diary → (event)*
event → (date · description · (time · duration)?, repeat?)
repeat → ((until | occurrences)? · frequency)
note the use of regular expressions above
? means that what precedes it is optional
some constraints specified by the DTD are:
– every event must have a date
– every event with a time must have a duration
– every event has at most one description
Graphs and paths: Finding connections and patterns in data 49
DTD example
Consider an XML DTD for our diary application:
diary → (event)*
event → (date · description · (time · duration)?, repeat?)
repeat → ((until | occurrences)? · frequency)
note the use of regular expressions above
? means that what precedes it is optional
some constraints specified by the DTD are:
– every event must have a date
– every event with a time must have a duration
– every event has at most one description
Graphs and paths: Finding connections and patterns in data 50
Query satisfiability
an XPath query may be unsatisfiable (with respect to a DTD)
this means that the answer to the query will always be empty
e.g., asking for diary events which repeat both until some date and
some number of occurrences
checking satisfiability first can yield savings in overall query
processing time
Graphs and paths: Finding connections and patterns in data 51
XPath satisfiability results
(joint work with PhD student Manizheh Montazerian)
we showed that checking satisfiability, even for simple XPath
fragments, is NP-complete
we also examined several real-world DTDs and discovered a new
property, called covering, which most of them preserved
we showed that query satisfiability for a fragment of XPath can be
tested in polynomial time when the underlying DTD is covering
Graphs and paths: Finding connections and patterns in data 52
Contents
Introduction
The complexity of querying graphs (1987–1995)
Optimising queries on trees (1999–2007)
Adding flexibility to graph queries (2006–
Ongoing and future work
Graphs and paths: Finding connections and patterns in data 53
Flexible querying
(joint work with Alex Poulovassilis, Andrea Calì, Carlos Hurtado, Petra
Selmer, Riccardo Frosini)
with LOD, many heterogeneous data sets are available
the standard query language is called SPARQL
regular expressions (property paths in SPARQL 1.1 (2013)) already
allow for some flexibility
but when users don’t know the vocabulary or structure, they may
need help
in particular, they may pose unsatisfiable queries
Graphs and paths: Finding connections and patterns in data 54
Example LOD graph
Graph of authors , prizes they have won, and countries where they
were born:
Nobel Booker
Neruda Gordimer Coetzee Carey
Chile South Africa Australia
hasWonPrize
wasBornIn
Graphs and paths: Finding connections and patterns in data 55
Example SPARQL query
Which authors born in South Africa have won both the
Nobel Prize in Literature and the Man Booker prize?
xBooker Nobel
South Africa
hasWonPrize hasWonPrize
wasBornIn
x is a variable
Graphs and paths: Finding connections and patterns in data 56
Matching answers
Two matchings for variable x : Gordimer and Coetzee
Nobel Booker
Neruda Gordimer Coetzee Carey
Chile South Africa Australia
Graphs and paths: Finding connections and patterns in data 57
Matching answers
Two matchings for variable x : Gordimer and Coetzee
Nobel Booker
Neruda Gordimer Coetzee Carey
Chile South Africa Australia
Gordimer
Graphs and paths: Finding connections and patterns in data 58
Matching answers
Two matchings for variable x : Gordimer and Coetzee
Nobel Booker
Neruda Gordimer Coetzee Carey
Chile South Africa Australia
Coetzee
Graphs and paths: Finding connections and patterns in data 59
in the real data set, wasBornIn connects a person to a city, not a
country
so the previous query will return no answers
a city is connected to a country by an edge labelled with isLocatedIn
so the correct regular expression to connect x to South Africa is:
wasBornIn · isLocatedIn
the query system can automatically make changes to a query so as to
help the user find relevant information
Graphs and paths: Finding connections and patterns in data 60
Approximate matching
the user’s original query is modified by applying edit operations to theregular expressions, e.g.,
insertions
deletions
substitutions
so isLocatedIn would be automatically inserted after wasBornIn
such edit operations have also been used for DNA sequence alignment
each operation may have a different cost associated with it
answers to queries are returned in ranked order, in increasing total
cost from the original query
Graphs and paths: Finding connections and patterns in data 61
Approximate matching
the user’s original query is modified by applying edit operations to theregular expressions, e.g.,
insertions
deletions
substitutions
so isLocatedIn would be automatically inserted after wasBornIn
such edit operations have also been used for DNA sequence alignment
each operation may have a different cost associated with it
answers to queries are returned in ranked order, in increasing total
cost from the original query
Graphs and paths: Finding connections and patterns in data 62
Optimisation
applying edit operations generates many additional queries
so we have been investigating ways to speed up query execution
one is to construct an approximate summary schema from the data
this allows the system to detect and ignore unsatisfiable queries
Graphs and paths: Finding connections and patterns in data 63
Schema summary optimisation
basic idea is to record all the path sequences up to a certain
maximum length that actually appear in the data
choose a small maximum length, e.g. 2
only an approximation: if we discover two paths a · b and b · c in the
data, the summary records that a · b · c is a possible path, even if it
doesn’t actually appear in the data
nevertheless, if a path does not appear in the summary, then it does
not appear in the data
this allows the system to check for unsatisfiable queries
and save time by not executing them
Graphs and paths: Finding connections and patterns in data 64
Schema summary optimisation
basic idea is to record all the path sequences up to a certain
maximum length that actually appear in the data
choose a small maximum length, e.g. 2
only an approximation: if we discover two paths a · b and b · c in the
data, the summary records that a · b · c is a possible path, even if it
doesn’t actually appear in the data
nevertheless, if a path does not appear in the summary, then it does
not appear in the data
this allows the system to check for unsatisfiable queries
and save time by not executing them
Graphs and paths: Finding connections and patterns in data 65
Sample schema summary results
for one query with a maximum edit cost set to 3,
112 additional queries were generated via edit operations
using a schema summary on a data set of 6.7M facts,
59 of these queries were unsatisfiable
without the summary, the queries did not complete execution within 8
hours
with a schema summary of size 2, they completed in under 4 seconds
Graphs and paths: Finding connections and patterns in data 66
Sample schema summary results
for one query with a maximum edit cost set to 3,
112 additional queries were generated via edit operations
using a schema summary on a data set of 6.7M facts,
59 of these queries were unsatisfiable
without the summary, the queries did not complete execution within 8
hours
with a schema summary of size 2, they completed in under 4 seconds
Graphs and paths: Finding connections and patterns in data 67
Contents
Introduction
The complexity of querying graphs (1987–1995)
Optimising queries on trees (1999–2007)
Adding flexibility to graph queries (2006–
Ongoing and future work
Graphs and paths: Finding connections and patterns in data 68
Ongoing and future work
further investigation of suitable query constructs, particularly for
social network applications
returning recommended paths to users, rather than simply nodes
provide explanations to users about how answers to queries were
derived
other methods to improve query execution time
use mappings between LOD data sets to allow for automatic rewriting
of queries in distributed scenarios
Graphs and paths: Finding connections and patterns in data 69