outline creating and federating radically distributed web ... · presentation (fonts etc) web...

8
1 Creating and federating radically distributed web databases using XML Fred Howell Institute for Adaptive and Neural Computation Division of Informatics University of Edinburgh Outline n Prelude: How do we get the XML? n Interlude: "web scraping art galleries" n Fugue: How do we actually implement radically distributed databases? n Finale: Where is this technology going? The bioinformatics database success story Will this centralised approach work as the focus of bioinformatics moves towards systems issues? (large networks of interacting proteins, crazy amounts of experimental data in weird and wonderful formats). Why won't this work? What's the incentive for a researcher to put effort into packaging up their data for someone else's database? Who gets the credit? What if each lab has a different experimental setup / preferred way to describe / structure their data? Do they have to change their thinking to fit an approach designed by a database programmer? What about all the information in the lab books for which there isn't an appropriate field in the database? Database technologies Aim: store and retrieve data and metadata 1) the hierachical file system (windows/unix) + simple, copes with any mix of file types - no easy searching, single hierarchy: /models/Neuron/sept2002/dynamic-synapse.mod /papers/pdf/dynsynpaper.pdf why not /interests/dynamic-synapses/*.mod *.pdf? 2) relational databases (Oracle, MySQL, SQL Server) + fast searching, complex queries - hard and expensive to set up. Doesn't support complex data structures well. Fussy about precise data formats.

Upload: others

Post on 01-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Outline Creating and federating radically distributed web ... · presentation (fonts etc) web scraping program (fetches.html, strips out good info, spits out XML) XML output – just

1

Creating and federating radicallydistributed web databases using XML

Fred Howell

Institute for Adaptive and Neural Computation

Division of Informatics

University of Edinburgh

Outline

n Prelude: How do we get the XML?

n Interlude: "web scraping art galleries"

n Fugue: How do we actually implement radicallydistributed databases?

n Finale: Where is this technology going?

The bioinformatics database success story

Will this centralised approach work as the focus of bioinformatics movestowards systems issues? (large networks of interacting proteins, crazyamounts of experimental data in weird and wonderful formats).

Why won't this work?

What's the incentive for a researcher to put effort into packaging uptheir data for someone else's database? Who gets the credit?

What if each lab has a different experimental setup / preferred way todescribe / structure their data? Do they have to change theirthinking to fit an approach designed by a database programmer?

What about all the information in the lab books for which there isn't anappropriate field in the database?

Database technologies

Aim: store and retrieve data and metadata

1) the hierachical file system (windows/unix)+ simple, copes with any mix of file types- no easy searching, single hierarchy:

/models/Neuron/sept2002/dynamic-synapse.mod/papers/pdf/dynsynpaper.pdfwhy not /interests/dynamic-synapses/*.mod *.pdf?

2) relational databases (Oracle, MySQL, SQL Server)+ fast searching, complex queries- hard and expensive to set up. Doesn't support complex datastructures well. Fussy about precise data formats.

Page 2: Outline Creating and federating radically distributed web ... · presentation (fonts etc) web scraping program (fetches.html, strips out good info, spits out XML) XML output – just

2

Database technologies (2)

3) Spreadsheet (Excel, openoffice):+ simple, includes graphs/calculations- hard to restructure data beyond initial layout

4) Pen/paper lab notebook+ simple, flexible- not searchable

5) text files

6) XML

E.F. Codd (1970). A relational model for large shared databanks. CACM 13(6), pp 377-387

How to build a web database usingOracle/MySQL

What’s wrong with relational databases?

n The Relational Data Model is optimised for the computer’sconvenience, not the user’s convenience. It doesn’t handle treestructures at all well (you have to split them into one table fordifferent type of object and make the parent/child relationships intountyped pointers. This is unnatural.)

n They were intended to be used by typing in SQL commands. Butthis power is beyond most users – who get shielded behind a basicweb form interface (e.g. using PHP) which restricts their searches toa predifined subset of operations on the data.

Why not just use the Web?

1. put the data on your web / ftp site2. make an html index file by hand

+ quick and easy, you don't need to ask anyone's permission for how tostructure your own data.

- you don't get search / sort (but you could use Google...)

How do we get to "give me a global list of all known models ofhippocampus"?Google again?

"Download results from all voltage clamp experiments on neuron X

The web at present

n .html pages

n .html generated on the fly

from databases

n Free text search with Google

Page 3: Outline Creating and federating radically distributed web ... · presentation (fonts etc) web scraping program (fetches.html, strips out good info, spits out XML) XML output – just

3

Napster / KaZaA The "semantic web"?

The next generation web will look like a big database rather than a big

document - Tim Berners Lee, "Weaving the Web"

Important features:

machine interpretable - XML not just HTML

anyone can say anything about anything (just like the web)

Some possible criticisms:

lots of academic arguments: put logic / predicate calculus

(if A and B then C), ontologies, dull fights over XML formats

(RDF / RDFS / XML-Schema / ...), only CS freaks will understand it

starts to look like a repeat of traditional AI / Cyc

not much (any?) actual useful software yet

How could we make a useful, global databaseof all scientific data?

* not relational databases - they're too expensive andinflexible

* not just the web + google - it will give you garbage

* not a centralised system - people don't like giving away theirdata

* has to allow quality control / tracking

* need all info which currently goes in lab books - not thefiltered data which ends up in papers

What if…

n Every experimentor maintained their own XML indexes of all theirexperiments as XML files on their websites, with pointers to the raw date

n Every lab maintained a list of their members’ sites

n Someone set up an XML index of labs

n … would we be able to perform quicker “browsing” of all data relating toprotein XYZ, all behavioural experiments where gene X has beenknocked out, compare results between species etc etc?

What’s needed to bring about this happy situation?

1) get data out of lab books into structured XML files

2) provide a “data browser” to surf / spider all this data conveniently

So how do we get the XML files?

n "Web scraping"

n Export relational database as an XML file

n A text editor

n Excel

n Custom scripts / programs

n A markup tool (filling in forms)n (see www.axiope.org, Microsoft XDocs/InfoPath)

Web scraping

"I want to access the data in your database in a way you hadn't expected"

1) your database generates HTML web pages from its internal structure

--- to get at the actual data from a program I need to web scrape

--- or to ask you for a database account, learn how to log on to your database server, etc etc

2) your database also exports its info as XML

--- now I can do all kinds of interesting things...

database maintainers are starting to do (2) but they're not entirely sure why....

Page 4: Outline Creating and federating radically distributed web ... · presentation (fonts etc) web scraping program (fetches.html, strips out good info, spits out XML) XML output – just

4

Web scraping Web scraping II

Web scraping III Web scraping IVuse LWP::Simple qw(get $ua);$ua->timeout(0.5);for ($let=ord('A'); $let <= ord('Z'); $let++) { $c = chr($let); $more = 1; for ($batch=1; $more==1; $batch++) { $url="http://www.nationalgallery.org.uk/cgi-bin/WebObjects.dll/CollectionPublisher.=&searchString=&workBatchIndex=0&artistBatchIndex=".$batch."&artistName=&indexLetter=$c $fname = $c."_".$batch.".html"; open (f,">cache/$fname"); $contents = get($url); print f $contents; $_ = $contents; if (/next_hi/) { $more = 1; } else {$more=0;} close(f); }}

And the practical upshot...

n Start with national gallery's web site

n Write a perl script to fetch all the html pages & extract the nuggets ofinformation

n Get an XML file with artist / painting title / url of image

n Convert the XML to a browseable website as late as possible / onthe fly (once it's in html you can't concatenate anymore withoutmore scraping)

Page 5: Outline Creating and federating radically distributed web ... · presentation (fonts etc) web scraping program (fetches.html, strips out good info, spits out XML) XML output – just

5

Diagram of web scraping process

web site (linked.html pages) –

information +presentation(fonts etc)

web scrapingprogram (fetches.html, strips outgood info, spitsout XML)

XML output –just containsrelevant(meta)data

The points of transforming the data to XML

We can now run our own custom routines onthe data (not whatever restricted facilities thewebsite designer thought of)

We can now start to think about writingprograms which combine together resultsfrom a large number of different sites in wayswhich the site creators hadn’t considered

In an ideal world…

n the data wouldn’t be buried in html

n the site creator would export info in XML too

n if the site is stored internally in a database, it should also export its data inXML (much easier than getting account/password/read about specifics oftheir database system)

n … but in the meantime, it is actually possible to extract useful data from the.html

The need for a generic schema-aware XML browser

n One problem with XML – it’s ugly to look at in raw formn So we need to transform it back into (e.g.) HTMLn * BUT * we don’t want to write a different transformation for each different

XML formatn * THEREFORE * we need a generic “data” browser which we can point at

any XML file and do all the usual sort/select/click operationsn * BUT * there isn’t a good browser like this yet – MSc project anyone?

n Options:n XMLSpy (basic commercial XML editor)n prototype axiope server (www.axiope.org) - university research projectn custom transformation routines (in PHP/Perl/Java/Python etc) tailored for a

particular XML format L (this is similar to most of the php glue b/w databasesand the web at present)

What does the XML look like?

n .

Page 6: Outline Creating and federating radically distributed web ... · presentation (fonts etc) web scraping program (fetches.html, strips out good info, spits out XML) XML output – just

6

Concatenation – *the* key advantage of XML

n Combine two web pages --> a mess

n Combine two databases --> a mess

n Combine two XML files --> an XML file

Adding the New York Metropolitan museumof art

n Another perl script(suspiciously similar to thefirst)

n Another XML filecontaining the art collection

n ... ditto for any other artgallery with a web site

And the upshot is:

n "show me all works by Picasso"

n I have a nasty hole on my wall that's 50 x 80 cm – show me allworks of art in european galleries which I could go steal to coverthe patch

n Possible, as we can sensibly concatenate XML files together,preserving the structure

n No fancy database programming required (assuming you have afancy XML browser)

n Galleries just need to tend a single text file rather than care andnurture a database

What is the "standard" way to do this?

n The "standard" approach is to try and link database servers - "we need todistribute the search". This is hard.... maybe that would work if the end users were happy typing SQL and understood

relational databases... also if the size of the *metadata* were huge (10s of Gbytes)but they don't - they're happy with a web interface

n There is already an infrastructure in place which can do most of this - the web

n All it takes is to put the data + metadata + structure as XML files on websites

n Everything else can be done by spidering

XML standards???

n Attempt 1 : get a bunch of interested parties to agree on"the standard" xml tags for situation X

n Attempt 2 : realise that researchers are never going toagree on "the standard" and fall back on "just use XML,however you choose"... but does this buy you anything?

Page 7: Outline Creating and federating radically distributed web ... · presentation (fonts etc) web scraping program (fetches.html, strips out good info, spits out XML) XML output – just

7

A wrinkle : what if the differentsites used different <tags>?n "Schema matching"

Schema mapping

n Just another XML file... (subject/verb/object) (cf RDF)

Result of schema mapping of two sites And the scientific application?

n Individual researchers mark up their data however they thinkappropriate

n Labs publish concatenated / filtered views of their researchers' data

n Federated views of all related labs' data become available for datamining

Page 8: Outline Creating and federating radically distributed web ... · presentation (fonts etc) web scraping program (fetches.html, strips out good info, spits out XML) XML output – just

8

Napster / KaZaA

Some key tricks to make it work

n Unlimited numbers of hierarchies when browsing– group byartist / title / picture size / art gallery / theme. (c.f. file systemswhich impose a single hierarchy, the web which in practiceimposes a division between different collections)

n Unrestricted data structures – if a gallery wants to record the"canvas thickness" let it, & cope with the resulting schemamatching problem

n Realise that the interesting analyses will be done bydownloading the databases onto researchers' PCs. DoingGRID style distributed computation is hard.

Some issues

Permanence of URLs

- how can we refer to a piece of data in a database ina paper, if someone might modify it / remove it?

Web of trust

"I trust Douglas's opinion concerning drosophila"

"show me all drosophila data relating to protein XYZwhich Douglas or people who's opinion he trusts

have recommended"