bio2rdf distributed querying model

Light Blue Shapes

URI based distributed querying

Peter Ansell

Aim

Access normalised RDF information located in multiple endpoints using the concept of Public Namespaces and Private Record Identifiers and distributed SPARQL queries which are matched to the content in each endpoint

Overall concepts

Query Types : Basically wrapping up SPARQL queries based on a regular expression matching an input query string.

Normalisation Rules : Rules that define the transformations from a standard normalised URI system to a system matching a particular endpoint, and the reverse if necessary

Providers : The entities which provide the information. They can be SPARQL endpoints or even simple URL's. If they are proxied they should return RDF information, but redirects are also available for other providers.

URI resolution example

User enters HTTP URL into their user agenthttp://mybio2rdf.local/namespace:identifier

Servlet receives requestHostname: mybio2rdf.local

Query string: /namespace:identifier

Servlet performs URL rewriting to pass query string to the atlas2rdf.jsp page based on WEB-INF/urlrewrite.xml


The query string is matched against the regular expressions in the configured query types and the unique query titles which had successful matches are selected

/namespace:identifier matches at least http://qut.bio2rdf.org/query:construct and http://qut.bio2rdf.org/query:taglabels

URI resolution step

For each of the query types a namespace test is applied to determine which regular expression matching groups are relevant, and whether the query type matches the given namespace

URI resolution step

Namespace test:

Is the query type specific to namespaces? If false, include the query type.

See CUSTOM_QUERY_NAMESPACE_PROVIDER_SPECIFIC

If so, is the query type relevant to all namespaces. If true, include the query type

See CUSTOM_QUERY_HANDLE_ALL_NAMESPACES

If not, check whether the query string matching groups matched either any or all of the query types namespacesas configuredof the matching group numbers declared for the query type.

See CUSTOM_QUERY_NAMESPACES_TO_HANDLE, CUSTOM_QUERY_NAMESPACE_INPUT_INDEXES, and CUSTOM_QUERY_NAMESPACE_MATCH_METHOD


Both query:construct and query:taglabels are relevant to all namespaces, and contain the namespace as the first matching group index, and since they have only one matching group as a namespace the match method is not relevant

URI resolution step

For each of the chosen query types, get a list of providers which handle the query title

If a query type is namespace specific, filter its list of providers based on whether they match any or all of the namespaces according to the query title namespace matching configuration. This time the inclusion is based on the namespace test with the list of namespaces configured for the provider


The query titles construct and taglabels were chosen, so they are now matched against the total list of providers to gain an initial list

The construct query is namespace specific so only construct providers which handle the given namespace will be included, where the taglabels query is not namespace specific so the any taglabels providers will be included in the final provider list

URI resolution step

Any of the providers which were defined as default and which handle the given query type are also included at this stage, without regard to the namespaces.

Default providers are intended to make it simpler to configure intermediate servers without having to know about all of the known namespaces

URI resolution step

For each of the query types, for each of the providers which remain.

If a provider needs a redirect, as opposed to proxying communication, replace any template variables on the endpoint URL and send an HTTP 302 redirect response as the result

URI resolution step

If no redirects generate the actual queries based on the templates given in the query types and the normalisation rules for the provider

The normalisation rules are matched against the template variables and replaced as necessary in order to make them specific to the relevant endpoint

Query templates

Some of the template variables include:${graphStart} and ${graphEnd} to allow for SPARQL graphs, or the lack of a graph

${endpointSpecificUri} to allow for the SPARQL endpoint to contain a different URI to the one which is desired

${input_1}, ${input_2}, etc., which correspond to the matching groups from the query type. ${input_1} is typically the namespace, although this is configurable.

Query templates

Some more template variables include:${graphUri} if it doesn't exist it is empty

${endpointUrl} this can also have template variables inside it, which are replaced before the redirect check phase

${defaultHostAddress} the standard base URL for this configuration, ie, http://bio2rdf.org/

${realHostName} the actual host being used, ie. http://mymirror.local/bio2rdf/

Query templates

Some template variables are available in their encoded forms. For example:${urlEncoded_endpointSpecificUri} a fully percent encoded version of the URI

${inputUrlEncoded_normalisedStandardUri} a version of the standard URI as given by the query type with the ${input_NN} sections internally percent encoded

${xmlEncoded_inputUrlEncoded_normalisedStandardUri} for use in RDF/XML templates

${inputUrlEncoded_privatelowercase_endpointSpecificUri} for use with endpoints which contain percent encoded URI's that have the private ${input_NN} variables completely in lowercase without regard to the case given in the ${queryString}

${queryString} The original input string which matched against the query type regular expression

Query templates example

For http://bio2rdf.org/namespace1:identifer1${queryString}=namespace1:identifier1

The other variables will be different depending on whether the construct provider for namespace1 is being contacted, or

URI resolution step

For each query, check its communication method

If it is declared as nocommunication, ignore it for now. It will be used with the static RDF/XML insertion stage

If it is declared as httpgeturl then perform HTTP resolution on the provider endpoint URL after replacing the relevant template variables

URI resolution step

If the communication method is declared as httppostsparql then POST the replaced query template to the endpoint URL

The SPARQL query is matched to the endpoint at this stage by the use of a query type that contains the basic structure of the query, and normalisation rules to make sure the URI's in the SPARQL match the endpoint and Graph combination

URI resolution step

The results of the httpgeturl and httppostsparql HTTP requests are passed through the list of rdf normalisation rules which are configured for the provider that was chosen so that they are normalised to the desired output format

More than one provider may be attached to the same endpoint and graph combination, so a given URI may resolve using more than one query on the same endpoint and graph depending on the query needs

Accessible databases

Each of the following databases have normalisation rules which normalise them back to bio2rdf.org URI'sDbpedia, Drugbank, LinkedCT, HCLS KB/Neurocommons, Diseasome, Dailymed, Bioguid DOI

These, together with the 40+ Bio2RDF sparql endpoints form a very large accessible knowledge base!

RDF accessible configuration

The configuration, including all query types, RDF normalisation rules, providers and known namespaces is available in RDF

http://qut.bio2rdf.org/admin/configuration/rdfxml

Integrating user extensions

A clear use case for a system where arbitrary queries can be performed as part of a single URI resolution is to integrate novel datasources such as user tags

The only requirement is that the query type relevant to the tags etc., matches the regular expression for the the URI it is extending. For example http://qut.bio2rdf.org/query:taglabels and http://qut.bio2rdf.org/query:construct both have regular expressions that match the basic http://bio2rdf.org/namespace:identifier URI

Future work

Content negotiation between RDF formats

HTML formatted results for easy browsing, possibly using Pubby as the rendering engine

Paged SPARQL calls using OFFSET and LIMIT

Alternative configurations for Dbpedia, SharedNames etc. that don't require http://bio2rdf.org/ as the base URI and have different basic queries

Import configuration from RDF similar to the current configuration output

Future work

Provide more pipes to perform integrated actions without having to put HTTP SPARQL requests into a workflow system when a URI resolution can perform the query in a distributed and normalised manner more efficiently

Bring together the current distributed efforts to provide a complete HTML redirection registry so that a large percentage of Bio2RDF namespaces can be redirected with http://bio2rdf.org/html/namespace:identifier

Form ontologies describing the query type, provider, rdf normalisation rule, namespace paradigm

Future work

Integrate http://rdf.myexperiment.org/sparql and similar workflow RDF endpoints so that scientific workflows can be linked to their data cleanly, and user enhancements such as tags and publications are cleanly integrated with the actual datasources they were derived from

bio2rdf distributed querying model

Technology