the centralized life sciences data (clsd) service michael grobe scientific data services

47
1 The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services Research Computing University Information Technology Services Indiana University at Indianapolis ([email protected]) January 2007

Upload: nuri

Post on 16-Jan-2016

32 views

Category:

Documents


1 download

DESCRIPTION

The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services Research Computing University Information Technology Services Indiana University at Indianapolis ([email protected]) January 2007. Outline. Basic genome science processes and vocabulary - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

1

The Centralized Life Sciences Data (CLSD) service

Michael GrobeScientific Data Services

Research ComputingUniversity Information Technology Services

Indiana University at Indianapolis([email protected])

January 2007

Page 2: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

2

Basic genome science processes and vocabulary

Basic relational algebraSimple SQL as an expression of the relational algebra

DB2 and the Federated Server

CLSD data sources: “relationalized”, mirrored, and federated

Accessing CLSD

Directions for possible future work:Adding data sourcesIntegrating more completely with the TeraGridIntegrating with other Grids

Questions, suggestions

Outline

Page 3: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

3

A “polymer” is a chemical composed of many similar units, e.g. polyvinyl chloride, starches, etc.

DNA is a (usually double-stranded) polymer composed of nucleotides:

Thymine, Adenosine, Cytosine, and Guanine

DNA carries genetic information. Individual units of genetic information are stored in individual (possibly quite long) segments of DNA.

RNA is a (usually single-stranded) polymer composed of nucleotides:

Uracil, Adenosine, Cytosine, Guanine

There are many varieties of RNA (mRNA, snRNA, rRNA, snoRNA,etc.), and they serve different functions within a cell. For example, RNA “transfers” genetic information, catalyses reactions, and otherwise assists or interferes with reactions.

Some chemistry

Page 4: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

4

Polymers are synthesized by catalysts called “polymerases” in a process called “polymerization.”

Proteins are polymers composed of (over 20 different kinds of) amino acids, such as:

Methionine (M), Isoleucine (I), Cysteine(C), Histidine (H), Alanine(A), Glutamic acid (E), Leucine (L), etc.

Proteins: •provide structure:

•microfilaments (polymers of actin), •microtubules (polymers of tubulins), •channels thru the cell wall, etc.

•catalyse and co-catalyse reactions, as “enzymes,”•bind with DNA to enhance or inhibit “transcription” and “translation”,•are sometimes marked for transport or degradation.

Protein primary, secondary and tertiary structures are important.

Proteins are degraded within proteasomes..

Some more chemistry

Page 5: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

5From Atherly,et al., 1999

Genetic material: 2 meters of DNA packaged into less than 1.4 microns

Page 6: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

6

The central model of molecular genetics

DNA can be reliably replicated during the process of cell division, by DNA-dependent DNA polymerases.

DNA can be “transcribed” to messenger RNA (mRNA) by DNA-dependent RNA polymerases. Transcription takes place in the nucleus (or equivalent).

mRNA is transported to the cytoplasm where it is used as a template for creating proteins by “ribosomes” in a process called “translation.”

The translation process encodes 1 amino acid for each 3 DNA bases in a sequence (“triplet”).

The function mapping each of the 64 possible triplets to an amino acid is the “genetic code.”

Ribosomes are complexes of RNA and protein.

Page 7: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

7

The central model within the cell

Diagram from: http://www.ncbi.nih.gov/About/primer/images/proteinsynth4.GIF(Don’t forget about degradation and recyling of AAs.)

Page 8: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

8

The central model in more detail

(Graphics of DNA and RNA from Atherly, et al. 1999)

Page 9: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

9

Mutations and polymorphisms Nucleotide sequence Translated AA sequence

Wildtype: ACTGAACTGATT Thr–Glu–Leu-IleSubstitution: ACTGACCTGATT Thr-Asp-Leu-IleDeletion: ACTCTGATT Thr-Leu-IleInsertion: ACTGAACCTGAACTGATT Thr-Glu-Pro-Gly-Leu-Ile

If mutations like these occur in genetic material within oocytes, they may be transmitted to offspring, and define “polymorphic” gene variations.

A Single Nucleotide Polymorphism (SNP) is a variation where one base is changed and passed on to offspring (and occurs with sufficient frequency).

A Deletion/Insertion Polymorphism (DIP) is a variation where multiple bases have been removed or inserted into a sequence.

dbSNP is a database of SNPs and DIPs containing millions of entries, and over 120K unique sequences that are inserted or deleted.

Page 10: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

10

Scale of human genome data Total number of bases: 3.2Gbp(DNA from one half of one chromosome (chromatid) from each of 24 chromosomes: 22 autosomal chromosome pairs plus the sex chromosomes.)

Percentage of genome consisting of protein coding genes: < 2%

Average gene length: ~3Kbp (but up to 2.4Mbp)

Average exon length: 200bp

Average protein length: 500-600AA

Percentage of “junk” DNA: often said to be ~50%

Percentage of “junk” DNA now suspected to be transcribed (the “dark matter” of the genome): ~50 to 100%

Some of that junk is mRNA that negatively regulates translation.

Page 11: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

11

Process control: cancer-related reaction pathways from Hanahan, et al.

Page 12: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

12

Basic relational algebra The relational algebra operates on relations, which are sets of tuples of the same arity, which is to say, collections of lists of the same length. Here are two 4-tuples:

( 1, 2, 3, 4 )( 8, 7, 9, 4 )

Relations are commonly represented as tables.

There are 5 primitive operations within the relational algebra:

Projection: extract specific columns from a relation

Selection: extract specific rows

Set union: create a new table composed of all the rows of two other tables

Set difference: remove the rows in one relation that appear in another

Cartesian product: “multiply” two tables to create a third

Page 13: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

13

Cartesian product in more detail

3 4 7

1 9 8

8 7 9 1

1 2 3 4

7 6 2 3

8 7 9 1 3 4 7

8 7 9 1 1 9 8

1 2 3 4 3 4 7

1 2 3 4 1 9 8

7 6 2 3 3 4 7

7 6 2 3 1 9 8

Cartesian product (arity: 4 + 3; length: 3 * 2)

Relation 2 (arity 3; length 2)Relation1 (arity 4; length 3)

Page 14: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

14

Relational databases and query languages

Database management systems based on the relational algebra were described by Edward F. Codd working for IBM in the early 1970s.

Codd’s formulation included:

•indexes and keys, •decomposition into normal forms, and •integrity constraints.

Multiple languages and interfaces were developed to query and modify collections of relations, among them the Structured English Query Language, SEQUEL, developed by Chamberlain and Boyce.

Page 15: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

15

SQL as an implementation of the relational algebra

t_num games rank

3 4 7

1 9 8

Player Innings Hits Teamnumber

8 7 9 1

1 2 3 8

7 6 2 3

Teams (arity 3; length 2)

Players (arity 4; length 3)

The most successful such language,SQL, was based on SEQUEL.SQL requires that each relation has a tablename, and each tuple position has a “fieldname”:

Page 16: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

16

SQL as an implementation of the relational algebra SQL commands map to the relational primitives as follows, where “*” stands for all fields in a table:

Projection select fieldname_list from tablenameex: select tnum,rank from Teams

Selection select * from tablename where <logical expression>ex: select * from Players where Teamnumber = 1

Union (select fieldname_list from tablename1) union

(select fieldname_list from tablename2)use ALL to keep duplicates

Set difference select * from (tablename1 except tablename2)

Cartesian product select * from tablename1, tablename2

Note that SQL does not specify how to perform a query; only what the result should be. It is a “declarative,” rather than “procedural,” language.

Page 17: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

17

The relational join operation An SQL “join” is a Cartesian product followed by a selection, as in:

select * from Players, Teams where Players.Teamnumber = Teams.t_num

which results in a Cartesian product table with only 2 (red) rows:

Player Innings Hits Teamnumber

t_num games rank

8 7 9 1 3 4 7

8 7 9 1 1 9 8

1 2 3 4 3 4 7

1 2 3 4 1 9 8

7 6 2 3 3 4 7

7 6 2 3 1 9 8

Page 18: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

18

IBM’s DB2 and WebSphere Federated Server,nee Information Integrator, nee DiscoveryLink

DB2 is a fully-featured relational database system that can house and serve large databases.

Data is usually imported in relational form, structured as rows composed of individual data values, possibly identified by unique IDs (keys).

DB2 can also access data in tables managed by other, usually physically remote, database management systems, such as Oracle, MySQL or DB2.

This process is known as “data federation.”

DB2 can also federate some external resources that are not normally accessed as relational tables (e.g. Blast). Such resources are transformed, or “relationalized” on-the-fly by “wrappers”.

Once these resources have been registered with their wrappers they may be referred to within SQL queries as is any other resource.

Page 19: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

19

WFS diagram from Del Prete

Page 20: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

20

Some WFS jargon Wrapper: a library to access a particular class of data sources or protocols.

Each wrapper contains information about data source characteristics. There are BLAST and PubMed wrappers, and now a “generic Script wrapper” that talks to user scripts.

Server: represents a specific data source (user mappings maybe required for authentication)

Nickname: a local table name (alias) for a data on a server (mapped to rows and columns)

A nickname looks like a table, but links to a server, which links to a wrapper/data source, where the wrapper knows how to process the data from the source.

Page 21: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

21

Using NCBI data within DB2: More than just mirroring

Mirroring usually implies maintaining exact copies of data sources.

Most data mirrored by CLSD must not only be copied, but also inserted into the CLSD relational structure.

This is accomplished by a series of scripts that:

•Download the data from its external site,•Convert it to a form that can be used to update CLSD tables,•Insert the data into tables, and •Monitor the overall process to identify and log errors.

These scripts are run regularly from crontab entries, and monitoring results are examined after every run.

Page 22: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

22

CLSD “relationalized” data sources

BIND -- Pathways, Gene interactions

ENZYME -- Enzyme nomenclature

ePCR -- ePCR results of UniSTS vs Homo sapiens

KEGG data sources: LIGAND -- Pathways, Reactions, & Compounds PATHWAY -- Pathway map coordinates

NCBI data sources: LocusLink -- Genetic Loci. (LocusLink has been inactive since

July 1, 2005 when it was retired in favor of UniGene.) UniGene -- Gene clusters

SGD -- Saccharomyces Genome Database

Page 23: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

23

KEGG datasource info

PATHWAY:   42,273 pathways generated from 306 reference pathways

LIGAND: 14,238 compounds, 4,111 drugs, 10,951 glycans, 6,810 reactions ,7,127 reactant pairs

Page 24: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

24

CLSD federated data sources

Federated NCBI data sources (subject to hit rate throttling):

Nucleotide -- Nucleotide sequences PubMed -- Journal abstracts

Federated local mirrors of NCBI data sources (not throttled):

Blast (updated monthly) is mirrored by UITSdbSNP (updated at major builds) is mirroed by IUSM

Some KEGG resources are federated via the FS KEGG user-defined functions

Page 25: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

25

Examples from the CLSD web sitehttp://scidata.iu.edu/CLSD/sql-in-db2.shtml

• To get a list of genes containing "brain" in their LOCUS_NAME in dbSNP126_shared:

select * from DBSNP126_SHARED.GENEIDTONAME where locus_name like '%brain%'

• To get a list of Bind Genes and their species:

select GeneNameA,Organism from bind.bind_interaction

• To get a list of genes mentioning "HUMAN" in their descriptions in KEGG:

select * from KEGG.GENE where description like '%HUMAN%'

• To get some info from PubMed:

select PMID, ArticleTitle FROM NCBI.pmarticles where entrez.contains (ArticleTitle, 'granulation') = 1 AND entrez.contains (PubDate, '1992') = 1

Page 26: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

26

BLAST: Both mirrored and federated NCBI Blast is typically accessed via a web page at NCBI, or some mirrored site.

Data is returned in a typical web interface format suitable for users.

Within CLSD, BLAST is accessed via an SQL query and data is returned as a table that can be manipulated as is any other DB2 table.

For example, here is an SQL query that invokes a blastall process running on libra00 from within DB2:

select GB_ACC_NUM, description, e_value from ncbi.BLASTN_NT where BlastSeq = 'AGTACTAGCTAGCTAGCTACTAGCTGACTGACTGACTGATGCATCGATGATGC‘

The local version of blastall conducts the search and returns results encoded within XML (by specifying the –m7 parameter).

Page 27: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

27

The DB2 federation software converts the XML encoded results into something like this:

GB_ACC_NUM DESCRIPTION E_VALUE(VARCHAR) (VARCHAR) (DOUBLE)

AE003644 Drosophila melanogaster chromosome 0.006664752L, section 53 of 83 of the complete sequence

AE003410 Drosophila melanogaster, chromosome 0.006664752L, region 34C4-36A7 (Adh region), section 4 of 10 of the comple

AC092228 Drosophila melanogaster, chromosome 0.006664752L, region 35X-35X, BAC clone BACR21J17, complete sequence

AP008207 Oryza sativa (japonica cultivar-group) 0.0263349genomic DNA, chromosome 1, complete sequence

AP003197 Oryza sativa (japonica cultivar-group) genomic 0.0263349DNA, chromosome 1, BAC clone:B1015E06

AP003105 Human DNA sequence from chromosome 1, 0.0263349putative argumentativeness gene GROBE1

Page 28: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

28

Modifying BLAST search settings via SQL

Parameters sent to blastall can be set by using equality comparisons as assignment statements within SQL conditionals, as in:

select Score, E_Value, HSP_Info, HSP_Q_Seq, HSP_H_Seq

from ncbi.BLASTN_NT

where BlastSeq = 'gagttgtcaatggcgagg'

and gapcost=8 and E_Value < .0005

which will pass gapcost and e-value settings on to blastall.

Page 29: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

29

BLAST data sources available via CLSD Here is a list showing which search types are supported by the DB2 BLAST wrapper within CLSD.

BLAST search type: Data sources

BLASTN: NT, EST_HUMAN, EST_MOUSE, and EST_OTHERA nucleotide sequence is compared with the contents of a

nucleotide sequence database.

BLASTP: NR, SPAn amino acid sequence is compared with the contents of an amino

acid database.

BLASTX: NR, SPA nucleotide sequence is compared with the contents of an amino

acid sequence database. Query is translated in all six reading frames.

Page 30: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

30

Examples from IBM

Query 1: Given a search sequence, search nucleotide (NT), and return the hits for only those sequences not associated with a Cloning Vector. For each hit, display the Cluster ID and Title from Unigene, in additon to the Accession Number and E-Value. Only show the top 5 hits, based on the ones with the lowest E-values.

Select nt.GB_ACC_NUM, nt.DESCRIPTION, nt.E_VALUE, useq.CLUSTER_ID, ugen.TITLE

From ncbi.BLASTN_NT nt, unigene.SEQUENCE useq, unigene.GENERAL ugen

Where BLASTSEQ = ‘GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGC

CGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCTGGCTAACACGGTGAAACCCCGTC’

And nt.DESCRIPTION not like ‘%cloning vector%’And nt.GB_ACC_NUM = useq.ACCAnd useq.CLUSTER_ID = ugen.CLUSTER_IDOrder by E_VALUE FETCH FIRST 5 ROWS ONLY

Page 31: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

31

User-defined functions (supplied by IBM) There exist special functions for manipulating sequence patterns:

•LSPatternMatch•LSPrositePattern

To get a list of (aspartate aminotranserase) BLAST results filtered by a (pyridoxal phosphate attachment site) pattern specified in PROSITE pattern language:

select gb_acc_num, HSP_H_SEQ from ncbi.blastp_nr where blastseq='MSQICKRGLLISNRLAPAALRCKSTWFSEVQMGPPDAILGVTE\AFKKDTNPKKINLGAGAYRDDNTQPFVLPSVREAEKRVVSRSLDKEYATIIGI\PEFYNKAIELALGKGSKRLAAKHNVTAQSISGTGALRIGAAFLAKFWQGNREI\YIPSPSWGNHVAIFEHAGLPVNRYRYYDKDT' and DB2LS.LSPatternMatch(HSP_H_SEQ, DB2LS.LSPrositePattern( '[GS]-[LIVMFYTAC]-[GSTA]-K-x(2)-[GSALVN].' ) ) > 0

Note the use of the period (.) to terminate the PROSITE pattern, and that the LSPatternMatch function returns the character position of the left-most substring matching the pattern, or zero if there is no match.

Page 32: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

32

Accessing CLSD: getting an account To access CLSD you must have an account on the Libra Cluster at IU (aka libra00.uits.iu.edu).

If you don’t have an account and are associated with Indiana University, request an account by filling out a Research Systems Account Application at

http://rac.uits.iu.edu/rats/forms/application.php.

In the comments section of the account request, add that you need a local and persistent password for use with CLSD.

Once you have a Libra account, send email to SDS at data @ indiana.edu and request instructions for defining a local and persistent password for use with CLSD.

TeraGrid users should send e-mail to SDS at data @ indiana.edu explaining how CLSD will be used, and describing their TeraGrid activities. SDS will then arrange for an appropriate Libra account and send instructions for defining a suitable password.

Page 33: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

33

Accessing CLSD: options

DB2 can be accessed in a variety of ways:

•DB2 Command Line Processor (Unix, Windows)•DB2 Control Center (wherever JRE is running)

•DB2 driver for Perl DBI•DB2 drivers for the Java Database Connectivity (JDBC) Application Program Interface (API), especially the JDBC Universal Driver

•Demonstration Web page (invokes a Java servlet that uses JDBC): http://discover.uits.indiana.edu:8421/access/

•Demonstration WebService (invoked as a function call via JAX-RPC):http://discover.uits.indiana.edu:8421/axis/CLSDservice.jws?wsdl

•Demonstration Web page (invokes a Java servlet that invokes the CLSD WebService):

http://discover.uits.indiana.edu:8421/access/index-for-service.html

•Experimental WSRF Resource (using WSRF within a GT4 container) •Experimental OGSA-DAI service (running within a GT4 container)

Page 34: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

34

JDBC access Connect to the CLSD:

Class.forName( "com.ibm.db2.jcc.DB2Driver" );

con = DriverManager.getConnection("jdbc:db2://libra00.uits.iu.edu:50000/clsd2",

accountName, accountPassword );

Prepare a query, send it to the db, and receive a result:

statement = con.createStatement();

resultSet = statement.executeQuery( query );

Get some query meta-data (column labels and column data types):

ResultSetMetaData rsmd = resultSet.getMetaData();

result = rsmd.getColumnLabel( colCount );

result2 = rsmd.getColumnTypeName( colCount );

Page 35: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

35

JDBC access (continued)

Get a row of data:

for( int colCount = 1; colCount <= numcols; colCount++ ){ String returnedString = ""; // Must be predefined.

returnedString = resultSet.getString( colCount ) + ""; out.println( "<td>" + returnedString + "</td>\n" );}

Page 36: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

36

Accessing CLSD thru a WebService (JAX-RPC)

The Java API for XML-based Remote Procedure Calls, or JAX-RPC, is a specification that defines a system for building distributed services (so-called “WebServices”) within the client-server model.

JAX-RPC makes it possible for a function invocation in a client like:

a_variable = function_name( parameter_list)

to cause the function, “function_name,” to run on a remote server and return a response containing the value to be assigned to the variable “a_variable”,and a function invocation in a client like:

returnString = queryCLSD( "select * from syscat.tables", "1", "5", "accountName", "accountPassword", “table” )

will return a (possibly very long) string containing the response to the query (given that various linkages have been prearranged).

Page 37: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

37

Outline of the CLSDservice public class CLSDservice { // Full source at: // http://scidata.iu.edu/CLSD/examples/CLSDservice.jws.txt public String queryCLSD( String query, String startingRowToPrint, String maxRows, String account, String password, String format ) { // Get a query string, etc. from the command line or Web // browser.

// Declare JDBC drivers and connect to DB2.

// Prepare a JDBC statement containing the SQL query, submit // it to DB2, and capture the returned JDBC result set.

// Query result set metadata for column names and types to // return as the first row, and then collect the contents of // each data row.

return theResponse; } // end queryCLSD} // end Class CLSDservice

Page 38: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

38

SOAP and WSDL

JAX-RPC uses SOAP and WSDL to establish the various linkages required to implement remote procedure calls.

SOAP messages are usually encoded as XML messages within HTTP requests where:

• A SOAP request is an HTTP POST request with an XML body.• A SOAP response is an HTTP response header followed by an XML body.

Such RPC functions are “exposed” as “operations” when described within web pages using the Web Services Description Language (WSDL).

Page 39: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

39

Java command-line client to access CLSD via CLSDservice public class testCLSDClient{ public static void main(String [] args) { try { String endpoint = "http://discover.uits.indiana.edu:8421/axis/CLSDservice.jws"; Service service = new Service(); Call call = (Call) service.createCall(); call.setTargetEndpointAddress( new java.net.URL( endpoint ) ); call.setOperationName( new QName("http://soapinterop.org/", "queryCLSD" ) ); String returnString = (String) call.invoke( new Object[] { "select * from syscat.tables", "1", "5", "accountName", "accountPassword", “table” } ); System.out.println( returnString ); } catch (Exception e) { System.err.println(e.toString()); } }}

Page 40: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

40

Perl command-line client to access CLSD via CLSDservice

#!perl –w

use SOAP::Lite;

# Set up the call to CLSD using SOAP.$host = “discover.uits.indiana.edu”;

$service = SOAP::Lite -> service( “http://$host:8421/axis/CLSDservice.jws?wsdl” );

# Make the call to CLSD.$result = $service->queryCLSD( “select tabschema,tabname from syscat.tables”, 1, 5, "DB2account", "password" "table" );

print $result;

Page 41: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

41

OGSA

The Open Grid Services Architecture (OGSA) is an “architecture” for building computational grids.

In particular, OGSA “…defines a set of core capabilities and behaviors that address key concerns in Grid systems.” [2] It does not, however, implement or define how to implement such core capabilities.

OGSA is NOT layered or object oriented.

However, both will be exploited naturally in some implementations.OGSA provides an architecture for building services such as:

•“Service-Based distributed query processing,” •“Grid Workflow”,•“Grid Monitoring Architecture”•etc.

Page 42: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

42

OGSA-DAI

OGSA-Data Access and Integration (OGSA-DAI) is a very flexible and powerful data access framework that can be used within an OGSA grid environment.

It provides various data movement, virtualization, and manipulation services that transform the use of data into a higher-level workflow.

The OGSA-DAI client shown in the next slide uses the OGSA-DAI Client Toolkit to send a hard-coded query to CLSD (here known as the “DB2Resource).

The Toolkit allows clients to use JDBC by creating a JDBC ResultSet object from an OGSA-DAI WebRowSet.

The response is encoded using XML and may be retrieved as a single string, or as individual fields by using individual JDBC calls as shown below.

Page 43: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

43

Java command-line client to access CLSD via OGSA-DAI

public class queryCLSD{ public static void main(String[] args) throws Exception { // Create an instance of the data service. String handle = "http://localhost:8080/wsrf/services/ogsadai/DataService"; String id = "DB2Resource"; DataService service = GenericServiceFetcher.getInstance().getDataService(

handle, id);

// Define a request composed of one activity. SQLQuery query = new SQLQuery( "select tabschema,tabname from syscat.tables"); WebRowSet rowset = new WebRowSet( query.getOutput() ); ActivityRequest request = new ActivityRequest(); request.add( query ); request.add( rowset );

Page 44: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

44

Java command-line client to access CLSD via OGSA-DAI 2

// Submit the request and retrieve results. Response response = service.perform( request ); ResultSet result = rowset.getResultSet(); ResultSetMetaData rsmd = result.getMetaData(); int numCols = rsmd.getColumnCount();

// Display each column from each row. while( result.next() ) { for( int colCount = 1; colCount <= numCols; colCount++ ) { out.print( “ “ + result.getString( colCount ) ); } out.println(); } }}

Page 45: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

45

This client displays a small part of the functionality provided by OGSA-DAI. In addition, an OGSA-DAI service can be configured to:

•operate on XML or text data sources, as well as relational data sources,

•perform a series of operations (also known as “activities”) as part of a single request,

•deliver results to a third party (via FTP, GridFTP, SMTP, etc.) or to another data service,

•deliver results asynchronously, which can be very useful for long-running requests, and

•utilize authentication methods supported by WSRF to provide grid-based security.

Also, exposing a database via OGSA-DAI makes it available for OGSA Distributed Query Processing (OGSA-DQP), so that its use may be further virtualized within the DQP model.

In some cases, however, OGSA-DAI and DQP may introduce performance penalties.

Page 46: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

46

Current and possible directions

Adding data sources: mirrored and federated•Requests for mirroring or federating will be gladly entertained

•DB2 now provides a user-configurable script wrapper that connects to a remote DB2 daemon that can start any co-located arbitrary script and return data encoded in XML (restricted to one foreign key per table)

Such a script could be built to relay any web resource that returns XML meeting key restrictions.

Wrappers could be constructed to relay some OGSA-DAI resources

Implementing the OGSA-DAI service in productional mode.

Integrating with the TeraGridCLSD is currently accessible from the TeraGrid, but authentication is local.It may be possible to enforce TeraGrid based X.509 authentication, using either WSRF or OGSA-DAI interfaces.

Page 47: The Centralized Life Sciences Data (CLSD) service Michael Grobe Scientific Data Services

47

References: – Atherly, Alan G, et al., The Science of Genetics, 1999.– Apache Foundation, AXIS User’s Guide,

http://ws.apache.org/axis/java/user-guide.html– Codd, Edward F., A Relational Model of Data for Large Shared Data

Banks, http://www.acm.org/classics/nov95/toc.html(See also: http://en.wikipedia.org/wiki/Edgar_F._Codd)

– CSLD web page: http://rac.uits.iu.edu/clsd/– Del Prete, Doug, Efficient access to Blast using IBM DB2 Information

Integrator, http://www-03.ibm.com/industries/healthcare/doc/content/bin/blast.pdf– Foster, Ian, et al. “The Open Grid Systems Architecture, Version 1.5”.– Sotomayer, Boria and Lisa Childers, Globus Toolkit 4: Programming

Java Services– Sundaram, Babu, Understanding WSRF, http://www-128.ibm.com/developerworks/edu/gr-dw-gr-wsrf1-i.html

Questions, comments, suggestions?