1 peter fox data science – csci/erth/itws-4350/6350 week 11, november 11, 2014 academic basis for...

50
1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data Tools and Data as Service Paradigms

Upload: margaret-norris

Post on 16-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

1

Peter Fox

Data Science – CSCI/ERTH/ITWS-4350/6350

Week 11, November 11, 2014

Academic Basis for Data and Information Science, Data

Models, Schema, Data Tools and Data as Service

Paradigms

Page 2: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Contents• Reading

• Informatics

• Data models

• Schema

• Tools

• Markup languages

• Data as service

• How are the projects going?

2

Page 3: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Definitions (revisited)

• Data - are pieces of <x> that represent the qualitative or quantitative attributes of a variable or set of variables.

• Data (plural of "datum", which is seldom used) - are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables.

• Data - are often viewed as the lowest level of abstraction from which information and knowledge are derived 3

Page 4: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Definitions ctd.

• Information– Representations (of facts? data?) in a form that

lends itself to human use

• Knowledge– …. Meaning – but watch how this may become

so very important

4

Page 5: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Data-Information-Knowledge Ecosystem

5

Data Information Knowledge

Producers Consumers

Context

PresentationOrganization

IntegrationConversation

CreationGathering

Experience

Page 6: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

6

Mind the gap• As we aim to use modern technology to

advance data science:

• There is often a gap between science and

the underlying infrastructure and technology

that is available

• Cyberinfrastructure is the new research environment(s) that support advanced data acquisition, data storage, data management, data integration, data mining, data visualization and other computing and information processing services over the Internet.

Informatics - information science includes the

science of (data and) information, the practice

of information processing, and the engineering

of information systems. Informatics studies the

structure, behavior, and interactions of natural

and artificial systems that store, process and

communicate (data and) information. It also

develops its own conceptual and theoretical

foundations. Since computers, individuals and

organizations all process information,

informatics has computational, cognitive and

social aspects, including study of the social

impact of information technologies. Wikipedia.

Page 7: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

A moment of history

• In the late 1950’s (actually around 1957-1958) the modern informatics term was coined

• Existed for a while but then split into library science and computer science and developed their own fields, became disconnected

• Now coming back to be relevant to science

• Informatics IS NOT just having a scientist work with an “IT/ICT” person

7

Page 8: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Advertisement• Spring 2015 – Xinformatics

• See last year: http://tw.rpi.edu/web/course/Xinformatics/2014

8

Page 9: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Library science• Curates the artifacts of knowledge

• Organizes and manages them for consumers– Cataloging and classification

• Preservation– ‘maintaining or restoring access to artifacts,

documents and records through the study, diagnosis, treatment and prevention of decay and damage’ (wikipedia)

• Digital age– Curation and preservation

9

Page 10: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Cognitive Science• Cognitive science is an interdisciplinary study

of the mind and intelligence

• It operates at the intersection of psychology, philosophy, computer science, linguistics, anthropology, and neuroscience.

• Of relevance for data and information science are three significant theoretical underpinnings– mental representation,– the nature of expertise, – and intuition

• Very relevant to model, data/metadata choice10

Page 11: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Social Science• Branch of humanities

• Especially as it relates to networks of scientists

• Exploits sociology of groups, teams

• Cultural norms as well as discipline norms– Modes of what and how rewards are given– Between those who produce and those who

consume data (and information)– More

11

Page 12: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Information theory• Semiotics, also called semiotic studies or

semiology, is the study of sign processes (semiosis), or signification and communication, signs and symbols, into three branches:– Syntactics: Relation of signs to each other in

formal structures– Semantics: Relation between signs and the

things to which they refer; their denotata– Pragmatics: Relation of signs to their impacts on

those who use them 12

Page 13: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Note: we have theories for…

• Knowledge -> various forms of logic(s)

• Information (Shannon, Weaver, Peirce…)

• But not ‘Data’ (except for …)

– … discuss

Page 14: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Mealy’s Introduction• “We do not, it seems, have a very clear and

commonly agreed upon set of notions about data-either what they are, how they should be fed and cared for, or their relation to the design of programming languages and operating systems. This paper sketches a theory of data which may serve to clarify these questions. It is based on a number of old ideas and may, as a result, seem obvious. Be that as it may, some of these old ideas are not common currency in our field, either separately or in combination; it is hoped that rehashing them in a somewhat new form may prove to be at least suggestive.” 14

Page 15: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Three elements and connections• Relations

• Data Maps

• Access Functions

• The data itself

• Procedures

• Storage and representation

• Descriptors15

Page 16: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Wickett et al…• “Heterogeneous digital data that has been produced by

different communities with varying practices and assumptions, and that is organized according to different representation schemes, encodings, and file formats, presents substantial obstacles to efficient integration, analysis, and preservation. This is a particular impediment to data reuse and interdisciplinary science. An underlying problem is that we have no shared formal conceptual model of information representation that is both accurate and sufficiently detailed to accommodate the management and analysis of real world digital data in varying formats. Developing such a model involves confronting extremely challenging foundational problems in information science. “

16

Page 17: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Premise

17

Data Information Knowledge

Context

PresentationOrganization

IntegrationConversation

CreationGathering

Experience

Page 18: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

1. Assume context free• Content and Structure• D=f(x;p)• D=data, f=transduction function, x=thing, p=parametric

dependence (e.g. time of transduction)

• HAVE – Syntax • DO NOT HAVE - Semantics – no meaning without context• OR - Pragmatics – no use without meaning??

• What about - Uncertainty, quality, bias (error) – none without context?

Page 19: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

2. Assume minimal context• Minimal = incomplete?• E.g. know instrument but not when, or of what• E.g. know what but not how

• Partial uncertainty? Conditional entropy?

• Constructive induction?

Page 20: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Pulling things over from Informatics

20

Data Information Knowledge

Context

PresentationOrganization

IntegrationConversation

CreationGathering

Experience

Page 21: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Information Models• Conceptual models, sometimes called domain

models, are typically used to explore domain concepts

• High-level conceptual models are often created as part of initial requirements envisioning efforts as they are used to explore the high-level static business or science or medical structures and concepts.

21

Page 22: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

(Information) Architecture

• Definition: – “is the art of expressing a model or concept of

information used in activities that require explicit details of complex systems” (wikipedia)

– “… I mean architect as in the creating of systemic, structural, and orderly principles to make something work - the thoughtful making of either artifact, or idea, or policy that informs because it is clear.” Wuman

22

Page 23: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Data Models• Conceptual data models, sometimes called domain

models, are typically used to explore domain concepts

• Conceptual data models are often created as the precursor to logical data models or as alternatives to them.

• http://en.wikipedia.org/wiki/Data_modelling

23

Page 24: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Observation and Measurement

24

Page 25: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Mapping model to geochemistry

25

Page 26: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Specimen Model

26

Page 27: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Conceptual model

27

Page 28: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Logical model

28

Page 29: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Physical model

29

Page 30: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

30

Page 31: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Conceptual model – shoreline photos

31

Page 32: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Logical model – shoreline photos

32

Page 33: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

However as a consumer• Do you ever really see these data models?

• What’s the most common form of making data available to others?

• What’s the most common means? Second most common?

33

Page 34: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Example XML<?xml version="1.0" encoding="ISO-8859-1"?>

<shiporder orderid="889923"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:noNamespaceSchemaLocation="shiporder.xsd">

<orderperson>John Smith</orderperson>

<shipto>

<name>Ola Nordmann</name>

<address>Langgt 23</address>

<city>4000 Stavanger</city>

<country>Norway</country>

</shipto>

<item>

<title>Empire </title>

<note>Special Edition</note>

<quantity>1</quantity>

<price>10.90</price>

</item>

<item>

<title>Hide your heart</title>

<quantity>1</quantity>

<price>9.90</price>

</item>

</shiporder>

34

Page 35: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Very simple schema<?xml version="1.0" encoding="ISO-8859-1" ?>

<xs:schema xmlns:xs=http://www.w3.org/2001/XMLSchema>

<xs:element name="shiporder">

<xs:complexType>

<xs:sequence>

<xs:element name="orderperson" type="xs:string"/>

<xs:element name="shipto">

<xs:complexType>

<xs:sequence>

<xs:element name="name" type="xs:string"/>

<xs:element name="address" type="xs:string"/>

<xs:element name="city" type="xs:string"/>

<xs:element name="country" type="xs:string"/>

</xs:sequence>

</xs:complexType>

</xs:element>

35

<xs:element name="item" maxOccurs="unbounded">

<xs:complexType>

<xs:sequence>

<xs:element name="title" type="xs:string"/>

<xs:element name="note" type="xs:string" minOccurs="0"/>

<xs:element name="quantity" type="xs:positiveInteger"/>

<xs:element name="price" type="xs:decimal"/>

</xs:sequence>

</xs:complexType>

</xs:element>

</xs:sequence>

<xs:attribute name="orderid" type="xs:string" use="required"/>

</xs:complexType>

</xs:element>

</xs:schema>

Page 36: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Markup Languages• Reminder:

– Mixes data and metadata, and yes, information– Tag structure does not always model the

underlying data structure– Modeling the XML itself, i.e. the schema is

another task– Does have the potential benefit that it is more for

use than storage

• Parsing the file:– Incomplete versus complete tags– Empty or optional fields 36

Page 37: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Data tools (just a few)• Models

– http://www.datamodel.org/– MSDN:

http://msdn.microsoft.com/en-us/library/bb399249.aspx

• Schema– The Schematron differs in basic concept from other

schema languages in that it not based on grammars but on finding tree patterns in the parsed document. This approach allows many kinds of structures to be represented which are inconvenient and difficult in grammar-based schema languages. If you know XPath or the XSLT expression language, you can start to use The Schematron immediately.

– http://www.schematron.com/37

Page 38: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Markup Language tools• Any context-sensitive editor

• XMLSpy, XML Notepad, XML Editor, oXygen

38

Page 39: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Data as Service• Modern internet architectures allow for

– Service oriented architectures– Resource oriented architectures

• Why is this important for data models, schema, etc.– Hides/ obscures underlying model, schemas– Service interfaces are often a poor/ hybrid match

for underlying models

• UML and ISO 19xxx family of standards, e.g. 19135 are changing the landscape

• Mature in certain settings.39

Page 40: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Open Geospatial Consortium• Web Feature Service (WFS)

– http://www.opengeospatial.org/standards/wfs– support INSERT, UPDATE, DELETE, LOCK,

QUERY and DISCOVERY operations on geographic features using HTTP as the distributed computing platform

– Built on Geographic Markup Language (GML)

• Tutorial– http://docs.codehaus.org/display/MAP/

WFS+Tutorial

40

Page 41: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

WFS examples

41

Page 42: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Open Geospatial Consortium• Web Mapping Service (WMS)

– http://www.opengeospatial.org/standards/wms– produces maps of spatially referenced data

dynamically from geographic information ("map" is a portrayal of geographic information as a digital image file suitable for display on a computer screen). A map is not the data itself. WMS-produced maps are generally rendered in a pictorial format such as PNG, GIF or JPEG, or occasionally as vector-based graphical elements in Scalable Vector Graphics formats.

– http://www.intl-interfaces.com/cookbook/WMS/– http://oceanesip.jpl.nasa.gov/esipde/guide.html

42

Page 43: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Open Geospatial Consortium• Web Coverage Service (WCS)

– http://www.opengeospatial.org/standards/wcs– supports electronic interchange of geospatial

data as "coverages" – that is, digital geospatial information representing space-varying phenomena

43

Page 44: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Open Geospatial Consortium• Sensor Observation Service (SOS)

– http://www.opengeospatial.org/standards/sos

• SWE Common– http://www.opengeospatial.org/projects/groups/

swecommonswg – Get_capabilities

44

Page 45: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

IVOA (www.ivoa.net)• Simple Image Access Protocol

– http://ivoa.net/Documents/SIA/20091008/PR-SIA-1.0-20091008.pdf

– This specification defines a protocol for retrieving image data from a variety of astronomical image repositories through a uniform interface. The interface is meant to be reasonably simple to implement by service providers. A query defining a rectangular region on the sky is used to query for candidate images.

– The service returns a list of candidate images formatted as a VOTable. For each candidate image an access reference URL may be used to retrieve the image. Images may be returned in a variety of formats including FITS and various graphics formats. Referenced images are often computed on the fly, e.g., as cutouts from larger images.

45

Page 46: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

IVOA (www.ivoa.net)• E.g. Simple Spectrum Access Protocol

– http://ivoa.net/Documents/REC/DAL/SSA-20080201.pdf– The Simple Spectrum Access (SSA) Protocol (SSAP)

defines a uniform interface to remotely discover and access one dimensional spectra. SSA is a member of an integrated family of data access interfaces altogether comprising the Data Access Layer (DAL) of the IVOA.

– SSA is based on a more general data model capable of describing most tabular spectrophotometric data, including time series and spectral energy distributions (SEDs) as well as 1-D spectra; however the scope of the SSA interface as specified in this document is limited to simple 1-D spectra, including simple aggregations of 1-D spectra.

46

Page 47: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Discussion• Theoretical concepts- do we have any hope?

• Data models – could you develop one?

• Forms of Schema?

• Service paradigms?

• Relation to data management?

47

Page 48: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

Summary• Informatics in relation to data science

– Discuss?

• Data models and schema and the tools that go with them are plentiful

• Modern use of XML and specific markup languages obscure the underlying data structure (physical and logical) but have other advantages

• Data as service carry this to another level48

Page 49: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

What is next

• Next lecture (#12) – Nov. 18th.– Webs of data, data on the web, deep Web,

data discovery, data citation

• NO CLASS on Nov. 25th – written projects due!

• Reading:– See web site for NEXT week (pre-reading)

49

Page 50: 1 Peter Fox Data Science – CSCI/ERTH/ITWS-4350/6350 Week 11, November 11, 2014 Academic Basis for Data and Information Science, Data Models, Schema, Data

How about those projects?

50