building a geospatial data dictionary: enhanced data description nearc, fall 2011 brian hebert...

47
Building a Geospatial Data Dictionary: Enhanced Data Description NEARC, Fall 2011 Brian Hebert Solutions Architect ScribeKey, LLC www.scribekey.com

Upload: helena-cook

Post on 17-Dec-2015

220 views

Category:

Documents


3 download

TRANSCRIPT

Building a Geospatial Data Dictionary: Enhanced Data Description

NEARC, Fall 2011

Brian HebertSolutions Architect

ScribeKey, LLC

www.scribekey.com

• Review goals and requirements for producing enhanced data description materials

• Look at approaches to data description• US Census data as sample• Review ScribeKey shareware tools• Discussion and Q&A

www.scribekey.com 2

Workshop Outline

• Make data as easy to understand and use as possible, reduce the learning curve.

• Learning about data takes lots of time and effort and given dataset(s) are often part of larger data use and mission.

• Make full use of the tools we have.• Apply these ideas to your own use cases.• Whether you are a user, provider, broker, creator

of data, help people use it in the best way.

www.scribekey.com 3

Goals

Lessons Learned• Global FGDC Metadata and

data description materials for large volume commercial geospatial data sets, containing 1000s of data layers and tables.

• Assess, describe, and standardize large collection of geospatial datasets and metadata.

• Borrow from data warehousing, business intelligence, and library science approaches.

200+ Countries72 Layers

100s of Attributes100s of Domains

Quarterly Updates 50+ States400 Layers

1000s of Attributes100s of DomainsAnnual Updates

www.scribekey.com 4

Background: Industrial Strength Metadata Generation

www.scribekey.com 5

• Sample data is reviewed and profiled. Any metadata is imported into repository.

• From profile, existing user documentation, technical support staff, and website, a metadata repository is populated and metadata document templates are developed.

• FGDC/ISO Metadata generated, as XML/HTML reports, from metadata repository.

MetadataRepository

MetadataTemplatesMetadata

Templates

MetadataExportApp

FGDC XML HTML

PDF

DOC

Sample Data: US Census • US Census Data for

Saratoga Country, NY• Good example• Lots of detail• Has CSDGM metadata• Has its own vocabulary

www.scribekey.com 6

Saratoga County, NYPersonal GeoDb

How Do People Learn About Data?

WebsiteDocumentation

Metadata

UserTech Support Data Itself

Users learn how to use data through a variety of sourceswww.scribekey.com 7

Email

• Documentation: Large volume, time consuming• FGDC Metadata: Sets of separate XML documents,

redundancy, cumbersome, different format than data being described, etc.

• Website: Lots of great info, somewhat unstructured

• Tech Support: Availability, cost• Data Itself: Familiarity takes time• How can we consolidate all of this information in

a single place in an easy-to-use format?www.scribekey.com 8

Challenges

www.scribekey.com 9

Solution: 2 Data Dictionary Formats

1) HTML Pages 2) GIS MetalayersLightweightFlexibleFamiliarStatic or Dynamic

Integrated Data/MetadataFlexibleFamiliarSimplification

Essentials: It’s All Metadata

Meaning Structure

Contents

10www.scribekey.com

Q: What does it mean to be familiar with data? A: Users know where to find something and how to

make detailed maps and reports.

Creating FGDC CSDGM MetadataIdentification_Information: Citation: Citation_Information: Originator: John Hancock Publication_Date: 2008 Title: Boston Streets Description: Abstract: The Boston Streets dataset provides a complete set of single line street segments for the town of Boston, Massachusetts. Purpose: The purpose of the Boston Streets dataset is to provide a basic street base map for general purpose use by the town and its people. Time_Period_of_Content: Time_Period_Information: Single_Date/Time: Calendar_Date: 2008 Currentness_Reference: Publication Date Status: Progress: Complete Maintenance_and_Update_Frequency: Quarterly Spatial_Domain: Bounding_Coordinates: West_Bounding_Coordinate: -70.00 East_Bounding_Coordinate: -69.00 North_Bounding_Coordinate: 45.00 South_Bounding_Coordinate: 44.00 Keywords: Theme: Theme_Keyword_Thesaurus: None Theme_Keyword: Streets Access_Constraints: This dataset may be freely accessed by the public. Use_Constraints: This dataset may be freely used by the public. Metadata_Reference_Information: Metadata_Date: 20080219 Metadata_Contact: Contact_Information: Contact_Person_Primary: Contact_Person: Sam Adams Contact_Address: Address_Type: Mailing Address: 100 Beacon Street City: Boston State_or_Province: MA Postal_Code: 02108 Contact_Voice_Telephone: 508-429-1234 Metadata_Standard_Name: FGDC Content Standards for Digital Geospatial Metadata Metadata_Standard_Version: FGDC-STD-001-1998

Checklist:

• CSDGM Core

• Only 26 Values

• Attribute Definitions

• Domain Values and Definitions

• Use USGS MP Toolwww.scribekey.com

11

Geospatial Metadata Issues

• There is no real support for non-geometric entities, e.g., tables. For example, the record count element is buried inside a geospatial element. So, there is no place to put a record count for a simple table.

• There is incomplete representation for domains. Domains can’t be shared. Domains have no name of their own, but exist only as info added to an attribute. Domains can only have 2 values, so can’t support 3 related values, e.g., MA, Massachusetts, 25.

• Attribute information is optional. Unlike the most basic RDBMS metadata available in any system, there are no elements for attribute data type and length.

• There are no elements at the entity level for specifying relationships, through joins, etc.

• Metadata at the individual feature record is not supported.

• Describing data layers resulting from combinations of N source datasets is not supported.

www.scribekey.com12

Geospatial Metadata Issues (cont.)• Because they are managed using two different physical implementations, geospatial

data and metadata get out of synch.

• Metadata is available as separate, independent documents. It can not easily be queried as a set. For example, getting a simple list of features/tables requires a custom XML application.

• The FGDC CSDGM XML based standard is complex and difficult to understand by end users and vendors building tools. Based on an XML using variable length records and nesting, it is basically the schema for an object oriented database, not a relational or object relational database.

• The new ISO standards are even more confusing and difficult to understand. ISO Layer metadata and entity, attribute, domain metadata are also now separated into two different standards. Current recommendation by FGDC is to continue using CSDGM.

• http://www.fgdc.gov/metadata/geospatial-metadata-standards

www.scribekey.com13

CSDGM Physical Implementation Guidelines

• The FGDC/CSDGM standard clearly states that the standard describes content, and not physical implementation. From the CSDGM Workbook:

The standard specifies information content, but not how to organize this information in a computer system or in a data transfer, or how to transmit, communicate, or present the information to a user. There are several reasons for this approach:

There are many means by which metadata could be organized in a computer. These include incorporating data as part of a geographic information system, in a separate data base, and as a text file. Organizations can choose the approach which suits their data management strategy, budget, and other institutional and technical factors.

In spite of these statements, geospatial metadata implementation has not been approached using industrial strength RDBMS data access technology, but rather relies on sets of separate XML files, using an entirely different data access and management paradigm than that used by the data it is describing.

www.scribekey.com14

FGDC XML Metadata

FGDC XML Metadata

FGDC XML Metadata

Metadata Import

Roads

Parcels

Buildings

Data Profiling

METADATA REPOSITORY

Centralizing Meaning, Structure, and Content: The RDBMS Based Metadata Repository

Data and Metadata Sources Data Description Toolswww.scribekey.com

15

RDBMS: Structure & Contents

XML: Meaning & Geospatial

How Does Data Profiling Help?

An essential tool for enhanced metadata: shows end user actual sample values, data types, lengths, formats, percent complete, etc. This valuable contents

information is typically not found in geospatial metadata.

www.scribekey.com16

NUM FIELD DESCRIPTION1 DatasetId A uniqe identifier for the dataset2 DatabaseName The name of the source database3 TableName The name of the source database table4 RecordCount The number of records in the table

5 ColumnCount The number of columns in the table6 NumberOfNulls The number of null values in the table

CSDGM Core into the RDBNUMELEMENT

1 Originator2 Publication_Date3 Title4 Abstract5 Purpose6 Calendar_Date7 Currentness_Reference8 Progress9 Maintenance_and_Update_Frequency

10 West_Bounding_Coordinate11 East_Bounding_Coordinate12 North_Bounding_Coordinate13 South_Bounding_Coordinate14 Theme_Keyword_Thesaurus15 Theme_Keyword16 Access_Constraints17 Metadata_Date18 Contact_Person19 Address_Type20 Address21 City22 State_or_Province23 Postal_Code24 Contact_Voice_Telephone25 Metadata_Standard_Name26 Metadata_Standard_Version

XML Metadata

XML Metadata

IMPORT

When metadata is imported into an RDB, the full flexibility of SQL becomes available for very flexible query and management of large volume data description information.

XML Metadata

17www.scribekey.com

Tools Demonstration

Data Profiling

Metadata Import

www.scribekey.com 18

Windows Based

Batch Command Line

.NET

.mdb Files

Logging

Inside the Repository: Tables and View

www.scribekey.com19

Data structure, contents, and meaning housed in a table-centric RDBMS repository. Easy to access, query, and share. If you didn’t have CSDGM

attribute metadata before, the data profile really helps with providing a baseline.

PROFILE:• DiTABLES• DiCOLUMNS• DiDOMAINS• DiDomainValues

METADATA INGEST:• CsdgmEnt• CsdgmAtt• CsdgmDomVal

VIEWS:• EntRpt• AttRpt• DomRpt

Elements from Profile and Metadata Ingestion can be combined through SQL views.

Helping with the Data Provider/End User Communication Gap

User Language

Data providers and users have different languages and understandings of data. Use of keywords, aliases, and

definitions in data dictionary helps bridge this gap; provides a translation

www.scribekey.com20

“LayerTable

AttributeMap

SymbolCentroid

JoinReport”

ProviderLanguage

“ImputeFROMHN

EDGESADDRFNInternal

PointMTFCCS1100”

GIS Users Data ModelersISO/OGC Schemas

The Tower of Babel

UML, XSD GML

ISO 19XXX

LayersAttributesSymbols,

Towns… ?

What does this mean?

OntologiesAbracadabra

Schemas and Semantics

www.scribekey.com21

Next Steps: Clarification and Completion

• We’ve integrated profile and metadata info

• Now need to refine this information

• Make sure everything is clear

• Make sure everything is complete

• Library Science to the rescue

www.scribekey.com 22

Library Science Artifacts

• Indexing and Abstracting• The Dictionary Hierarchy• Types and Taxonomies• The Thesaurus• The Glossary

www.scribekey.com 23

With the Metadata Repository loaded, a number of useful data description artifacts can be developed.

Indexing and Abstracting: The Overview Page

www.scribekey.com 24

• The most essential information

• Clear concise writing

• Links to details• Automated tools

are no substitute for subject matter expertise

• Limits of FGDC or ISO schemas as template

• Data driven

The Data Dictionary Hierarchy:Categories, Entities, Attributes, Domains

www.scribekey.com 25

• Data typically falls into higher level categories

• Entities include layers and tables and relationships among them

• Attribute data types, lengths, domain contents provide the heart of data detail for query, reporting, and mapping

• A streamlined and flexible view of metadata

Feature Types and Taxonomies

www.scribekey.com 26

• Users need to be able to search through metadata and data easily, using feature names they are familiar with.

• Domain profiles and metadata are starting points for developing of feature description typology.

• Isolated domain information doesn’t always present the entire picture.

This HTML page allows users, to look up a feature name and find the corresponding layer and attribute SQL query that can be used to filter for it.

The Thesaurus: What’s in a name?

www.scribekey.com 27

US Census MTFCC SDTS Entities

Choosing the Best Names

www.scribekey.com 28

• If you’re developing a new set of names for data categories, entities, attributes, and domain values, use words that your data user audience is familiar with.

• Don’t invent new words when an existing ones will do. Reuse taxonomies.

• “Consistency is the last refuge of the unimaginative” Oscar Wilde

• Natural language is often inconsistent, but can still be very clear for end users.

Choosing the Best Names (cont.)

www.scribekey.com 29

The Google Test

lon/lat: 201,000,000

lat/lon: 7,870,000

Tool Demonstration: Sql2Html

www.scribekey.com 30

Glossaries

• Which words and terms need to be described?

• Text analysis tools are freely available for helping with this task.

• This list was generated from entity definitions.

• Can also be used as input to list of keywords for FGDC metadata.

http://textalyser.net/

www.scribekey.com 31

Metalayers: Metadata as GIS Data

www.scribekey.com 32

Tables from the Metadata Repository can be easily accessed in ArcMap, and joined with polygon layers to provide access to fully integrated data/metadata

Metalayers: Metadata as GIS Data (cont.)

www.scribekey.com 33

Metadata Repository layer/table information, as populated from data profiling and FGDC metadata ingestion, for US Census data, Saratoga County area,

against full backdrop of New York towns.

Table-Centric Metadata in ArcGIS

www.scribekey.com 34

• Metadata tables can be added to your ArcMap .mxd files.

• If you have multiple sets of heterogeneous data, you can link metadata tables with polygons depicting data coverage areas.

• Metadata can now be used like any other geospatial data, as the basis for color shading, symbology, reports, etc.

• Metadata can be used to first find data, through lighter weight wrapper, then drill through to actual underlying data.

Are Data Aggregation Results Metadata?

35

• Data aggregation provides a key component of decision support information systems, AKA, Business Intelligence (BI).

• Provides a smaller, faster, high level summary and simplification of large volumes of data.

• Helps decision makers focus in on what’s important.

• Created using standard RDBMS SQL aggregation constructs, SUM, COUNT, and GROUP BY and OLAP technology.

www.scribekey.com

Id Type Amt1 D 642 B 243 F 954 C 985 D 546 A 267 C 298 F 849 C 56

10 E 5911 C 3112 F 7714 D 5215 A 816 E 2717 F 8218 C 3819 G 5720 A 1321 E 6822 B 8723 D 43… … …

100,000 F 55

Type Num TotalA 15,713 386,623B 15,631 605,963C 15,258 613,823D 15,591 685,496E 15,739 916,551F 15,929 1,120,066G 15,634 1,123,990

BASE DATA

AGGREGATE

Metalayers: Aggregation

www.scribekey.com 36

Metalayer Drilldown and Rollup

COUNTY

TOWN

CENSUS TRACT

www.scribekey.com37

Applying Pivot Table like view and Drilldown and Rollup with hierarchical

geography units

Increasingly detailed views

Meta-Layer Geometry Creation and Management

Lon/Lat Bounding

Boxes

Spatial_Domain: Bounding_Coordinates: West_Bounding_Coordinate: -167.946360 East_Bounding_Coordinate: 179.001991 North_Bounding_Coordinate: 71.298141 South_Bounding_Coordinate: 17.678360

Three basic approaches to generating layer coverage polygons as 1) bounding boxes 2) convex/concave hulls, tessellations and 3) existing administrative or other polygons. Choice

based on presentation and data management requirements.

1

2

3

www.scribekey.com38

Convex Hull of Census Edges Layer

Convex hulls are useful for describing arbitrary Metalayer coverage areas when no existing political or administrative boundary polygons are available.

www.scribekey.com 39

Summary and Take-Aways: 5 Phases

www.scribekey.com 40

1) Developed standardized geospatial metadata

2) Profiled data

3) Integrated profile results and metadata in an RDBMS repository

4) Refined information, using library science approach and artifacts

5) Exported metadata from repository in 2 convenient formats, HTML and geospatial data layers.

Take Away: Lightweight HTML Data Dictionary

Full descriptions of data categories, entities, attributes, domain values.Information integrated from documentation, data profiles, metadata, and data provider website. Available as stand alone HTML or on web

site. www.scribekey.com 41

Take Away: Metalayers

www.scribekey.com 42

Use data profiles and metadata to create GIS layers to allow variety of map presentations, reports, etc. to summarize and highlight

datasets by metadata values.

Take Away: Data Description Checklist • Is there a Data User Guide? A glossary and

index?• Are primary data categories and entities fully

described?• Are all acronyms, abbreviations, provider

vocabulary terms explained? • Are short, cryptic database field names and

values explained?• Are data types, lengths, keys, nulls allowed,

formats, lists clear to help user form SQL queries?

• Is FGDC/ISO Metadata available? • Are sample values and data profiles available?• Are data presentations, maps, symbols,

reports prepared for quick start? • All this info in one place?

www.scribekey.com 43

Meaning Structure

Contents

Complete metadata describes Meaning, Structure, and

Contents. Maximize understanding of details

by end user to help create queries/reports/maps.

Take Away: Use a Geospatial Metadata Repository

A B C

A B C

A B C

A B C

Areas Entities

Attributes Domains

METADATAREPOSITORY

The Metadata Repository, implemented as an RDMBS, is populated with automated tools then used to generate metadata outputs, data

dictionary content, schemas, maps, etc.

Data Layers

Metadata

Documents

Assessments

www.scribekey.com 44

Derivative Datasets

Metalayers

Pivot Tables

New Schemas

Data Dictionary

Enhanced User Views

The Future: Structured vs. Unstructured Query Query/Access

Structured data queries require that a use know the exact entity.attribute=value construct to find data. Unstructured data queries can use underlying metadata tables like the FeatureFilter,

to locate the correct entity.attribute=value construct to find data. Metadata is also generally much smaller volume than the data it is describing and can be queried very quickly.

www.scribekey.com 45

ScribeKey Shareware Tools

1) Data Profiler: SkProfile.exe2) Metadata Importer: SkMtd2Db.exe3) SQL To HTML Generator: SkSql2Html.exe4) MS Access Metadata Repository

• Look at ReadMe.txt files• Work with Personal Geodatabases • Requires .NET runtime

www.scribekey.com 46

Thank youQ&A

Brian Hebert Solutions Architect

ScribeKey, LLC

www.scribekey.com