data science for dtic data ecosystem dr. brand niemann director and senior data scientist/data...

14
Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community http://semanticommunity.info/ http://www.meetup.com/Virginia-Big-Data-Meetup/ http://www.meetup.com/Federal-Big-Data-Working-Group/ http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group _Meetup December 13, 2014 1

Upload: oswald-briggs

Post on 12-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

1

Data Science forDTIC Data Ecosystem

Dr. Brand NiemannDirector and Senior Data Scientist/Data Journalist

Semantic Communityhttp://semanticommunity.info/

http://www.meetup.com/Virginia-Big-Data-Meetup/ http://www.meetup.com/Federal-Big-Data-Working-Group/

http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_MeetupDecember 13, 2014

Page 2: Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

2

Overview

• Master Data Repository Tool RFI: Due December 30th

• GPO's FDsys: 1 Billion Items Served and Ready For More• Big Data: Google Page Rank• Data Science: Questions, Data Mining, Invert Bath Tub, and

Digital Government Strategy• DTIC Site Map, Thesaurus, and Subject Categories• Knowledge Base: MindTouch• Knowledge Base Spreadsheet Linked Data Index: Excel• Analytics and Visualization: Spotfire• Semantic Search: Semantic Insights• Results: Conclusions and Next Steps

Page 3: Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

3

Master Data Repository Tool RFI:Due December 30th

• DTIC’s goal is to consolidate, unify, manage, control, search, analyze, and disseminate scientific and technical data using a single tool.

• If they really want one tool then the GPO FEDSYS is probably the closest I have worked on in the past.

• DTIC Needs Data Scientists to Build a Data Ecosystem with Data Science.

• Map RFI Requirements to Digital Government Strategy to Show How This Data Science Pilot Meets and Exceeds Those Requirements.

Page 4: Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

4

GPO's FDsys:1 Billion Items Served and Ready For More

• But, “We are not resting on our laurels,” said GPO’s Chief Technology Officer Ric Davis. A major refresh of FDsys now is in the planning stages, which will include an updated search engine and improved support for mobile devices.

• FDsys uses Extensible Markup Language (XML) and an ISO standard format for archival information to enable searching across multiple collections, a feature not available in the original GPO website, GPO Access.

• “It was really a flat store of files with a search engine on top of it,” LaPlant said of the old Access. “We needed something to better manage and preserve.”

• The agency is evaluating cloud-based technology for FDsys as part of its upcoming major refresh, along with a new an open source search engine, Solr, which promises fault tolerant performance on a large scale.

• In 2012 GPO began replacing its 30-year-old composition engine called Microcomp with XML Professional Publisher (XPP) to enable the direct XML formatting of documents for both electronic and print publication. This eliminated the step of transforming documents for publication in XML.

• My Comment: So I taught XML Training at GPO, showed them how to “author once and use many” (print, CD, Web, mobile), and suggested MindTouch, the state-of-the-art Wiki with Solr (Lucene) in the Amazon Cloud that I use for all of my work, so I am still ahead of them after 15 years! I use four tools: MindTouch, Spotfire, Semantic Insights, and Be Informed.http://gcn.com/Articles/2014/05/08/GPO-FDsys.aspx?Page=3&p=1

Page 5: Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

5

Big Data: Google Page Rank• PageRank is an algorithm used by Google Search to rank websites in their

search engine results. PageRank was named after Larry Page, one of the founders of Google. PageRank is a way of measuring the importance of website pages. According to Google:– PageRank works by counting the number and quality of links to a page to determine a

rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

• PageRank is now one of 200 ranking factors that Google uses to determine a page’s popularity. Google Panda is one of the other strategies Google now relies on to rank popularity of pages. Even though PageRank is no longer directly important for SEO purposes, the existence of back-links from more popular websites continues to push a webpage higher up in search rankings.

• My Comment: Why not create big data pages that are data in relational and graph format?

http://en.wikipedia.org/wiki/PageRank

Page 6: Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

6

Data Science: Questions, Data Mining, Invert Bath Tub, and Digital Government Strategy

Answer Four Questions:How is the data collected?Where is it stored?What are the results?Why should we believe them?

Follow Data Mining Process:Business UnderstandingData UnderstandingData PreparationModelingEvaluationDeployment

Invert the Activity Level Bathtub:Collection (Easy and Fast)Analysis (Maximize Time Spent)Communications (Easy and Fast)

Digital Government Strategy:Unstructured is StructuredUnstructured and Structured Are IntegratedWell-defined URLsContent (XML, Java, and APIs with Non-Web Formats Like PDF Converted)Data Ecosystem

Page 7: Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

7

DTIC Site Map

http://www.dtic.mil/dtic/sitemap.html

Page 8: Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

8

DTIC Thesaurus

http://www.dtic.mil/dtic/stresources/techreports/dticSearchTools/thesaurus_desc.html

Page 9: Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

9

DTIC Subject Categories

http://www.dtic.mil/dtic/stresources/techreports/subjCategory/index.html

Page 11: Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

11

Knowledge Base Spreadsheet Linked Data Index: Excel

http://semanticommunity.info/@api/deki/files/31840/DTIC.xls?origin=mt-web

Page 13: Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

13

Semantic Search: Semantic Insights

http://www.semanticinsights.com/applications/index.htm

My Note: I requested use of Research Assistant and Research Librarian on DTIC Content.

Page 14: Data Science for DTIC Data Ecosystem Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community

14

Results: Conclusions and Next Steps

• A Data Scientist Has Built a DTIC Data Ecosystem That Answers Four Basic Questions, Supports Data Mining, Inverts the Bath Tub, and Complies With the Digital Government Strategy.

• The DTIC Data Ecosystem Was Built From the DTIC Web Site Map and Satisfies the RFI Requirements.

• The DTIC Data Ecosystem Provides Sematic Search and Visualizations in MindTouch, Excel, and Spotfire.

• Semantic Community Has Requested the Use of Research Assistant and Research Librarian Betas from Semantic Insights For Use on DTIC Content.