sql saturday 111 atlanta applied enterprise semantic mining

36
Applied Enterprise Semantic Mining Mark Tabladillo Ph.D. Data Mining Architect SQL Saturday Atlanta April 14, 2012

Upload: mark-tabladillo

Post on 05-Dec-2014

1.635 views

Category:

Technology


1 download

DESCRIPTION

SQL Server 2012 debuts a new Semantic Platform (commonly known as the applied task, Semantic Search). This text mining technology leverages the already established Full Text Index, and builds semantic indexes in a two-phase process. This presentation provides a science description and demo for the Enterprise implementation of Tag Index and Document Similarity Index. At present (RTM), the indexes work for 15 languages. Included are strategy tips for how to best leverage the technology along with already-existing Microsoft text mining and data mining.

TRANSCRIPT

Page 1: Sql Saturday 111 Atlanta applied enterprise semantic mining

Applied Enterprise Semantic MiningMark Tabladillo Ph.D.Data Mining ArchitectSQL Saturday Atlanta April 14, 2012

Page 2: Sql Saturday 111 Atlanta applied enterprise semantic mining

About MarkTab

20+ Years in AtlantaConsulting since 1998; Incorporated 2003Part-Time Faculty at University of Phoenix

SAS and Microsoft ExpertPresenter since 1998 at conferences like TechEdand SAS Global Forum

http://marktab.com @MarkTabNet

Page 3: Sql Saturday 111 Atlanta applied enterprise semantic mining

Introduction

SQL Server 2012 has new Programmability Enhancements

Statistical Semantic SearchFile TablesFull-Text Search Improvements

These combined technologies make SQL Server 2012 a strong contender in text mining

Page 4: Sql Saturday 111 Atlanta applied enterprise semantic mining

PROBLEM STATEMENT

Page 5: Sql Saturday 111 Atlanta applied enterprise semantic mining

Challenges

Building and Maintaining Applications with relational and non-relational data is hard

Complex integrationDuplicated functionalityCompensation for unavailable services

80% of all data is not stored in databases! Most of it is “unstructured”

(2012, Michael Rys, Microsoft)

Page 6: Sql Saturday 111 Atlanta applied enterprise semantic mining

MICROSOFT AND GOOGLE

Page 7: Sql Saturday 111 Atlanta applied enterprise semantic mining

History

July 2008Microsoft purchases Powerset for US$100 MillionGoogle Dismisses Semantic Searchhttp://venturebeat.com/2008/06/26/microsoft-to-buy-semantic-search-engine-powerset-for-100m-plus/http://www.forbes.com/2008/07/01/powerset-msft-search-tech-intel-cx_ag_0701powerset.html

Page 8: Sql Saturday 111 Atlanta applied enterprise semantic mining

History

March 2009Google announces “snippets” as relevant to searchThe media picks this story up as “semantic search”http://googleblog.blogspot.com/2009/03/two-new-improvements-to-google-results.html#!/2009/03/two-new-improvements-to-google-results.html

Page 9: Sql Saturday 111 Atlanta applied enterprise semantic mining

History

February 2012Google announces Knowledge Graph, an explicit application of semantic searchhttp://mashable.com/2012/02/13/google-knowledge-graph-change-search/

Page 10: Sql Saturday 111 Atlanta applied enterprise semantic mining

History

April 2012Microsoft purchases 800+ patents from AOL for US$1 BillionAmong the patents are semantic search and metadata querying – older than Googlehttp://www.theregister.co.uk/2012/04/09/aol_microsoft_patent_deal/

Page 11: Sql Saturday 111 Atlanta applied enterprise semantic mining

PURPOSE STATEMENT (SQL SERVER)

Page 12: Sql Saturday 111 Atlanta applied enterprise semantic mining

Goals

Reduce the cost of managing all dataSimplify the development of applications over all dataProvide management and programming services for all dataMake SQL Server the preferred choice for managing Unstructured Data and allow building Rich Application Experience on top

(2012, Michael Rys, Microsoft)

Page 13: Sql Saturday 111 Atlanta applied enterprise semantic mining

NEW IN SQL SERVER 2012http://msdn.microsoft.com/en-us/library/cc645577.aspx

Page 14: Sql Saturday 111 Atlanta applied enterprise semantic mining

Statistical Semantic Search

Identifies statistically relevant key phrasesBased on these phrases, can identify (by score) similar documents

Page 15: Sql Saturday 111 Atlanta applied enterprise semantic mining

FileTables

Built on existing SQL Server FILESTREAM technologyFiles and documents

Stored in special tables in SQL ServerAccessed if they were stored in the file system

Page 16: Sql Saturday 111 Atlanta applied enterprise semantic mining

Full-Text Search Enhancements

Property search: search on tagged properties (such as author or title)Customizable NEAR: find words or phrases close to one anotherNew Word Breakers and Stemmers (for many languages)

Page 17: Sql Saturday 111 Atlanta applied enterprise semantic mining

HOW DOES SEMANTIC SEARCH WORK?

Page 18: Sql Saturday 111 Atlanta applied enterprise semantic mining

RowsetOutput

with Scores

VarcharNVarchar

Office

PDF

From Documents to Output

Page 19: Sql Saturday 111 Atlanta applied enterprise semantic mining

“Beyond Relational” vs. “Adoption”

Start with unstructured (meaning non-relational) dataUse Windows technology

Reading and Writing Files (Win32 API)iFilters for reading proprietary formats

Develop indexed structure from unstructured data

Page 20: Sql Saturday 111 Atlanta applied enterprise semantic mining

(iFilter Required)

DocumentsFull-Text Keyword

Index“FTI”

iFilters

Semantic Document Similarity Index “DSI”

Semantic Database

Semantic Key

Phrase Index –

Tag Index “TI”

Page 21: Sql Saturday 111 Atlanta applied enterprise semantic mining

“iFilter”?

IFilters are components that allow search services to index content of specific file types, letting you search for content in those files. They are intended for use with Microsoft Search Services (Sharepoint, SQL, Exchange, Windows Search).

Page 22: Sql Saturday 111 Atlanta applied enterprise semantic mining

Microsoft Office 2010 Filters Pack

Legacy Office Filter (97-2003; .doc, .ppt, .xls)Metro Office Filter (2007; .docx, .pptx, .xlsx)Zip FilterOneNote filterVisio FilterPublisher FilterOpen Document Format Filter

Page 23: Sql Saturday 111 Atlanta applied enterprise semantic mining

Adobe PDF iFilter 9 for 64-bit platforms

Allows PDF searchNot currently supported for Windows 7

But I used it anyway ☺Add the Bin directory to your path

Computer (right click), Properties, Advanced System Settings, Environment Variables

Page 24: Sql Saturday 111 Atlanta applied enterprise semantic mining

“Semantic Language Statistics Database”?

This database contains the statistical language models required by semantic search. A single semantic language statistics database contains the language models for all the languages that are supported for semantic indexing.

Page 25: Sql Saturday 111 Atlanta applied enterprise semantic mining

Languages Currently Supported

Traditional ChineseGermanEnglishFrenchItalianBrazilianRussianSwedishSimplified ChineseBritish EnglishPortugueseChinese (Hong Kong SAR, PRC)SpanishChinese (Singapore)Chinese (Macau SAR)

Page 26: Sql Saturday 111 Atlanta applied enterprise semantic mining

PERFORMANCE

Page 27: Sql Saturday 111 Atlanta applied enterprise semantic mining

Phases of Semantic Indexing

Full Text Keyword Index “FTI”

Semantic Key Phrase Index –

Tag Index “TI”

Semantic Document Similarity Index “DSI”

http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing

Page 28: Sql Saturday 111 Atlanta applied enterprise semantic mining

Integrated Full Text Search (iFTS)

Improved Performance and Scale:Scale-up to 350M documents for storage and searchiFTS query performance 7-10 times faster than in SQL Server 2008Worst-case iFTS query response times less than 3 sec for corpusSimilar or better than main database search competitors

(2012, Michael Rys, Microsoft)

Page 29: Sql Saturday 111 Atlanta applied enterprise semantic mining

Linear Scale of FTI/TI/DSI

First known linearly scaling end-to-end Search and Semantic product in the industry

Time in Seconds vs. Number of Documents(2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft)

Page 30: Sql Saturday 111 Atlanta applied enterprise semantic mining

Conclusion

SQL Server 2012 adds new text processing capabilitiesThis technology scales linearlyMicrosoft invites millions of documents for enterprise-level applications

Page 31: Sql Saturday 111 Atlanta applied enterprise semantic mining

Network

MarkTab Consultinghttp://marktab.com

Bloghttp://marktab.net

Twitter@marktabnet

Page 32: Sql Saturday 111 Atlanta applied enterprise semantic mining

APPENDIX

Page 33: Sql Saturday 111 Atlanta applied enterprise semantic mining

References

Videohttp://channel9.msdn.com/Shows/DataBound/DataBound-Episode-2-Semantic-Searchhttp://www.microsoftpdc.com/2009/SVR32

Semantic Search (Books Online) – explains the demo

http://msdn.microsoft.com/en-us/library/gg492075.aspx

Paperhttp://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p213.pdf

Page 34: Sql Saturday 111 Atlanta applied enterprise semantic mining

Demo: My Semantic Search Sample

http://mysemanticsearch.codeplex.com/Requires:

iFiltersSemantic Language Statistics DatabaseIIS7, IIS6, with Windows Authentication.NET 4.0Silverlight 4.0FILESTREAM (complete)

Page 35: Sql Saturday 111 Atlanta applied enterprise semantic mining

Demo: T-SQL and Documents

Naveen GargRequires Adventure Works (from Codeplex)http://blogs.msdn.com/b/sqlfts/archive/2011/07/21/introducing-fulltext-statistical-semantic-search-in-sql-server-codename-denali-release.aspx

Page 36: Sql Saturday 111 Atlanta applied enterprise semantic mining

Abstract

SQL Server 2012 debuts a new Semantic Platform (commonly known as the applied task, Semantic Search). This text mining technology leverages the already established Full Text Index, and builds semantic indexes in a two-phase process. This presentation provides a science description and demo for the Enterprise implementation of Tag Index and Document Similarity Index. At present (RTM), the indexes work for 15 languages. Included are strategy tips for how to best leverage the technology along with already-existing Microsoft text mining and data mining.