sql saturday 111 atlanta applied enterprise semantic mining

Post on 05-Dec-2014

1.635 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

SQL Server 2012 debuts a new Semantic Platform (commonly known as the applied task, Semantic Search). This text mining technology leverages the already established Full Text Index, and builds semantic indexes in a two-phase process. This presentation provides a science description and demo for the Enterprise implementation of Tag Index and Document Similarity Index. At present (RTM), the indexes work for 15 languages. Included are strategy tips for how to best leverage the technology along with already-existing Microsoft text mining and data mining.

TRANSCRIPT

Applied Enterprise Semantic MiningMark Tabladillo Ph.D.Data Mining ArchitectSQL Saturday Atlanta April 14, 2012

About MarkTab

20+ Years in AtlantaConsulting since 1998; Incorporated 2003Part-Time Faculty at University of Phoenix

SAS and Microsoft ExpertPresenter since 1998 at conferences like TechEdand SAS Global Forum

http://marktab.com @MarkTabNet

Introduction

SQL Server 2012 has new Programmability Enhancements

Statistical Semantic SearchFile TablesFull-Text Search Improvements

These combined technologies make SQL Server 2012 a strong contender in text mining

PROBLEM STATEMENT

Challenges

Building and Maintaining Applications with relational and non-relational data is hard

Complex integrationDuplicated functionalityCompensation for unavailable services

80% of all data is not stored in databases! Most of it is “unstructured”

(2012, Michael Rys, Microsoft)

MICROSOFT AND GOOGLE

History

July 2008Microsoft purchases Powerset for US$100 MillionGoogle Dismisses Semantic Searchhttp://venturebeat.com/2008/06/26/microsoft-to-buy-semantic-search-engine-powerset-for-100m-plus/http://www.forbes.com/2008/07/01/powerset-msft-search-tech-intel-cx_ag_0701powerset.html

History

March 2009Google announces “snippets” as relevant to searchThe media picks this story up as “semantic search”http://googleblog.blogspot.com/2009/03/two-new-improvements-to-google-results.html#!/2009/03/two-new-improvements-to-google-results.html

History

February 2012Google announces Knowledge Graph, an explicit application of semantic searchhttp://mashable.com/2012/02/13/google-knowledge-graph-change-search/

History

April 2012Microsoft purchases 800+ patents from AOL for US$1 BillionAmong the patents are semantic search and metadata querying – older than Googlehttp://www.theregister.co.uk/2012/04/09/aol_microsoft_patent_deal/

PURPOSE STATEMENT (SQL SERVER)

Goals

Reduce the cost of managing all dataSimplify the development of applications over all dataProvide management and programming services for all dataMake SQL Server the preferred choice for managing Unstructured Data and allow building Rich Application Experience on top

(2012, Michael Rys, Microsoft)

NEW IN SQL SERVER 2012http://msdn.microsoft.com/en-us/library/cc645577.aspx

Statistical Semantic Search

Identifies statistically relevant key phrasesBased on these phrases, can identify (by score) similar documents

FileTables

Built on existing SQL Server FILESTREAM technologyFiles and documents

Stored in special tables in SQL ServerAccessed if they were stored in the file system

Full-Text Search Enhancements

Property search: search on tagged properties (such as author or title)Customizable NEAR: find words or phrases close to one anotherNew Word Breakers and Stemmers (for many languages)

HOW DOES SEMANTIC SEARCH WORK?

RowsetOutput

with Scores

VarcharNVarchar

Office

PDF

From Documents to Output

“Beyond Relational” vs. “Adoption”

Start with unstructured (meaning non-relational) dataUse Windows technology

Reading and Writing Files (Win32 API)iFilters for reading proprietary formats

Develop indexed structure from unstructured data

(iFilter Required)

DocumentsFull-Text Keyword

Index“FTI”

iFilters

Semantic Document Similarity Index “DSI”

Semantic Database

Semantic Key

Phrase Index –

Tag Index “TI”

“iFilter”?

IFilters are components that allow search services to index content of specific file types, letting you search for content in those files. They are intended for use with Microsoft Search Services (Sharepoint, SQL, Exchange, Windows Search).

Microsoft Office 2010 Filters Pack

Legacy Office Filter (97-2003; .doc, .ppt, .xls)Metro Office Filter (2007; .docx, .pptx, .xlsx)Zip FilterOneNote filterVisio FilterPublisher FilterOpen Document Format Filter

Adobe PDF iFilter 9 for 64-bit platforms

Allows PDF searchNot currently supported for Windows 7

But I used it anyway ☺Add the Bin directory to your path

Computer (right click), Properties, Advanced System Settings, Environment Variables

“Semantic Language Statistics Database”?

This database contains the statistical language models required by semantic search. A single semantic language statistics database contains the language models for all the languages that are supported for semantic indexing.

Languages Currently Supported

Traditional ChineseGermanEnglishFrenchItalianBrazilianRussianSwedishSimplified ChineseBritish EnglishPortugueseChinese (Hong Kong SAR, PRC)SpanishChinese (Singapore)Chinese (Macau SAR)

PERFORMANCE

Phases of Semantic Indexing

Full Text Keyword Index “FTI”

Semantic Key Phrase Index –

Tag Index “TI”

Semantic Document Similarity Index “DSI”

http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing

Integrated Full Text Search (iFTS)

Improved Performance and Scale:Scale-up to 350M documents for storage and searchiFTS query performance 7-10 times faster than in SQL Server 2008Worst-case iFTS query response times less than 3 sec for corpusSimilar or better than main database search competitors

(2012, Michael Rys, Microsoft)

Linear Scale of FTI/TI/DSI

First known linearly scaling end-to-end Search and Semantic product in the industry

Time in Seconds vs. Number of Documents(2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft)

Conclusion

SQL Server 2012 adds new text processing capabilitiesThis technology scales linearlyMicrosoft invites millions of documents for enterprise-level applications

Network

MarkTab Consultinghttp://marktab.com

Bloghttp://marktab.net

Twitter@marktabnet

APPENDIX

References

Videohttp://channel9.msdn.com/Shows/DataBound/DataBound-Episode-2-Semantic-Searchhttp://www.microsoftpdc.com/2009/SVR32

Semantic Search (Books Online) – explains the demo

http://msdn.microsoft.com/en-us/library/gg492075.aspx

Paperhttp://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p213.pdf

Demo: My Semantic Search Sample

http://mysemanticsearch.codeplex.com/Requires:

iFiltersSemantic Language Statistics DatabaseIIS7, IIS6, with Windows Authentication.NET 4.0Silverlight 4.0FILESTREAM (complete)

Demo: T-SQL and Documents

Naveen GargRequires Adventure Works (from Codeplex)http://blogs.msdn.com/b/sqlfts/archive/2011/07/21/introducing-fulltext-statistical-semantic-search-in-sql-server-codename-denali-release.aspx

Abstract

SQL Server 2012 debuts a new Semantic Platform (commonly known as the applied task, Semantic Search). This text mining technology leverages the already established Full Text Index, and builds semantic indexes in a two-phase process. This presentation provides a science description and demo for the Enterprise implementation of Tag Index and Document Similarity Index. At present (RTM), the indexes work for 15 languages. Included are strategy tips for how to best leverage the technology along with already-existing Microsoft text mining and data mining.

top related