sql saturday 111 atlanta applied enterprise semantic mining
DESCRIPTION
SQL Server 2012 debuts a new Semantic Platform (commonly known as the applied task, Semantic Search). This text mining technology leverages the already established Full Text Index, and builds semantic indexes in a two-phase process. This presentation provides a science description and demo for the Enterprise implementation of Tag Index and Document Similarity Index. At present (RTM), the indexes work for 15 languages. Included are strategy tips for how to best leverage the technology along with already-existing Microsoft text mining and data mining.TRANSCRIPT
Applied Enterprise Semantic MiningMark Tabladillo Ph.D.Data Mining ArchitectSQL Saturday Atlanta April 14, 2012
About MarkTab
20+ Years in AtlantaConsulting since 1998; Incorporated 2003Part-Time Faculty at University of Phoenix
SAS and Microsoft ExpertPresenter since 1998 at conferences like TechEdand SAS Global Forum
http://marktab.com @MarkTabNet
Introduction
SQL Server 2012 has new Programmability Enhancements
Statistical Semantic SearchFile TablesFull-Text Search Improvements
These combined technologies make SQL Server 2012 a strong contender in text mining
PROBLEM STATEMENT
Challenges
Building and Maintaining Applications with relational and non-relational data is hard
Complex integrationDuplicated functionalityCompensation for unavailable services
80% of all data is not stored in databases! Most of it is “unstructured”
(2012, Michael Rys, Microsoft)
MICROSOFT AND GOOGLE
History
July 2008Microsoft purchases Powerset for US$100 MillionGoogle Dismisses Semantic Searchhttp://venturebeat.com/2008/06/26/microsoft-to-buy-semantic-search-engine-powerset-for-100m-plus/http://www.forbes.com/2008/07/01/powerset-msft-search-tech-intel-cx_ag_0701powerset.html
History
March 2009Google announces “snippets” as relevant to searchThe media picks this story up as “semantic search”http://googleblog.blogspot.com/2009/03/two-new-improvements-to-google-results.html#!/2009/03/two-new-improvements-to-google-results.html
History
February 2012Google announces Knowledge Graph, an explicit application of semantic searchhttp://mashable.com/2012/02/13/google-knowledge-graph-change-search/
History
April 2012Microsoft purchases 800+ patents from AOL for US$1 BillionAmong the patents are semantic search and metadata querying – older than Googlehttp://www.theregister.co.uk/2012/04/09/aol_microsoft_patent_deal/
PURPOSE STATEMENT (SQL SERVER)
Goals
Reduce the cost of managing all dataSimplify the development of applications over all dataProvide management and programming services for all dataMake SQL Server the preferred choice for managing Unstructured Data and allow building Rich Application Experience on top
(2012, Michael Rys, Microsoft)
NEW IN SQL SERVER 2012http://msdn.microsoft.com/en-us/library/cc645577.aspx
Statistical Semantic Search
Identifies statistically relevant key phrasesBased on these phrases, can identify (by score) similar documents
FileTables
Built on existing SQL Server FILESTREAM technologyFiles and documents
Stored in special tables in SQL ServerAccessed if they were stored in the file system
Full-Text Search Enhancements
Property search: search on tagged properties (such as author or title)Customizable NEAR: find words or phrases close to one anotherNew Word Breakers and Stemmers (for many languages)
HOW DOES SEMANTIC SEARCH WORK?
RowsetOutput
with Scores
VarcharNVarchar
Office
From Documents to Output
“Beyond Relational” vs. “Adoption”
Start with unstructured (meaning non-relational) dataUse Windows technology
Reading and Writing Files (Win32 API)iFilters for reading proprietary formats
Develop indexed structure from unstructured data
(iFilter Required)
DocumentsFull-Text Keyword
Index“FTI”
iFilters
Semantic Document Similarity Index “DSI”
Semantic Database
Semantic Key
Phrase Index –
Tag Index “TI”
“iFilter”?
IFilters are components that allow search services to index content of specific file types, letting you search for content in those files. They are intended for use with Microsoft Search Services (Sharepoint, SQL, Exchange, Windows Search).
Microsoft Office 2010 Filters Pack
Legacy Office Filter (97-2003; .doc, .ppt, .xls)Metro Office Filter (2007; .docx, .pptx, .xlsx)Zip FilterOneNote filterVisio FilterPublisher FilterOpen Document Format Filter
Adobe PDF iFilter 9 for 64-bit platforms
Allows PDF searchNot currently supported for Windows 7
But I used it anyway ☺Add the Bin directory to your path
Computer (right click), Properties, Advanced System Settings, Environment Variables
“Semantic Language Statistics Database”?
This database contains the statistical language models required by semantic search. A single semantic language statistics database contains the language models for all the languages that are supported for semantic indexing.
Languages Currently Supported
Traditional ChineseGermanEnglishFrenchItalianBrazilianRussianSwedishSimplified ChineseBritish EnglishPortugueseChinese (Hong Kong SAR, PRC)SpanishChinese (Singapore)Chinese (Macau SAR)
PERFORMANCE
Phases of Semantic Indexing
Full Text Keyword Index “FTI”
Semantic Key Phrase Index –
Tag Index “TI”
Semantic Document Similarity Index “DSI”
http://msdn.microsoft.com/en-us/library/gg492085.aspx#SemanticIndexing
Integrated Full Text Search (iFTS)
Improved Performance and Scale:Scale-up to 350M documents for storage and searchiFTS query performance 7-10 times faster than in SQL Server 2008Worst-case iFTS query response times less than 3 sec for corpusSimilar or better than main database search competitors
(2012, Michael Rys, Microsoft)
Linear Scale of FTI/TI/DSI
First known linearly scaling end-to-end Search and Semantic product in the industry
Time in Seconds vs. Number of Documents(2011 – K. Mukerjee, T. Porter, S. Gherman – Microsoft)
Conclusion
SQL Server 2012 adds new text processing capabilitiesThis technology scales linearlyMicrosoft invites millions of documents for enterprise-level applications
Network
MarkTab Consultinghttp://marktab.com
Bloghttp://marktab.net
Twitter@marktabnet
APPENDIX
References
Videohttp://channel9.msdn.com/Shows/DataBound/DataBound-Episode-2-Semantic-Searchhttp://www.microsoftpdc.com/2009/SVR32
Semantic Search (Books Online) – explains the demo
http://msdn.microsoft.com/en-us/library/gg492075.aspx
Paperhttp://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p213.pdf
Demo: My Semantic Search Sample
http://mysemanticsearch.codeplex.com/Requires:
iFiltersSemantic Language Statistics DatabaseIIS7, IIS6, with Windows Authentication.NET 4.0Silverlight 4.0FILESTREAM (complete)
Demo: T-SQL and Documents
Naveen GargRequires Adventure Works (from Codeplex)http://blogs.msdn.com/b/sqlfts/archive/2011/07/21/introducing-fulltext-statistical-semantic-search-in-sql-server-codename-denali-release.aspx
Abstract
SQL Server 2012 debuts a new Semantic Platform (commonly known as the applied task, Semantic Search). This text mining technology leverages the already established Full Text Index, and builds semantic indexes in a two-phase process. This presentation provides a science description and demo for the Enterprise implementation of Tag Index and Document Similarity Index. At present (RTM), the indexes work for 15 languages. Included are strategy tips for how to best leverage the technology along with already-existing Microsoft text mining and data mining.