"updates on semantic fingerprinting", francisco webber, inventor and co-founder of...
TRANSCRIPT
© cortical.io inc. 2016
Semantic Folding
co-Founder and General Manager
Francisco De Sousa Webber
Language Intelligence made easy
© c
ortic
al.io
inc.
201
6
What is Cortical.io ?
We explore & expand Semantic Folding Theory
We spread & sell Semantic Folding Technology
We build & grow Cortical.io as the “Oracle” for
Semantic Processing
© c
ortic
al.io
inc.
201
6
NLP Market Problem
• All systems are based on statistics - low differentiability • Hard to build - high level of expertise needed • Inaccurate compared to humans - low precision • Have complex tuning procedures - hard to deploy • Slow and inefficient compared to humans - hard to scale
Natural Language Processing Technology:
• Human metadata management - for differentiability • Human specialists - for expertise • Human correction - for precision • Human generated gold-standards - for tuning
Weakness Compensated with:
Business-NLP is currently very expensive.
© c
ortic
al.io
inc.
201
6
Solution: Language Intelligence
• By Jeff Hawkins (Silicon Valley, California) • numenta.com technical implementation & IP • Processing algorithm of the human brain (neo-cortex)
Hierarchical Temporal Memory Theory
• By Francisco De Sousa Webber (Vienna, Austria) • cortical.io technical implementation & IP • Processing language-data like the human brain
Semantic Folding Theory
+
© c
ortic
al.io
inc.
201
5
Cortical Constraints• Neocortex is a 2D sheet of repeating Modular Assemblies of neurons with
binary inputs. • Neocortex is a Memory System not a processor. • Neocortex stores Pattern Sequences. • Neocortex is an Online Learning system • Neocortex is only Trained by Exposure to real-world data • All data fed into the neocortex must have Sparse Distributed Representation
(SDR) format: • SDRs are very long Binary Vectors with max. 2% of “1”. • Every SDR-bit is a self-contained Semantic Feature of the world (via sensorial input). • Every SDR-bit is an Explicit Part of the signal. • Similar “things” have similar SDRs. • The Union of SDRs maintains all information of its members.
© c
ortic
al.io
inc.
201
5
Virtual Word Layer
hear see touch
word (SDR)…..
Wor
d se
nsor
stre
amW
ord production stream
Sym
bol in
put
Muscles
Motor output
Symbol output
Virtualization into Retina
© c
ortic
al.io
inc.
201
5
Offline Process
RetinaDB GenerationRetina Training defines the Semantic Universe.
Training Collection specifies all vocabulary, linguistic properties and knowledge.
The Semantic Folding Engine generates a Semantic Map.
Every utterance is positioned within the Semantic Space.
Every term is defined by its distributed selection of utterances/contexts.
A topographic bit-vector is generated for each term of the corpus.
Training Collection
PreprocessinSemantic Folding Engine
Fingerprint Generation
© c
ortic
al.io
inc.
201
5
Retina-API OperationThe generated topographic bit-vectors are called Semantic Fingerprints
The Semantic Fingerprints are stored in the highly performant RetinaDB
The RetinaDB is a complete Language Model
The Retina-API provided functions: convert, compare, dissect, classify and extract text
The user application interacts via a REST Interface
Functions out
Fingerprint out
Compare out
RetinaDB
Retina API
User Application
REST call
Online ProcessOffline Process
Training Collection
PreprocessinSemantic Folding Engine
Fingerprint Generation
© c
ortic
al.io
inc.
201
5
Tuning The Retina
“cholecystitis”
© c
ortic
al.io
inc.
201
5
Aligning Semantic Spaces
philosophy philosophie filosofía философия فلسفة
Concepts and their representations are stable across languages.
EN FR ES RU AR ZH
© c
ortic
al.io
inc.
201
5
The Cortical.io Retina Technology …
… converts any text into a semantic fingerprint.
teens like playing good music with their mobile
phonesFingerprint Generation
© c
ortic
al.io
inc.
201
5
organ
Step 1: Word Fingerprints
piano
church
liver
© c
ortic
al.io
inc.
201
5
aggregation +
sparsification
Step 2: Text Fingerprintsteens like to hear music on their mobile phones
teens like to hear music on their mobile
phones
© c
ortic
al.io
inc.
201
5
Similar meanings …
… look similar
37% overlap
teens like using itunes on their iphone he consumes chart hits on his notebook
© c
ortic
al.io
inc.
201
5
Different meanings …
… look different
the fishermen are sailing out of
5% overlap
teens like using itunes on their iphone the fishermen are sailing out of the harbor
© c
ortic
al.io
inc.
201
5
EvaluationThere are very few comparable algorithms: a couple of academic ones that cannot be readily used for production purposes and Google’s Word2Vec.
The MEN Test Collection: http://clic.cimec.unitn.it/~elia.bruni/MEN.html The RG-65 Test Collection: http://www.aclweb.org/aclwiki/index.php?title=RG-65_Test_Collection_(State_of_the_art) The WordSimilarity-353 Test Collection: http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ Yu&Dredzde 2014: http://arxiv.org/pdf/1411.4166.pdf Distributed representations of words and phrases: http://papers.nips.cc/paper/5021-di
MEN-3K RG-65 WS-353
word2vec (Google) 55,2 44,8 54,7
Yu&Drezde (2014) 50,1 47,1 53,7
cortical.io Retina 67,4 71,3 62,2
% better word2vec 18,1 37,2 12,1
% better Yu&Drezde 25,7 33,9 13,7
© c
ortic
al.io
inc.
201
6
Semantic Folding Products
• Document Retrieval • Expert Finding • Knowledge Management
Enterprise Semantic Search
• Semantic Streaming Text Filter • (Social) Media Monitoring • Business Intelligence & Analytics
Big Text Data
• Natural Language based Automation • Content Personalisation • Semantic Profiling
Semantic Matching
similarity engineexample document
most similar documents
ordered along the users
information need
query document index result set ranking
#finance#markets
#mobile
#movies#products
#trend Topic of interest
Analytics
Match Making
En
terp
rise
Ap
plic
atio
nW
eb A
pp
© c
ortic
al.io
inc.
201
6
Cloudera Integration
© c
ortic
al.io
inc.
201
6
Semantic Search
similarity engineexample document
most similar documents
ordered along the users
information need
query document index result set ranking
© c
ortic
al.io
inc.
201
5
Semantic Content Filter
real-time, across languages, intelligent, meaning based
#finance#markets
#mobile
#movies#products
#trend
Topic of interest
Analytics
© c
ortic
al.io
inc.
201
5
Example: Twitter Filter The State of the Art
desired topic
Every tweet related to
smart phones
200 catch words
mobile phone
Iphone
cell
Android
…
sim-card
text message
network
Verizon
AppleGoogle
…
5 words per tweet
Required throughput for one filter 200 X 5 X 20,000 = 20,000,000 comparisons per second
20,000 tweets/sec
© c
ortic
al.io
inc.
201
5
The State of the Art
Cost per Filter: $ 10,000+ per Month
© c
ortic
al.io
inc.
201
5
Example: Twitter Filter Semantic Fingerprinting
stream of semantic fingerprintstwitter firehose
realtime content sub-stream
Filter Fingerprint
not matching
matchmatchmatch
© c
ortic
al.io
inc.
201
5
Cost per Filter: $ 10 per Month
Cortical.io Streaming Text Filter
convert 100.000+ tweets per second
1.000+ semantic filters
+
one per firehose scalable with number of Filters
© c
ortic
al.io
inc.
201
5
Dynamic Topic Pattern Analysis
Topic Monitoring
Unseen topics or sudden topic jumps are detected
Compliance Monitoring
Ongoing e-mail conversation Time >
Appearance of unseen topic clusters
© c
ortic
al.io
inc.
201
5
Similar meanings “look” similar
Special “Financial Retina”
Bridging the Vocabulary Gap
fraud
Words
corruption AND mafia
Expressions
“anti human trafficking”
Idioms
Money laundering is the process of transforming the proceeds of crime into ostensibly legitimate
money or other assets.
Text
© c
ortic
al.io
inc.
201
5
Combine Fingerprints with AI Algorithms
Text Anomaly Detection
7. Enabling Artificial Intelligence Applications
chat
Message Forums
Blog Posts
Facebook PostsRealtime Anomaly Detection in Text Streams
any Text Stream
© c
ortic
al.io
inc.
201
5
Combine Fingerprints with AI Algorithms
http://www.cortical.io/demos/semantic-anomaly-detection/
© c
ortic
al.io
inc.
201
5
website: cortical.io
product: https://aws.amazon.com/marketplace/pp/B00T5794P6/
twitter: http://twitter.com/cortical_io
video: https://www.youtube.com/watch?v=g3ZxJokDpds
demos: http://www.cortical.io/demos.html
API: http://api.cortical.io
Numenta: http://numenta.com