real time semantic search engine for social tv streams
DESCRIPTION
Social TV, the use of social networks to comment on TV programs is a growing phenomena. TV channels and brands are turning into social networks to look for real time insights about their programs. Understanding the global conversation about a program is useful to acquire insights for broadcasters and brands. For broadcasters, acquiring insights while a program is aired enable them to produce new content formats that include social conversation. For brands, it helps to prevent reputation crisis and increase the reach of their marketing efforts. For viewers, which increasingly use second screen devices, should benefit from tools that help to understand opinions around main content and connect with peers during TV programs or live events. We present a system that combines natural language processing (Textalytics API) and a scalable semi-structured database/search engine (senseiDB) to provide semantic and faceted search, real time analytics and support visualizations for this kind of applications. In the first part, we will present some of the useful NLP methods that we can use to tame unstructured big data like Twitter or Facebook comments. We will include description for tasks like text categorization, sentiment analysis, named entity recognition. We would also see how this data could be related to external data like Linked Data points. While the description would be general, examples would be illustrated using Textalytics API. Then we would present how this data could be ingested and made available for search in real time using a semi-structured database like SenseiDB. We would present key features of SenseiDB including high performance real time indexing and simultaneous querying, distribution and support for full-text and faceted search. We would also discuss how facets may be overused to provide real time analytics and enable semantic search. Finally we will discuss advantages, problems and current limitations of SenseiDB. Takeaway Points. - Analyzing and searching text in social streams - Integrating text analytics services (Textalytics) and a semi-structured database (SenseiDB) - Key features of SenseiDBTRANSCRIPT
Textalytics: Meaning-as-a-Service
Real Time Semantic Search for Social TV
streams
César de Pablo SánchezDaedalus
8/11 2013
Big Data Spain (Madrid)
The plot
1. What's Social TV?
2. Monitoring Social TV conversations. A preliminary architecture
3. Understanding the buzz. Textalytics
4. Organizing the mess. SenseiDB
5. Lessons learned
Social TV
SecondScreen
Transmedia
Not just TV
Sports
Elections
Alerts
Big Data?
Volume
Velocity
Variety
Users?
Viewers
Channels
Brands
Viewers?
Participate
Vote
Influence
Confirm beliefs
Keep updated
Belong to group
Viewers?
Participate
Influence
Confirm beliefs
Keep updated
Channels?
Understand
React
Measure
Channels?
Understand
React
Brands?
Select programs
Reputation
Find public
Reputation
Find public
Brands?
Example from Bluefin Labs
Monitoring Social TV conversations.
The architecture
trackergateway
HTTP Stream
pipeline
Pull
EPG
Understanding the buzzTextalytics API
Core API
Topics Extraction
Text Classification
SentimentAnalysis
Languageidentification
Lemmatization,POS and Parsing
Speeech Recognition andSpeaker Diarization
Semantic LinkedData ViewerSemantic LinkedData Viewer
Spell, Grammar and Style
User Demographics
Language identification
● Given a text identify a language list - or just one● 62 languages● Using language ngrams signatures
● Social TV● Filter – TV hashtags often implies language● Sometimes hashtags are multilingual – but not
relevant for users
Text Classification
● Theme labels – IPTC ● Relevance ● Multiple labels● Tailored for short text (tweets)
● Define your own models and categories
● Social TV – filter on topic content
Sentiment analysis
● Document level classification ● Positive/Negative/Neutral ● Subjective/Objective ● Tailored for short texts● Handles twitter jargon – RT, @, hashtags, emoticons,
spelling errors, disfluence● Other features
● Entity level sentiment ● Segment level sentiment
Topics Extraction➔ Personas:
Ben Bernanke, Mariano Rajoy…
➔ Empresas, Organizaciones:BBVA, Bankia, Goldman Sachs, Coca-Cola, Reserva Federal…
➔ Entidades económicas:Ibex35, Dax Xetra…
➔ Ubicaciones:Londres, EE.UU., París…
➔ Conceptos:prima de riesgo, presidente del Gobierno, intervención parlamentaria, índice bursátil, situación económica…
➔ Referencias de tiempo:hoy, ayer, sobre las 11 de la mañana…
➔ Cantidades económicas:104 dólares, 1 euro…
● 12 main types● Ontology with > 200 types● Instances – BBVA● Classes – bank● fictional/historic
● SocialTV:● populate custom dictionaries –
programs, celebrities, fictional characters
● relationship
Entity Linking
● Linking entities to their 'real' representation● Linking to several LOD sources
API
● NLP and Semantics API ● Multilingual: EN, ES (FR,IT,PT,CA)● REST Service : JSON and XML● Combine best of all worlds
● Deep language analysis● Comprehensive resources: linguistics and Dbs● Ontology● Rule Based Method● Statistics and Machine Learning Methods
● High level semantic API – close to bussines scenarios
● Core API – building blocks
Topics
Sentiment
Classif.
Linked Data
POS
Configuración yRecursos
Lingüísticos
Configuración yRecursos
Lingüísticos
Configuración yRecursos
Lingüísticos
API Análisis Medios
API Publicación Semántica
…
API
Organizing the mess.SenseiDB
SenseiDB
● Open source, distributed, realtime, semi-structured database
● From LinkedIn sna: powering Linkedin home and LinkedIn signals● Integrates other open source technologies:
– Zoie – lucene based search engine– Bobo - faceted search– Apache Kafka – pub-sub system
● http://www.senseidb.com/
SenseiDB features
● 'Hybrid' Information Retrieval – Database ● Full text search ● Structured and faceted search ● Fast real time updates with low latency and high troughput
– pull model● Single table/collection● BQL – a SQL like language ● Eventual consistency ● Distributed – sharding and partitioning ● Hadoop integration
Faceted search
● Amazon.com?● Identify relevant
attributes to use as filters
● Predefined facets● Define a table schema ● Define fields as facets
– facet schema● Efficient - in memory
Faceted search in depth
● Field types● Basic: string, int, short, long, float, double, char● Complex: date and text (analyzed, termvectors)
● Facet types● Simple : 1 row – 1 value ● Hierarchical – Path c>b>a● Range – define ranges ● Multi : 1 row – n values ● Histogram – define bins and their size● TimeRange – for real time data● Custom
Real time indexing
● Data events – add and delete ● Data streams – succession of data events● Gateways
● Read data events from data streams● File● JDBC ● JMS ● Kafka ● Custom: Twitter
BQL – search, filter and facets
● Search – common boolean and phrase operators
● Filters – where contitions● Facets support basic analytics task defined on
facets● Relevance
● Default – recency ● Ad-hoc - may be defined in query
BQL Query Example on Tweets
SELECT *
WHERE hashtags in (“TopChef”)
BROWSE BY
hashtags, user_screen_name, urls
Tweet Query example
Query examples
SELECT *
WHERE QUERY IS "relaxing cup of coffee”
Query examples
SELECT *
WHERE QUERY IS "relaxing cup of coffee”
BROWSE by entities, sentiment
Query examples
SELECT *
WHERE QUERY IS "relaxing cup of coffee”
AND time IN LAST 2 hours
BROWSE by entities, sentiment
Using facets for semantic search
● Define a facet for:● entities/concept → tweets about Chicote – include
all variants + user + hashtags ● for each entity types → Navigate by type – Popular
people ● classification/sentiment/emotions → Positive
tweets about Chicote ● users or hashtags → popular users / popular
mentions / correlated hashtags
Architecture
Scalability
● Zookeper to keep replicas● Low indexing latency (no batch commit)● Low search latency – even with indexing bursts ● Horizontally scalable – shards ● Shards may be replicated N times● Elastic – nodes can be added to accomodate
growth
Other features
● Batch indexing via Hadoop – ETL ● Simple analytics by batch indexing● Customized relevance models● MapReduce functions over facets
● Sum, avg, min, max ● DistinctCount
● Activity values – volatile values – likes
Lessons learned
Conclusions
● SenseiDB is fast at searching/indexing – no variance
● A couple nodes enough to handle Spanish SocialTV volume
● Love query language and time operators - BQL● Support real time exploration
Limitations
● SenseiDB● Documentation is still scarce● Single table model – flat users and reputation● Tricks to store complex facets● Manageability
● Social TV Tracker● Group and disambiguate entity mentions across
tweets ● Relevance is tricky – ad hoc
Comparison
● Solr ● NearRT updates
– Soft commits● Simple facets● Popular – great
tools
● Storm, S4 ?
● ElasticSearch● Batch/realtime
commits ● On line facets ● Aggregation after
facets● Much better plugin
system
Thanks and QA
@zdepablo #bigdata #socialtv #2ndscreen
#nlp @textalytics