unstructured big data - ihbi webinar b… · unstructured)big)data 2016003024 customers: gender/age...
Post on 24-Jun-2020
2 Views
Preview:
TRANSCRIPT
chp.cmich.edu/ihbi
UNSTRUCTURED BIG DATA2016-‐03-‐24
Customers:Gender/age
Socioeconomic Status
GeographyWeather/Culture/Terrain
Satisfaction
Products/Services:FeaturesBenefitIrritantUndesireable
CostPurchaseEnvironmental
InfluencersPromotionCompetitorsAntagonists
Producers:Reputation
Innovation
Quality
Social Contract
chp.cmich.edu/ihbi
Big Data: Internet of Things2
Internet of Things (IoT): the world through the eyes of sensors
chp.cmich.edu/ihbi
Big Data: Unstructured Communications3
Unstructured Data (Texts): the world through the eyes of humans
In this talk, we focus on text, recognizing that audio, image, and gesturing are active research areas.
chp.cmich.edu/ihbi
Topics¨ Business Value¨ Data Sources¨ Data Selection¨ Taxonomy & Ontology¨ Analytics¨ Knowledge Extraction¨ Knowledge Presentation¨ RoadMap
4
chp.cmich.edu/ihbi
Business Value5
¨ Competition is forcing companies to be much quicker at detecting changing customer values & expectations:¤ Performance¤ Environmental impact¤ Safety¤ Aesthetics¤ Cost
¨ Text Mining promises rapid, continuing detailed identification of market niches and recent technology developments:¤ Geographic and weather related¤ Age-‐group, gender¤ Ethnic background¤ Socio-‐economic background
chp.cmich.edu/ihbi
Requirements6
¨ To obtain and maintain a capability to keep abreast of these dynamics, the successful organization must:¤ Monitor the communications between members of the market segments¤ Identify individuals (persons and organizations) with high impact¤ Be able to extract and quantify changes in the values and sentiments
relevant to the organizations goals¤ Identify new solutions for your business problems¤ Inform the appropriate business decision makers about important
changes.¨ This implies hardware, communications and software investments
matched by people with advanced IT and business analytics skills
chp.cmich.edu/ihbi
Organizational Implications7
¨ Deployment of Capabilities (Skills & Hardware)¨ Obtaining and using data from outside the organization
¨ Data Security (Knowing your sources – hacking)¨ Data Quality Assessment methods (Contamination)¨ Retention Policies (Where to keep what data)¨ Data Stewardship
chp.cmich.edu/ihbi
Deployment8
¨ Outsourcing¨ Funding projects
¤ Investments to learn how to extract value ¨ Centralization of the capability
¤ Until tools are available to reduce the learning curves¤ Requires project prioritization to pursue realistic business value
chp.cmich.edu/ihbi
Data Sources
¨ Primary Data Sources¨ Data Products from 3rd parties
¨ Extractors¨ Filters¨ Text Cleansing/Transforms¨ Synchronize, harmonize, and
integrate¨ Data staging
¨ Language (Varies by ‘Mother tongue’)¤ Spelling accuracy¤ Grammar¤ Translation
¨ Source purpose¤ Observations¤ Opinion ¤ Analysis¤ Emotional reaction
¨ Timeline¨ Granularity¨ Target audience¨ Author and Author Affiliations
AttributesCollection Factors9
chp.cmich.edu/ihbi
Data Collections10
Craw
lers,
Filte
rs & Cleaners
Topical RepositoriesGeography, Time frame
Enhanced Data Warehouse
Meta DataDirectory
chp.cmich.edu/ihbi
Data Sources¨ Social Media¨ Blogs¨ Emails¨ Search Engines¨ Web-‐pages¨ Click-‐streams¨ RSS feeds¨ Newspapers, Trade journals, Magazines¨ Patents, Reports¨ Peer-‐reviewed journals¨ Government/Industry data sources
Collection Management¨ Meta-‐data of specific sources
¤ Language, Frequency, Filter details¨ Crawlers¨ Third-‐party extractors¨ Scheduler
How / Who will manage this?
11
chp.cmich.edu/ihbi
Data Selection Criteria¨ Subject of interest¨ Granularity
¤ Time (When)¤ Geography (Where)¤ Population segment (Who)¤ Technology (What)
¨ Values (Why)¤ Performance¤ Cost¤ Sustainability¤ Environmental impact¤ etc.
¨ Keywords¤ Connected documents (links,
citations, site-‐maps, ..)¨ Tags
¤ Explicit (Annotators)¤ Inferred
¨ Fuzzy Logic (convert quantitative data to words)
12
chp.cmich.edu/ihbi
Search Logic¨ Generate new extraction processes
for new subjects
¨ Ongoing Maintenance of data collections for specific business purposes
¨ Modification of previous projects’ data collection processes
¨ Shut down of obsolete data collections
¨ Meta-‐Data of Past Studies¤ Sources¤ Search logic¤ Search engine identification¤ Post-‐project Assessment
¨ Granularity transforms¤ Time, Geography, ..
¨ Keyword lists for specific concepts¨ Previously generated Annotators
How / Who will manage this?
13
chp.cmich.edu/ihbi
Data Extraction for Business Goals14
Topical RepositoriesGeography, Time frame Unique
requirements
Previous selection logic
WhenWhereWhoWhatWhy
Taxonomy & Ontology
Clusters
Annotators• Time frame• Geography• Author• Inheritance• Sentiment• Concept titles• …
Harmonization• Part of speech• Synonyms• Extractors• Start/Stop lists• …
chp.cmich.edu/ihbi
Taxonomy, Ontology and Annotators15
¨ Taxonomy – set of unique concepts (with tags) that cover the subject of interest
¨ Ontology – relationships between terms of the taxonomy:¤ Contained within/ is part of¤ Sequence in time
¨ Annotators – generation of tags to specify non-‐obvious attributes
chp.cmich.edu/ihbi
Taxonomy & Ontology Development¨ What is a document?
¤ Sentence? Paragraph?
¨ Word x Document matrix¤ Parse & Stem
¨ Taxonomy Generation
¨ Clustering
¨ Naming the Clusters
¨ Sentiment Assignment
¨ Residuals (not clearly interesting)
¨ Synonyms
¨ Concepts
¨ Start words/concepts
¨ Stop words/concepts
¨ Sentiment words & phrases¨ Annotators
¤ Parts of Speech
¤ UIMA-‐rules (Unstructured Information Management applications)
¤ Inference (products without a component of interest might be tagged as ‘component-‐free’)
¤ Inheritance
16
chp.cmich.edu/ihbi
Normalization17
¨ Define the base for counts:¤ Put in terms of communication intensity
n Segment populationn School in session, or not
¤ Number of ‘touches’n Followers, readers, subscribers
¤ Responsesn Likes, replies, forum thread length, …
chp.cmich.edu/ihbi
Clustering Process¨ Iterative clustering and
cluster/document naming by subject
¨ Multiple dimensions:¤ Verbs¤ Nouns
¨ Hierarchic clustering¤ Features, Benefits¤ Values
¨ Self-‐organizing, k-‐means, ..¨ Organized lists of:
¤ Synonyms¤ Stop Lists¤ Start Lists¤ Common concepts
¨ Sentiment Rules for weights¨ Annotators & Data Supplements
How / Who will manage this?
18
chp.cmich.edu/ihbi
Venn Diagrams -‐ Relationships
¨ Universe of documents¨ Classes of interest in hierarchical fashion¨ Classes that have high chronological correlation¨ Classes that are ‘competitors’
¤ Within same higher level¤ In different higher levels of hierarchy¤ Differences in Feature frequencies between ‘competitors’
19
chp.cmich.edu/ihbi
Modeling¨ Map classes together to
define interest in differences
¨ Analysis of existing data in paired clusters over time
¨ Implement SPC to trigger alerts when outside control intervals
¨ Mapping tool to show classes and relationships
¨ Process to convert maps to statistical comparisons of frequencies within classes
¨ SPC triggers:
¤ Big change in frequency of {concept} in {class}
¤ Emerging/fading {concepts} in {class}¤ Big change in relationship between {class1}
and {class2}
¨ Find high impact documents that have/will likely affect {concept} frequency in {classes} in future
¨ Business user feedback module to identify false positives/negatives for model improvements
20
chp.cmich.edu/ihbi
Analytics21
WhenWhereWhoWhatWhy
Clusters
Class Relationships• Competitive• Parent-‐Child• Sibling• ..
ClassifiersNeural Networks
Support Vector Machines
FrequencyRatios across ClassesRatios within Classes
SPC over time
NetworksInfluencers
New Docs
Decision TreesScoring Leaves
Sequence Analysis
Events that are followed shortly by shifts in distributions:Event → Impacts (duration)
Feature CorrelationsWithin a ClassBetween Classes
SentimentPos/Neg/Neutral
chp.cmich.edu/ihbi
Analytics¨ Classification¨ Trends within classes¨ Organizing Classes
¤ Competing (Sentiment, with/without, ..)¤ Independent¤ Hierarchical¤ Chronological
¨ Sentiment measure for classes¨ Statistical differences in related class
contents¨ Emerging/dying concepts¨ Influence Tracking/Measurement
¨ Neural Networks (classifiers)¨ Normalization¨ SPC¨ Integration with structured data¨ Network Analysis¨ Sequence Analysis¨ False positives/negatives¨ Sentiment measures¨ Issues:
¤ Ambiguity¤ Sarcasm¤ Analogies
22
chp.cmich.edu/ihbi
Knowledge Extraction¨ Dynamics of past to current Classes
¤ Size of subsets¤ Relationships of competitors¤ Relationships between peer classes
¨ Role of specific authors or groups of authors¤ Age, gender¤ Geography, organizational association
¨ Relationship of Class statistics to events (chronology)¤ What events change perceptions?
¨ Important events¤ With impacts
¨ Important sources/authors¤ Influence
¨ Strong associations¤ Classes that change synchronously
¨ Sequence rules¤ Possible cause-‐effects
¨ SPC¤ Dynamics of the field
¨ Insight Delivery: Alerting sub-‐system to route specified insights to roles → people (email addresses)
23
chp.cmich.edu/ihbi
Knowledge Presentation¨ Flow charts of data through
processing steps¨ Texts (Examples or
summaries)¨ Responsibilities¨ Class Relationships¨ Timelines¨ Related Class attributes¨ Document and Agent
relationships
¨ PPTX of Procedures¨ Narratives of conclusions¨ Swimlanes¨ Venn Diagrams¨ Line graphs (Timelines)¨ Tables of Class Attributes
¤ Pie charts, Gauges¨ Network Diagrams¨ Alerts
24
chp.cmich.edu/ihbi
Alerts25
ClassifiersNeural Networks
Support Vector Machines
Updated Knowledge Bases
Business Analysts
chp.cmich.edu/ihbi
High Impact Agents
¨ People – who, how identified, when¨ Organizations¨ Events¨ Document types¨ Source Types
26
chp.cmich.edu/ihbi
Analysis → Prediction → Proscription27
¨ If sub-‐markets are unhappy with your ‘product’:¨ Use predictive models to estimate value involved.
¤ If customer education is the answer (“you have an appropriate product”), send education messages via the Influencers showing evidence of mis-‐education
¤ If you evolve your product features/costs, do that and then educate your potential customers.
chp.cmich.edu/ihbi
RoadMap Swimlanes28
Sponsor Sets Budget + Timeline for POC
Find Case Histories
Specify PoCscope and goals
Find Trusted Consultants & Vendors
ExecutiveMgt
KnowledgeMgt
Bus. ProcessMgt
IT & Procurement
Set Budget + Timeline for Production Env
Establish methodology for Taxonomy & Ontology
Prioritize projects of scope & goals
Identify & train resources; procure hardware
Monitor Budget & Returns of Big Data investments
Manage Taxonomy & Ontology growth
Deploy capability as capacity grows
Maintain skills & hardware to support capability
chp.cmich.edu/ihbi
Topics Discussed¨ Business Value¨ Data Sources¨ Data Selection¨ Taxonomy & Ontology¨ Analytics¨ Knowledge Extraction¨ Knowledge Presentation¨ RoadMap
29
chp.cmich.edu/ihbi
References¨ Elder, J., Miner , G., & Nisbet, B. (2012). Practical text mining and statistical analysis for non-‐structured text data applications.Waltham: Elsevier.
¨ Goutam, C., Pagolu, M., & Garla, a. S. (2013). Text Mining and Analysis Practical Methods, Examples, and Case Studies using SAS. Cary: SAS Institute.
¨ Reamy, T. (n.d.). Enterprise Content Categorization –How to Successfully Choose, Develop and Implement a Semantic Strategy. KAPS Group.
30
chp.cmich.edu/ihbi
Contacts31
¨ Dr. Imad Haidar, Sr Researcher & Data Scientist, IHBI, haida1i@cmich.edu
¨ Chunxia (Shar) Tang, Senior Research Analyst, IHBItang1c@cmich.edu
¨ James Mentele, Senior Research Fellow, IHBIjames.mentele@cmich.edu
top related