analyzing the power of tweets in predicting commodity futures - … · machine learning algorithms...

20
1 © Copyright 2013 Pivotal. All rights reserved. Analyzing the power of Tweets in predicting Commodity Futures Mar 17, 2014 Srivatsan Ramanujam Senior Data Scientist Pivotal

Upload: others

Post on 10-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • 1 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian 1 © Copyright 2013 Pivotal. All rights reserved. 1 © Copyright 2013 Pivotal. All rights reserved.

    Analyzing the power of Tweets in predicting Commodity Futures

    Mar 17, 2014 Srivatsan Ramanujam Senior Data Scientist

    Pivotal

  • 2 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    Problem Definition   Can we predict Corn, Soybean and Wheat futures based on Social Chatter on Twitter ?

      The Customer: A major Agricultural Cooperative

  • 3 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    Data

  • 4 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    Obtaining Data   Used to fetch 5-years of historical tweets matching any of a list of keywords of interest

    Tweets Table Poster Information

  • 5 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    GNIP

      As plugged-in partners, we’ve worked with GNIP before, experience was great!

      We needed historical data and GNIP’s Historical PowerTrack came in handy

      Clean API, quick quotes, convenient to download results of historical jobs

  • 6 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    Grain Futures Vs. Volume of Tweets

  • 7 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    The Platform

  • 8 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    Data Science Toolkit   Appliance

    –  Full Rack DCA with Greenplum Database   ETL

    –  Python

      Modeling –  SQL –  MADlib –  PL/Python, PL/Java –  Ark-Tweet-NLP1 with PL/Java Wrappers

      Visualization –  Tableau

    1CMU ARK Twitter Parts-of-Speech tagger : http://www.ark.cs.cmu.edu/TweetNLP (GPL 2)

  • 9 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    Pivotal Greenplum MPP DB Think of it as multiple PostGreSQL servers

    Segments/Workers

    Master

    Rows are distributed across segments by a particular field (or randomly)

  • 10 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

      The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster

    •  Data Parallelism: -  PL/X piggybacks on

    Greenplum’s MPP architecture

    •  Allows users to write Greenplum/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby

    Master

    Master Host

    SQL

    Interconnect

    Segment Host Segment Segment

    Segment Host Segment Segment

    Segment Host Segment Segment

    Segment Host Segment Segment

    PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}

  • 11 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    Scalable, in-database ML

    •  Open Source!https://github.com/madlib/madlib •  Works on Greenplum DB and PostgreSQL •  Active development by Pivotal

    -  Latest Release : 1.4 (Dec 2014) •  Downloads and Docs: http://madlib.net/

  • 12 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    MADlib In-Database Functions

    Predictive Modeling Library

    Linear Systems •  Sparse and Dense Solvers

    Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank

    Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white,

    clustered, marginal effects)

    Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market

    Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation

    Descriptive Statistics

    Sketch-based Estimators •  CountMin (Cormode-

    Muthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent

    Values) Correlation Summary

    Support Modules

    Array Operations Sparse Vectors Random Sampling Probability Functions

  • 13 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    The Models

  • 14 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    The Approach

    •  In addition to identifying textual cues in tweets that were correlated with commodity futures, we also wanted to analyze whether tweet sentiment was correlated with commodity futures

  • 15 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    Sentiment Analysis – Challenges   Language on Twitter doesn’t

    adhere to rules of grammar, syntax or spelling

      We don’t have labeled data for our problem. The tweets aren’t tagged with sentiment

      Semi-Supervised Sentiment Prediction can be achieved by dictionary look-ups of tokens in a Tweet, but without Context, Sentiment Prediction is futile!

    “Cool”

  • 16 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    Sentiment Analysis – Approach

    1: Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/)

    Phrase Extraction

    Semi-Supervised Sentiment Classification

    Phrasal Polarity Scoring

    Sentiment Scored Tweets

    Use learned phrasal polarities to score

    sentiment of new tweets

    Part-of-speech tagger1

    Break-up Tweets into tokens and tag their

    parts-of-speech

      Parallelized ArkTweetNLP to achieve fast parts-of-speech tagging on Tweets

      Custom (patent pending) algorithm to extract contextual cues & score sentiment of tweets

  • 17 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    Text Analytics Pipeline with GNIP stream

    Stored on HDFS

    Tweet Stream

    (gpfdist) Loaded as

    external tables into GPDB

    Parallel Parsing of JSON and extraction

    of fields using PL/Python

    Topic Analysis through MADlib pLDA

    Sentiment Analysis through custom

    PL/Python functions

    D3.js

  • 18 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    There is significant signal in Tweets in predicting commodity futures

    Key Take-Aways

    Sentiment Analysis of tweets can provide an additional signal in predicting commodity futures. Twitter sentiment was negatively correlated with commodity futures, in the sample we analyzed

    A blended model of Text Regression, Sentiment Analysis and Tweet Actor information gave us encouraging results and we believe that when combined with market fundamentals like weather or yield will give better models

  • 19 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    What’s in it for me?

  • 20 © Copyright 2013 Pivotal. All rights reserved.

    @gopivotal @being_bayesian

    Pivotal Open Source Contributions http://gopivotal.com/pivotal-products/open-source-software

    •  MADlib – In-database parallel ML -  https://github.com/madlib/madlib

    •  PyMADlib – Python Wrapper for MADlib -  https://github.com/gopivotal/pymadlib

    •  PivotalR – R wrapper for MADlib -  https://github.com/madlib-internal/PivotalR

    •  Part-of-speech tagger for Twitter via SQL -  http://vatsan.github.io/gp-ark-tweet-nlp/

    Questions? @being_bayesian