Using Identity Data Mining
Securing & Personalizing Commerce
Jonathan LeBlancDeveloper Evangelist (PayPal)
Github: http://github.com/jcleblancTwitter: @jcleblanc
The Problem
Commerce Relies on Static Data Contributions
Premise
You can determine the personality profile of a person based on their usage habits
Personalization == Security
Technology was the Solution!
Then I Read This…
Us & Them
The Science of Identity
By David Berreby
The Different States of Knowledge
What a person knows
What a person knows they don’t know
What a person doesn’t know they don’t know
Technology was NOT the Solution
Identity and discovery are
NOT a technology solution
Our Subject Material
Our Subject Material
HTML content is poorly structured
There are some pretty bad web practices on the interwebz
You can’t trust that anything semantically valid will be present
How We’ll Capture This Data
Start with base linguistics
Extend with available extras
The Com
ponents
The Basic Pieces
Page Data
Scrapey Scrapey
Keywords Without all
the fluff
WeightingWord diets
FTW
Capture Raw Page Data
Semantic data on the webis sucktastic
Assume 5 year olds built the sites
Language is the key
Extract Keywords
We now have a big jumble of words. Let’s extract
Why is “and” a top word? Stop words = sad panda
Weight Keywords
All content is not created equal
Meta and headers and semantics oh my!
This is where we leech off the work of others
Simple
Ext
ract
ion E
ngine
Questions to Keep in Mind
Should I use regex to parse web content?
How do users interact with page content?
What key identifiers can be monitored to detect interest?
Fetching the Data: The Request
$html = file_get_contents('URL');
$c = curl_init('URL');
The Simple Way
The Controlled Way
Fetching the Data: cURL$req = curl_init($url);
$options = array( CURLOPT_URL => $url, CURLOPT_HEADER => $header, CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true, CURLOPT_AUTOREFERER => true, CURLOPT_TIMEOUT => 15, CURLOPT_MAXREDIRS => 10 );
curl_setopt_array($req, $options);
//list of findable / replaceable string characters $find = array('/\r/', '/\n/', '/\s\s+/'); $replace = array(' ', ' ', ' '); //perform page content modification $mod_content = preg_replace('#<script(.*?)>(.*?)</ script>#is', '', $page_content); $mod_content = preg_replace('#<style(.*?)>(.*?)</ style>#is', '', $mod_content);
$mod_content = strip_tags($mod_content);$mod_content = strtolower($mod_content);$mod_content = preg_replace($find, $replace, $mod_content); $mod_content = trim($mod_content);$mod_content = explode(' ', $mod_content);
natcasesort($mod_content);
//set up list of stop words and the final found stopped list$common_words = array('a', ..., 'zero'); $searched_words = array();
//extract list of keywords with number of occurrences foreach($mod_content as $word) { $word = trim($word); if (preg_match('/[^a-zA-Z]/', $word) == 1){ $word = ''; } if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; } }
arsort($searched_words, SORT_NUMERIC);
Scraping Site Meta Data
//load scraped page data as a valid DOM document $dom = new DOMDocument(); @$dom->loadHTML($page_content);
//scrape title $title = $dom->getElementsByTagName("title"); $title = $title->item(0)->nodeValue;
//loop through all found meta tags $metas = $dom->getElementsByTagName("meta"); for ($i = 0; $i < $metas->length; $i++){ $meta = $metas->item($i); if($meta->getAttribute("property")){ if ($meta->getAttribute("property") == "og:description"){ $dataReturn["description"] = $meta->getAttribute("content"); } } else { if($meta->getAttribute("name") == "description"){ $dataReturn["description"] = $meta->getAttribute("content"); } else if($meta->getAttribute("name") == "keywords”){ $dataReturn[”keywords"] = $meta->getAttribute("content"); } } }
Extendin
g the E
ngine
Weighting Important Data
Tags you should care about: meta (include OG), title, description, h1+, header
Bonus points for adding in content location modifiers
Weighting Important Tags
//our keyword weights$weights = array("keywords" => "3.0", "meta" => "2.0", "header1" => "1.5", "header2" => "1.2");
//add modifier hereif(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; }
Expanding to Phrases
2-3 adjacent words, making up a direct relevant callout
Seems easy right? Just like single words
Language gets wonky without stop words
Working with Unknown Users
The majority of users won’t be immediately targetable
Use HTML5 LocalStorage & Cookie backup
Adding in Time Interactions
Interaction with a site does not necessarily mean interest in it
Time needs to also include an interaction component
Gift buying seasons see interest variations
Grouping Using Commonality
InterestsUser A
InterestsUser B
Inte
rests
Com
mon
www.slideshare.com/jcleblanc
Thank You! Questions?
Jonathan LeBlancDeveloper Evangelist (PayPal)
Github: http://github.com/jcleblancTwitter: @jcleblanc