COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
1
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
Profiling User Activities With Minimal Traffic Traces
Tiep Mai, Deepak Ajwani and Alessandra SalaBell Laboratories, Ireland
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
2
Outline
Telecom data and privacy issue
Truncated URL dataset
User behavior analysis on limited data
• Micro-action burst decomposition
• Representative URL selection
Future work and Conclusions
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
3
End-to-End View of the Telecom Network
Mobile user
Webservices
Client-sidedata
Server-sidedata
Telecom data
Huge data but with limited features
Empower telecom data analysis with this data
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
4
Providing Personalized Services
• Personalized services require user activity profiling Traditional approaches rely on features extracted from rich data sources
Server side data: full URLs of visited pages, page categories, transaction data, search queries, click through rate, etc.
Client side data: full URLs (cookies), application data (web browsing), etc.
Network side data: full URLs, HTTP packet content, etc.
• Our goal: Provide medium-grained user profiling with privacy preserving limited dataset for a large user-pool
User privacy considerations
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
5
Mobile Web TracesUser Behavioral Analysis from Timestamped Data
• Mobile traces provide precious insights in user behavior Critical to enable service personalization and enrich user’s online
experience
• Complete mobile web traces risk to reveal sensitive info http://finance.yahoo.com/q?s=BAC Bank of America Corp. stock
price
https://www.google.ie/#q=postnatal+depression sensitive health condition
http://www.amazon.com/Dell-Inspiron-i15R-15-6-inch-Laptop/dp/B009US2BKA specific purchased product
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
6
Removing Sensitive Data from URL Traces
• Telecom Operators subjected to restrictive privacy legislations
• Conservative approach to share data Anonymized, truncate and sampled data
Traces from10,000 anonymized users over 30 days, i.e. +130 Million records
• Focus on the dataset of truncated URLs or IP addresses
• Resulting data:
1. Truncated: www.amazon.com/Dell-Inspiron-i15R-15-6-inch-Laptop/dp/B009US2BKA
2. Noisy: unintentional web traffic as advertisement, web analytics, etc. Quality of behavior analysis depends on effectively separating
unintentional traffic from user activities on truncated URL
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
7
• Collection of web traces of several URL types
• Aim: filter out traces that do not represent explicit user action
Identifying features to drive detection on unintentional traces
Validate across different users
• Diversity of web domains:
Web Browsing Behaviors Across Time & Users
High diversity in user activities High diversity across users
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
8
Methodology Approach
• User activities as collection of micro user actions, i.e. burst
Web clicks, chat replies
• Assumption: Each burst represents atomic user activity
Combination of intended and unintended web-traffics
• Methodology
1. Burst decomposition
2. Activity extraction:
Domain classification : Leverage specialized feature of domain appearance in the burst
Online representative URL selection and activity association
Increase prediction
accuracy by 20%
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
9
Burst Decomposition – Statistical Parametric Distribution Fitting
• Goal: Decompose the web-trace back into constituent data bursts
• A need for a threshold of packet inter-arrival time (IAT) to separate traces into bursts
• Study the inter-arrival time distribution
• No parametric distribution would match most user traces
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
10
Burst Decomposition Algorithm
• Robust burst decomposition algorithm that is independent of the distribution shape
• Starting from the smallest value, find the value such that extended probability by increasing decaying point is insignificant, compared to the accumulated probability at that point
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
11
Domain Classification – Initial Insight
• Goal: automatically identify URLs representing user activities
• Measurements are aggregated for all users for each domain
Record-level measurements
Burst-level measurements
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
12
Domain Classification - Methodology
• Logistic regression
• Validation error and AIC, BIC
• Two discriminating features
ob,j=1 – ub,j=1 (~ 22.87) : probability that a domain comes first in bursts with more than one unique domains
ub,j=2 (~ -9.51) : probability that a domain comes in bursts with two unique domains
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
13
Trade-offs of Domain Classification Results
• Trade-off between accuracy, sensitivity, precision and specificity
Maximizing accuracy
Maximizing sensitivity and specificity
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
14
Future Works
• Mapping domain to activities (reading, shopping, browsing) and identifying user activities online
• Activity query and recommendation
• Correlating truncated URL data with user location data
Spatial temporal study of user activities
COPYRIGHT © 2011 ALCATEL-LUCENT. ALL RIGHTS RESERVED. ALCATEL-LUCENT — INTERNAL PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION
15
Conclusions and Remarks
• Telecom data: Huge but limited; Strict privacy regulations
• URL trace data:
Privacy preservation with truncation
Noisy data
Burst property of micro user actions
• Goal: Perform activity extraction and behaviour analysis for a large user-pool with limited and noisy data
• Method:
Burst decomposition and feature extractions
Representative URL identification and activity extraction
Doing medium-grained behavior analysis is feasible with limited, noisy and privacy preservation URL data