e6885 network science lecture 13: large-scale analysis and...
TRANSCRIPT
1
© 2011 Columbia University
E6885 Network Science Lecture 13:
Large-Scale Analysis and Advanced Network
Analysis Applications
E 6885 Topics in Signal Processing -- Network Science
Ching-Yung Lin, Dept. of Electrical Engineering, Columbia University
December 12th, 2011
© 2010 Columbia University2 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Course Structure
Final Project Presentation 1412/19/11
Large-Scale Network Processing System 1312/12/11
Social and Economical Issues of Network Analysis1212/05/11
Graphical Models and Analysis1111/28/11
Information Diffusion in Networks1011/21/11
Final Project Proposal Presentation911/14/11
Dynamic Networks -- II810/31/11
Dynamic Networks -- I710/24/11
Network Topology Inference610/17/11
Network Modeling510/10/11
Network Visualization, Sampling and Estimation410/03/11
Network Partitioning, Clustering, and Use Case309/26/11
Network Representations and Characteristics209/19/11
Overview – Social, Information, and Cognitive Network Analysis109/12/11
Topics CoveredClass
Number
Class
Date
2
© 2010 Columbia UniversityE6885 Network Science – Lecture 13: Analysis of Network Flow Data
Scientific Challenges of Large-Scale and Real-Time Network Mining Infrastructure
�Speed-Up of Network Mining Algorithms (with Christos Falutsos and U Kang)
�Network Sampling Theory (with Xifeng Yan)
�High Performance Computing for Network Analysis (with Ted Brown, Nitesh Chawla,
Jaideep Srivastava, Ido Rosen, and David Hsu)
�Social Network Storage and Indexing (with Ted Brown)
© 2010 Columbia University
Example: Centralities in Large Networks
[15th Century Florentine Family]
Degree: # of neighbor
Closeness: avg. shortest path
length
Betweenness: # of times a node
sits between shortest path
Measuring the financial
company value
Network attack monitoring [Internet Web]
Degree : Easy
|V| = 15 |E| = 19|V| = Billions |E| = Billions
Closeness : Easy
Betweenness : Easy
Degree : Easy
Closeness : Hard
Betweenness : Hard
O(|V|3)
O(|V|2log|V|)
O(|E|)
Application
Three centralities
“Who are the most
important actors?”
3
© 2010 Columbia University5 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Large-Scale Graph Indexing and Network Management
5
Raw Graph Graph after shuffle
1 2Zip
Zip
Zip
Zip
Zip
Zip
Zip
…
…
Compressed
blocks
3
Graph DBs
..
..
Unified Query Execution EngineQuery Vectors
Resulting Vectors
1
User Query Stage
Indexing Stage
2
3
45
© 2010 Columbia University6 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Graph Storage and Indexing
� 1. Block Formulation
� Any partitioning algorithm
(METIS, Disco, etc.) can be used
� 2. Block Compression
� Compress each block using gzip.
� 3. Block Placement
Vertical: Horizontal: Grid:
.
.
..
Zip
Inefficient for out-
neighbor query
Inefficient for in-
neighbor query
Efficient for in/out-
neighbor query
Our choice
4
© 2010 Columbia University7 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Graph Analytics Queries
� SQL-like queries
� Global Queries: degree, pagerank, RWR, connected component
� Targeted Queries
� Query Execution Engine
� Main tool: generalized matrix-vector
multiplication
� Grid Selection
1-step in-neighbors 1-step out-neighbors 1-step inout-neighbors
k-step
neighbors
induced subgraph,
egonet k-core cross-edges
© 2010 Columbia University8 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Applications of existing GBase supported queries
Browsing Ranking Finding Community Anomaly
Detection
Viz.
Connected
Component
Radius
PageRank, RWR
Induced Subgraph
K-Nh
K-Egonet
K-core
Cross-edges
5
© 2010 Columbia University9 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Graph Mining on Hadoop� Extract Graph Features at Different Levels – Scale Up & Speed
� Other our Hadoop-Based existing Functions
By Parallelism
By Scalable
Algorithm Design
“Non-Negative Residual Matrix Factorization with Application to Graph Anomaly Detection”, SI of Best Papers of SIAM Data Mining Conf. 2011
Large-scale graph analysis of up to 1B nodes & 7B edges
[AAAI’11]
(Generalized) matrix-vector mul. PageRank, centralities,Global
Matrix factorization, partitioncommunity, roleSub-graph
Ego-netdegree, edge, weightLocal
How-ToExamplesLevel
© 2010 Columbia University
Our algorithms
� We proposed two new centralities (`effective closeness’ and `LineRank’), and
efficient large scale algorithms for billion-scale graphs
Scalability ResultsEffective Closeness vs. Closeness
Analysis of Real-World Graph
For 2 Billon Edges,
- standard closeness: 30,000 years
- effective closeness: ~ 1 day !
1,000,000 times faster!
6
© 2010 Columbia University11 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Challenges and Core Ideas
� Challenges 1: Scalability
� Core Idea 1: Indexing at Block-level + Parallelism
� Challenges 2: Application Heterogeneity
� Core Idea 2: Unified Query Execution Engine
11
Key Feature of GBase:
- F1. Algorithms. Define common, core algorithms to satisfy various graph applications
- F2. Storage. Store and manage huge graphs in distributed settings to answer queries efficiently
- F3. Query Optimization. Exploit the storage and the general algorithm to answer queries quickly
© 2010 Columbia University12 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Unified Query Run-Time Execution Engine in GBase
� Q: Given a graph, can we compute connected components, PageRank, Random
Walk with Restart, and diameter/radius with one algorithm?
� A: Yes, expanding GIM-V for run-time SQL queries [GBase, KDD 2011]
–Generalized Iterative Matrix-Vector Multiplication [Pegasus, ICDM 2009]
–Extension of plain matrix-vector multiplication
7
© 2010 Columbia University13 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Main Idea: Intuition
� Plain M-V multiplication
1
1
0.1
• Weighted Combination
of Colors
• ~ Message Passing
1 1 0.1
1
1
0.1X
∑=
=4
144 '
i
ii vmv
M v
=
'v
Details
© 2010 Columbia University14 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Main Idea
� Plain M-V multiplication
Three Implicit Operations here:
combine2
combineAll
assign
multiply and ijm jv
sum n multiplication results
update 'iv
'vvM =×
1 1 0.1
1
1
0.1X
M v
=
'v
∑=
=4
1
'i
ijij vmv
1
1
0.1
Message sending
Message combination
Details
8
© 2010 Columbia University15 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Generalize MV to GIM-V
� GIM-V: Customizing the three operations leads to many algorithms
Assign
Sum
Multiply
assign
combineAll
combine2
Con. Cmpt. DiameterRWRPageRankStandard MVOperations
MIN
MIN
Multiply
Assign
Sum with rj
prob.
Multiply
with c
Assign
Sum with
restart prob
Multiply
with c
BIT-OR()
BIT-OR()
Multiply
bit-vector
(approx.)
Details
© 2010 Columbia University16 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Fast Run-Time Query Algorithms for GIM-V
� Solution 1: Decrease the file size and shuffle time by Block Indexs
� Solution 2: Achieve sub-linear response time by Grid Placement and Selection
Grid Placement Grid Selection
9
© 2010 Columbia University17 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Performance
� Experiments done in M45 hadoop cluster
– Provided by Yahoo!
– One of the top 50 supercomputers in the world
– 500 nodes, 4000 cores, 3TB RAM, 1.5PB disk
� Data Sets
© 2010 Columbia University18 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Performance
10
© 2010 Columbia University
Example: Bayesian Network Model – LDA on Blue Genes
�3-level hierarchical Bayesian model for content analysis; it is important to
social network analysis research but presents computational challenges
�Large-scale content analysis, Design of new architecture
Finding friends in blog data [1]
LDAGenerating email keywords [4]
Mining source code [3]Entity resolution [2]Mining graphs [5]
© 2010 Columbia University
Speeding Up LDA
Parameter estimation
� Variational Estimation-Maximization (EM)• It is an alternating procedure, while each of two
steps has potential to be in parallel
• E-step to find optimizing values of variational
parameters (used to compute posterior distribution
of hidden variables)
• M-step to find maximum likelihood estimates under
posterior from E-step
Inference� In E-step, for each iteration, data access
(reference) could be in parallel
wordn
topick
… …
word1
topick
… …
wordN
topick
… …
……
Inf. Inf.Inf.
MLE MLE MLE
… …
… …
E-step
M-step
Optimization to all (computing and I/O) nodes
Hardware architecture:
1.Instruction-level support for computation of 1st,
2nd, 3rd derivatives
2.Functional units for big but simple loops
traversing rows and columns
Making LDA much faster on Hadoop or Bluegenes
11
© 2010 Columbia University21 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Understanding Human
Goal: Novel System for Detecting and Predicting Behaviors, through
large-scale social network analytics and data mining, for security
applications to decrease insider threats such as colleague-shooting,
suicide, data leakage, malware propagation, etc. or for commerce
applications, such as marketing and sales.
Detecting &
Predicting Feed subscription
Social sensors
Database access
Click streams capturer
Graph analysis
Behavior analysis
Semantics analysis
Emails
Chats
Meetings
Web Page Clicks
DB Server Logs
Social Media Data
Multimodality
Analysis
Beyond traditional security framework: leverage IBM Research strength on
1.Semantics: IBM Watson Q&A Framework,
2.Graphs and Machine Learning: SmallBlue (IBM Atlas) Social Network Analysis,
3.Large-Scale Processing: IBM BigInsight, and
4.Stream Processing: IBM Infosphere Streams
Psychological
analysis
© 2010 Columbia University22 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Cybersecurity Applications of Social Network Analytics
� Hypothesis: People interaction is where the
major activities are and is the major target for
cybersecurity
� Security Applications:
– Spam and Malware Propagation
– Bipolar communications in Targeted Attack –
Human vs. Malware
– Data Exfiltration through Social Networks –
finding sensitive data have been leaked.
– Entity Resolution by linking Social Media
Networks and Organization Networks
� Detection and Early Warning of Anomalies:
– Suspicious Communications
– Collusive Behaviors
– Inappropriate Online Social Network
Postings
Breakthrough Needed: Science advances and advanced multimodality analytics platform to model, learn, and predict relation networks and dynamic people behavior
Attacker / Spamer:
Near-Star
Normal:
(1) Clique-like
(2) Two-way links
Use scenario 2
Use scenario 1
Data leakage prediction;
Malware propagation
prediction
12
© 2010 Columbia University23 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Cognitive Network
� Cognitive Network, e.g.:
– 30,000 nodes of dynamic brain MRI functional networks
Cognitive
Network
© 2010 Columbia University24 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
New Type of EEG Detector and Signal Analysis
� Fundamental Research on EEG Signal Processing
� New Dry Sensing
– Classifying Attention, Relaxation, etc.
– Classifying Target – P300 signals
– Classifying Visual Cortex Signals
� Breakthrough Non-Contact Sensing – suitable for everyday/normal use
� Cognitive Wireless Sensor becomes possible
EEG Wireless Sensor
developed
by an IBM partner
13
© 2010 Columbia University25 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Our Multi-Modality Analysis Platform
Task 3:
Multimodality
Learning
Task 1:
Network
Analytics
Task 2:
Semantic
Analytics
Task 4:
Behavior
Reasoning &
Graphical
Models
Task 5:
Interface
Task 0: Infrastructure
© 2010 Columbia University26 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Application Example Composite Social-Cognitive-Info Wireless Network
Status Monitoring,
Visualization,
Personalized
Information
Recommendation,
Routing, etc.
EEG / Audio signal detection
GPS / location
detection,
Information Display
3G
3G
3G
14
© 2010 Columbia University27 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
New Social Visual Sensor for Trustworthy Face Detection
� In fMRI and EEG studies of human faces, people “unconsciously” judge the trutstworthy of a
novel face in 100 ms.
� The rating difference of “consensus trustworthy” faces v.s. “consensus untrustworthy” faces
is in average 0.65 pints on a 5-point scale.
� Is it possible to detect “trustworthy” rating of human face automatically from visual signals?
Faces response at Amygala (Engell 2007) Faces v.s. descriptions (Todorov, 2007)
© 2010 Columbia University28 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Privacy Lesson
Perception > Policy > Law
15
© 2010 Columbia University29 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Privacy laws worldwide
European Union• European Data Protection Directive (1995)
Canada• PIPEDA(2001 - 2004)
U.S. – Sectoral• Children’s Privacy; COPPA (1999)
• Financial Sector GLB (2001)
• Health Sector; HIPAA (2002)
• California Privacy; (2005) Taiwan• Computer-Processed PD Protection Law (1995)
South Korea• Info & CommNetwork Util. & Info Protection Law (2000)
Japan• Personal Data Protection Act (2005)
APEC• Guidelines (2004)
Existing Private SectorPrivacy Laws
EmergingPrivate SectorPrivacy Laws
Existing Private SectorPrivacy Laws
EmergingPrivate SectorPrivacy Laws
APEC• Guidelines (2004)
Russia• Federal law on Pers Data (January 2007)
Australia•Privacy Amendment Act (2001)
New Zealand• Privacy Act (1993)
Chile• Protection of Private Life Law (1999)Argentina• Protection of PD Law (2000)
Dubai• Data Protection Law (January 2007)
© 2010 Columbia University30 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
The most important two elements of privacy
– The claim of individuals, groups or institutions to define for
themselves when, how and to what extent information
about them is communicated to others
[ Michael, Privacy in Harris and Joseph eds, The International Convenant on
Civil and Political Rights and UK Law, 1995]
– The individual’s ability to control the circulation of
information relating to him.
[ Milller, Assault on Privacy, 1971]
Privacy is human right and personal perception
16
© 2010 Columbia University31 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
For social software -- a root of conflict
(United Nations) Universal Declaration of Human
Rights [1948]
Article 12: No one shall be subjected to arbitrary
interference with his privacy, family, home or
correspondence, nor to attacks upon his honor and
reputation. Everyone has the right to the protection of
the law against such inference or attacks.
(United Nations) Universal Declaration of
Human Rights [1948]
Article 19: Everyone has the right to freedom
of opinion and expression; this right includes
freedom to hold opinions without interference
and to seek, receive and impart information
and ideas through any media and regardless
of frontiers.
© 2010 Columbia University32 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
European Convention on Human Rights (ECHR) Article 10
� Everyone has the right to freedom of expression. this right shall include freedom to hold opinions
and to receive and impart information and ideas without interference by public authority and
regardless of frontiers. This article shall not prevent States from requiring the licensing of
broadcasting, television or cinema enterprises.
� The exercise of these freedoms, since it carries with its duties and responsibilities, may be
subject to such formalities, conditions, restrictions or penalties as are prescribed by law and are
necessary in a democratic society, in the interests of national security, territorial integrity or public
safety, for the prevention of disorder or crime, for the protection of health or morals, for the
protection of the reputation or the rights of others, for preventing the disclosure of information
received in confidence, or for maintaining the authority and impartiality of the judiciary.
17
© 2010 Columbia University33 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Privacy Legislation -- Europe
� Three main legal instruments:
–The European Convention on Human Rights (ECHR) Article
8, which protects the right to privacy [Rome, November 1950]
–Directive 95/46/EC of the European Parliament and of the
Council of 24 October 1995 on the protection of individuals
with regard to the processing of personal data and on the free
movement of such data [1995]
–Directive 02/58/EC, concerning the Processing of Personal
Data and the Protection of Privacy in the Electronic
Communications Sector. [2002]
© 2010 Columbia University34 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
The European Convention on Human Rights [1950]
� Article 8:
1. Everyone has the right to respect for his private and family life, his
home and his correspondence.
2. These shall be no interference by a public authority with the exercise of
this right except such as is in accordance with the law and is
necessary in a democratic society in the interests of national security,
public safety or the economic well-being of the country, for the
prevention of disorder or crime, for the protection of health or morals,
or for the protection of the rights and freedoms of others.
18
© 2010 Columbia University35 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Workplace privacy
� Company as Legal person -- liable to sue and being sued
� Needs from Employers’ viewpoints:
1. Employer Liability for Employee Misuse of Company Owned Technology
2. Protection of Trade Secrets andAvoidance of Corporate Defamation
3. Discovery in Litigation
4. Productivity of Personnel and Systems
� Needs from Employees’ viewpoints:
– Human Rights
© 2010 Columbia University36 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Timing of Legislation
� Legislation is mostly behind what have already happened and is a social collective
action to seek for a common-ground resolution to existing issues.
� Legislation of privacy law of countries were enacted in the technical era of:
– Database technology – no concept of search engine
–Web 1.0 - no concept of social software
– Combating spam
–Regulating the behavior of government or personal-data collection
companies
– Regulating marketing behavior
19
© 2010 Columbia University37 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Definition of Personal Data in EU Directive 95/46/EC� Article 2 (a):
–Personal data shall mean
• any information
Both objective and subjective information about a person
Irrespective of the technical medium
• relating to
Check ‘content’, ‘purpose’ or ‘result’
• an identified or identifiable
On the means likely reasonably to be used by controller or by
any other person to identify that person
The particular context and circumstances plays important role
• natural person
About living individual
In principle, not including ‘legal person’. However, member
states may extend legislation to legal person
© 2010 Columbia University38 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Criteria for Making Data Processing Legitimate (in 95/46/EC)
� Article 7:
– Personal data may be processed only if:
• The data subject has unambiguously given his consent; or
• for the performance of a contract to which the data subject is party or in order to take steps
at the request of the data subject prior to entering into a contract; or
• for compliance with a legal obligation to which the controller is subject; or
• in order to protect the vital interests of the data subject; or
• for the performance of a task carried out in the public interest….; or
• for the legitimate interests pursued by the controller or by the third party or parties to whom
the data are disclosed, except where such interests are overridden by the interests for
fundamental rights and freedoms of the data subject which require protection under Article
1(1) (i.e., fundamental rights and freedoms of natural persons).
controller shall mean the natural or legal person, public authority, agency or any other body which
alone or jointly with others determines the purposes and means of the processing of personal data;
where the purposes and means of processing are determined by national or Community laws or
regulations….
20
© 2010 Columbia University39 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Information Collection (in 95/46/EC)
� Article 11: Information where the data have not been obtained from the
data subject
–The controller or his representative must at the time of undertaking
the recording of personal data or if a disclosure to a third party, no
later than the time when the data are first disclosed provide the
data subject with at lease the following info:
• The identity of the controller and of his representative;
• The purpose of the processing
• Any further information such as
The categories of data concerned,
The recipients or categories of recipients,
The existence of the right of access to and the right to rectify the data
concerning him
© 2010 Columbia University40 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Types of personal data (in 95/46/EC)
� Article 8:
– Shall prohibit the processing of these sensitive personal data:
• Racial or ethnic origin
• Political opinions
• Religious or philosophical beliefs
• Trade-union membership
• Health or sex life
– The above can be processed if
• The data subject has given explicit consent, except prohibited by law
• For carrying out the obligations and specific rights of the controller in the field of employment law
• To protect the vital interests of the data subject
• In the course of its legitimate activities with appropriate guarantees by a foundation, association, or any other non-profit-seeking body… solely to the members of the body…
• Data are manifestly made public by the data subject
21
© 2010 Columbia University41 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Case Law
� Lindqvist v. Jonkoping [2002]
� Opinion of Advocate General:
– Mrs. Lindqvist was a part time, voluntary catechist in a parish in
Sweden.
– She set up a web-page of the parish. There is a direct link for the
webpage in the homepage of the Swedish church.
– It contained information about the parish including: the names, and in
some occasions the full names, of other employees and herself; her
colleagues’ jobs and hobbies; telephone numbers and other personal
information;
– Additionally, it was mentioned that one of her colleagues was a part-
timer because she had health problems.
– Mrs Lindqvist did not notify her colleagues about the webpage neither
did she inform the Datainspektionen. (Information Commissioner for
Sweden).
© 2010 Columbia University42 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Questions in the Lindqvist v Jonkoping case
1. Is it in the scope of the Directive? Does it constitute the processing of personal data by automatic means to list on a webpage a number of persons with comments about their jobs and hobbies?
2. Can the act of loading such information onto a webpage be regarded as outside the scope of the Directive, under the exceptions?
3. Is information on a webpage stating that a named colleague has injured her foot and is on half-time on medical grounds personal data concerning health which may not be processed?
4. If a person a person in Sweden uses a computer to load personal data onto a webpage stored on a server not in Sweden does that constitute a transfer of data to a third country?
5. Can the provisions of the Directive, in a case such as the above, be regarded as contradictory with the general principles of freedom of expression of other freedoms and rights, which are applicable within the EU and are enshrined in ECHR Article 10?
22
© 2010 Columbia University43 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Processing of personal data and the protection of privacy in the electronic communications sector ( 02/ 58 / EC)
� Regulates ISPs and Telcommunication providers
� Specifically focuses on marketing
� Direct marking email messages may be sent only to subscribers who have
given their prior consent (“opt-in”). Prior permission is required for B2C
communication covering all “Natural persons”.
� For B2B communication, EU member states are free to make “opt-out” the
minimum legislation.
� In US, the CAN-SPAM Act allows direct marketing email messages to be sent
to anyone, without permission, until the recipient explicitly requests that they
cease (“opt-out”).
© 2010 Columbia University44 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
US vs. Europe
� US relies on a mix of legislation, regulation, and self-regulation.
� European Union relies on comprehensive legislation that requires creation of
government data protection agencies, registration of databases with those agencies
and in some instances prior approval before personal data processing may begin.
� Safe Harbor Program (US Dept. of Commerce and European Commission):
– EU would prohibit the transfer of personal data to non-European Union
nations that do not meet the European “adequacy” standard for privacy
23
© 2010 Columbia University45 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Safe Harbor Program – 7 requirements
� Notice
� Choice
� Onward Transfer
� Access
� Security
� Data Integrity
� Enforcement
IBM signed up on 8/15/2002
© 2010 Columbia University46 E6885 Network Science – Lecture 13: Analysis of Network Flow Data
Questions?