sns analysis using cloud computing services
Post on 11-May-2015
3.283 Views
Preview:
DESCRIPTION
TRANSCRIPT
SNS Analysis using Cloud Computing ServicesDHT-based Key-Value Storage and MapReduce-based Analysis
DongWoo Leeoiko.cloud@gmail.com
SocialFlowOikoLabDSOiko
Laboratory 2CloudKR
PlatformDay2009
1
Agenda
‣ Introduction• Social Network Serivce• Motivation : Visualization, Social Network Analysis• SocialFlow• Scale Out Technologies : Cloud Computing
‣ SNS Analysis Architecture based on Cloud• Overall Process• Crawling• DHT Storage (CouchDB)• MapReduce• Pair-Wise Similarity
‣ Cloud Computing Service• Amazon Web Service• EC2 / S3 / Elastic MapReduce• Tips
‣ References
2CloudKR
2
Introduction
Mobile DeviceCloud ComputingSocial Network
2CloudKR
3
Social Network Service
“Social Applications = Social Networks”“A social network is a collection of people bound together through a specific set of social relations.”
“A collection of people is a social network if and only if it is possible for something to spread virally through that collection.”
2CloudKR
4
Social Network Services : Twitter, Facebook2CloudKR
5
Social Applications
6
Social Networks
http://www.vincos.it/world-map-of-social-networks/
7
Social Network Analysis
‣ Social Graph Analysis
‣ Visualization
‣ Person-to-Person Relationship
‣ Temporal Mind Mining (Content Clustering)
‣ Post-Mortem Log Processing
2CloudKR
8
Social Network Analysis : Visualization2CloudKR
‣ Social Graph(50 People)
9
Social Network Analysis : Visualization2CloudKR
‣ Social Graph (100 People)
10
Social Network Analysis : Visualization2CloudKR
‣ Social Graph (200 People)
‣ Limitations‣ Visualization‣ Computational Complexity
11
‣Social 3D Graph
Social Network Analysis : Visualization 2CloudKR
12
SocialFlow
‣ Thoughts, Feelings, Interests, Relationship and Information of SNS
‣ Real-time Massive Social Data Streams
‣ Difficult to follow the Social Streams
‣ Need a way to get a summary or clustered information based on Common Interests
2CloudKR
SocialFlowOikoLabD
13
SocialFlow
‣ Getting Common Flows of people through Content Similarities
‣ Reflecting Short-Term Interests of People
‣ Extracting Hot Issues
‣ Revealing Relationships among In/Out Resources
‣ Implementing Scale-Out Technologies
‣ Evolving toward Recommendation System based on Collective Intelligence
2CloudKR
14
Scale Out Technologies : Cloud Computing2CloudKR
15
Why Cloud Computing?
‣ SPOF (Single Point of Failure)
‣ Cluster Administration (Who do this?)
‣ Initial Infrastructure Investment (Risk Management)
‣ Focus on Main Thing (Intelligence)
‣ Enable Highly Scalable Services
2CloudKR
New resource provision paradigms for Grid Infrastructures: Virtualization and Cloud / ISGC 2009
http://tinyurl.com/nacgu7
16
Cloud Computing: e.g. Storage Failure2CloudKR
Failure Trends in a Large Disk Drive Population, by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz André Barroso, Google Inc.
17
SNS Analysis Architecture based on Cloud2CloudKR
SocialFlowOikoLabD
18
Experimental Project
SocialFlowOikoLabD
‣Python / Django / Boto
‣ML / Data Mining
‣DHT / CouchDB
‣Cloud / AWS S3, EC2, Hadoop MapReduce
2CloudKR
19
Workflow2CloudKR
SNS Crawler MapReduce CDN UserPost-Processing
In-house Cluster(Local DataCenter)
Cloud Service
20
Technologies : Before
Key-ValueStorage
ConsistentDHT MapReduce
MachineLearning
CouchDB
CouchJSHash_ring
HomeMade
Crawler
2CloudKR
Crawler Crawler
21
Technologies : After
Key-ValueStorage
ConsistentDHT MapReduce
MachineLearning
CouchDB
EC2Hadoop
Hash_ring
HomeMade
Crawler
2CloudKR
Crawler Crawler
Storage S3
22
Crawling2CloudKR
DB
DB
DB
DB
IndexerIndex
File
[ term, doc ]
Mapper
Crawler
Crawler
Crawler
Crawler
DHT Replication
‣ Fetching recent postings of SNS
‣ Storing fetched postings to CouchDB Storage through DHT Layer (which select a sever)
‣ Pushing raw data into the Cloud to process them with MapReduce
23
Consistent DHT (Distributed Hash Table)
2CloudKR‣ Uniform key distribution and load balancing with a good hash function
‣ Minimizing the effects of a storage crash or temporal down
‣ High availability with replication scheme
2
Replicas
Replicate(k, k-1, k+1)
Node k-1
Node k+1 Node k
1
0N-1
Node N-1
k+1
k-1
!"#$!%&'()*+,-.(/0123',(0405123',(&6-.-7-1(080.-'9(.0405.-'9(.&6-.-7-1(0:
‣ Notice: A real node has non-linear portions of the total key space.
24
Consistent DHT (Distributed Hash Table)
2CloudKR
2
Node k-1
Node k+1 Node k
1
0N-1
Node N-1
Memory Cache
DHT
DHT Front End
AWS S3
html image
SNS Anlysis
Admin View
View
User View
SNS Crawler
Anonymouse User Traffic
Admin Traffic
Generated Contents
25
Consistent DHT : Replication2CloudKR
A B
D
B C
A
C D
B
D A
C
B
B
B
Replica Replica
* Replica = 2
26
CouchDB (Key-Value Storage)2CloudKR
‣ Erlang -based Key-Value Storage
‣ Storage Engine (MVCC, B-tree)
‣ RESTful API
‣ Service-side JavaScript Engine (MapReduce)
‣ View Engine
‣ Futon Web UI
27
CouchDB: Server-side Javascript
‣ Purpose
‣ Local Computations on Local Data Sets
‣ Features
‣ Mozilla’s Spidermonkey
‣ MapReduce Framework with Javascript
‣ Fork External Process (couchjs)
‣ Performance Enhancements Expected
‣ Googles V8 (Chrome’s Javascript Engine / JIT)
2CloudKR
http://tinyurl.com/m76sx3
28
CouchDB: MapReduce2CloudKR
doc = (d1, d2, fq)
dx: { di }
29
Map & Reduce : Pair-Wise Similarity2CloudKR
DB
DB
DB
DB
IndexerIndex File
Group File
[ term, doc ] [ term, { docs } ]
Mapper Reducer
Doc File
DocCombinator
Candidate File
[ term, { docs } ] =>
[ doc1, doc2 ]
Mapper
Result File
[ freq, doc1, doc2 ]
Reducer
DocGrouper
DocPairCounter
‣ Indexer and Grouper for Processing Korean.
‣ No NLP and No Structural Analysis.
‣ Produce a pairwise similarity between two postings.
30
Map & Reduce : Optimization
‣ Concerns‣ Consider Key Group Size Distribution‣ Data Load Balancing‣ Barrier Point
‣ Sample Data‣ Two months postings of my friends‣ Reachable graph: 4,060 Peoples‣ Total Postings: 206,115
2CloudKR
31
Pair-Wise Similarity and its TreeMap
Posting: 110,008Users: 2,691
Score >= 6
32
Pair-Wise Similarity and its Cluster2CloudKR
➡One issue and different opinions among people
33
Pair-Wise Similarity and its Cluster2CloudKR
➡C
omm
on In
tere
st /
Hot
Issu
e
34
Pair-Wise Similarity and its Cluster2CloudKR
➡One person and the similar contents pattern (specialty)
35
Pair-Wise Similarity and its Cluster2CloudKR
➡ Similar Structure of Sentences (trendy, parady)
36
Deployment
www
Flickr
S3/CloudFront
EC2
2CloudKR
37
Cloud Computing Service2CloudKR
38
Before the Cloud Age2CloudKR‣ Smart Shell Guru’s Daily Work : Parallel Sort
$ wc -l data$ split -l 1000k data
$ sort -rm data*.sorted > data.sorted
scpNFS
scpNFS
$ nohup ./work.sh data1 > data1.processed$ nohup sort -r data1.processed > data1.sorted
➡ Need to prepare/maintain physical machines and resources➡ Need to monitor job progress (wait and see job’s status)➡ Need to cope with machine failure (slave nodes / storages / networks)➡ Need to schedule multiple jobs
Complexity
39
Amazon Web Service : Overview2CloudKR
EBS (Elastic Block Store)EC2 (Elastic Compute Cloud) 1 GB to 1TBMount
SimpleDB S3 (Simple Storage Service)
API
CloudFront
SQS (Simple Query Service)
HTTP
Clients
Buckets
Objects
Permissions
key-value
AMI (Machine Image)
EC2 EC2 EC2 EC2
Access Key IDSecret Access KeyKey Pair
Clients HTTP
Admin
SSH
Clients
Clients
Elastic MapReduceInstant EC2 Hadoop Cluster
Hadoop Hadoop Hadoop
Header
CloudWatch
Auto Scaling
Elastic Load Balancing
Mgmt Console
Monitoring
Edges
Messages
Import/Export
Offline
eSATA/USB
EC2 CLI
40
Amazon Web Service2CloudKR
‣ Amazon Management Console
41
AWS : AMI
AMIAmazon Machine Image
2CloudKR
42
AWS : Paid AMI / The Cloud Market
AMIAmazon Machine Image
2CloudKR
Paid AMI
43
AWS : How to make a AMI (1)2CloudKR
Loopback File# dd if=/dev/zero of=new_image.fs bs=1M count=1024
Make ext3 file system# mke2fs -F -j new_image.fs# mkdir /mnt/ec2-fs# mount -o loop new_image.fs /mnt/ec2-fs# mkdir /mnt/ec2-fs/dev# /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x console# /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x null# /sbin/MAKEDEV -d /mnt/ec2-fs/dev -x zero# mkdir /mnt/ec2-fs/etc
Create /mnt/ec2-fs/etc/fstab (Add /dev/sda1 --> /, /etc/pts, shm, /proc, /sys)Create yum-xen.conf
# mkdir /mnt/ec2-fs/proc# mount -t proc none /mnt/ec2-fs/proc# yum -c yum-xen.conf --installroot=/mnt/ec2-fs -y groupinstall Base
Edit /mnt/ec2-fs/etc/sysconfig/network-scripts/ifcfg-eth0Edit /mnt/ec2-fs/etc/sysconfig/networkEdit /mnt/ec2-fs/etc/fstab (Add /dev/sda2 --> /mnt, /dev/sda3 --> swap)
chroot /mnt/ec2-fs /bin/shEdit services
44
AWS : How to make a AMI (2)2CloudKR
Building an AMI# yum install ruby# rpm -i ec2-ami-tools-noarch.rpm (Download from public s3 bucket)# ec2-bundle-image -i new_image.fs -k my-private-key.key -u aws-user-id
Local Machine Root File System# ec2-bundle-vol -k my-private-key.key -s 1000 -u aws-user-id
Upload to S3# ec2-upload-bundle -b my-bucket -m image.manifest -a my-aws-access-key-id -s my-secret-key-id
Register AMI# ec2-register my-bucket/image.manifestIMAGE ami-xxxx
Testing# ec2-describe-images ami-xxxx
Deregister AMI# ec2-deregister ami-xxxx
Running AMI# ec2-run-intances ami-xxxx -n 1
http://docs.amazonwebservices.com/AWSEC2/2006-06-26/DeveloperGuide/
45
AWS : EC2 Running Instance2CloudKR‣ AWS Management Console
46
AWS : EC2 Running Instance2CloudKR
47
Amazon Web Service: Access Methods2CloudKR
‣ Access Key ID / Secret Access Key ID / Key Pairs
‣ Amazon Management Console‣ EC2 API (WSDL) / EC2 CLI (Command Line Interface)‣ SSH
‣ Firefox Extensions• S3 Firefox Organizer• Elasticfox
‣ S3•DNS: s3 CNAME s3.amazonaws.com. e.g) Bucket Name: /s3.xyz.com http://s3.xyz.com ---> S3‘s s3.xyz.com
‣s3cmd (python)‣s3cmd.rb / s3sync.rb (ruby)‣S3Hub (Mac)
48
Amazon Web Service: Elasticfox 2CloudKR‣ Firefox’s Extension: Elasticfox
49
Amazon Web Service: Elasticfox 2CloudKR
‣ Key Pairs‣ Private Key‣ SSH
50
Amazon Web Service: Elasticfox 2CloudKR
‣ Security Groups‣ Open Network Ports
51
AWS: Elastic MapReduce2CloudKR
‣ EC2 + Hadoop
‣Tools‣ Management Console‣ elastic-mapreduce CLI
‣ Preparation‣ Code --> S3‣ Data --> S3
‣ Log Folder‣ Output Folder
‣Job Flow‣ Streaming‣ Custom Jar‣ Sample Applications
52
AWS: Elastic MapReduce2CloudKR
53
AWS: Elastic MapReduce : Web UI2CloudKR
54
AWS: Elastic MapReduce : CLI for Workflow
Step1
Step2
Step3
input/*
output1/part-000**
output2/part-000**
output3/part-000**
2CloudKR
jobflow #id
55
AWS: Elastic MapReduce2CloudKR
‣ Failed tasks will be rescheduled in other Hadoop slaves.‣ If a task is finished, the same instance will be killed by a tracker.
56
AWS: Elastic MapReduce2CloudKR
57
AWS: SocialFlow Automation2CloudKR
DHT
Home IDC Amazon Wild World
UsersAdmin
Re
ad
On
ly
Re
ad
/Write
Local Global
S3
boto python Launching EC2 pool
Results
Renderer
58
AWS: EC2, EMR Price Model2CloudKR
Service Type Per Instance HourPer Instance Hour 1 Week (7 Days) 1 Week (7 Days)
EC2
On-Demand$ 0.10 (S)$ 0.40 (L)$ 0.80 (E)
$ 0.10 (S)$ 0.40 (L)$ 0.80 (E)
$ 16.8 $ 67.2 $ 134.4
KRW 20,865 KRW 83,462 KRW 166,924
EC2
Reserved1yr $ 3253yr $ 500
$ 0.03 (S)$ 0.12 (L)$ 0.24 (E)
$ 0.03 (S)$ 0.12 (L)$ 0.24 (E)
$ 5.04 $ 20.16 $ 40.32
KRW 6,259 KRW 25,038KRW 50,077
ElasticMapReduce On-Demand
$ 0.10 (S)$ 0.40 (L)$ 0.80 (E)
$ 0.015$ 0.06$ 0.12
$ 19.32$ 77.28$ 154.56
KRW 23,995KRW 95,981KRW 191,963
1 USD = 1242 KRW(S) = Small, (L) = Large, (E) = Extra Large
59
AWS: Performance
http://tinyurl.com/qj6ao7
2CloudKR
60
AWS: Performance2CloudKR
61
AWS: Performance
http://tinyurl.com/p9jsyz
2CloudKR
62
AWS: Performance
http://tinyurl.com/cqqxgl
2CloudKR
63
10 Cent Tips
‣ AWS EC2
‣ Minimizing set-up time with prepared shell scripts
‣ Use Boto for automating deployments
‣ Use S3 (Free of Charge between S3 and EC2 in the same region)
‣ $0.030 per GB through June 30, 2000 ($0.1 per GB normal price)
‣ AWS Elastic MapReduce
‣ Enabling the SSH port(22) and Hadoop related ports (9100, 91001)
‣ Assess to Master Node: ssh -i keypair hadoop@public_dns_name
‣ Double Check (PATH, etc)
‣ Debug, Debug, Debug
‣ Use EC2 for hadoop (eg. Clouera’s Hadoop AMI) (No extra cost for Hadoop!)
2CloudKR
64
10 Cent Tips
‣ AWS S3
‣ Setting HTTP header for images and static resources.
‣ Cache-Control: max-age=31536000
‣ Block Search Bots
‣ robots.txt at the root of a Bucket‣ User-agent: *‣ Disallow: /
‣ Using BitTorrent for large files
‣ http://s3.xyz.com/xfile.zip?torrent
‣ Compress Rendered HTML with gzip
‣ Content-Encoding: gzip
2CloudKR
$ s3cmd put index.html s3://s3.xyz.com/www \ --mime-type "text/html” \ --add-header "Content-Encoding: gzip" \ --acl-public
65
Amazon Web Service : Limitations2CloudKR
66
References
‣ 10 MapReduces Tips, Cloudera, http://tinyurl.com/pxuqup ‣ Christian Charas, Thierry Lecroq, Handbook of Exact String-Matching Algorithms‣ Dan Pritchett (eBay), BASE: Alternative ACID, p.48-55, ACM Queue May/June 2008‣ Edward Chang, (Google Research), Mining Large Scale Social Networks, MMDS ’08‣ Edward Walker, Benchmarking Amazon EC2 for high-performance scientific computing‣ Matei Zaharia et al, Improving MapReduce Performance in Heterogeneous Environments, OSDI ’08
‣ Following Twitter‣ http://twitter.com/AmazonEC2‣ http://twitter.com/AmazonS3S3
2CloudKR
67
top related