1 insights into access patterns of internet media systems lei guo

46
1 Insights into Access Pat terns of Internet Media Systems Lei Guo

Upload: virgil-lindsey

Post on 03-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

  • Insights into Access Patterns of Internet Media SystemsLei Guo

  • Media content on the InternetVideo applications are mainstream

    Video traffic is doubling every 3 to 4 monthsYouTube: more than 100 M daily streams since July 2006No. 3 until nowGoogle Yahoo!YouTube

  • Existing media delivery systemsP2P networkHigh CPU and bandwidth consumption, or unsatisfactory QoSCaching for performance improvementA lot of research studiesCommercial productsAccess patterns play an important roleExploit temporal localityHigh performance with cost effective hardware

  • Zipf model of Internet traffic patternsZipf distribution (power law)Characterizes the property of scale invarianceHeavy tailed, scale free

    80-20 ruleIncome distribution: 80% of social wealth owned by 20% people (Pareto law)Web traffic: 80% Web requests access 20% pages (Breslau, INFOCOM99)

    System implicationsObjectively caching the working set in proxySignificantly reduce network traffica: 0.6~0.8Reference rank distribution

    i : rank of objectsyi : number of references

  • Does Internet media traffic follow Zipfs law?Chesire, USITS01: Zipf-likeCherkasova, NOSSDAV02: non-ZipfAcharya, MMCN00: non-ZipfYu, EUROSYS06: Zipf-likeVeloso, IMW02: Zipf-likeSripanidkulchai, IMC04: non-ZipfGummadi, SOSP03: non-ZipfIamnitchi, INFOCOM04: Zipf-like

  • Inconsistent media access pattern modelsStill based on the Zipf modelZipf with exponential cutoffZipf-Mandelbrot distributionGeneralized Zipf-like distributionTwo-mode Zipf distributionFetch-at-most-once effectParabolic fractal distributionAll case studiesBased on one or two workloadsDifferent from or even conflict with each otherAn insightful understanding is essential to guideContent delivery system designInternet resource provisioningPerformance optimization

  • OutlineMotivation and objectivesStretched exponential of Internet media trafficDynamics of access patterns in media systemsCaching implicationsConclusion and future work

  • Workload summary16 workloads in different media systemsWeb, VoD, P2P, and live streamingBoth client side and server sideDifferent delivery techniquesDownloading, streaming, pseudo streamingOverlay multicast, P2P exchange, P2P swarmingData set characteristicsWorkload duration: 5 days - two years Number of users: 103 - 105Number of requests: 104 - 108Number of objects: 102 - 106nearly all workloads available on the Internetall major delivery techniquesdata sets of different scales

  • Stretched exponential distributionMedia reference rank follows stretched exponential distribution (verified by Chi-square test)i : rank of media objects (N objects)y : number of referencesReference rank distribution: fat head and thin tail in log-log scale straight line in logx-yc scale (SE scale)c: stretch factor

  • Set 1: Web media systems (server logs)ST-SVR-01 (15 MB)*HPC-98 (14 MB)*HPLabs-99 (120 MB)HPC-98: enterprise streaming media server logs of HP corporation (29 months)HPLabs: logs of video streaming server for employees in HP Labs (21 months)ST-SVR-01: an enterprise streaming media server log workload like HPC-98 (4 months)log scale in x axispowered scale yclog scalefat headthin tailc = 0.22R2 = 0.977Y left: y^c scaleY right: log scaleParameters: maximum likelihood methodR2: coefficient of determination (1 means a perfect fit)x: rank of media object y: number of references to the object

  • Set 2: Web media systems (req packets)ST-CLT-05 (4.5 MB)PS-CLT-04 (1.5 MB)ST-CLT-04 (2 MB)PS-CLT-04: HTTP downloading/pseudo streaming requests, 9 daysST-CLT-04: RTSP/MMS on-demand streaming requests, 9 daysST-CLT-05: RTSP/MMS on-demand streaming requests, 11 daysAll collected from a big cable network hosted by a large ISPpowered scale yclog scalefat headthin taillog scale in x axisx: rank of media object y: number of references to the object

  • Set 3: VoD media systemsmMoD-98: logs of a multicast Media-on-Demand video server, 194 days

    CTVoD-04: streaming serer logs of a large VoD system by China telecom, 219 days, reported as Zipf in EUROSYS06

    IFILM-06: number of web page clicks to video clips in IFILM site, 16 weeks (one week for the figure)

    YouTube-06: cumulative number of requests to YouTube video clips, by crawling on web pages publishing the data*mMoD-98 (125 MB)*CTVoD-04 (300 MB)IFILM-06 (2.25 MB)YouTube-06 (3.4 MB)powered scale yclog scalefat headthin taillog scale in x axis

  • Set 4: P2P media systemsBT-03 (636 MB)*KaZaa-02 (300 MB)*KaZaa-03 (5 MB)KaZaa-02: large video file (> 100 MB. Files smaller than 100 MB are intensively removed) transferring in KaZaa network, collected in a campus network, 203 days.

    KaZaa-03: music files, movie clips, and movie files downloading in KaZaa network, 5 days,reported as Zipf in INFOCOM04.

    BT-03: 48 days BitTorrent file downloading (large video and DVD images) recorded by two tracker sitesfat headthin tail

  • Set 5: Live streaming and movie picturesIMDB-06Akamai-03Movie-02Akamai-03: server logs of live streaming media collected from akamai CDN, 3 months, reported as two-mode Zipf in IMC04

    Movie-02: US movie box office ticket sales of year 2002.

    IMDB-06: cumulative number of votes for top 250 movies in Internet Movie Database web sitefat head thin tail

  • Why Zipf observed before?Media traffic is driven by user requestsIntermediate systems may affect traffic patternEffect of extraneous traffic (ads video insertion)Filtering effect due to caching (through a proxy)Biased measurements may cause Zipf observationcachead servermedia serverproxy

  • Extraneous media trafficmeta file link web serverstreamingmedia serverads serverads cliplogo clipvideoprogramad and logo video are pushed to clients mandatorily

  • Effects of extraneous traffic on reference rank distributionsDo not represent user access patternsHigh request rate (high popularity)High total number of requestsNot necessary Zipf with extraneous trafficExtraneous traffic changeAlways SE without extraneous traffic Small object sizes, small traffic volume200420052004: 2 objects2005: mergedinto 1 objectNon-ZipfZipfwith extraneous traffic

  • Caching effectWeb workload: caching can cause a flattened head in log-log scaleStretched exponential is not caused by caching effectLocal replay events can be traced by commercial streaming media protocols

    Play summaryCache validationServer logspacket snifferClient side workload

  • Why media access pattern is not ZipfRich-get-richer and Zipf / power lawPareto law: income distributionWeb access is ZipfPopular pages can attract more usersYahoo! ranks No.1 more than six yearsPages update to keep popularZipf-like for long durationMedia access is notPopularity decreases with time exponentiallyMedia objects are immutableRich-get-richer not presentNon-Zipf in long duration--- Web--- VideoNumber of distinct weekly top N popular objects in 16 weeks

    Top 1 Web object never changesTop 1 video object changes every week161

  • OutlineMotivation and objectivesStretched exponential of Internet media trafficDynamics of access patterns in media systemsCaching implicationsConclusion and future work

  • Dynamics of access patterns in media systemsMedia reference rank distribution in log-log scaleDifferent systems have different access patternsThe distribution evolves over time in a system (NOSSDAV02)

    All follow stretched exponential distributionStretch factor: cMinus of slope: a

    Dynamics and physical meaningsMedia file sizesAging effects of media objectsDeviation from the Zipf model

  • Stretch factors of different systems

  • Stretch factor and media file sizeOther issuesDifferent encoding rates (file length in time is a better metric)Different content type: entertainment, educational, businessStretch factor c reflects the view time of a video objectEDUBIZfile size vs. stretch factor c

    0 5 MB: c 100 MB: c >= 0.3c increases with file size,despite the underlying system

  • Stretched exponential parameters over timeIn a media system over time tConstant request rate reqConstant object birth rate objConstant median file sizeStretch factor c is a time invariant constantParameter a increases with time tObjects created in [0, t)Objects created in (-, 0]: ~O(log t)Popularity decreases exponentially

  • Evolution of media reference rank distributionVoD (IFILM)P2P (BT)Reference rank distribution (slope = -a)Parameter aWeb mediat0: system upgradeSlow down12Model confirmed:a increases with time, anda converges to a constant

    Caused by objects created before workload collectionSpeed up

  • Deviation from the Zipf modela increases with c (c < 2)a increases with req/obj a increases with t Deviation increases with time Big file workloads have larger deviationOEEFOrank (log)reference # (log)Deviation: In which case SE is close to Zipf?

  • Deviation from Zipf for different sized movie videosMovie trailer (IFILM)VoD (CTVoD)US Movie (income)2.25 Mdeviation = 0.25300 Mdeviation = 0.45movie theaterdeviation = 0.62reference # (log)reference # (log)reference # (log)rank (log)rank (log)rank (log)Big/long media files have large deviationLarge stretch factor cHigh req/obj (movie has low production rate)

  • Example: YouTube Video Measurements in IMC0780 days in a campus networkAll downloads for all timeLarge a log N, large deviationSmall a log N, small deviationShort time, old objects dominant: N(t) >> objtNo old objectsN(t) = 0In campus: small request rate reqEntire YouTube: huge request rate reqZipfnon-ZipfN: small # of objectsN: all objectsTwo aspects of the same model: stretched exponential

  • OutlineMotivation and objectivesStretched exponential of Internet media trafficDynamics of access patterns in media systemsCaching implicationsConclusion and future work

  • Caching analysis methodologyAnalyze media caching for short term workloadDistribution is stationary (evolution takes time)Requests are independentObject has unit sizeIntuition from the distribution shapeHighly concentrated requestsLess concentrated requests

  • Modeling caching performanceAsymptotic analysis for small cache size k (k
  • Potential of long term media cachingShort term: Requests dominated by old objects, dilute concentrationLong term: Requests dominated by new objectsRequest concentration: significantly increasedRequest correlation: old objects can be replaced when unpopularTo achieve maximal concentrationVery long time (months to years) and huge amount of storageClient-server based proxy systems cannot scalePeer-to-peer systems are scalable for this purposeCDF of object agesCDF of requests in 11 daysold objectsnew objectsobject popularity decreases exponentially

  • Applications of media access pattern modelResource allocation for large scale media distributionCDN systemsPeer-to-peer systemsBenchmarking the performance of media serversSynthetic workload generationSPECweb benchmark: pendingTraffic engineering and management of the InternetAnalyze flow patterns of video trafficThrottle illegal digital movie downloadingAnalyzing user request patterns for video searchBetter relevance rankingSearch intent analysis

  • OutlineMotivation and objectivesStretched exponential of Internet media trafficDynamics of access patterns in media systemsCaching implicationsConclusion and future work

  • Concluding RemarksMedia access patterns do not fit Zipf model Media access patterns are stretched exponentialOur findings implies thatClient-server based proxy systems are not effective to deliver media contentsP2P systems are most suitable for this purposeWe provide an analytical basis for the effectiveness of P2P media content delivery infrastructure

  • Future work: User behavior models for social networksInternet systems are user-drivenUser networks and cloud services on the InternetUser behavior models: Zipf-like, stretched exponential, or others?Social networks are powered by user contributionsUser generated content (UGC)Video sharing, blog, bookmark sharing, question/answer, Wikipedia, Understanding UGC creation patterns in social networksCommon users vs. core usersValidate the correctness of user provided knowledgeIdentifying spam content in social networksDistinguish normal users and spam usersEvaluate the health of a social networkEfficient resource management in the underlying supporting system

  • Future work: User behavior models for search engine systemsTypical Internet cloud servicesUser behavior models forKeyword selectionQuery sequence, page click throughQuery frequency, page dwelling timeAnalyze user search intentBetter rank boosting algorithms and query re-writing functionsServe more accurate informationIndentify search engine abuse and attacksImprove search engine performanceMining co-occurrence patterns of search keywordsExploiting inter-query locality on search index

  • Other related projectsInternet measurement and modelingThe stretched exponential distribution of Internet media access patterns, PODC 2008Delving into Internet streaming media delivery: A quality and resource utilization perspective, IMC 2006 Measurements, analysis, and modeling of BitTorrent-like systems, IMC 2005Analysis of multimedia workloads with implications for Internet streaming, WWW 2005Internet media systemsASAP: an AS-aware peer-relay protocol for high quality VoIP with low overhead, ICDCS 2006 DISC: Dynamic Interleaved Segment Caching for interactive streaming, ICDCS 2005PROP: a scalable and reliable P2P assisted proxy streaming system, ICDCS 2004

  • Other related projects (cont.)Wireless InternetPSM-throttling: Minimizing energy consumption for bulk data communications in WLANs , ICNP 2007Cooperative Relay Service in a Wireless LAN, JSAC 2007SCAP: Smart caching in wireless access points to improve P2P streaming, ICDCS 2007Exploiting idle communication power to improve wireless network performance and energy ffficiency, INFOCOM 2006

    Social networks and Web 2.0Tag-based social interest discovery, WWW 2008Understanding instant messaging traffic characteristics, ICDCS 2007

    Peer-to-peer and overlay networksLightFlood: Minimizing redundant messages and maximizing scope of peer-to-peer search, TPDS 2008Fast and low cost search schemes by exploiting localities in P2P networks, JPDC 2005

  • Thank you!

  • Segment caching performanceSegment hit ratio byte hit ratioMuch lower than object hit ratioLRU is less efficient for media cachingLess request concentrationLarger working setSegment LFU is even worseSequential order in an object not capturedST-CLT-04ST-CLT-05LRU efficiencyConventional replacement policies are not efficient(i): minimum cache size to hold object i

  • Ongoing and future workMedia traffic is driven by user requestsHow users request media objectsHow media access patterns affect system performanceSocial networks are powered by user contributionsuser generated content (UGC)video sharing, blog, bookmark sharing, question/answer, How users contribute UGC content in social networks?How UGC creation patterns affect the growth and popularity of a social network?

  • UGC creation patterns in online social networksUGC creation patterns in social networks are stretched exponentialx: user ranky: number of postsYahoo! Blog (article posting)Yahoo! Blog (photo posting)

  • UGC creation patterns in online social networks (continue)Yahoo! Delicious (bookmark)Yahoo! Answer (answer)Media request is content consumptionStretch factor vs. file size (length): UGC creation is content productionStretch factor vs. effort to create a UGC object:

  • Segment-based streaming media cachingStreaming media are often partially accessedSegment caching is efficientIdeal segment reference rank distributionM segments per object, no partial accessTwo mode SE distribution: same c, same a in the 2nd modeMslope = -aslope = -aobjectsegment

  • Segment-based streaming media caching (cont.)Segmentation: 5 seconds of media dataSegments rank distribution: two-mode SE Same stretch factor cSmaller a than object rank distributionLess temporal localityST-SVR-01ST-CLT-04ST-CLT-05

    Prospects of future work Content consumptionContent productionA great amount of content is created by common users, called user generated content (UGC)

    Do core users account for most content?How common users contribute UGC content?How to validate the correctness of user provided knowledge?Machine learning and information retrieval

    How user behaviors affect their networks, the underlying system, and the Internet? Do user behavior models follow a pattern like Zipf or SE?

    Search, click, , query reformulation, , click, stopProspects of future work Content consumptionContent productionA great amount of content is created by common users, called user generated content (UGC)