data and society lecture 7: data infrastructurebermaf/data course 2018/lecture 7... ·...
TRANSCRIPT
Fran Berman, Data and Society, CSCI 4370/6370
Data and Society
Lecture 7: Data Infrastructure
3/30/18
Fran Berman, Data and Society, CSCI 4370/6370
Announcements 3/30
• Make sure you sign up and do 2 presentations by the end of the semester.
• Check what you think your grades are (attendance, op-ed, and presentation scores) with Fran during office hours. You are responsible for being sure that these are accurate.
Fran Berman, Data and Society, CSCI 4370/6370
Wednesday Section Friday lecture
First Half of Class Second Half of Class Assignments
January 17 : NO class January 19 L!: CLASS INTRO AND LOGISTICS Presentation Model / Op-Ed Instructions
Op-Ed instructions
January 24: NO class January 26 L2: BIG DATA 1 4 Presentations
January 31: NO class February 2 L3: BIG DATA 2 -- IoT 4 Presentations
February 7: NO class February 9 L4: DATA AND SCIENCE 4 Presentations Op-Ed due Feb. 9
February 14: 5 Presentations
February 16 L5: DATA AND HEALTH / LESLIE McINTOSH GUEST SPEAKER
4 Presentations Op-Ed drafts returned Feb. 21
February 21: 5 Presentations
February 23 L6: DATA STEWARDSHIP AND PRESERVATION
4 Presentations Research Paper instructions
February 28: 5 Presentations
March 2 CLASS CANCELED DUE TO SNOW
March 7 : 5 Presentations March 9: NO CLASS / PAPER PREPARATION Op-Ed Final due March 7
March 14: Spring Break March 16 SPRING BREAK
March 21: NO class March 23: NO CLASS / PAPER PREPARATION
March 28: 4 Presentations
March 30 L7: INFRASTRUCTURE 4 Presentations Research Paper due March 28
April 4: NO class April 6 L8: DATA RIGHTS, POLICY, REGULATION 4 Presentations
April 11: 4 Presentations April 13 L9: DATA AND ETHICS 4 Presentations
April 18: 4 Presentations April 20 L10: DATA AND COMMUNICATION 4 Presentations
April 25: 4 Presentations April 27 L11: DATA FUTURES 4 Presentations
Fran Berman, Data and Society, CSCI 4370/6370
Today (3/30/18)
• Lecture 7: Data Infrastructure
– Technology – driven science
– Data and the LHC
– Data and Entertainment
• Break
• 4 Student Presentations
4
Fran Berman, Data and Society, CSCI 4370/6370
Technology-driven Science: Many different kinds of infrastructure (cyberinfrastructure, e-infrastructure) needed for modern science
COMPUTE (more FLOPS)
DA
TA
(more
BY
TE
S)
Home, Lab,
Campus, Desktop
Applications
Compute-
intensive
HPC
Applications
Data-intensive
and
Compute-
intensive
HPC
applications
Compute-intensive Grid,
Distributed, and Cloud
Applications
Data - oriented
Grid, Distributed
and Cloud
Applications
NETWORK
(more
BW)
Data-intensive
applicationsLongitudinal evolution:
• ‘80’s, 90’s +: Computational Science, first national networks
• ‘90’s, 00’s +: Development of integrated cyberinfrastructure, emerging focus on data
• 00’s, 10’s +: Increasing integrated cyberinfrastructure workflows, emergence of data science
Fran Berman, Data and Society, CSCI 4370/6370
Cyberinfrastructure evolution: Technology-driven science
increasing focus
in the 80’s and 90’s (data issues often in the background …)
• Many reports in 80’s and early 90’s focused on the potential of information technologies (primarily computers and high-speed networks) to address key scientific and societal challenges
• First federal “Blue Book” in 1992 focused on key computational problems including
– Weather forecasting
– Cancer genes
– Predicting new superconductors
– Aerospace vehicle design
– Air pollution
– Energy conservation and turbulent combustion
– Microsystems design and packaging
– Earth’s bioshpere
– Broader education resources
Fran Berman, Data and Society, CSCI 4370/6370
In the beginning … The Branscomb Pyramid, circa 1993
Branscomb Pyramid provides a framework to associate computational power with community use.
Original Branscomb Committee Report (“From Desktop to TeraFlop”) at http://www.csci.psu.edu/docs/branscomb.txt
Fran Berman, Data and Society, CSCI 4370/6370
The Branscomb Pyramid, circa 2018
Small-scale devices and personal
computers
Small-scale Campus/Commercial
Clusters
Large-scale campus/commercial
resources, Center supercomputers
Leadership Class
PF EF
TF, PF
TF
MF, GF
Opportunities for Innovation at all levels …
Kilo 103
Mega 106
Giga 109
Tera 1012
Peta 1015
Exa 1018
Zetta 1021
Yotta 1024
Fran Berman, Data and Society, CSCI 4370/6370
Also in 1993: The Top500 List created to
rank supercomputers
• TOP500 list ranks and details the 500 most powerful supercomputers in the world
• Most powerful = performance on the LinPack benchmark.
• Rankings provide invaluable statistics on supercomputer trends by country, vendor, sector, processor characteristics, etc.
• List compiled by Hans Meuer of University of Mannheim, Jack Dongarra of University of Tennessee, and Erich Strohmaier and Horst Simon of NERSC / LBNL. List comes out in November and June each year.
http://top500.org/
Fran Berman, Data and Society, CSCI 4370/6370
Top500 List for November 2017
Fran Berman, Data and Society, CSCI 4370/6370
What the Top500 List measures
Rmax and Rpeak values are in TFlops
• Computers assessed based on their performance on the LINPACK Benchmark – calculating the solution to a dense system of linear equations.
– User may scale the size of the problem and optimize the software in order to achieve the best performance for a given machine
– Algorithm used must conform to LU factorization with partial pivoting (operation count for the algorithm must be 2/3 n^3 + O(n^2) double precision floating point operations.
• Rpeak values calculated using the advertised clock rate of the CPU. (theoretical performance)
• Rmax = maximal LINPACK performance achieved (actual performance)
Rensselaer CCI Blue Gene Q on current Top500 list (November 2017):
• 229th most powerful supercomputer in the world
• 45th most powerful Academic supercomputer in the world
• 6th most powerful Academic supercomputer in the US (of 15 on Top500 list)
Rank System Cores Rmax Rpeak Power
Fran Berman, Data and Society, CSCI 4370/6370
Performance Development(Slide courtesy of Jack Dongarra)
0.1
1
10
100
1000
10000
100000
1000000
10000000
100000000
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s
59.7 GFlop/s
400 MFlop/s
1.17 TFlop/s
10.5 PFlop/s
51 TFlop/s
74 PFlop/s
SUM
N=1
N=500
6-8 years
Jack’s Laptop (12 Gflop/s)
1993 1995 1997 1999 2001 2003 2005 2007 2009 2011
Jack’s iPad2 & iPhone 4s (1.02 Gflop/s)
Fran Berman, Data and Society, CSCI 4370/6370
Cyberinfrastructure evolution: Broader
national cyberinfrastructure development
• Publication of the Atkins report from NSF’s Blue
Ribbon Task Force on Cyberinfrastructure
accelerated CI as a critical national focus within
federal R&D investments and especially at NSF
• Report and follow-on programs and projects
evolved existing efforts and provided the seed for a
new era of Cyberinfrastructure innovations in the
research community whose impact can still be seen
today
– NSF Partnerships for Advanced Computing Infrastructure
– NSF TeraGrid, XSEDE
– DOE Science Grid
– NIH Big Data to Knowledge, etc.
Atkins Report: http://www.nsf.gov/cise/sci/reports/atkins.pdf
Fran Berman, Data and Society, CSCI 4370/6370
Data and the LHC
Fran Berman, Data and Society, CSCI 4370/6370
The Large Hadron Collider (LHC)
• LHC is the world’s most powerful particle collider.
• LHC’s goal is to allow physicists to test the predictions of
different theories of particle physics, high-energy
physics, (in particular the properties of the Higgs Boson)
and the large family of new particles predicted by
supersymmetric theories.
• LHC contains seven detectors, each designed for a
different kind of research. LHC built near Geneva
between 1998 and 2008 in collaboration with over
10,000 scientists and engineers from over 100 countries
• LHC lies in a 17 mile circumference tunnel beneath the
France-Switzerland border.
• LHC collisions produce 10’s of PBs of data per year.
– Subset of data analyzed by distributed grid of 170+
computers in 36 countries
A collider is a type of a particle accelerator with two directed
beams of particles.
In particle physics colliders are used as a research tool: they
accelerate particles to very high kinetic energies and let them
impact other particles.
Analysis of the byproducts of these collisions gives scientists
good evidence of the structure of the subatomic world and the laws
of nature governing it.
Many of these byproducts are produced only by high energy
collisions, and they decay after very short periods of time. Thus many of them are hard or near
impossible to study in other ways.
Information from Jamie Shiers and Wikipedia
Fran Berman, Data and Society, CSCI 4370/6370
What happens at CERN?
• Accelerators create particle collisions
– Protons circulate at close to the speed of light
– 10’s of millions of collisions every second
– Collisions recreate the conditions of the first moments of the universe
• Detectors study collisions and the thousands of particles emerging from them.
• Worldwide network of computers filters, records and processes the data from the collisions
– LHC computing grid processes PBs of data each year
• Physicists throughout the world analyze the data
Information from http://home.cern/
CERN's current and future accelerators
• Linear accelerator 2 Linac 2 is the starting point for the protons used in physics experiments at CERN
• Linear accelerator 3 Linac 3 is the starting point for the ions used in physics experiments at CERN
• Linear accelerator 4 Linac 4 boosts negative hydrogen ions to high energies. It will become the source of proton beams for the Large Hadron Collider in 2020
• The Antiproton Decelerator Not all accelerators increase a particle's speed. The AD slows down antiprotons so they can be used to study antimatter
• The Large Hadron Collider The 27-kilometre LHC is the world's largest particle accelerator. It collides protons or lead ions at energies approaching the speed of light
• The Low Energy Ion Ring LEIR takes long pulses of lead ions from Linac 3 and transforms them into the short, dense bunches suitable for injection to the Large Hadron Collider
• The Proton Synchrotron A workhorse of CERN's accelerator complex, the Proton Synchrotron has juggled many types of particle since it was first switched on in 1959
• The Proton Synchrotron Booster Four superimposed synchrotron rings receive protons from the linear accelerator, boost them to 800 MeV and inject them into the Proton Synchrotron
• The Super Proton Synchrotron The second-largest machine in CERN’s accelerator complex provides a stepping stone between the Proton Synchrotron and the LHC
Fran Berman, Data and Society, CSCI 4370/6370
Worldwide LHC Computing Grid
Image from http://wlcg.web.cern.ch/
Fran Berman, Data and Society, CSCI 4370/6370
LHC – Stewardship and Preservation Challenges
• Significant volumes of high energy physics data are thrown away “at birth” – i.e. via very strict filters (aka triggers) before writing to storage. To a first approximation, all remaining data needs to be preserved for a few decades.
– LHC data particularly valuable as reproducibility of experiments is tremendously expensive and almost impossible to achieve
• Tier 0 and 1 sites currently provide bit preservation at scale
– Data more usable and accessible when services coupled with bit preservation
– In the process of “self certification” according to ISO 16363 of the Tier0 and TIer1 sites.
Slide adapted from Jamie Shiers, CERN 2016
Fran Berman, Data and Society, CSCI 4370/6370
Post-collision
David South | Data Preservation and Long Term Analysis in HEP | CHEP 2012, May 21-25 2012 | Page 6
After the collisions have stopped
> Finish the analyses! But then what do you do with the data?
§ Until recently, there was no clear policy on this in the HEP community
§ It’s possible that older HEP experiments have in fact simply lost the data
> Data preservation, including long term access, is generally not part of
the planning, software design or budget of an experiment
§ So far, HEP data preservation initiatives have been in the main not planned by the
original collaborations, but rather the effort a few knowledgeable people
> The conservation of tapes is not equivalent to
data preservation!
§ “We cannot ensure data is stored in file formats appropriate for
long term preservation”
§ “The software for exploiting the data is under the control of the
experiments”
§ “We are sure most of the data are not easily accessible!”
Slide adapted from Jamie Shiers, CERN 2016
Fran Berman, Data and Society, CSCI 4370/6370
Data: Outlook for
HL-LHC
• The LHC – including all
foreseen upgrades – will
run until circa 2040. By
that time between 10 and
100 EB of
data will have been
gathered.
• These data (the
uninteresting stuff has
already been discarded)
should be preserved for a
number of decades.
• Very rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates.
• To be added: derived data (ESD, AOD), simulation, user data…
At least 0.5 EB / year (x 10 years of data taking)
PB
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
450.0
Run 1 Run 2 Run 3 Run 4
CMS
ATLAS
ALICE
LHCb
We are here!
Slide adapted from Jamie Shiers, CERN in 2016
Fran Berman, Data and Society, CSCI 4370/6370
Digitally-enabled Movies
Fran Berman, Data and Society, CSCI 4370/6370
Data Stewardship and Preservation especially important as the
Arts become more digitally-enabled
• Consumers must move (migrate) downloaded digital music to new media players when old players are too full, sometimes requiring re-registration of Digital Rights Management authorization to insure they do not lose access to favorite songs
• Authors must find applications interoperable with old word processing SW to read manuscripts written with obsolete SW
• Digital photos recorded on floppies can’t be accessed on modern computers without floppy disk drives
• Old video games may only run on obsolete game systems
Fran Berman, Data and Society, CSCI 4370/6370
Digital movies
• Most movies are not shot on film but recorded through digital media
– More than 80% of the movie theaters in the U.S. no longer handle film …
• Many digital technologies used in film-making:– Image capture
– Visual effects
– Mastering and final color grading
– Sound capture
– Sound effects
– Sound editing and mixing
– Digital distribution to theaters and other platforms, etc.
• Film industry has been adopting digital technologies in piecemeal fashion over the last 25+ years
The Girl with the Dragon Tattoo was
produced entirely in
digital format
Fran Berman, Data and Society, CSCI 4370/6370
Many components
of movie process
archived beyond
the film itself
Images from http://spectrum.ieee.org/consumer-electronics/standards/will-todays-digital-movies-exist-in-100-years#
Fran Berman, Data and Society, CSCI 4370/6370
Avatar: Digital tour-de-
force
• Film released in 2009 and
distributed by 20th Century Fox
• Directed and written by James
Cameron, produced by James Cameron and Jon Landau
• Became highest grossing film of all time (>$2B)
• Won Academy Awards for Best Art Direction, Best
Cinematography and Best Visual Effects
• Sequel coming
Avatar image from Film Education http://www.filmeducation.org/resources/film_library/getfilm.php?film=2037
Fran Berman, Data and Society, CSCI 4370/6370
Avatar both data-intensive and compute-
intensive
• Avatar technologies developed by Weta Digital Ltd and partners.
– Weta Digital Ltd. Data Center in New Zealand
– (Weta Digital also responsible for computer-rendered scenes in Lord of the Rings Trilogy, King Kong, etc.)
• Avatar IT equal parts of computing power in the data center (creating the visual effects) and data management of artistic processes (driving the film experience)
• Every minute of Avatar represents 17.28 GB of data (~ 3TB in all)
• Avatar used 1 PB of storage space for rendering
Fran Berman, Data and Society, CSCI 4370/6370
Avatar -- Innovative IT
• Technological innovations included:
– Performance capture process: actors wore special gear and cameras that translated live action into realistic animation in real-time
– 3D Fusion Camera: 2 high defcameras in a single camera body to create depth perception
– Virtual camera system: shows actors’ virtual counterparts in their digital surroundings in real time
– Motion capture stage, etc.
Avatar image from Wikipedia article with caption “Cameron pioneered a specially designed camera built into a 6-inch boom that allowed the facial expressions of the actors to be captured and digitally recorded for the animators to use later.”
http://en.wikipedia.org/wiki/Avatar_(2009_film)
Fran Berman, Data and Society, CSCI 4370/6370
Avatar Technologies
http://www.youtube.com/watch?v=OJ1JzYPjcj0
(8:39)
CGI = Computer-generated imagery
Fran Berman, Data and Society, CSCI 4370/6370
Weta Digital IT Environment
• Computing core included 40K processors and 104TB of RAM
– 10K square foot server farm with 34 racks of 32 HP Blade servers each
– Center uses water-cooled racks and leverages chilly climate of New Zealand
– Interconnected by 10 gigabit network so that storage seems local
• Data storage leveraged partnerships with NetApp and Fujitsu to develop storage system which
– reduced the amount of manual data management in the process of rendering files
– balanced the throughput requirements of the renderwall (compute) to maximize access to commonly used files
• Digital Asset Management System “Gaia” developed by Microsoft
Fran Berman, Data and Society, CSCI 4370/6370
Avatar (Weta Digital) computers occupied
spots 193-197 on the Top500 List in
November 2009
Fran Berman, Data and Society, CSCI 4370/6370
Great reads …
Fran Berman, Data and Society, CSCI 4370/6370
Lecture 7 Sources• Atkins Report: http://www.nsf.gov/cise/sci/reports/atkins.pdf
• LHC, www.wikipedia.com
• Worldwide LHC Computing Grid website, http://wlcg-public.web.cern.ch/tier-centres
• The Digital Dilemma, Strategic Issues in Archiving and Accessing Digital Motion Picture Materials, Science and Technology Council of the Academy of Motion Picture Arts and Sciences, http://www.scribd.com/doc/55498058/The-Digital-Dilemma
• Processing Avatar, Information Management http://www.information-management.com/newsletters/avatar_data_processing-10016774-1.html
• Data Plays a Supporting Role in Avatar, ComputerWorldhttp://www.computerworld.com/s/article/346361/Data_center_plays_supporting_role_in_i_Avatar_i_
• Wikipedia article on Avatar http://en.wikipedia.org/wiki/Avatar_(2009_film)
• “Will Today’s Digital Movies Exist in 100 Years”, IEEE Spectrum, http://spectrum.ieee.org/consumer-electronics/standards/will-todays-digital-movies-exist-in-100-years#
• “The Afterlife is Expensive for Digital Movies”, The New York Times, http://www.nytimes.com/2007/12/23/business/media/23steal.html?pagewanted=all
Fran Berman, Data and Society, CSCI 4370/6370
Discussion article for Today
• “Is big data racist? Why policing by data isn’t necessarily objective”, Ars Technica, https://arstechnica.com/tech-policy/2017/12/is-big-data-racist-why-policing-by-data-isnt-necessarily-objective/2/
Fran Berman, Data and Society, CSCI 4370/6370
Presentations
Fran Berman, Data and Society, CSCI 4370/6370
Presentation Articles for April 6
• “Embedding a tweet could be copyright infringement, says new court ruling”, The Verge, https://www.theverge.com/2018/2/16/17020278/tweet-embed-copyright-infringement-justin-goldman-tom-brady-photo-ruling [Ben H]
• “Hatch introduces bipartisan bill to clarify cross-border data policies”, The Hill, http://thehill.com/policy/technology/372637-hatch-introduces-bipartisan-bill-to-clarify-cross-border-data-policies [Alex C]
• “Canada’s Privacy Commissioner contemplates new online erasure, data protection rules”, Reuters, https://www.reuters.com/article/bc-finreg-data-protection-rules-canada/canadas-privacy-commissioner-contemplates-new-online-erasure-data-protection-rules-idUSKCN1GD66F [Ethan S]
• “How a fight over Star Wars download codes could reshape copyright law,” Ars Technica, https://arstechnica.com/tech-policy/2018/02/judge-slaps-down-disney-effort-to-stop-resale-of-star-wars-download-codes/[Yishan D]
Fran Berman, Data and Society, CSCI 4370/6370
Presentation articles for April 11
• “Click here to kill everyone,” New York Magazine, http://nymag.com/selectall/2017/01/the-internet-of-things-dangerous-future-bruce-schneier.html [Ethan G]
• “Where does blockchain fit into digital rights management,” IPWatchdog, http://www.ipwatchdog.com/2018/02/06/blockchain-fit-digital-rights-management/id=93024/ [Lindsay Z]
• “French news site L'Express exposed reader data online, weeks before GDPR deadline”, Zdnet, http://www.zdnet.com/article/french-magazine-lexpress-exposed-reader-data/ [Peter K]
• “Everything you need to know about Led Zeppelin’s “Stairway to Heaven” copyright trial”, LA Times, http://www.latimes.com/entertainment/music/la-et-ms-led-zeppelin-copyright-trial-info-20160614-snap-story.html [Kayla C]
Fran Berman, Data and Society, CSCI 4370/6370
Presentation articles for April 13
• “The Follower Factory,” New York Times, https://www.nytimes.com/interactive/2018/01/27/technology/social-media-bots.html [Wei P.]
• “Is it too late for big data ethics?” Forbes, https://www.forbes.com/sites/kalevleetaru/2017/10/16/is-it-too-late-for-big-data-ethics/#4fd4e33f3a6d [Daniel C]
• “Your Roomba already maps your home. Now the CEO plans to sell that map,” USA Today, https://www.usatoday.com/story/tech/nation-now/2017/07/25/roomba-plans-sell-maps-users-homes/508578001/[Michelle H]
• “Racist, sexist AI could be a bigger problem than lost jobs,” Forbes, https://www.forbes.com/sites/parmyolson/2018/02/26/artificial-intelligence-ai-bias-google/#fd91bbf1a015 [Halley F]
Fran Berman, Data and Society, CSCI 4370/6370
Presentation articles for Today
• “The world’s most valuable resource is no longer oil but data,” The Economist, https://www.economist.com/news/leaders/21721656-data-economy-demands-new-approach-antitrust-rules-worlds-most-valuable-resource [Zimo X]
• “Data is infrastructure; how is data transforming UK construction and infrastructure?”, Lexology, https://www.lexology.com/library/detail.aspx?g=f29218ee-9027-44b2-8d93-df3b5fa06e5e [Sarah M]
• “America’s digital infrastructure is crumbling, too” Bloomberg View, https://www.bloomberg.com/view/articles/2018-02-01/america-s-digital-infrastructure-is-crumbling-too [Diego C]
• “The quest for digital equity”, Gov Tech, http://www.govtech.com/civic/The-Quest-for-Digital-Equity.html [Trulee]