leveraging the mapreduce application model to run text
TRANSCRIPT
![Page 1: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/1.jpg)
Leveraging theMapReduce Application Model
to Run Text Analytics in HPC Clusters
Cyril Briquet, McMaster University
![Page 2: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/2.jpg)
Overview
1. Text Analytics with Voyeur Tools
2. HPC burst computing
3. MapReduce
4. Challenges
![Page 3: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/3.jpg)
Text analytics with Voyeur Tools
Voyeur Tools
* interactive web app
* toolbox (word frequencies, collocates, concordances, ...)
* for text corpora (primary scope: Digital Humanities)
http://voyeurtools.org (you can try now)
![Page 4: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/4.jpg)
Text analytics with Voyeur Tools
![Page 5: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/5.jpg)
Overview
1. Text Analytics with Voyeur Tools
2. HPC burst computing
3. MapReduce
4. Challenges
![Page 6: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/6.jpg)
HPC burst computing
* next generation of Voyeur Tools:
scale to process large text corpora upon user request,
so-called “burst computing”
* use of HPC resources to process bursts of user requests
so that interactive web app remains responsive
* SHARCNET-sponsored project & Postdoctoral Fellowship
(started: January 2010)
![Page 7: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/7.jpg)
Burst computing opportunities
3 opportunities to use HPC resources in Voyeur Tools:
* initial data importing
* data indexing
* data analysis
=> multithreading “OK”, but limited data parallelism
![Page 8: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/8.jpg)
Overview
1. Text Analytics with Voyeur Tools
2. HPC burst computing
3. MapReduce
4. Challenges
![Page 9: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/9.jpg)
MapReduce [Dean et al. 2004]
![Page 10: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/10.jpg)
MapReduce features [Dean et al. 2004]
* automatic parallelization and distribution
* I/O scheduling (+ data-aware scheduling)
* fault-tolerance
* status and monitoring
![Page 11: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/11.jpg)
Using MapReduce in Voyeur Tools
* initial data importing:
import text corpora into MapReduce filesystem
* data indexing: online or (preferably) batch indexing jobs
* data analysis: online analysis jobs
![Page 12: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/12.jpg)
Overview
1. Text Analytics with Voyeur Tools
2. HPC burst computing
3. MapReduce
4. Challenges
![Page 13: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/13.jpg)
Challenge: data model
impact of data model
* current Voyeur Tools: file-based
(O.S.-level, “local” filesystem)
vs.
* MapReduce: record-based
(application-level, “distributed” filesystem)
![Page 14: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/14.jpg)
Challenge: computation model
impact of computation model
* current Voyeur Tools: multiple nested loops
vs.
* MapReduce: 2-phase record-based processing
![Page 15: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/15.jpg)
Challenge: cluster job queue
impact of cluster job queue
* middleware daemons (for data distribution, computation)
should ideally be always-on (to reduce web app latency)
* application malleability (to # available cores):
supplementary challenge...
![Page 16: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/16.jpg)
Challenge: global filesystem
impact of cluster-level global filesystem
* very convenient for many applications...
but we'd rather preposition data to local disks,
in order to maximize parallelism of data access
![Page 17: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/17.jpg)
Challenge: firewalls
impact of firewalls
* can communicate <===> SHARCNET nodes, clusters
* cannot download data from the web
(= limited parallelism of initial data import from web servers)
![Page 18: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/18.jpg)
Challenge: public web front-end
requirement for public web front-end
* scalable servlet container
* DNS entry
![Page 19: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/19.jpg)
Conclusion
* project getting started
* will rely on Open Source software:
. Apache Hadoop
. Apache Tomcat (maybe Eclipse Jetty)
. and of course Voyeur Tools
* importance of flexible data model
![Page 20: Leveraging the MapReduce Application Model to Run Text](https://reader031.vdocuments.us/reader031/viewer/2022022405/621695551506c329b76799a6/html5/thumbnails/20.jpg)
Leveraging theMapReduce Application Model
to Run Text Analytics in HPC Clusters
Cyril Briquet, McMaster University