prov-vis: large-scale scientific data visualization using...

Prov-Vis: Large-Scale Scientific Data Visualization Using Provenance*

Felipe Horta1, Jonas Dias1, Renato Elias1, Daniel de Oliveira2, Alvaro L. G. A. Coutinho1 and Marta Mattoso1

1 COPPE- Federal University of Rio de Janeiro, 2 IC- Fluminense Federal University {fhorta, jonasdias, marta}@cos.ufrj.br, {renato, alvaro}@nacad.ufrj.br, [email protected]

Abstract — Large-scale scientific computing often rely on in-tensive tasks chained through a workflow. Scientists need to check the status of the execution at particular points, to discov-er if anything odd has happened and take actions. To achieve that, they need to track partial result files, which is usually complex and laborious. When using a scientific workflow sys-tem, provenance data keeps track of every step of the execu-tion. If traversing provenance data is allowed at runtime, it is easier to monitor and analyze partial results. However, visuali-zation of partial results is necessary to be done in sync to the workflow provenance. Prov-Vis is a scientific data visualization tool for large-scale workflows that is based on runtime prove-nance queries to organize and aggregate data for visualization. Prov-Vis helps scientists to follow the steps of the running workflow and visualize the produced partial results. This inno-vates because several systems execute workflows “offline” and do not allow for runtime analysis and workflow steering. To evaluate Prov-Vis, a finite element computational fluid dynam-ics workflow is executed on a supercomputer. Prov-Vis sup-ported the visualization, on a tiled-wall, of several simulation steps and different views based on runtime provenance que-ries.

Keywords-component; hpc; provenance; visualization; cluster; large-scale; scientific workflow; user-steering.

I. INTRODUCTION Several large-scale experiments demand High Perfor-

mance Computing (HPC). These experiments usually rely on compute-intensive activities that may be chained as a scien-tific workflow. The workflow activities call computer pro-grams and scripts that consume and produce data of different types. Unfortunately, as the workflow scales in terms of the number of activities or complexity, it is harder to follow the execution and produced data. Besides data management complexity, the visualization of partial results may also be complex to be performed on large-scale scenarios. When scientists are able to monitor and steer the workflow, they may discover that computations are incorrect or that a lot of information has to be filtered to achieve better results. On another scenario, they may need to know which input pro-duced which output.

Managing the visualization and analysis of results of large-scale workflows manually is laborious and error prone. Besides, data transfers can be costly, and some retrieved results may not be relevant for analysis. This happens due to data fragments that are dissociated from metadata, i.e. it is

hard to identify their use; or as each analysis may focus on a subset of the outputs, relevance of data could be obscured.

Scientific Workflow Management Systems (SWfMS) may improve data management on scientific workflows. They make it easier to represent and reference workflow data through provenance [1]. Provenance is a key feature in SWfMS since it allows for keeping track of everything that happened during workflow execution. Scientists can submit high-level and domain-specific provenance queries like: “What are the maximum values for velocity and pressure on a given simulation exploration?” or “According to a simula-tion exploration, the residual values of pressure are increas-ing or decreasing?”. Provenance data enables a powerful association between workflow metadata with strategic work-flow results. With provenance it is possible to aggregate different types of metadata to the workflow results, making it easier to analyze and draw conclusions from data.

In particular, the scientist should be allowed to make the-se analyses during runtime, so they can follow the workflow execution and check if something is going wrong [2]. Scien-tists can monitor time-consuming workflow execution at specific simulation exploration points, analyze data at runtime and decide to stop or re-execute some activities. However, most of the systems execute workflows “offline” and do not allow for runtime analysis and workflow steering. To address this problem, we designed Prov-Vis as a visuali-zation tool to use provenance as a guide to the analysis pro-cess during the execution. Prov-Vis integrates with Chiron [3] and SciCumulus [4] workflow engines, which allow for runtime provenance data query. Prov-Vis enables users to navigate through workflow data, selecting data of interest, and visualize partial results at runtime enriched with prove-nance. Such runtime analysis is an important step to stage out partial results and visually support the steering of work-flows by scientists [2].*

II. PROV-VIS VISUALIZATION AND USER-STEERING Visualization may support complex analysis of scientific

workflows. Scientists need navigation tools to traverse data enhanced with provenance so they may perform result filter-ing and a selective data staging. This way, they can stage out only relevant results, going fast to results they need to ana-lyze. Keeping track of partial results enables scientists to monitor and analyze workflow execution. Therefore, scien-tists can be aware of the current workflow status; identify and solve problems during execution; recover from failures

* Work partially sponsored by CNPq, CAPES and FAPERJ.

and mistakes; reduce time spent in processing low-quality data; and finally, cut execution time and reduce financial costs avoiding to stage out useless data across remote envi-ronments such as cluster or clouds.

After collecting relevant data for analysis, scientists need to choose the best visualization tool for each case. There are several types of data structures to represent a produced out-put. Some results also need to be aggregated and consolidat-ed prior to visualization. Thus, even after obtaining all the desired produced outputs, scientists sometimes need to trace each file to format it and open it on third-party software, moving them to the visualization environment. This task is error prone and complex with a huge amount of outputs.

To tackle visualization and user steering needs, Prov-Vis provides for an efficient data selection and filtering, for later consolidation and data staging of relevant data needed for preliminary analyzes conclusions. This solution is based on user-defined visualization actions such as: , (i) associating certain behavior to a given data type; (ii) making use of spe-cific statistical or visualization tools; (iii) empowering results with provenance data, and (iv) triggering some action on a dedicated visualization environment.

After staging data out, scientists may need to visualize multiple results, aiming to establish comparisons – mainly on parameter exploration scenarios – or maybe analyze highly detailed images and simulations enhanced by provenance data. Figure 1 shows an access to each parameter exploration to be visualized during parallel computational fluid dynamics (CFD) simulations. On most of these cases, trivial visualiza-tion environments, restricted by ordinary computer displays, are not enough to analyze such sensitive data. For cases like these, Prov-Vis accesses tiled wall displays technologies improving the visualization experience.

Figure 1. Web interface to provenance data and visualization environment.

Prov-Vis displays provenance data with an interactive web-based application, which browses data across a user-friendly and cross-platform interface at runtime (Figure 1).

Taking advantage of a web service, a remote application integrates selected results at the web-client with a visualiza-tion environment, such as a tiled wall display cluster. This remote application provides an interface that is accessed remotely at runtime and interprets the results to be visualized – based on user configuration – with third-party software for visualization. Scientists can take advantage of available tiled display technologies such as TACC DisplayCluster, SAGE, CGLX and Paraview.

III. RELATED WORK Prov-Vis integrates HPC workflow execution support

with visualization needs, enriching results with provenance data, but there are other solutions in SWfMS or co-processing libraries that support a subset of these features.

Vistrails [5] offers strong visualization features enriched by provenance support, but there is no HPC support. Thus, from VisTrails, it is not possible to visualize data directly from clusters or clouds. Swift/Turbine [6] or Pegasus [7] SWfMS can execute with HPC in very large scale with provenance, but have no runtime provenance support, thus data visualization at runtime is very complex. On the other hand, a solution such as Paraview Coprocessing Library [8] focus on in situ visualization and analysis coprocessing, but does not support workflow provenance.

IV. CONCLUSIONS Large-scale experiments on computational science and

engineering rely on compute-intensive tasks that can be chained through a workflow. Prov-Vis offers a web-application that helps to visualize workflow results and partial results enriched with provenance data. Prov-Vis eases result data navigation and selection to be visualized. Thus, based on user defined action scripts and file tags asso-ciation several procedures can be automated until results (or partial results) are projected on dedicated visualization envi-ronments. Our solution has a remote application responsible for integrating dedicated visualization environments to sup-port data visualization enriched with provenance, allowing scientists to keep track of which data produced which result. All these provide for a broader workflow view.

REFERENCES [1] J. Freire, D. Koop, E. Santos, and C. T. Silva, “Provenance for Com-

putational Tasks: A Survey,” Computing in Science and Engineering, v.10, no. 3, pp. 11–21, 2008.

[2] M. Mattoso, J. Dias, D. Oliveira, K. Ocaña, E. Ogasawara, F. Costa, F. Horta, V. Silva, and I. Araújo, “User-Steering of HPC Workflows: State of the Art and Future Directions,” in Proceeding of the 2nd In-ternational Workshop on Scalable Workflow Enactment Engines and Technologies (SWEET’13), New York, NY, USA, 2013.

[3] E. Ogasawara, J. Dias, V. Silva, F. Chirigati, D. Oliveira, F. Porto, P. Valduriez, and M. Mattoso, “Chiron: A Parallel Engine for Algebraic Scientific Workflows,” Concurrency and Computation, 2013.

[4] D. Oliveira, E. Ogasawara, F. Baião, and M. Mattoso, “SciCumulus: A Lightweight Cloud Middleware to Explore Many Task Computing Paradigm in Scientific Workflows,” in 3rd International Conference on Cloud Computing, Washington, DC, USA, 2010, pp. 378–385.

[5] S. P. Callahan, J. Freire, E. Santos, C. E. Scheidegger, C. T. Silva, and H. T. Vo, “VisTrails: visualization meets data management,” in SIGMOD International Conference on Management of Data, Chica-go, Illinois, USA, 2006, pp. 745–747.

[6] Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, V. Nefedova, I. Raicu, T. Stef-Praun, and M. Wilde, “Swift: Fast, Relia-ble, Loosely Coupled Parallel Computation,” in 3rd IEEE World Congress on Services, Salt Lake City, USA, 2007, pp. 206, 199.

[7] E. Deelman, G. Mehta, G. Singh, M.-H. Su, and K. Vahi, “Pegasus: Mapping Large-Scale Workflows to Distributed Resources,” in Workflows for e-Science, Springer, 2007, pp. 376–394.

[8] N. Fabian, K. Moreland, D. Thompson, A. C. Bauer, P. Marion, B. Geveci, M. Rasquin, and K. E. Jansen, “The ParaView Coprocessing Library: A scalable, general purpose in situ visualization library,” in 2011 IEEE Symposium on Large Data Analysis and Visualization (LDAV), Oct., pp. 89–96.

Nodes with 8 cores

Mesh Processing

Domain Partitioning

Parallel CFD Solver

Input Mesh i Mesh i

partitioned in M parts

node-x node-x node-x

node-z

./edgeCFDMesh mpirun –n 8 edgeCFDPre

mpirun –n M edgeCFD

16 Mesh i partitions

Solver executed with 16 cores for

case i

Job i

Chiron is running in each core of each node: managing scheduling, fault-tolerance, provenance data gathering, … runtime provenance queries

CFD Workflow

DisplayCluster

Prov-Vis

Tiled Wall Display

Workflow Engine Provenance Web Client

Remote Application

Exploration 1

Exploration 2

Exploration N

... Exploration 3

prov-vis: large-scale scientific data visualization using...

Documents