people.uwplatt.edupeople.uwplatt.edu/.../s13/se_1/reesejo_datavisualizatio…  · web viewdata...

14
Software and Data Visualization Jon Reese Abstract Data visualization is not a field unique to software development. However, with the age of information comes an issue known as big data. The amount of data generated vastly outweighs the amount of data we can understand. To deal with this problem, visual representations can be generated automatically through Software. There is a great amount of utility in data visualization software. The goal is often for users to interpret big data intuitively to gain insight. There is a demand for data visualization software. There are also are also times when data visualization could be used to improve a system’s interface. This paper explores the utility of data visualization in understanding big data and how it is implemented. Methods of data visualizations are explored for their strengths and weaknesses. Introduction A common goal of data visualization is to represent data in a relatable and meaningful manner. A table of places mapped to latitude and longitude points may be an effective way to store locations. However this is not an effective way to show where locations are in relation to each other and the earth. A map is a visualization of that data and communicates it to a person in a relatable way: The distance on the map is proportional to the actual distance between locations. Visually representing data is not a method brought about by the Digital Age. World maps, mathematical graphs, and charts have

Upload: buimien

Post on 06-Feb-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: people.uwplatt.edupeople.uwplatt.edu/.../s13/se_1/reesejo_DataVisualizatio…  · Web viewData mining is the automated process of finding patterns and trends in big data. This mined

Software and Data Visualization

Jon Reese

Abstract

Data visualization is not a field unique to software development. However, with the age of information comes an issue known as big data. The amount of data generated vastly outweighs the amount of data we can understand. To deal with this problem, visual representations can be generated automatically through Software. There is a great amount of utility in data visualization software. The goal is often for users to interpret big data intuitively to gain insight. There is a demand for data visualization software. There are also are also times when data visualization could be used to improve a system’s interface. This paper explores the utility of data visualization in understanding big data and how it is implemented. Methods of data visualizations are explored for their strengths and weaknesses.

Introduction

A common goal of data visualization is to represent data in a relatable and meaningful manner. A table of places mapped to latitude and longitude points may be an effective way to store locations. However this is not an effective way to show where locations are in relation to each other and the earth. A map is a visualization of that data and communicates it to a person in a relatable way: The distance on the map is proportional to the actual distance between locations.

Visually representing data is not a method brought about by the Digital Age. World maps, mathematical graphs, and charts have existed well before software. However, the age of information brings the problem of “Big data.” The rate of generation of data often greatly outweighs the rate at which we can comprehend it. Software involving data visualization provides a way to learn things from “Big data” that might otherwise not be feasible.

For an interesting example, Randall Munroe posted a survey on his blog asking for people to name colors based on random RGB values. This survey received five million responses from 225,500 users. The survey showed the user a color based on a random RGB value and asked for a name of the color. The survey was to find correlations between color names and sections of the RGB spectrum. The data was stored in a SQLite database and released to the public.

This is an example of the “Big data” problem. To select only one-hundred users and view the color names and RGB values that they provided would be time consuming and difficult to interpret. Even if interpreted, it would yield an incomplete and warped concept of color names and color values. A data visualization solution was made to visualize the entire pool of data. It displays common names using a color derived from the many RGB values given to generate the

Page 2: people.uwplatt.edupeople.uwplatt.edu/.../s13/se_1/reesejo_DataVisualizatio…  · Web viewData mining is the automated process of finding patterns and trends in big data. This mined

2

following 3D and rotatable data visualization. This visualization allows the user to explore the spectrum of mutually understood color names and their color values. The 3D visualization was created in JavaScript by Howard Yeend.

Figure 1: Color Spectrum

Data visualization is not the only way to achieve information from big data. Data mining is the automated process of finding patterns and trends in big data. This mined data is extracted from the big data to be used. To compare, data mining extracts interesting bits of information from big data, while data visualization makes the data easily interpreted for the viewer’s interest. In this case, data mining would encompass analyzing the raw data for correlations between names and color values. Data visualization takes data and creating the 3D visual. Often data visualization cooperates with data mining but neither requires the other.

Page 3: people.uwplatt.edupeople.uwplatt.edu/.../s13/se_1/reesejo_DataVisualizatio…  · Web viewData mining is the automated process of finding patterns and trends in big data. This mined

3

Software Engineering and Data Visualization

Figure 2: Mapping the Internet

Data Visualization is not unique to Software Engineering. In the example of Figure 2, an MIT study visualized the connections between ISP’s and networks to demonstrate the congestion of the Internet. This is data visualization in a computer science study dealing with massive amounts of data. In this figure the points are nodes of the internet. The larger and more central nodes are the ones that traffic more data. The lines between the nodes are the mass of data trafficked between two nodes. In this case, big data was visualized to demonstrate a broader idea: How the core of the internet is a chokepoint, and it suggests we could take advantage of more peer-to-peer connections. Data Visualization in a software system can be implemented into a piece of software’s GUI. Implementations can be added to make interfaces more informative. They can help monitor massive systems, provide informative geographical maps, and help find correlations in big data. Visualizations are effective at taking information that is overwhelming and making it manageable and intuitive. This enables the viewer to find something in big data quickly and in a relatable manner. A classic example would be selecting “Properties” of a hard drive on Windows and seeing a pie chart showing available space on the hard drive. To a common user this gives a satisfactory amount of information quickly. The user can compare how much space is used to how much space is free for a rough idea of the space he or she has left from experience. Visualization also does not need to remove functionality. Visualizations in software interfaces can be interactive, preferably allowing the user to identify something that catches their attention.

Choosing When and How to Visualize

Page 4: people.uwplatt.edupeople.uwplatt.edu/.../s13/se_1/reesejo_DataVisualizatio…  · Web viewData mining is the automated process of finding patterns and trends in big data. This mined

4

Determining if one should visualize data can be very situational. Data visualization may be important when there is an overwhelming amount of data, when there is an interest in understanding that data, and when the specific numbers of the data are less important than how the numbers relate to one another. In order to do this it should be safe to assume we know what the user is looking for in the data. Many people find data visualizations striking, but in software it should be more than aesthetic. Data visualization for its own sake will complicate an interface. It should be used to take an overwhelming amount of data and make it manageable and intuitive. It can also take data that is not relatable for the user and put it in a relatable context. If it is not relatable or intuitive but data visualization is still desired, an explanation would be useful.

Choosing how to visualize data is an important step. A good guideline is to show only relevant information in a way that gives it context. When creating any visualization one should consider what the viewer is looking to discover. Let’s say the management of a convenience store chain wants to discover correlations between items purchased together in order to plan the store organization. A useful data visualization feature in this company’s system would be a correlation matrix of items purchased together. An example of a correlation matrix can be seen in Figure 3.

Figure 3: Correlation Matrix example

A correlation matrix is generated by listing all but the first item down the vertical axis, and all but the last item on the horizontal axis. This allows for each item to have a perpendicular intersection underneath the diagonal without repetition. In the case of a large set of items only ones with strong correlations could be shown. While this scenario may be contrived, there is definitely desired utility in data visualization. In fact, there is software and software companies that develop for the sole purpose of visualizing large amounts of data. The software company SAS (Statistical Analysis Software) provides data visualization software for businesses to understand their gathered data so they can make more informed decisions. In other words, SAS provides software for companies to understand and learn from their wealth of data.

Page 5: people.uwplatt.edupeople.uwplatt.edu/.../s13/se_1/reesejo_DataVisualizatio…  · Web viewData mining is the automated process of finding patterns and trends in big data. This mined

5

Notice the color shades in Figure 3 are based on a single hue and a variation of intensity. When visualizing quantitative data using colors it is easier to understand differences in intensity as higher and lower, while differences in hue can come across as simply a different color. A correlation matrix helps find individual correlations between many different variables. To view correlations between two variables a scatterplot is an effective visualization tool. If geographical location gives the data context, then a geographical visualization is worth considering. If hierarchy and quantity (especially volume) gives the data context, then a treemap visualization is most likely appropriate. Network visualizations show a large set of items and how they are in relationship to each other. Word clouds are very simple and are often used to view which topics are trending in social media. Word clouds are commonly used and are simply a collage of words whose font size is based on the amount it was repeated.

Geographical Visualization

Figure 4: Verizon’s coverage map (March 2013)

A prospective customer of Verizon may want to know if he or she will be covered. Without the aid of this graphical visualization, a customer may be lost and uncertain if he or she will be covered. The thought of checking the potential locations where coverage would be needed seems overwhelming. The customer will quickly understand by looking at the map represented by Figure 4 if the locations are covered, are not covered, or if it is questionable.

Geographical visualization is the incorporation of a map with the data visualization. The map gives a relatable context to the data for the user. Therefore geographical visualization is worth considering when the data’s location is relevant. It is important that these visualizations feature

Page 6: people.uwplatt.edupeople.uwplatt.edu/.../s13/se_1/reesejo_DataVisualizatio…  · Web viewData mining is the automated process of finding patterns and trends in big data. This mined

6

the data, and anything that doesn’t provide context for the data is probably better left out. Satellite images are often not preferred in this data visualization because it can be very distracting. Satellite view adds a massive amount of detail that would distract from the data that is important, and could warp the clarity of color coding.

There are many possible ways to show quantitative information on a map. One option would be to have bar graphs on each section. If these bar graphs were the same size, they would be too small for some landmasses and too large for others, skewing our perception of the quantity. If they fit the size, then they would not be to scale. Filling the areas with different intensities of a color solves this problem. It is easy to compare shades of one area to another, and the size of the area does not affect our perception of the quantity. These are known as choropleth maps.

In Figure 5, using police records and Google maps API, a map of cities demonstrates safe areas in the city of Milwaukee. The method is rather simple. Zero crimes reported in an area will give the darkest shade of blue. The highest amount of crime in an area will show the lightest shade. All areas between the two will be color coded accordingly. This simple method turns available data into a useful map those concerned. A geographical visualization is important in this case because physical location is very relevant.

Page 7: people.uwplatt.edupeople.uwplatt.edu/.../s13/se_1/reesejo_DataVisualizatio…  · Web viewData mining is the automated process of finding patterns and trends in big data. This mined

7

Figure 5: Safety of Milwaukee

Treemap Visualization

Treemaps are a common type of data visualization. They display information through nested rectangles. In Java standard libraries the TreeMap class is implemented as a Tree data structure. These maps easily communicate in three main ways. First, based on the nested rectangles, a Treemap communicates a hierarchical data structure. For example, the rectangles could represent folders in the file system of a computer. The main rectangle would represent the main directory and rectangles with more rectangles inside would represent folders while a single rectangle would represent a file. Second, the rectangle size could represent its portion of the whole, like a slice of a pie graph. In the same example the size of the rectangle would be an excellent representation for the hard drive space taken up. Third, the color of the rectangles could represent the type of file. Folders will either take up virtually no space or hold file(s). This example as well as the Java SE 6 TreeMap class is implemented in the useful open source software “WinDirStat” shown in Figure 6.

Figure 6: WinDirStat

WinDirStat (Windows Directory Statisics) is designed as a cleanup tool for versions of Windows. Thanks to the treemap visualization, no matter how buried files are, they are apparent based on size. When dealing with the potentially overwhelming amount of data on a hard drive this treemap visualization makes it easy to spot a massive zip file buried beneath directories, or a folder that holds a significant amount of space through files. The selected rectangle, the rectangle with the white border, in Figure 6 represents a folder.

Page 8: people.uwplatt.edupeople.uwplatt.edu/.../s13/se_1/reesejo_DataVisualizatio…  · Web viewData mining is the automated process of finding patterns and trends in big data. This mined

8

Another example of the TreeMap visualization being effectively used is with Newsmap. Newsmap (Figure 7) is a news feed that pulls from Google News and arranges it using a TreeMap. It fully takes advantage of the format. Colors represent categories of news, shades of colors represent how recent the news article is, and size represents popularity of the article. These articles are nested within a larger rectangle representing the location for the news.

Figure 7: NewsMap

While the concept of news headlines in a Treemap may seem overwhelming and foreign at first, familiarity with the format it can provide many advantages. If the user is familiar with the environment information can be interpreted quickly.

Marcos Weskamp, creator of Newsmap, posted the following: Currently, the internet presents a highly disorganized collage of information. Many of us

are working in an information-soaked world. There is too much of everything. We are subject everywhere to a sensory overload of images, bombarded with information; in magazines and advertisements, on TV, radio, in the cityscape. The internet is a wonderful communication tool, but day after day we find ourselves constantly dealing with information overload. Today, the internet presents a new challenge, the wide and unregulated distribution of information requires new visual paradigms to organize, simplify and analyze large amounts of data. New user interface challenges are arising to deal with all that overwhelming quantity of information.

Conclusion

Data is being recorded at a rate that far exceeds our ability to comprehend it. This is the problem called “Big data” and in many situations human interpretation can be aided by data visualization in software interfaces. Showing the user what he or she is looking for quickly and in a relatable

Page 9: people.uwplatt.edupeople.uwplatt.edu/.../s13/se_1/reesejo_DataVisualizatio…  · Web viewData mining is the automated process of finding patterns and trends in big data. This mined

9

context can be insightful, and make for a popular software system. While the implementation of data visualization into an interface is situational, those situations should be considered opportunities.

Page 10: people.uwplatt.edupeople.uwplatt.edu/.../s13/se_1/reesejo_DataVisualizatio…  · Web viewData mining is the automated process of finding patterns and trends in big data. This mined

10

References

[1] Statistical Analysis Solutions (2012). Data Visualization Techniques. Retrieved fromhttp://www.sas.com/reg/wp/corp/51989

[2] Stephen Few (2009). Introduction to Geographical Data Visualization. Retrieved from http://www.perceptualedge.com/articles/visual_business_intelligence/geographical_data_visualization.pdf

[3] Crime safety map (2013). [Geographical data visualization of Milwaukee March 19, 2013] Location Inc. Retrieved from http://www.neighborhoodscout.com/wi/milwaukee/crime/

[4] Newsmap news feed (2013). [Treemap data visualization of Google News March 19, 2013]Marcos Weskamp. Retrieved from http://newsmap.jp

[5] Marcos Weskamp. Newsmap [pg 6]. Message posted to http://marumushi.com/projects/newsmap

[6] WinDirStat Developers (Open Source) (2007). WinDirStat (Version 1.1.2) [Software]. Available from http://windirstat.info/index.html

[7] Verizon coverage locator (2013). [Verizon mobile phone coverage locator March 19, 2013] Verizon. Retrieved from http://www.verizonwireless.com/b2c/support/coverage-locator

[8] Howard Yeend (May 4, 2010). XKCD Colour Survey – a 3D visualization. [Web log post]. Retrieved from http://www.puremango.co.uk/2010/05/xkcd-color-survey-3d-visualization/

[9] Duncan Graham-Rowe (2007). Mapping the Internet. MIT Technology Review. Retrieved from http://www.technologyreview.com/news/408104/mapping-the-internet/

[10] Itoh, T., Yamaguchi, Y.; Ikehaha, Y.; Kajinaga, Y., (2004). Hierarchical data visualization using a fast rectangle-packing algorithm,” Visualization and Computer Graphics, IEEE Transactions on, vol.10, no.3 pp.302,313. May 2004 doi:http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=1272729&contentType=Journals+%26+Magazines&searchField%3DSearch_All%26queryText%3D.QT.Data+visualization.QT.

[11] Yau, N. (2012). Visualize this, the flowingdata guide to design, visualization, and statistics. Wiley.