processing and visualization of collected data based on...

4
Processing and Visualization of Collected Data Based on Open-Source Tools and Principles Sanja Grbac Babić * , Kristijan Cetina * * Istarsko veleučilište - Università Istriana di scienze applicate, Pula, Croatia [email protected], [email protected] Summary - This paper demonstrates the use of open- source tools and libraries to process collected data and visualize analysis. These data were collected using a built-in assembly with a GPS receiver and an Arduino microcomputer, and in a second example with the help of a mobile device and a free Phyphox app. Using the Python programming language, the Matplotlib, NumPy and Pandas libraries analysis and visualization the collected data are made within the Jupyter notebook. In this paper, we demonstrate how it is possible to do a simple data analysis and apply it for learning purposes, but also in further research. Different graphical data representations are used in the paper to demonstrate the importance of the correct representations selected for different types of data. In an age when the amount of data collected is growing exponentially, it is important to be able to extract useful information from them, to make the data analysis results as accurate as possible. The complete software part of the work was created using an open-source tool and all the details and results of this work are freely available on GitHub 1 . Keywords - Python, Matplotlib, open-source, data analysis I. INTRODUCTION Today, data collection is increasingly important. As technological development made it possible to collect and store data, it became necessary and very useful to collect, but also to process and analyse data. In the ever-increasing number of available data, the challenge is to analyse that data and visualize it as properly as possible, so that the data can be used further. In this way, future trends can be predicted by data processing, but also lots information can be deduced from proper analysis. To obtain as accurate information as possible appropriate tools should be used. This paper describes a possible way to build a data analysis using Jupyter notebook with Python and help of the NumPy and Matplotlib libraries. The paper is organized in six sections. In section II. we discuss technology stack. In section III we describe data acquisition methods, giving the results of data processing and visualization in section IV. In section V. is presented a case study and intentions for future work. Finally, the conclusion is found in section VI. 1 https://github.com/KristijanCetina/BachelorThesis/tree/master/dataAna lysis II. TECHNOLOGY STACK A. Jupyter notebook The Jupyter Notebook is an open-source web application that can be used to create and share documents that contain live code, equations, visualizations, and text. Project Jupyter is a non-profit, open-source project, born out of the IPython Project in 2014 as it evolved to support interactive data science and scientific computing [1]. Beside Python, Jupyter Notebook supports Julia, R and many other languages [2]. One of the benefits of using interactive notebook over some other IDE (Integrated Development Environment) code editor for Python is ease of use and ability to quickly iterate over certain cells before continuing with analysis without the need to run the whole script every time. This greatly facilitates the development of the script for beginners and less experienced coders. Other benefit is easy sharing and reproducibility of the work. The Jupyter Notebook includes export to PDF as final document or LaTeX for future work and publishing of the results as part of a research project. It is even possible to export as HTML. Several papers have been published with supporting notebooks to reproduce the analysis, or the creation of key plots. The detection of gravitational waves by the LIGO experiment (LIGO Scientific Collaboration and Virgo Collaboration et al., 2016), is one such: the researchers posted a notebook on their website illustrating in detail how to filter and process the data to reveal the signature of a distant black hole merger (LIGO collaboration) [3]. Jupyter Notebook contains series of cells which supports adding rich content to it. There are four types of cells: Code, Markdown, Heading and Raw NBConvert. As the Heading cell is no longer supported, using Markdown cell is preferred way to do it. The Raw NBConvert cell type is only intended for special use cases when using the nbconvert command line tool. Basically, it allows control over the formatting in a very specific way when converting from a Notebook to another format. 2060 MIPRO 2020/SP

Upload: others

Post on 02-Nov-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Processing and Visualization of Collected Data Based on ...docs.mipro-proceedings.com/sp/08_SP_5758.pdf · 6. element size array and measuring the time it takes to complete a task

Processing and Visualization of Collected Data Based on Open-Source Tools and Principles

Sanja Grbac Babić*, Kristijan Cetina * * Istarsko veleučilište - Università Istriana di scienze applicate, Pula, Croatia

[email protected], [email protected]

Summary - This paper demonstrates the use of open-source tools and libraries to process collected data and visualize analysis. These data were collected using a built-in assembly with a GPS receiver and an Arduino microcomputer, and in a second example with the help of a mobile device and a free Phyphox app. Using the Python programming language, the Matplotlib, NumPy and Pandas libraries analysis and visualization the collected data are made within the Jupyter notebook.

In this paper, we demonstrate how it is possible to do a simple data analysis and apply it for learning purposes, but also in further research. Different graphical data representations are used in the paper to demonstrate the importance of the correct representations selected for different types of data. In an age when the amount of data collected is growing exponentially, it is important to be able to extract useful information from them, to make the data analysis results as accurate as possible.

The complete software part of the work was created using an open-source tool and all the details and results of this work are freely available on GitHub1.

Keywords - Python, Matplotlib, open-source, data analysis

I. INTRODUCTION Today, data collection is increasingly important. As

technological development made it possible to collect and store data, it became necessary and very useful to collect, but also to process and analyse data. In the ever-increasing number of available data, the challenge is to analyse that data and visualize it as properly as possible, so that the data can be used further. In this way, future trends can be predicted by data processing, but also lots information can be deduced from proper analysis. To obtain as accurate information as possible appropriate tools should be used.

This paper describes a possible way to build a data analysis using Jupyter notebook with Python and help of the NumPy and Matplotlib libraries.

The paper is organized in six sections. In section II. we discuss technology stack. In section III we describe data acquisition methods, giving the results of data processing and visualization in section IV. In section V. is presented a case study and intentions for future work. Finally, the conclusion is found in section VI.

1https://github.com/KristijanCetina/BachelorThesis/tree/master/dataAnalysis

II. TECHNOLOGY STACK

A. Jupyter notebook The Jupyter Notebook is an open-source web

application that can be used to create and share documents that contain live code, equations, visualizations, and text. Project Jupyter is a non-profit, open-source project, born out of the IPython Project in 2014 as it evolved to support interactive data science and scientific computing [1]. Beside Python, Jupyter Notebook supports Julia, R and many other languages [2].

One of the benefits of using interactive notebook over some other IDE (Integrated Development Environment) code editor for Python is ease of use and ability to quickly iterate over certain cells before continuing with analysis without the need to run the whole script every time. This greatly facilitates the development of the script for beginners and less experienced coders.

Other benefit is easy sharing and reproducibility of the work. The Jupyter Notebook includes export to PDF as final document or LaTeX for future work and publishing of the results as part of a research project. It is even possible to export as HTML.

Several papers have been published with supporting notebooks to reproduce the analysis, or the creation of key plots. The detection of gravitational waves by the LIGO experiment (LIGO Scientific Collaboration and Virgo Collaboration et al., 2016), is one such: the researchers posted a notebook on their website illustrating in detail how to filter and process the data to reveal the signature of a distant black hole merger (LIGO collaboration) [3].

Jupyter Notebook contains series of cells which supports adding rich content to it. There are four types of cells: Code, Markdown, Heading and Raw NBConvert. As the Heading cell is no longer supported, using Markdown cell is preferred way to do it.

The Raw NBConvert cell type is only intended for special use cases when using the nbconvert command line tool. Basically, it allows control over the formatting in a very specific way when converting from a Notebook to another format.

2060 MIPRO 2020/SP

Page 2: Processing and Visualization of Collected Data Based on ...docs.mipro-proceedings.com/sp/08_SP_5758.pdf · 6. element size array and measuring the time it takes to complete a task

The most used cell types are Code in which is used to run a source code and Markdown for headings, user comments and explanations.

B. NumPy library NumPy is the fundamental package for scientific

computing with Python. It contains among other things: • a powerful N-dimensional array object • sophisticated (broadcasting) functions • tools for integrating C/C++ and Fortran code • useful linear algebra, Fourier transform, and

random number capabilities [4]

In addition to significantly facilitating the numerical and statistical analysis of the data set, NumPy, due to its structure and representation of the data field, enables a significant improvement in data processing performance. To demonstrate the difference in performance between a pure Python and a NumPy library, it was done a simple experiment consisting of summing elements in a 106 element size array and measuring the time it takes to complete a task. Our test shows result of 6.07 ms ± 117 µs per loop for pure Python and 749 µs ± 6.74 µs per loop for NumPy2. There is a significant difference in the time it takes to complete a task, and industry experience shows an even greater difference in more complex tasks [5].

C. Plotting and version control Matplotlib is a Python 2D plotting library which

produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms [6]. All plots in this paper are made utilising Matplotlib, whose documentation with hundreds of examples it’s possible to use for this type of application.

For version control is used Git. Git is a free and open-source distributed version control system designed to handle everything from small to very large projects with speed and efficiency [7].

III. DATA ACQUISITION The data used in this paper were obtained using a

custom-made Arduino based hardware, made by authors in their previous work3 and the free phyphox app [7]. This paper does not provide details on the hardware used since the data analysis and visualization is independent of data source and acquisition method. Instead, is given just a brief overview of used hardware to show a various possibility.

Used data logger was created as part of author's final thesis. The device is based on Arduino UNO microcontroller with added TMP364 temperature sensor on top of Adafruit Ultimate GPS Logger Shield5. The shield has an integrated SD card slot which is used to write a data to a card.

An example of the record from Arduino microcontroller is:

2 Results may vary depending on used system 3 https://github.com/KristijanCetina/BachelorThesis 4https://github.com/KristijanCetina/BachelorThesis/blob/master/resources/TMP35_36_37.pdf

$GPRMC,052730.000,A,4458.8300,N,01356.1

724,E,12.41,283.84,310719,,,A*54,26.38

Where is:

• $GPRMC - Sentence Identifier,

• 052730.000 – Time stamp (UTC),

• A - validity (A-ok, V-invalid),

• 4458.8300 – Current Latitude,

• N – North / South,

• 01356.1724 – Current Longitude,

• E – East / West,

• 12.41 – Speed in knots,

• 283.84 – True course,

• 310719 – Date stamp,

• A*54 – checksum,

• 26.38 – temperature in °C.

Other data source was smartphone with installed phyphox application which allows to use the sensors of the smartphone for various experiments. In this case, accelerometer data were recorded during a travel of a lift in apartment building.

All the data were recorded in comma separated values (.csv) file and transferred to a computer for further analysis.

Phyphox accelerometer data are given as:

• Time (s);

• Linear Acceleration x (m/s2);

• Linear Acceleration y (m/s2);

• Linear Acceleration z (m/s2);

• Absolute acceleration (m/s2).

Coordinate system used in application is shown in figure 1.

5 https://learn.adafruit.com/adafruit-ultimate-gps-logger-shield?view=all

Figure 1. Coordinate system used in phyphox application

Figure 1. Coordinate system used in phyphox application

MIPRO 2020/SP 2061

Page 3: Processing and Visualization of Collected Data Based on ...docs.mipro-proceedings.com/sp/08_SP_5758.pdf · 6. element size array and measuring the time it takes to complete a task

IV. DATA PROCESSING AND VISUALISATION As a first example of plotting data, it has been plotted

speed of a car caring datalogger as function of time. To be able to use libraries in code first it was needed to import them as follows: import pandas as pd

import matplotlib.pyplot as plt

import matplotlib as mpl

import numpy as np

It is common naming convention to name imported libraries as shown and use aliases in code. Next step is import data used for analysis. As there is no header in the used.csv file, we must name the attributes (columns): filename='GPSLOG10.CSV'

data=pd.read_csv(filename,

header=None, delimiter=',' ,names=

['Sentence', 'Time', 'Validity',

'Latitude', 'NS',’Longitude', 'EW',

'Speed', 'Direction', 'Date', 'NA1',

'NA2', 'Checksum', 'Temperature'])

and finally, we can plot our data as shown in figure 2 with: plt.plot(data['Time'],data['Speed']*1.

852, 'b-',label='Speed')

plt.legend(loc='upper left')

plt.xlabel ('time')

plt.ylabel ('v [km/h]')

plt.title('Vehicle speed')

Should be noted that data are in knots (nautical miles per hour), but they are plotted in km/h, so it was needed to multiply raw data with 1.852 to convert units. Statistical analysis can be done, though for more advanced ones additional libraries may be needed. One of the simpler data is the minimum and maximum temperature which we can get with:

print ('Min temp: ', np.min (data

['Temperature']), '°C')

print ('Max temp: ', np.max (data

['Temperature']), '°C').

As a response we get „Min temp: 25.73 °C; Max temp: 27.99 °C “for a given dataset.

Using colormaps it’s possible to encode additional information on a plot. As an example, on figure 3 is shown position of vehicle on x-y axes and its speed in colour with scale aside.

One should be mindful when choosing colormap what data will be shown as explained in [9] and [10].

For the second example we plotted accelerometer data acquired using before mentioned phyphox app on

smartphone. The smartphone was firmly attached to the lift wall taking care to align the axis of motion with the axis of the device. Result is shown in figure 4.

The key point of this plot is to show the influence of the individual components of the axis movement on the total force exerted on the body. The first three plots show linear movement on x, y and z axis respectively, while fourth plot show linear combination of them. It's easy to notice a very small amplitude in movements in x-axis and majority of total force is from y-axis. That is to be expected as this axis is aligned with lifts travel path.

Subplots of linear movements can be plotted independent of each other, but in this case is important to have them plotted on same scale to avoid misunderstanding due to different scales. In setup we used the following code: fig,axs=plt.subplots(4,sharex=True,

sharey=True)

Figure 2: Plot of vehicle speed

Figure 3: Position and speed of vehicle

Figure 4: Accelerometer data

2062 MIPRO 2020/SP

Page 4: Processing and Visualization of Collected Data Based on ...docs.mipro-proceedings.com/sp/08_SP_5758.pdf · 6. element size array and measuring the time it takes to complete a task

By default, parameter sharey is set to False which scales each subplot to full range of its motion component which gives a wrong misconception about amplitude of motion. By setting it to True all subplots are using same range which gives a realistic representation of actual forces.

V. CASE STUDY AND FUTURE WORK In everyday life, one often requires monitoring of events, that is, collecting data that can help us find useful information and predict future situations and needs, and is especially needed in technical systems. As today's technology has enabled real-time remote monitoring, the authors of the paper would put these principles into practice in their future work.

In his work [11], Grgin implemented remote monitoring of the battery on board using a microcontroller. The lead-acid battery on board the vessel is used to start the main propulsion and to supply various electrical and electronic devices, which may be damaged during the non-use period due to insufficient maintenance. Lead batteries have a relatively high degree of self-discharge, especially in colder winter conditions. Replacing a rechargeable battery results in a financial and environmental aspect.

Since they are already polluting the environment during their production, it would be logical to think that such devices provide adequate maintenance to prevent unnecessary waste of funds and to prevent additional environmental pollution. The expected life of lead-acid batteries commonly used on board vessels varies greatly depending on the operating conditions and the level of maintenance. It is certainly not advisable for lead batteries to remain relatively low for long periods of time as their battery life is rapidly reduced, so it is advisable to check the battery status regularly.

That is why it’s advisable to monitor in real time remotely. In addition to monitoring, it is also desirable to collect and subsequently process the collected data.

The purpose of the work [11], was to create, with the help of a microcontroller, a device for remote monitoring of the battery voltage and any charging current so that the battery could be adequately maintained.

Continuing on the work [11], it’s planned to monitor and analyse the data over time, and based on the obtained

data, it would be possible to predict the maintenance cycle and ultimately extend the life of the battery itself.

In the future, we will explore possibilities of used tools and techniques for visualizing more complex data and structures as well as their statistical properties.

VI. CONCLUSION In this paper, we explored the possibilities of data

analysis and visualization using open-source tools. With the main focus on ease of use and sharing and reproducibility of analysis. With the availability of open-source tools and resource this is possible to achieve and use by novice for similar work as well as an expert for more complex analysis.

REFERENCE [1] Project Jupyter. [Online]. Available at: https://jupyter.org/,

[Accessed: 8.1.2020.]. [2] List of supported kernels. [Online]. Available at:

https://github.com/jupyter/jupyter/wiki/Jupyter-kernels, [Accessed: 9.1.2020.]

[3] Kluyver, Thomas, Benjamin Ragan-Kelley, Fernando Pérez, Brian E. Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley et al. "Jupyter Notebooks-a publishing format for reproducible computational workflows." In ELPUB, pp. 87-90. 2016.

[4] NumPy library homepage. [Online]. Available at: https://numpy.org, [Accessed: 8.1.2020.]

[5] Van Der Walt, Stefan, S. Chris Colbert, and Gael Varoquaux. "The NumPy array: a structure for efficient numerical computation." Computing in Science & Engineering 13, no. 2 (2011): 22.

[6] J. D. Hunter, "Matplotlib: A 2D Graphics Environment", Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95, 2007.

[7] Git homepage. [Online]. Available at: htto://git-scm.com, [Accessed: 8.1.2020.]

[8] Phyphox app project homepage. [Online]. Available at: https://phyphox.org/, [Accessed:8.1.2020]

[9] Moreland, Kenneth. "Why we use bad color maps and what you can do about it." Electronic Imaging 2016, no. 16 (2016): 1-6.

[10] A Better Default Colormap for Matplotlib | SciPy 2015, [Online], Available at: https://www.youtube.com/watch?v=xAoljeRJ3lU, [Accessed: 10.1.2020.]

[11] Grgin, S. „Daljinsko praćenje stanja baterije na plovilu pomoću ESP8266 mikrokontrolera“, Završni rad, Istarsko veleučilište, Pula, 2019.

MIPRO 2020/SP 2063