traces documentation

23
traces Documentation Release 0.5.1 Mike Stringer Dec 29, 2019

Upload: others

Post on 16-Apr-2022

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: traces Documentation

traces DocumentationRelease 0.5.1

Mike Stringer

Dec 29, 2019

Page 2: traces Documentation
Page 3: traces Documentation

Contents

1 Why? 3

2 Installation 5

3 Quickstart: using traces 73.1 Adding more data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 It’s flexible . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4 More info 94.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.2 API Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Contributing 17

Index 19

i

Page 4: traces Documentation

ii

Page 5: traces Documentation

traces Documentation, Release 0.5.1

A Python library for unevenly-spaced time series analysis.

Contents 1

Page 6: traces Documentation

traces Documentation, Release 0.5.1

2 Contents

Page 7: traces Documentation

CHAPTER 1

Why?

Taking measurements at irregular intervals is common, but most tools are primarily designed for evenly-spaced mea-surements. Also, in the real world, time series have missing observations or you may have multiple series with differentfrequencies: it’s can be useful to model these as unevenly-spaced.

Traces aims to make it simple to write readable code to:

• Wrangle. Read, write, and manipulate unevenly-spaced time series data

• Explore. Perform basic analyses of unevenly-spaced time series data without making an awkward / lossytransformation to evenly-spaced representations

• Convert. Gracefully transform unevenly-spaced times series data to evenly-spaced representations

Traces was designed by the team at Datascope based on several practical applications in different domains, because itturns out unevenly-spaced data is actually pretty great, particularly for sensor data analysis.

3

Page 8: traces Documentation

traces Documentation, Release 0.5.1

4 Chapter 1. Why?

Page 9: traces Documentation

CHAPTER 2

Installation

To install traces, run this command in your terminal:

$ pip install traces

5

Page 10: traces Documentation

traces Documentation, Release 0.5.1

6 Chapter 2. Installation

Page 11: traces Documentation

CHAPTER 3

Quickstart: using traces

To see a basic use of traces, let’s look at these data from a light switch, also known as Big Data from the Internet ofThings.

The main object in traces is a TimeSeries, which you create just like a dictionary, adding the five measurements at6:00am, 7:45:56am, etc.

>>> time_series = traces.TimeSeries()>>> time_series[datetime(2042, 2, 1, 6, 0, 0)] = 0 # 6:00:00am>>> time_series[datetime(2042, 2, 1, 7, 45, 56)] = 1 # 7:45:56am>>> time_series[datetime(2042, 2, 1, 8, 51, 42)] = 0 # 8:51:42am>>> time_series[datetime(2042, 2, 1, 12, 3, 56)] = 1 # 12:03:56am>>> time_series[datetime(2042, 2, 1, 12, 7, 13)] = 0 # 12:07:13am

What if you want to know if the light was on at 11am? Unlike a python dictionary, you can look up the value at anytime even if it’s not one of the measurement times.

>>> time_series[datetime(2042, 2, 1, 11, 0, 0)] # 11:00am0

The distribution function gives you the fraction of time that the TimeSeries is in each state.

>>> time_series.distribution(>>> start=datetime(2042, 2, 1, 6, 0, 0), # 6:00am>>> end=datetime(2042, 2, 1, 13, 0, 0) # 1:00pm>>> )Histogram({0: 0.8355952380952381, 1: 0.16440476190476191})

The light was on about 16% of the time between 6am and 1pm.

3.1 Adding more data. . .

Now let’s get a little more complicated and look at the sensor readings from forty lights in a building.

7

Page 12: traces Documentation

traces Documentation, Release 0.5.1

How many lights are on throughout the day? The merge function takes the forty individual TimeSeries and effi-ciently merges them into one TimeSeries where the each value is a list of all lights.

>>> trace_list = [... list of forty traces.TimeSeries ...]>>> count = traces.TimeSeries.merge(trace_list, operation=sum)

We also applied a sum operation to the list of states to get the TimeSeries of the number of lights that are on.

How many lights are typically on during business hours, from 8am to 6pm?

>>> histogram = count.distribution(>>> start=datetime(2042, 2, 1, 8, 0, 0), # 8:00am>>> end=datetime(2042, 2, 1, 12 + 6, 0, 0) # 6:00pm>>> )>>> histogram.median()17

The distribution function returns a Histogram that can be used to get summary metrics such as the mean orquantiles.

3.2 It’s flexible

The measurements points (keys) in a TimeSeries can be in any units as long as they can be ordered. The valuescan be anything.

For example, you can use a TimeSeries to keep track the contents of a grocery basket by the number of minuteswithin a shopping trip.

>>> time_series = traces.TimeSeries()>>> time_series[1.2] = {'broccoli'}>>> time_series[1.7] = {'broccoli', 'apple'}>>> time_series[2.2] = {'apple'} # puts broccoli back>>> time_series[3.5] = {'apple', 'beets'} # mmm, beets

To learn more, check the examples and the detailed reference.

8 Chapter 3. Quickstart: using traces

Page 13: traces Documentation

CHAPTER 4

More info

4.1 Examples

Traces aims to make it simple to write readable code to:

• Wrangle. Read, write, and manipulate unevenly-spaced time series data

• Explore. Perform basic analyses of unevenly-spaced time series data without making an awkward / lossytransformation to evenly-spaced representations

• Convert. Gracefully transform unevenly-spaced times series data to evenly-spaced representations

This section has a few examples of how to do these things.

4.1.1 Read and manipulate

Say we have a directory with a bunch of CSV files with information about light bulbs in a home. Each CSV file hasthe wattage used by the bulb as a function of time. Some of the light bulbs only send a signal when the state changes,but some send a signal every minute. We can read them with this code.

def parse_iso_datetime(value):return datetime.strptime(value, "%Y-%m-%dT%H:%M:%S")

def read_all(pattern='data/lightbulb-*.csv'):"""Read all of the CSVs in a directory matching the filename patternas TimeSeries.

"""result = []for filename in glob.iglob(pattern):

print('reading', filename, file=sys.stderr)ts = traces.TimeSeries.from_csv(

filename,

(continues on next page)

9

Page 14: traces Documentation

traces Documentation, Release 0.5.1

(continued from previous page)

time_column=0,time_transform=parse_iso_datetime,value_column=1,value_transform=int,default=0,

)ts.compact()result.append(ts)

return result

ts_list = read_all()

The call to ts.compact() will remove any redundant measurements. Depending on how often your data changescompared to how often it is sampled, this can reduce the size of the data dramatically.

4.1.2 Basic analysis

Now, let’s say we want to do some basic exploratory analysis of how much power is used in the whole home. We’llfirst take all of the individual traces and merge them into a single TimeSeries where the values is the total wattage.

total_watts = traces.TimeSeries.merge(ts_list, operation=sum)

The merged time series has times that are the union of all times in the individual series. Since each time series is thewattage of the lightbulb, the values after the sum are the total wattage used over time. Here’s how to check the meanpower consumption in January.

histogram = total_watts.distribution(start=datetime(2016, 1, 1),end=datetime(2016, 2, 1),

)print(histogram.mean())

Let’s say we want to break this down to see how the distribution of power consumption varies by time of day.

for hour, distribution in total_watts.distribution_by_hour_of_day():print(hour, distribution.quantiles([0.25, 0.5, 0.75]))

Or day of week.

for day, distribution in total_watts.distribution_by_day_of_week():print(day, distribution.quantiles([0.25, 0.5, 0.75]))

Finally, we just want to look at the distribution of power consumption during business hours on each day in January.

for t in datetime_range(datetime(2016, 1, 1), datetime(2016, 2, 1), 'days'):biz_start = t + timedelta(hours=8)biz_end = t + timedelta(hours=18)histogram = total_watts.distribution(start=biz_start, end=biz_end)print(t, histogram.quantiles([0.25, 0.5, 0.75]))

In practice, you’d probably be plotting these distribution and time series using your tool of choice.

10 Chapter 4. More info

Page 15: traces Documentation

traces Documentation, Release 0.5.1

4.1.3 Transform to evenly-spaced

Now, let’s say we want to do some forecasting of the power consumption of this home. There is probably someseasonality that need to be accounted for, among other things, and we know that statsmodels and pandas are tools withsome batteries included for that type of thing. Let’s convert to a pandas Series.

regular = total_watts.moving_average(300, pandas=True)

That will convert to a regularly-spaced time series using a moving average to avoid aliasing (more info here). At thispoint, a good next step is the excellent tutorial by Tom Augspurger, starting with the Modeling Time Series section.

4.2 API Reference

4.2.1 TimeSeries

In traces, a TimeSeries is similar to a dictionary that contains measurements of something at different times. Onedifference is that you can ask for the value at any time – it doesn’t need to be at a measurement time. Let’s say you’remeasuring the contents of a grocery cart by the number of minutes within a shopping trip.

>>> cart = traces.TimeSeries()>>> cart[1.2] = {'broccoli'}>>> cart[1.7] = {'broccoli', 'apple'}>>> cart[2.2] = {'apple'}>>> cart[3.5] = {'apple', 'beets'}

If you want to know what’s in the cart at 2 minutes, you can simply get the value using cart[2] and you’ll see{'broccoli', 'apple'}. By default, if you ask for a time before the first measurement, you’ll get None.

>>> cart = traces.TimeSeries()>>> cart[-1]None

If, however, you set the default when creating the TimeSeries, you’ll get that instead:

>>> cart = traces.TimeSeries(default=set())>>> cart[-1]set([])

In this case, it might also make sense to add the t=0 point as a measurement with cart[0] = set().

Performance note

Traces is not designed for maximal performance, but it’s no slouch since it uses the excellent sortedcontain-ers.SortedDict under the hood to store sparse time series.

class traces.TimeSeries(data=None, default=None)A class to help manipulate and analyze time series that are the result of taking measurements at irregular pointsin time. For example, here would be a simple time series that starts at 8am and goes to 9:59am:

>>> ts = TimeSeries()>>> ts['8:00am'] = 0>>> ts['8:47am'] = 1>>> ts['8:51am'] = 0

(continues on next page)

4.2. API Reference 11

Page 16: traces Documentation

traces Documentation, Release 0.5.1

(continued from previous page)

>>> ts['9:15am'] = 1>>> ts['9:59am'] = 0

The value of the time series is the last recorded measurement: for example, at 8:05am the value is 0 and at8:48am the value is 1. So:

>>> ts['8:05am']0

>>> ts['8:48am']1

There are also a bunch of things for operating on another time series: sums, difference, logical operators andsuch.

compact()Convert this instance to a compact version: the value will be the same at all times, but repeated measure-ments are discarded.

defaultReturn the default value of the time series.

difference(other)difference(x, y) = x(t) - y(t).

distribution(start=None, end=None, normalized=True, mask=None, interpolate=’previous’)Calculate the distribution of values over the given time range from start to end.

Parameters

• start (orderable, optional) – The lower time bound of when to calculate thedistribution. By default, the first time point will be used.

• end (orderable, optional) – The upper time bound of when to calculate the dis-tribution. By default, the last time point will be used.

• normalized (bool) – If True, distribution will sum to one. If False and the time valuesof the TimeSeries are datetimes, the units will be seconds.

• mask (TimeSeries, optional) – A domain on which to calculate the distribution.

• interpolate (str, optional) – Method for interpolating between measurementpoints: either “previous” (default) or “linear”. Note: if “previous” is used, then the re-sulting histogram is exact. If “linear” is given, then the values used for the histogram arethe average value for each segment – the mean of this histogram will be exact, but highermoments (variance) will be approximate.

Returns Histogram with the results.

exists()returns False when the timeseries has a None value, True otherwise

first_item()Returns the first (time, value) pair of the time series.

first_key()Returns the first time recorded in the time series

first_value()Returns the first recorded value in the time series

12 Chapter 4. More info

Page 17: traces Documentation

traces Documentation, Release 0.5.1

get(time, interpolate=’previous’)Get the value of the time series, even in-between measured values.

get_item_by_index(index)Get the (t, value) pair of the time series by index.

items()→ list of the (key, value) pairs in ts, as 2-tuples

classmethod iter_merge(timeseries_list)Iterate through several time series in order, yielding (time, list) tuples where list is the values of eachindividual TimeSeries in the list at time t.

iterintervals(n=2)Iterate over groups of n consecutive measurement points in the time series.

iterperiods(start=None, end=None, value=None)This iterates over the periods (optionally, within a given time span) and yields (interval start, interval end,value) tuples.

TODO: add mask argument here.

last_item()Returns the last (time, value) pair of the time series.

last_key()Returns the last time recorded in the time series

last_value()Returns the last recorded value in the time series

logical_and(other)logical_and(t) = self(t) and other(t).

logical_or(other)logical_or(t) = self(t) or other(t).

logical_xor(other)logical_xor(t) = self(t) ^ other(t).

mean(start=None, end=None, mask=None, interpolate=’previous’)This calculated the average value of the time series over the given time range from start to end, when maskis truthy.

classmethod merge(ts_list, compact=True, operation=None)Iterate through several time series in order, yielding (time, value) where value is the either the list of eachindividual TimeSeries in the list at time t (in the same order as in ts_list) or the result of the optionaloperation on that list of values.

moving_average(sampling_period, window_size=None, start=None, end=None, place-ment=’center’, pandas=False)

Averaging over regular intervals

multiply(other)mul(t) = self(t) * other(t).

n_measurements()Return the number of measurements in the time series.

n_points(start=-inf, end=inf, mask=None, include_start=True, include_end=False, normal-ized=False)

Calculate the number of points over the given time range from start to end.

Parameters

4.2. API Reference 13

Page 18: traces Documentation

traces Documentation, Release 0.5.1

• start (orderable, optional) – The lower time bound of when to calculate thedistribution. By default, start is -infinity.

• end (orderable, optional) – The upper time bound of when to calculate the dis-tribution. By default, the end is +infinity.

• mask (TimeSeries, optional) – A domain on which to calculate the distribution.

Returns int with the result

operation(other, function, **kwargs)Calculate “elementwise” operation either between this TimeSeries and another one, i.e.

operation(t) = function(self(t), other(t))

or between this timeseries and a constant:

operation(t) = function(self(t), other)

If it’s another time series, the measurement times in the resulting TimeSeries will be the union of the setsof measurement times of the input time series. If it’s a constant, the measurement times will not change.

remove(time)Allow removal of measurements from the time series. This throws an error if the given time is not actuallya measurement point.

remove_points_from_interval(start, end)Allow removal of all points from the time series within a interval [start:end].

sample(sampling_period, start=None, end=None, interpolate=’previous’)Sampling at regular time periods.

set(time, value, compact=False)Set the value for the time series. If compact is True, only set the value if it’s different from what it wouldbe anyway.

set_interval(start, end, value, compact=False)Set the value for the time series on an interval. If compact is True, only set the value if it’s different fromwhat it would be anyway.

slice(start, end)Return an equivalent TimeSeries that only has points between start and end (always starting at start)

sum(other)sum(x, y) = x(t) + y(t).

threshold(value, inclusive=False)Return True if > than treshold value (or >= threshold value if inclusive=True).

to_bool(invert=False)Return the truth value of each element.

4.2.2 Histogram

class traces.Histogram(data=(), **kwargs)

max(include_zero=False)Maximum observed value with non-zero count.

mean()Mean of the distribution.

14 Chapter 4. More info

Page 19: traces Documentation

traces Documentation, Release 0.5.1

min(include_zero=False)Minimum observed value with non-zero count.

normalized()Return a normalized version of the histogram where the values sum to one.

standard_deviation()Standard deviation of the distribution.

total()Sum of values.

variance()Variance of the distribution.

4.2. API Reference 15

Page 20: traces Documentation

traces Documentation, Release 0.5.1

16 Chapter 4. More info

Page 21: traces Documentation

CHAPTER 5

Contributing

Contributions are welcome and greatly appreciated! Please visit the repository for more info.

17

Page 22: traces Documentation

traces Documentation, Release 0.5.1

18 Chapter 5. Contributing

Page 23: traces Documentation

Index

Ccompact() (traces.TimeSeries method), 12

Ddefault (traces.TimeSeries attribute), 12difference() (traces.TimeSeries method), 12distribution() (traces.TimeSeries method), 12

Eexists() (traces.TimeSeries method), 12

Ffirst_item() (traces.TimeSeries method), 12first_key() (traces.TimeSeries method), 12first_value() (traces.TimeSeries method), 12

Gget() (traces.TimeSeries method), 12get_item_by_index() (traces.TimeSeries method),

13

HHistogram (class in traces), 14

Iitems() (traces.TimeSeries method), 13iter_merge() (traces.TimeSeries class method), 13iterintervals() (traces.TimeSeries method), 13iterperiods() (traces.TimeSeries method), 13

Llast_item() (traces.TimeSeries method), 13last_key() (traces.TimeSeries method), 13last_value() (traces.TimeSeries method), 13logical_and() (traces.TimeSeries method), 13logical_or() (traces.TimeSeries method), 13logical_xor() (traces.TimeSeries method), 13

Mmax() (traces.Histogram method), 14mean() (traces.Histogram method), 14mean() (traces.TimeSeries method), 13merge() (traces.TimeSeries class method), 13min() (traces.Histogram method), 14moving_average() (traces.TimeSeries method), 13multiply() (traces.TimeSeries method), 13

Nn_measurements() (traces.TimeSeries method), 13n_points() (traces.TimeSeries method), 13normalized() (traces.Histogram method), 15

Ooperation() (traces.TimeSeries method), 14

Rremove() (traces.TimeSeries method), 14remove_points_from_interval()

(traces.TimeSeries method), 14

Ssample() (traces.TimeSeries method), 14set() (traces.TimeSeries method), 14set_interval() (traces.TimeSeries method), 14slice() (traces.TimeSeries method), 14standard_deviation() (traces.Histogram

method), 15sum() (traces.TimeSeries method), 14

Tthreshold() (traces.TimeSeries method), 14TimeSeries (class in traces), 11to_bool() (traces.TimeSeries method), 14total() (traces.Histogram method), 15

Vvariance() (traces.Histogram method), 15

19