memory interoperability in analytics and machine learning

27
www.twosigma.com Memory Interoperability for Analytics and Machine Learning 6/19/22 All Rights Reserved Wes McKinney @wesmckinn ScaledML @ Stanford March 25, 2017

Upload: wes-mckinney

Post on 05-Apr-2017

1.917 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Memory Interoperability in Analytics and Machine Learning

www.twosigma.com

May 2, 2023All Rights Reserved

Memory Interoperability for Analytics and Machine Learning

Wes McKinney @wesmckinnScaledML @ StanfordMarch 25, 2017

Page 2: Memory Interoperability in Analytics and Machine Learning

2May 2, 2023

Me

• Currently: Software Architect at Two Sigma Investments• Creator of Python pandas project• PMC member for Apache Arrow and Apache Parquet• Author of Python for Data Analysis

• Other Python projects: Ibis, Feather, statsmodels

All Rights Reserved

Page 3: Memory Interoperability in Analytics and Machine Learning

3May 2, 2023

Important Legal Information

The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws.  Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved

All Rights Reserved

Page 4: Memory Interoperability in Analytics and Machine Learning

4May 2, 2023

This talk

• Benefits of interoperable data and metadata• Challenges to sharing memory between runtime environments• Apache Arrow: Purpose and C++ architecture• Opportunities for collaboration• Example application: pandas 2.0

All Rights Reserved

Page 5: Memory Interoperability in Analytics and Machine Learning

5May 2, 2023

Changing hardware landscape

• Intel has released first production 3D Xpoint SSD• Reported 1000x faster than NAND, less expensive than RAM

• Convergence between RAM vs. shared memory / mmap performance

All Rights Reserved

Page 6: Memory Interoperability in Analytics and Machine Learning

6May 2, 2023

Changing software landscape

• Next-gen ML / AI frameworks (TensorFlow, Torch, etc.)• DIY open source architectures for machine learning in production

• Streaming / batch data processing pipelines• Data cleaning and feature engineering• Model fitting / scoring / serving

All Rights Reserved

Page 7: Memory Interoperability in Analytics and Machine Learning

7May 2, 2023

“Zero-copy” memory interfaces

• Enables computational tools to process a dataset without any additional serialization, or transfer to a different memory space

• Can do random access on a dataset that does not fit in RAM

• Another interpretation: reading a dataset is a metadata-only conversion

All Rights Reserved

Page 8: Memory Interoperability in Analytics and Machine Learning

8May 2, 2023

Challenges to zero-copy memory sharing

• Cross-language issues• Type metadata + logical types• Byte/bit-level memory layout

• Language-specific issues• In-memory data structures• Memory allocation and sharing constructs

All Rights Reserved

Page 9: Memory Interoperability in Analytics and Machine Learning

9May 2, 2023

What is pandas?

• Popular in-memory data manipulation tool for Python• Focused on tabular datasets (“data frames”)

• Sprawling codebase spanning multiple areas• IO for many data formats• Array manipulations / data preparation• OLAP-style analytics

• Internals implemented using NumPy array objects

All Rights Reserved

Page 10: Memory Interoperability in Analytics and Machine Learning

10May 2, 2023

NumPy

• Tensor memory model ("ndarray") for numeric data• Strided, homogeneously-typed, byte-addressable memory• APL-inspired semantics• Zero-copy construction from compatible memory layouts

• Computational tools support both strided and contiguous memory access

All Rights Reserved

Page 11: Memory Interoperability in Analytics and Machine Learning

11May 2, 2023

pandas: Technical debt + Architectural issues

• Tensor library like NumPy awkward fit for pandas use cases• Multidimensionality + strided memory access complicated

algorithms• Lack of built-in missing value support• Weak on native string, variable length, or nested types

• pandas at core a “in-memory columnar” problem, similar to analytical SQL engines

All Rights Reserved

Page 12: Memory Interoperability in Analytics and Machine Learning

12May 2, 2023

Thesis: Tensors and Tables

• 2 data structures best suited for zero-copy sharing• Tensors: N-dimensional, homogeneously-typed arrays• Tables: Column-oriented, heterogeneously typed

• These data structures can be defined using common memory and metadata primitives

All Rights Reserved

Page 13: Memory Interoperability in Analytics and Machine Learning

13May 2, 2023

Observations

• A Tensor is semantically a multidimensional view of a 1D block of memory

• Writing computational code targeting arbitrary tensors is much more difficult than 1D contiguous arrays

• Tensors of non-fixed size types (e.g. strings) occur less frequently

All Rights Reserved

Page 14: Memory Interoperability in Analytics and Machine Learning

14May 2, 2023

Apache Arrow

• github.com/apache/arrow• Collaboration amongst broad set of OSS projects around language-

agnostic shared data structures• Initial focus

• In-memory columnar tables• Canonical metadata• Interoperability between JVM and native code (C/C++) ecosystem

All Rights Reserved

Page 15: Memory Interoperability in Analytics and Machine Learning

15May 2, 2023

High performance data interchange

All Rights Reserved

Today With Arrow

Source: Apache Arrow

Page 16: Memory Interoperability in Analytics and Machine Learning

16May 2, 2023

What does Apache Arrow give you?

• Cache-efficient columnar memory: optimized for CPU affinity and SIMD / parallel processing, O(1) random value access

• Zero-copy messaging / IPC: Language-agnostic metadata, batch/file-based and streaming binary formats

• Complex schema support: Flat and nested data types

• Main implementations in C++ and Java: with integration tests• Bindings / implementations for C, Python, Ruby, Javascript in

various stages of development

All Rights Reserved

Page 17: Memory Interoperability in Analytics and Machine Learning

17May 2, 2023

Arrow in C++

• Reusable memory management and IO subsystem for native code applications

• Layered in multiple components• Memory management• Type metadata / schemas• Array / Table containers• IO interfaces• Zero-copy IPC / messaging

All Rights Reserved

Page 18: Memory Interoperability in Analytics and Machine Learning

18May 2, 2023

Arrow C++: Memory management

• arrow::Buffer• RAII-based memory lifetime with std::shared_ptr<Buffer>• arrow::MemoryMappedBuffer: for memory maps

• arrow::MemoryPool• Abstract memory allocator for tracking all allocations

All Rights Reserved

Page 19: Memory Interoperability in Analytics and Machine Learning

19May 2, 2023

Arrow C++: Type metadata

• arrow::DataType• Base class for fixed size, variable size, and nested datatypes

• arrow::Field• Type + name + additional metadata

• arrow::Schema• Collection of fields

All Rights Reserved

Page 20: Memory Interoperability in Analytics and Machine Learning

20May 2, 2023

Arrow C++: Array / Table containers

• arrow::Array• 1-dimensional columnar arrays: Int32Array, ListArray,

StructArray, etc.• Support for dictionary-encoded arrays

• arrow::RecordBatch• Collection of equal-length arrays

• arrow::Column• Logical table “column” as chunked array

• arrow::Table• Collection of columns

All Rights Reserved

Page 21: Memory Interoperability in Analytics and Machine Learning

21May 2, 2023

Arrow C++: IO interfaces

• arrow::{InputStream, OutputStream}• arrow::RandomAccessFile

• Abstract file interface• arrow::MemoryMappedFile

• Zero-copy reads to arrow::Buffer• Specific implementations for OS files, HDFS, etc.

All Rights Reserved

Page 22: Memory Interoperability in Analytics and Machine Learning

22May 2, 2023

Arrow C++: Messaging / IPC

• Metadata read/write using Google’s Flatbuffers library• Encapsulated Message type

• Write record batches, read with zero-copy• arrow::{FileWriter, FileReader}

• Random access / “batch” binary format• arrow::{StreamWriter, StreamReader}

• Streaming binary format

All Rights Reserved

Page 23: Memory Interoperability in Analytics and Machine Learning

23May 2, 2023

In development: arrow::Tensor

• Targeting interoperability with memory layouts as used in NumPy, TensorFlow, Torch, or other standard tensor-based frameworks

• data: arrow::Buffer• shape: dimension sizes• strides: memory ordering

• Zero-copy reads using Arrow’s shared memory tools• Support Tensor math libraries for C++ like xtensor

All Rights Reserved

Page 24: Memory Interoperability in Analytics and Machine Learning

24May 2, 2023

Example use: Ray ML framework from Berkeley RISELab

All Rights Reserved

Source: https://arxiv.org/abs/1703.03924

• Shared memory-based object store

• Zero-copy tensor reads using Arrow libraries

Page 25: Memory Interoperability in Analytics and Machine Learning

25May 2, 2023

Example use: pandas 2.0

• In-planning rearchitecture of pandas’s internals• libpandas — largely Python-agnostic C++11 library• Decoupling pandas data structures from NumPy tensors

• Support analytics targeting native Arrow memory• Multicore / parallel algorithms• Leverage latest SIMD intrinsics

• Lazy-loading DataFrames from primary input formats• CSV, JSON, HDF5, Apache Parquet

All Rights Reserved

Page 26: Memory Interoperability in Analytics and Machine Learning

26May 2, 2023

Other examples

• Spark integration (SPARK-13534)• Weld integration (ARROW-649)

All Rights Reserved

Page 27: Memory Interoperability in Analytics and Machine Learning

27May 2, 2023

Thank you

• Building code and community around• IO subsystems• Metadata• Data structures and in-memory formats

All Rights Reserved