distributed edge learning for big data analytics: challenges and …cssongguo/edgeai-short.pdf ·...

25
Distributed Edge Learning for Big Data Analytics: Challenges and Trends Song Guo The Hong Kong Polytechnic University [email protected] https://www4.comp.polyu.edu.hk/~cssongguo/

Upload: others

Post on 01-Sep-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

Distributed Edge Learning for Big Data Analytics:Challenges and Trends

Song Guo The Hong Kong Polytechnic University

[email protected]://www4.comp.polyu.edu.hk/~cssongguo/

Page 2: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

RGC Research Impact Fund (RIF), “Edge Learning: the Enabling Technology for Distributed Big Data Analytics in Cloud-Edge Environment”, 2020-2025, Project Coordinator, 7,640,000 HKD.

2

Acknowledgement

Page 3: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

• Background and Preliminaries

• Challenge Analysis

• Approaches and Results

• Future Directions

3

Outline

Page 4: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

Booming Era of Intelligence

Smart Home Self-DrivingSmart Health Smart Grid Multimedia ServiceSmart Surveillance

Smart Decision Making, Automation, and Optimization

Big Data AI4

Page 5: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

From Cloud Intelligence to Edge IntelligenceCloud Intelligence

5

Cloud

Edge Intelligence

Cloud

Edge• Save bandwidth

847 ZB vs. 19.5 ZB Per Year• Reduce latency𝒔𝒔-level to 𝐦𝐦𝐦𝐦/𝝁𝝁𝒔𝒔-level• Ensure privacy

Model Training &Inference

Model Training &Inference

Model Training &Inference

Page 6: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

6

Distributed Machine Learning

Weakness - Slow• Single machine has limited

hardware resource• The size of training data is large• Machine learning model is large

Advantage - Fast• Train the model on multiple machines

in parallelChallenges:How to configure these machines?

Single-machine training Multiple-machine training

Page 7: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

7

Distributed Edge Learning

Worker 1 Worker 2

Worker 3 Worker 4

Server1 Server2

Distributed Machine Learning in a Data Center

The principle of edge computing naturally facilitates distributed edge learning by leveraging edge/device resources.

Distributed Machine Learning with Edge Devices

Page 8: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

• Gradient aggregation and model synchronization: Stochastic Gradient Descent (SGD) • Architecture: Parameter Server

8

Basic Algorithm and Architecture

Page 9: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

• BSP (Bulk Synchronous Parallel): Models at all workers are strictly synchronized in every iteration.

• ASP (Asynchronous Parallel): Workers update the global model asynchronously under a given threshold of staleness.

9

Synchronization Methods

BSP ASP

Page 10: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

• Background and Preliminaries

• Challenge Analysis

• Approaches and Results

• Future Directions

10

Outline

Page 11: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

11

Challenges

Network

Cloud

DataStream

Edge

Constrained Communication

Constrained Computation

Unstable Environment

Hard to train

Hardware Diversity

Data Heterogeneity

QoS Diversity

Hard to customize

No Incentive for Participation

Hard to sustain

Vulnerable Edge Devices Privacy of Data

Hard to protect

It is quite challenging to realize distributed edge learning in an efficient,secure, sustainable, and customized manner due to the inherentcharacteristics of the cloud-edge environment.

Page 12: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

12

Hard to Train: Constrained Resource

• Limited bandwidth vs. increased communication overhead• Limited power, computing capacity and memory space

Image throughput for the Inception-V3 model with CPU & GPU [AI Benchmark]

Prediction by Deloitte Insights

Presenter
Presentation Notes
The number of edge device is expected to be doubled in the future four years. The increase of edge devices increases the bandwidth and communication requirements. 移动设备受cpu,gpu,存储等制约,进行AI推断和训练的速度较慢。 Performance evolution of mobile AI accelerators: image throughput for the float Inception-V3 model.  Note that the Inception-V3 is a relatively small network, and for bigger models the advantage of Nvidia GPUs over other silicon might be larger.
Page 13: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

13

Hard to Train: Unstable Environment

• Unguaranteed network Quality of Service (QoS))• Unstable runtime of the edge device (computation/OS/battery/mobility)

Presenter
Presentation Notes
Note: 边缘设备由于易损坏、网络带宽波动以及设备的移动特定造成稳定性较差,从而出现严重的stragglers的情况。 Baseline: learning curve in homogeneous system. 假设异构性仅考虑硬件算力的差异,并只考虑cpu、gpu,那么当集群中gpu占比从0-100%时,对ps计算造成的影响. 右下图:Changes in the distribution of latencies across 512 different flows binned in 1-hour intervals between the Oregon and Virginia regions. We can see the distribution of the minimum, 5th, 25th, 50th, 75th, and 95th percentiles and maximum latencies observed during each 1-hour time interval. 
Page 14: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

14

Hard to Protect: Security

GlobalModel

∆𝑊𝑊1𝑊𝑊 𝑊𝑊 𝑊𝑊 𝑊𝑊

∆𝑊𝑊2 ∆𝑊𝑊3 ∆𝑊𝑊𝑁𝑁

Clients

Serverθ1

α1

αk

αL

θi

θq

x1

xj

xN

O1

Ok

OL

Good dataPoisoned data

Attacked Model

• Cloud-edge environment suffers from malicious attacks.

More malicious workers, lower accuracy

Higher poisoning rate, higher error

Presenter
Presentation Notes
[1] A Little Is Enough: Circumventing Defenses For Distributed Learning, NeurIPS’19 [2] Manipulating Machine Learning : Poisoning Attacks and Countermeasures for Regression Learning, SP’18
Page 15: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

Hard to Protect: Privacy

15

homomorphic Encryption:execute on encrypted data.

differential privacy: add a perturbation to the gradient.

• Two mainstream approaches for privacy protection in cloud-based learning are differential privacy and homomorphic encryption.

• How to design lightweight mechanisms to protect data privacy on vulnerable and resource-constrained edge devices in distributed edge learning?

a degraded convergence rate high complexity, only support simple operations

Page 16: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

16

Hard to Sustain: Lack of Participations

resource consumption

=data assets

not participate without incentive

hard to avoid malicious behavior

Contribution

Client A Client Bhard to classify the contribution of A & B

Contribution

…time

time

Learning strategy A

Learning strategy B

Contribution

hard to guarantee the sustainability

Contribution

Contribution

• How to build a benign ecosystem to for sustainable development of edge learning?

not participate without incentiveContribution

Client A Client B

Contribution

…time

time

Learning strategy A

Learning strategy B

Contribution

Contribution

Contribution

Page 17: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

Hard to Customize: Edge Diversity

17Hard to adapt to different hardware and environments

QoS diversity

Hardware diversity

Data Heterogeneity

Page 18: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

• Background and Preliminaries

• Challenge Analysis

• Approaches and Results

• Future Directions

18

Outline

Page 19: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

19

ApproachesChallenges Example approaches

Hard to trainConstrained resources

OSP [Wang@ICPP’19]Falcon [Zhou@ICDCS’19] Petrel [Zhou @ICDCS’20][Zhou@TC’20][Zhou@ICDCS’19][Zhang@UbiComp’20]

Unstable environment Heter-aware gradient coding [Wang@ICDCS’19]L4L [Zhan@IPDPS’20]

Hard to protectSecurity guarantee DGSB [Wang@20] [Yu@TBD'20]Privacy protection [Du@TSC'19], [Yu@CPS'18]

Hard to sustain Lack of participations

Incentive mechanism with DRL [Zhan@INFOCOM’20][Zhan@IoT’20][Zhan@NETWORK’19][Zhan@NETWORK’20]

Page 20: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

20

Critical Scientific Issues

Performance: How to improve learning performanceunder systems constraints (i.e. high communication cost, device-level resource constraints)?

Security and Privacy: How to enhance data/model privacy and how to design lightweight security mechanisms?

Incentive: How to stimulate effective and efficient collaboration among all organizations of federated learning ecosystem?

Win-Altogether

Secure

Goals

Fast

Conclusion

Page 21: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

Hardware Diversity

21

Flexibility Efficiency

Edge devices are equipped with various processing units with different computing capabilities.

How to make the edge learning adapted to different hardware environments?

CPU(Central Processing Unit)

GPU(Graphics Processing Unit)

NPU(Neural Processing Unit)

• Hardware of edge devices:

Page 22: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

22

Data Heterogeneity

• Data distributions of workers are not identical.• Applying the same learning strategy for all

workers fails to work efficiently.Dataset1

Cloud

Dataset2Dataset3 Dataset4

Page 23: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

QoS Diversity

Data are generated separately by the distributed edge devices.

Different training environments

Different data qualities Current edge learning systems usually use one model for all clients.

MeansBut

• Different clients has different measurements on QoS in terms of accuracy, latency, cost, etc.• One model will never be the optimal choice for all clients.

23

Page 24: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

Wireless Optimization for Edge Learning

24

• How to jointly optimize the edge learning process and the wireless transmission over diverse modes of wireless communication over a complex network topology?

over-the-air computation and gradient compression:simultaneous transmission and decoding the average/sum of gradients

channel allocation and transmission scheduling in complex wireless environment:jointly optimize the training algorithm and transmission scheduling

synchronization mechanism in multi-layer architecture:decentralized synchronization and layer by layer synchronization

Page 25: Distributed Edge Learning for Big Data Analytics: Challenges and …cssongguo/EdgeAI-short.pdf · 2020. 8. 25. · Worker 1 Worker 2 Worker 3 Worker 4 Server1 2. Distributed Machine

Thank you !

25

Edge Learning: A revolutionary learning paradigm enabling ubiquitous intelligence!

[email protected]://www4.comp.polyu.edu.hk/~cssongguo/