facenet: a unified embedding for face recognition and clustering · 2015-05-24 · implementing...

1
FaceNet: A Unified Embedding for Face Recognition and Clustering Florian Schroff 1 , Dmitry Kalenichenko 1 , James Philbin 1 ({fschroff, dkalenichenko, jphilbin}@google.com) 1 Google Inc. Figure 1: Face Clustering. Shown is an exemplar cluster for one user. All these images in the users personal photo collection were clustered together. 1.04 1.22 1.33 0.78 Figure 2: Illumination and Pose invariance. This figure shows the output distances of FaceNet between pairs of faces of the same and a different person in different pose and illumination combinations. A distance of 0.0 means the faces are identical, 4.0 corresponds to the opposite spectrum, two different identities. You can see that a threshold of 1.1 would classify every pair correctly. Overview Despite significant recent advances in the field of face recognition [1, 2], implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a com- pact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recog- nition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors. Figure 1 shows one cluster in a users personal photo collection. It is a clear showcase of the incredible invariance to occlusion, lighting, pose and This is an extended abstract. The full paper is available at the Computer Vision Foundation webpage. ... Batch DEEP ARCHITECTURE L2 Triplet Loss E M B E D D I N G Figure 3: Model structure. Our network consists of a batch input layer and a deep CNN followed by L 2 normalization and the triplet loss. 10,000,000 100,000,000 1,000,000,000 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0% MultiAdd (FLOPS) VAL @103 FAR NNS2 NNS1 NN4 NN3 NN1 NN2 Figure 4: FLOPS vs Accuracy trade-off. Shown is the trade-off between FLOPS and accuracy for a wide range of different model sizes and architec- tures. Highlighted are the four models that we focus on in our experiments. even age. See Figure 2 for another illustration of its robustness to lighting and pose. Our method uses a deep convolutional network trained to directly opti- mize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches (see Figure 3). To train, we use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. We learn an embedding f (x), from an image x into a feature space R d , such that the squared distance between all faces of the same identity is small, whereas the squared distance between a pair of face images from different identities is large. The loss that is being minimized is then L = N i h f (x a i ) - f (x p i ) 2 2 -k f (x a i ) - f (x n i )k 2 2 + α i + . (1) α is a margin that is enforced between positive and negative pairs. Super- script a, p, n denote the anchor, positive and negative, respectively. The summation is over the set of all possible triplets in the training set. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face. We evaluate these architectures on a wide range of hyperparameters, leading to different trade-offs of computation and accuracy, see Figure 4. On the widely used Labeled Faces in the Wild (LFW) dataset, our sys- tem achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result [1] by 30% on both datasets. [1] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deeply learned face representations are sparse, selective, and robust. CoRR, abs/1412.1265, 2014. [2] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface: Closing the gap to human-level performance in face verification. In IEEE Conf. on CVPR, 2014.

Upload: others

Post on 20-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FaceNet: A Unified Embedding for Face Recognition and Clustering · 2015-05-24 · implementing face verification and recognition efficiently at scale presents serious challenges

FaceNet: A Unified Embedding for Face Recognition and Clustering

Florian Schroff1, Dmitry Kalenichenko1, James Philbin1 ({fschroff, dkalenichenko, jphilbin}@google.com)1Google Inc.

Figure 1: Face Clustering. Shown is an exemplar cluster for one user. Allthese images in the users personal photo collection were clustered together.

1.04

1.22 1.33

0.78

Figure 2: Illumination and Pose invariance. This figure shows the outputdistances of FaceNet between pairs of faces of the same and a differentperson in different pose and illumination combinations. A distance of 0.0means the faces are identical, 4.0 corresponds to the opposite spectrum, twodifferent identities. You can see that a threshold of 1.1 would classify everypair correctly.

OverviewDespite significant recent advances in the field of face recognition [1, 2],implementing face verification and recognition efficiently at scale presentsserious challenges to current approaches. In this paper we present a system,called FaceNet, that directly learns a mapping from face images to a com-pact Euclidean space where distances directly correspond to a measure offace similarity. Once this space has been produced, tasks such as face recog-nition, verification and clustering can be easily implemented using standardtechniques with FaceNet embeddings as feature vectors.

Figure 1 shows one cluster in a users personal photo collection. It is aclear showcase of the incredible invariance to occlusion, lighting, pose and

This is an extended abstract. The full paper is available at the Computer Vision Foundationwebpage.

...

Batch

DEEP ARCHITECTURE L2 Triplet Loss

EMBEDDING

Figure 3: Model structure. Our network consists of a batch input layer anda deep CNN followed by L2 normalization and the triplet loss.

10,000,000 100,000,000 1,000,000,00020.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

Multi­Add (FLOPS)

VAL @10­3 FAR NNS2

NNS1

NN4

NN3

NN1

NN2

Figure 4: FLOPS vs Accuracy trade-off. Shown is the trade-off betweenFLOPS and accuracy for a wide range of different model sizes and architec-tures. Highlighted are the four models that we focus on in our experiments.

even age. See Figure 2 for another illustration of its robustness to lightingand pose.

Our method uses a deep convolutional network trained to directly opti-mize the embedding itself, rather than an intermediate bottleneck layer as inprevious deep learning approaches (see Figure 3). To train, we use tripletsof roughly aligned matching / non-matching face patches generated using anovel online triplet mining method. We learn an embedding f (x), from animage x into a feature space Rd , such that the squared distance between allfaces of the same identity is small, whereas the squared distance between apair of face images from different identities is large.

The loss that is being minimized is then L =

N

∑i

[∥∥ f (xai )− f (xp

i )∥∥2

2−‖ f (xai )− f (xn

i )‖22 +α

]+. (1)

α is a margin that is enforced between positive and negative pairs. Super-script a, p, n denote the anchor, positive and negative, respectively. Thesummation is over the set of all possible triplets in the training set.

The benefit of our approach is much greater representational efficiency:we achieve state-of-the-art face recognition performance using only 128-bytesper face.

We evaluate these architectures on a wide range of hyperparameters,leading to different trade-offs of computation and accuracy, see Figure 4.

On the widely used Labeled Faces in the Wild (LFW) dataset, our sys-tem achieves a new record accuracy of 99.63%. On YouTube Faces DB itachieves 95.12%. Our system cuts the error rate in comparison to the bestpublished result [1] by 30% on both datasets.

[1] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deeply learned face representationsare sparse, selective, and robust. CoRR, abs/1412.1265, 2014.

[2] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. Deepface:Closing the gap to human-level performance in face verification. In IEEE Conf.on CVPR, 2014.