# Review Notes - DeepFace- Closing the Gap to Human-Level Performance in Face Verification

## Introduction

• Paper describes a CNN architecture for Face Verification
• Face verification? Specified an image of a face,verify that it belongs to a particular person. Of course, a system that can do this can likely do other facial recognition tasks as well. This is the typical pipeline for facial recognition:
• Detect wherethe face is. To crop the image so that itcontains just the face.
• Align theface, diverse parts of the face (eyes, mouth,nose, etc.) to be at roughly the same area of the image; this is done throughaffine transformations (translations and rotations).
• Represent the face: vectorrepresentation which can easily be used to compare.
• Classify usingthe representation.
• Paper brings in drastic changes to align step by doing explicit 3D modeling of the face, and represent step by using a nine-layer deep neural network with more than 120 million parameters.

## Face Alignment / Frontalization

The idea is to clear of variations within the images/faces so that every face appears to look straight into the camera(“frontalized”).

#### 2D alignment

• The alignment process starts by detecting 6 fiducial/landmark points, by a Support Vector Repressor (SVR).
• Fiducial points are used to approximate scale, rotate and translate the image into six anchor locations, using SVRs (features: LBPs) and iterate on the new warped images until there is no substantial change, eventually composing the final 2D similarity transformation.
• With the iterative process, the locations of the fiducial points are gradually refined, and the aggregated transformation generates a 2D aligned face crop.
• The Face crop is centered at the center of the eyes (2), the tip of the nose (1) and mouth locations (3).

#### 3Dalignment

• The 2D alignment allows normalizing variations within the 2D-plane, not out-of-plane variations (e.g., seeing that face from its left/right side). To normalize out-of-plane variations, paper uses 3D transformation.
• By detect an additional 67 landmarks on the faces (again via SVRs).
• Manually place 67 anchor points on the 3D shape, which is an average of the 3D scans from the USF Human-ID database, and in this way achieve full correspondence between fiducial points and their 3D reference.
• map the 67 landmarks to that mesh.
• An affine 3D-to-2D camera P is then fitted using the generalized least squares solution.

#### Deep CNN architecture

The CNN receives the frontalized face images(152x152, RGB).

##### Convolution-pooling-convolution filtering
• C1-Convolution,32 filters, 11x11, ReLU (-> 32x142x142, CxHxW)

• M2-Maxpooling over 3x3, stride 2 (-> 32x71x71)

• C3-Convolution,16 filters, 9x9, ReLU (-> 16x63x63)

• A 3D-aligned 3-channels (RGB) face image of size 152 by 152 pixels is given to a convolutional layer (C1) with 32 filters of size 11x11x3. The resulting 32 feature maps are then fed to a max-pooling layer (M2) which takes the max over 3x3 spatial neighborhoods with a stride of 2, separately for each channel. M2 is followed by another convolutional layer C3 that has 16 filters of size 9x9x16

• Convolution-pooling-convolution filtering** is responsible for **extracting low-level facefeatures like the texture and edges.

• Max-pooling layers make the output of convolution networks more robust to local translations. However, several levels of pooling would cause the network tolose information about the precise position of detailed facial structure andmicro-textures.

##### Locally-connected layers
• L4-Local Convolution, 16 filters, 9x9, ReLU (-> 16x55x55)
• L5-Local Convolution, 16 filters, 7x7, ReLU (-> 16x25x25)
• L6-Local Convolution, 16 filters, 5x5, ReLU (-> 16x21x21)
• L4,L5, and L6) apply a filter bank like convolutional layers but every location in the feature map learns a different set of filters. This is because different regions of an aligned image have different local statistics, the spatial stationary assumption of convolution cannot hold.
• The use of local layers does not affect the computational burden of feature extraction but does affect the number of parameters subject to training.
• Local Convolutions use a different set of learned weights at every “pixel” (while a normal convolution uses the same set of weights at all locations).
• They can afford to use local convolutions because of their frontalization, which roughly forces specific landmarks to be at specific locations.
##### Fully-connected layers
• F7-Fully Connected, 4096, ReLU
• F8-Fully Connected, 4030, SoftMax
• The two layers (F7 and F8) are fully connected and are able to capture correlations between features captured in distant parts of the face images,e.g., position and shape of eyes and mouth.
• The output of the F7 will be used as the raw face representation feature vector
• The output of the last fully-connected layer is fed to a K-way SoftMax which produces a distribution over the class labe
• The network uses dropout (apparently only after the first fully connected layer).
• Features are normalized to decrease the sensitivity due to illumination differences (probably the 4096 fully connected layer).
• Each component is divided by its maximum value across a training set. Additionally, the whole vector is L2-normalized. The goal of this step is to make the network less sensitive to illumination changes.
##### Training

o The network receives images, each showing a face, and is trained on the SFC as a multi-class classification problem using a GPU-based engine, implementing the standard back-propagation on feed-forward nets by stochastic gradient descent (SGD).

o The net includes more than 120 million parameters which took three days to train for roughly 15 epochs.

##### Face verification metrics
• To depict whether two images of faces show the same person, they try three different methods.

• Each of these relies on the vector extracted by the first fully connected layer in the network.

• Let these vectors be $f_1$ (image 1) and $f_2$ (image 2). The methods are then:

1. Inner product between $f_1$ (image 1) and $f_2$ (image 2). The classification (same person/not same person) is then done by a simple threshold.

2. Weighted (chi-squared) distance$(\chi^2)$. Equation, per vector component i: $weight_i \frac{(f_1[i] - f_2[i])^2}{ (f_1[i] + f_2[i])}$. The vector is then fed into an SVM.

3. Siamese network. absolute distance $(d)$ between $f_1$ and $f_2$ is calculated by, $d (f_1,f_2)= \Sigma \alpha_i abs(f_1[i] - f_2[i])$ each component is weighted by a learned weight and then the sum of the components is calculated. If the result is above a threshold, the faces are considered to show the same person. $alpha$ is are trained by standard cross entropy loss and back-propagation of the error.

### Results

Network was trained on the Social Face Classification(SFC) dataset. That seems to be a Facebook-internal dataset with 4.4 million faces of 4k people each with 800 to 1200 faces, where the most recent 5% of face images of each identity are left out for testing.

• LFW dataset: 13,323 web photos of 5,749 celebrities which are divided into 6,000 face pairs in 10 splits.

• is used to validate the model trained on SFC.
• The proposed method reaches an accuracy of 97% with a single model, 97.35% with an ensemble on the LFW dataset.

• Face recognition (“which person is shown in the image”) (apparently, they retrained the whole model on LFW for this task?):

• Simple SVM with LBP (i.e., not their network): 91.4% meanaccuracy.
• with frontalization, with 2d alignment:? No value.
• no frontalization (only 2d alignment): 94.3%mean accuracy.
• no frontalization, no 2d alignment: 87.9% mean accuracy.
• Face verification (two images -> same/not same person) (apparently also trained on LFW? unclear):

• Method 1 (inner product + threshold): 95.92% mean accuracy.
• Method 2 ( vector + SVM): 97.00% mean accuracy.
• Method3 (Siamese): Apparently 96.17% accuracy alone, and 97.25% when used in an ensemble with other methods (under particular training schedule using SFC dataset).

• YTF dataset (YouTube video frames): Collects 3,425 YouTube videos of 1,595 subjects, divided into 5,000 video pairs and 10 splits and used to evaluate the video-level face verification.

92.5%accuracy via -method.

#### Reference

[1] Taigman, Y., Yang, M., Ranzato, M. A., & Wolf, L.(2014). Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1701-1708).

All results and images are directly taken from the reference paper, for the purpose of better understanding.