YuNet: A Tiny Millisecond-level Face Detector

Great progress has been made toward accurate face detection in recent years. However, the heavy model and expensive computation costs make it difficult to deploy many detectors on mobile and embedded devices where model size and latency are highly constrained. This paper presents a millisecond-level anchor-free face detector, YuNet, which is specifically designed for edge devices. There are several key contributions in improving the efficiency-accuracy trade-off. First, it analyses the influential state-of-the art face detectors in recent years and summarizes the rules to reduce the size of models. Then, a lightweight face detector, YuNet, is introduced. The detector contains a tiny and efficient feature extraction backbone and a simplified pyramid feature fusion neck. YuNet has the best trade-off between accuracy and speed. It has only 75856 parameters and is less than 1/5 of other small-size detectors. In addition, a training strategy is presented for the tiny face detector, and it can effectively train models with the same distribution of the training set. The proposed YuNet achieves 81.1% mAP (single-scale) on the WIDER FACE validation hard track with a high inference efficiency (Intel i7-12700K: 1.6ms per frame at 320×320). Because of its unique advantages, the repository for YuNet and its predecessors has been popular at GitHub and gained more than 11K stars at  

From Springer





Face detection has been an attractive topic in computer vision for decades. It is heavily dependent as a prerequisite step for many face-related applications such as face recognition, face beautification, face alignment, face tracking, etc. Given an image, face detection locates the face regions by bounding boxes. Many methods have been proposed to improve face detection performance, from early hand-crafted features to current CNN-based features.


Face detection is less challenging than generic object detection. The accuracy reaches saturation on the challenging benchmark WIDER FACE. Some people may think face detection is a solved problem. However, it is not. The top-ranked methods all use large pre-trained backbone networks, complex feature enhancement modules and heavy test time augmentations (TTAs) for better ranks . For example, one of the best detectors, Mog-face, achieves state-of-the-art accuracy with 711M parameters and 808 GFLOPs (for VGA images). The impressive accuracy comes from the consumption of considerable storage and computation resources.


However, face detection is widely deployed on edge devices such as cell phones, service robots, surveillance cameras and Internet of things (IoT) devices in real-world applications. These devices have limited storage resources and computing capability due to their cost. In addition, only a few noticeable faces need to be detected, and tiny faces in the background are normally not needed in many applications. Even when deployed in a central server, a fast and efficient detector can save considerable energy and make the server handle considerable data synchronously. Compared with a huge face detector that can improve the average precision (AP) slightly on some benchmarks, an efficient tiny detector is more urgently needed.


The backbone networks in a face detector are essential for performance. Some popular backbone networks such as VGG-16 from the VGGNet series, ResNet- 50/101/152 from the ResNet series and MobileNet were originally designed for image classification of ImageNet. As shown in Fig. 1, face detection is different from image classification, which takes the output of the deepest layer as the feature vector. To handle objects of different scales, different feature maps from different layers are employed for detection. Large faces are easier to detect due to the richness of information. In addition, large faces are normally detected from a deeper feature map and are easier to detect than smaller faces. It gives a strong hint that the backbone should focus on small faces in face detection.



Fig. 1 To handle faces of different sizes, normally large faces will be detected from a deeper feature map, and small faces will be detected from a shallower feature map since a pixel on different feature maps has a different field of view.


The distribution of the face sizes should also be noted. In the WIDER FACE dataset, most faces are small ones, which are less than 20 pixels. It is similar in many facerelated applications. Many data augmentation operations, especially random cropping, will change the distributions of face sizes. If training a model with a dataset of different distributions (distribution A, B and C in Fig. 2), the AP will decrease obviously. The further from the original distribution, the lower AP will be.



Fig. 2 If training a face detector with datasets of different distributions (A in red, B in green, and C in blue), the AP tends to decrease as the distribution moves further away from the original distribution.


A tiny millisecond-level face detector, YuNet, has been designed and presented in the following part of the paper. The contributions of the paper are listed as follows.

1) According to the authors’ unique understanding of face detection, this paper designed a tiny face detector, which has a very limited number of parameters, a very low latency and promising accuracies.

2) This paper suggested a data sampling strategy for model training. It can obviously improve the accuracy of a deep detector, especially of a lightweight detector.

3) The proposed YuNet should be the best tiny face detector, which achieves an AP of 81.1% on the WIDER FACE validation hard set and has gained more than 11K stars at for its effectiveness.



Download full text

YuNet: A Tiny Millisecond-level Face Detector

Wei Wu, Hanyang Peng, Shiqi Yu


author = {Wei Wu and Hanyang Peng and Shiqi Yu},
journal = {Machine Intelligence Research},
title = {YuNet: A Tiny Millisecond-level Face Detector},
year = {2023},
volume = {20},
number = {5},
pages = {656-665},
doi = {10.1007/s11633-023-1423-y}

  • Share:
Release Date: 2023-10-18 Visited: