Current Issue
column
Display Method:
2023, vol. 20, no. 6,
pp. 753-782, doi: 10.1007/s11633-022-1364-x
Abstract:
Lung cancer is the leading cause of cancer-related deaths worldwide. Medical imaging technologies such as computed tomography (CT) and positron emission tomography (PET) are routinely used for non-invasive lung cancer diagnosis. In clinical practice, physicians investigate the characteristics of tumors such as the size, shape and location from CT and PET images to make decisions. Recently, scientists have proposed various computational image features that can capture more information than that directly perceivable by human eyes, which promotes the rise of radiomics. Radiomics is a research field on the conversion of medical images into high-dimensional features with data-driven methods to help subsequent data mining for better clinical decision support. Radiomic analysis has four major steps: image preprocessing, tumor segmentation, feature extraction and clinical prediction. Machine learning, including the high-profile deep learning, facilitates the development and application of radiomic methods. Various radiomic methods have been proposed recently, such as the construction of radiomic signatures, tumor habitat analysis, cluster pattern characterization and end-to-end prediction of tumor properties. These methods have been applied in many studies aiming at lung cancer diagnosis, treatment and monitoring, shedding light on future non-invasive evaluations of the nodule malignancy, histological subtypes, genomic properties and treatment responses. In this review, we summarized and categorized the studies on the general workflow, methods for clinical prediction and clinical applications of machine learning in lung cancer radiomic studies, introduced some commonly-used software tools, and discussed the limitations of current methods and possible future directions.
Lung cancer is the leading cause of cancer-related deaths worldwide. Medical imaging technologies such as computed tomography (CT) and positron emission tomography (PET) are routinely used for non-invasive lung cancer diagnosis. In clinical practice, physicians investigate the characteristics of tumors such as the size, shape and location from CT and PET images to make decisions. Recently, scientists have proposed various computational image features that can capture more information than that directly perceivable by human eyes, which promotes the rise of radiomics. Radiomics is a research field on the conversion of medical images into high-dimensional features with data-driven methods to help subsequent data mining for better clinical decision support. Radiomic analysis has four major steps: image preprocessing, tumor segmentation, feature extraction and clinical prediction. Machine learning, including the high-profile deep learning, facilitates the development and application of radiomic methods. Various radiomic methods have been proposed recently, such as the construction of radiomic signatures, tumor habitat analysis, cluster pattern characterization and end-to-end prediction of tumor properties. These methods have been applied in many studies aiming at lung cancer diagnosis, treatment and monitoring, shedding light on future non-invasive evaluations of the nodule malignancy, histological subtypes, genomic properties and treatment responses. In this review, we summarized and categorized the studies on the general workflow, methods for clinical prediction and clinical applications of machine learning in lung cancer radiomic studies, introduced some commonly-used software tools, and discussed the limitations of current methods and possible future directions.
2023, vol. 20, no. 6,
pp. 783-798, doi: 10.1007/s11633-022-1399-z
Abstract:
Most modern consumer-grade cameras are often equipped with a rolling shutter mechanism, which is becoming increasingly important in computer vision, robotics and autonomous driving applications. However, its temporal-dynamic imaging nature leads to the rolling shutter effect that manifests as geometric distortion. Over the years, researchers have made significant progress in developing tractable rolling shutter models, optimization methods, and learning approaches, aiming to remove geometry distortion and improve visual quality. In this survey, we review the recent advances in rolling shutter cameras from two aspects of motion modeling and deep learning. To the best of our knowledge, this is the first comprehensive survey of rolling shutter cameras. In the part of rolling shutter motion modeling and optimization, the principles of various rolling shutter motion models are elaborated and their typical applications are summarized. Then, the applications of deep learning in rolling shutter based image processing are presented. Finally, we conclude this survey with discussions on future research directions.
Most modern consumer-grade cameras are often equipped with a rolling shutter mechanism, which is becoming increasingly important in computer vision, robotics and autonomous driving applications. However, its temporal-dynamic imaging nature leads to the rolling shutter effect that manifests as geometric distortion. Over the years, researchers have made significant progress in developing tractable rolling shutter models, optimization methods, and learning approaches, aiming to remove geometry distortion and improve visual quality. In this survey, we review the recent advances in rolling shutter cameras from two aspects of motion modeling and deep learning. To the best of our knowledge, this is the first comprehensive survey of rolling shutter cameras. In the part of rolling shutter motion modeling and optimization, the principles of various rolling shutter motion models are elaborated and their typical applications are summarized. Then, the applications of deep learning in rolling shutter based image processing are presented. Finally, we conclude this survey with discussions on future research directions.
2023, vol. 20, no. 6,
pp. 799-821, doi: 10.1007/s11633-022-1400-x
Abstract:
Photorealistic rendering of the virtual world is an important and classic problem in the field of computer graphics. With the development of GPU hardware and continuous research on computer graphics, representing and rendering virtual scenes has become easier and more efficient. However, there are still unresolved challenges in efficiently rendering global illumination effects. At the same time, machine learning and computer vision provide real-world image analysis and synthesis methods, which can be exploited by computer graphics rendering pipelines. Deep learning-enhanced rendering combines techniques from deep learning and computer vision into the traditional graphics rendering pipeline to enhance existing rasterization or Monte Carlo integration renderers. This state-of-the-art report summarizes recent studies of deep learning-enhanced rendering in the computer graphics community. Specifically, we focus on works of renderers represented using neural networks, whether the scene is represented by neural networks or traditional scene files. These works are either for general scenes or specific scenes, which are differentiated by the need to retrain the network for new scenes.
Photorealistic rendering of the virtual world is an important and classic problem in the field of computer graphics. With the development of GPU hardware and continuous research on computer graphics, representing and rendering virtual scenes has become easier and more efficient. However, there are still unresolved challenges in efficiently rendering global illumination effects. At the same time, machine learning and computer vision provide real-world image analysis and synthesis methods, which can be exploited by computer graphics rendering pipelines. Deep learning-enhanced rendering combines techniques from deep learning and computer vision into the traditional graphics rendering pipeline to enhance existing rasterization or Monte Carlo integration renderers. This state-of-the-art report summarizes recent studies of deep learning-enhanced rendering in the computer graphics community. Specifically, we focus on works of renderers represented using neural networks, whether the scene is represented by neural networks or traditional scene files. These works are either for general scenes or specific scenes, which are differentiated by the need to retrain the network for new scenes.
2023, vol. 20, no. 6,
pp. 822-836, doi: 10.1007/s11633-023-1466-0
Abstract:
While recent years have witnessed a dramatic upsurge of exploiting deep neural networks toward solving image denoising, existing methods mostly rely on simple noise assumptions, such as additive white Gaussian noise (AWGN), JPEG compression noise and camera sensor noise, and a general-purpose blind denoising method for real images remains unsolved. In this paper, we attempt to solve this problem from the perspective of network architecture design and training data synthesis. Specifically, for the network architecture design, we propose a swin-conv block to incorporate the local modeling ability of residual convolutional layer and non-local modeling ability of swin transformer block, and then plug it as the main building block into the widely-used image-to-image translation UNet architecture. For the training data synthesis, we design a practical noise degradation model which takes into consideration different kinds of noise (including Gaussian, Poisson, speckle, JPEG compression, and processed camera sensor noises) and resizing, and also involves a random shuffle strategy and a double degradation strategy. Extensive experiments on AGWN removal and real image denoising demonstrate that the new network architecture design achieves state-of-the-art performance and the new degradation model can help to significantly improve the practicability. We believe our work can provide useful insights into current denoising research. The source code is available athttps://github.com/cszn/SCUNet .
While recent years have witnessed a dramatic upsurge of exploiting deep neural networks toward solving image denoising, existing methods mostly rely on simple noise assumptions, such as additive white Gaussian noise (AWGN), JPEG compression noise and camera sensor noise, and a general-purpose blind denoising method for real images remains unsolved. In this paper, we attempt to solve this problem from the perspective of network architecture design and training data synthesis. Specifically, for the network architecture design, we propose a swin-conv block to incorporate the local modeling ability of residual convolutional layer and non-local modeling ability of swin transformer block, and then plug it as the main building block into the widely-used image-to-image translation UNet architecture. For the training data synthesis, we design a practical noise degradation model which takes into consideration different kinds of noise (including Gaussian, Poisson, speckle, JPEG compression, and processed camera sensor noises) and resizing, and also involves a random shuffle strategy and a double degradation strategy. Extensive experiments on AGWN removal and real image denoising demonstrate that the new network architecture design achieves state-of-the-art performance and the new degradation model can help to significantly improve the practicability. We believe our work can provide useful insights into current denoising research. The source code is available at
2023, vol. 20, no. 6,
pp. 837-854, doi: 10.1007/s11633-023-1458-0
Abstract:
This paper aims to address the problem of supervised monocular depth estimation. We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation. Moreover, the Transformer and convolution are good at long-range and close-range depth estimation, respectively. Therefore, we propose to adopt a parallel encoder architecture consisting of a Transformer branch and a convolution branch. The former can model global context with the effective attention mechanism and the latter aims to preserve the local information as the Transformer lacks the spatial inductive bias in modeling such contents. However, independent branches lead to a shortage of connections between features. To bridge this gap, we design a hierarchical aggregation and heterogeneous interaction module to enhance the Transformer features and model the affinity between the heterogeneous features in a set-to-set translation manner. Due to the unbearable memory cost introduced by the global attention on high-resolution feature maps, we adopt the deformable scheme to reduce the complexity. Extensive experiments on the KITTI, NYU, and SUN RGB-D datasets demonstrate that our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins. The effectiveness of each proposed module is elaborately evaluated through meticulous and intensive ablation studies.
This paper aims to address the problem of supervised monocular depth estimation. We start with a meticulous pilot study to demonstrate that the long-range correlation is essential for accurate depth estimation. Moreover, the Transformer and convolution are good at long-range and close-range depth estimation, respectively. Therefore, we propose to adopt a parallel encoder architecture consisting of a Transformer branch and a convolution branch. The former can model global context with the effective attention mechanism and the latter aims to preserve the local information as the Transformer lacks the spatial inductive bias in modeling such contents. However, independent branches lead to a shortage of connections between features. To bridge this gap, we design a hierarchical aggregation and heterogeneous interaction module to enhance the Transformer features and model the affinity between the heterogeneous features in a set-to-set translation manner. Due to the unbearable memory cost introduced by the global attention on high-resolution feature maps, we adopt the deformable scheme to reduce the complexity. Extensive experiments on the KITTI, NYU, and SUN RGB-D datasets demonstrate that our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins. The effectiveness of each proposed module is elaborately evaluated through meticulous and intensive ablation studies.
2023, vol. 20, no. 6,
pp. 855-871, doi: 10.1007/s11633-023-1432-x
Abstract:
Cancelable biometrics are a group of techniques to transform the input biometric to an irreversible feature intentionally using a transformation function and usually a key in order to provide security and privacy in biometric recognition systems. This transformation is repeatable enabling subsequent biometric comparisons. This paper introduces a new idea to be exploited as a transformation function for cancelable biometrics aimed at protecting templates against iterative optimization attacks. Our proposed scheme is based on time-varying keys (random biometrics in our case) and morphing transformations. An experimental implementation of the proposed scheme is given for face biometrics. The results confirm that the proposed approach is able to withstand leakage attacks while improving the recognition performance.
Cancelable biometrics are a group of techniques to transform the input biometric to an irreversible feature intentionally using a transformation function and usually a key in order to provide security and privacy in biometric recognition systems. This transformation is repeatable enabling subsequent biometric comparisons. This paper introduces a new idea to be exploited as a transformation function for cancelable biometrics aimed at protecting templates against iterative optimization attacks. Our proposed scheme is based on time-varying keys (random biometrics in our case) and morphing transformations. An experimental implementation of the proposed scheme is given for face biometrics. The results confirm that the proposed approach is able to withstand leakage attacks while improving the recognition performance.
2023, vol. 20, no. 6,
pp. 872-883, doi: 10.1007/s11633-023-1430-z
Abstract:
3D shape recognition has drawn much attention in recent years. The view-based approach performs best of all. However, the current multi-view methods are almost all fully supervised, and the pretraining models are almost all based on ImageNet. Although the pretraining results of ImageNet are quite impressive, there is still a significant discrepancy between multi-view datasets and ImageNet. Multi-view datasets naturally retain rich 3D information. In addition, large-scale datasets such as ImageNet require considerable cleaning and annotation work, so it is difficult to regenerate a second dataset. In contrast, unsupervised learning methods can learn general feature representations without any extra annotation. To this end, we propose a three-stage unsupervised joint pretraining model. Specifically, we decouple the final representations into three fine-grained representations. Data augmentation is utilized to obtain pixel-level representations within each view. And we boost the spatial invariant features from the view level. Finally, we exploit global information at the shape level through a novel extract-and-swap module. Experimental results demonstrate that the proposed method gains significantly in 3D object classification and retrieval tasks, and shows generalization to cross-dataset tasks.
3D shape recognition has drawn much attention in recent years. The view-based approach performs best of all. However, the current multi-view methods are almost all fully supervised, and the pretraining models are almost all based on ImageNet. Although the pretraining results of ImageNet are quite impressive, there is still a significant discrepancy between multi-view datasets and ImageNet. Multi-view datasets naturally retain rich 3D information. In addition, large-scale datasets such as ImageNet require considerable cleaning and annotation work, so it is difficult to regenerate a second dataset. In contrast, unsupervised learning methods can learn general feature representations without any extra annotation. To this end, we propose a three-stage unsupervised joint pretraining model. Specifically, we decouple the final representations into three fine-grained representations. Data augmentation is utilized to obtain pixel-level representations within each view. And we boost the spatial invariant features from the view level. Finally, we exploit global information at the shape level through a novel extract-and-swap module. Experimental results demonstrate that the proposed method gains significantly in 3D object classification and retrieval tasks, and shows generalization to cross-dataset tasks.
2023, vol. 20, no. 6,
pp. 884-896, doi: 10.1007/s11633-022-1403-7
Abstract:
Score-based multimodal biometric fusion has been shown to be successful in addressing the problem of unimodal techniques′ vulnerability to attack and poor performance in low-quality data. However, difficulties still exist in how to unify the meaning of heterogeneous scores more effectively. Aside from the matching scores themselves, the importance of the ranking information they include has been undervalued in previous studies. This study concentrates on matching scores and their ranking information and suggests the ranking partition collision (RPC) theory from the standpoint of the worth of scores. To meet both forensic and judicial needs, this paper proposes a method that employs a neural network to fuse biometrics at the score level. In addition, this paper constructs a virtual homologous dataset and conducts experiments on it. Experimental results demonstrate that the proposed method achieves an accuracy of 100% in both mAP and Rank1. To show the efficiency of the proposed method in practical applications, this work carries out more experiments utilizing real-world data. The results show that the proposed approach maintains a Rank1 accuracy of 99.2% on the million-scale database. It offers a novel approach to fusion at the score level.
Score-based multimodal biometric fusion has been shown to be successful in addressing the problem of unimodal techniques′ vulnerability to attack and poor performance in low-quality data. However, difficulties still exist in how to unify the meaning of heterogeneous scores more effectively. Aside from the matching scores themselves, the importance of the ranking information they include has been undervalued in previous studies. This study concentrates on matching scores and their ranking information and suggests the ranking partition collision (RPC) theory from the standpoint of the worth of scores. To meet both forensic and judicial needs, this paper proposes a method that employs a neural network to fuse biometrics at the score level. In addition, this paper constructs a virtual homologous dataset and conducts experiments on it. Experimental results demonstrate that the proposed method achieves an accuracy of 100% in both mAP and Rank1. To show the efficiency of the proposed method in practical applications, this work carries out more experiments utilizing real-world data. The results show that the proposed approach maintains a Rank1 accuracy of 99.2% on the million-scale database. It offers a novel approach to fusion at the score level.
2023, vol. 20, no. 6,
pp. 897-908, doi: 10.1007/s11633-022-1392-6
Abstract:
Automatic segmentation and classification of brain tumors are of great importance to clinical treatment. However, they are challenging due to the varied and small morphology of the tumors. In this paper, we propose a multitask multiscale residual attention network (MMRAN) to simultaneously solve the problem of accurately segmenting and classifying brain tumors. The proposed MMRAN is based on U-Net, and a parallel branch is added at the end of the encoder as the classification network. First, we propose a novel multiscale residual attention module (MRAM) that can aggregate contextual features and combine channel attention and spatial attention better and add it to the shared parameter layer of MMRAN. Second, we propose a method of dynamic weight training that can improve model performance while minimizing the need for multiple experiments to determine the optimal weights for each task. Finally, prior knowledge of brain tumors is added to the postprocessing of segmented images to further improve the segmentation accuracy. We evaluated MMRAN on a brain tumor data set containing meningioma, glioma, and pituitary tumors. In terms of segmentation performance, our method achieves Dice, Hausdorff distance (HD), mean intersection over union (MIoU), and mean pixel accuracy (MPA) values of 80.03%, 6.649 mm, 84.38%, and 89.41%, respectively. In terms of classification performance, our method achieves accuracy, recall, precision, and F1-score of 89.87%, 90.44%, 88.56%, and 89.49%, respectively. Compared with other networks, MMRAN performs better in segmentation and classification, which significantly aids medical professionals in brain tumor management. The code and data set are available athttps://github.com/linkenfaqiu/MMRAN .
Automatic segmentation and classification of brain tumors are of great importance to clinical treatment. However, they are challenging due to the varied and small morphology of the tumors. In this paper, we propose a multitask multiscale residual attention network (MMRAN) to simultaneously solve the problem of accurately segmenting and classifying brain tumors. The proposed MMRAN is based on U-Net, and a parallel branch is added at the end of the encoder as the classification network. First, we propose a novel multiscale residual attention module (MRAM) that can aggregate contextual features and combine channel attention and spatial attention better and add it to the shared parameter layer of MMRAN. Second, we propose a method of dynamic weight training that can improve model performance while minimizing the need for multiple experiments to determine the optimal weights for each task. Finally, prior knowledge of brain tumors is added to the postprocessing of segmented images to further improve the segmentation accuracy. We evaluated MMRAN on a brain tumor data set containing meningioma, glioma, and pituitary tumors. In terms of segmentation performance, our method achieves Dice, Hausdorff distance (HD), mean intersection over union (MIoU), and mean pixel accuracy (MPA) values of 80.03%, 6.649 mm, 84.38%, and 89.41%, respectively. In terms of classification performance, our method achieves accuracy, recall, precision, and F1-score of 89.87%, 90.44%, 88.56%, and 89.49%, respectively. Compared with other networks, MMRAN performs better in segmentation and classification, which significantly aids medical professionals in brain tumor management. The code and data set are available at
2023, vol. 20, no. 6,
pp. 909-922, doi: 10.1007/s11633-022-1385-5
Abstract:
To achieve automatic, fast and accurate severity classification of bulbar conjunctival hyperemia severity, we proposed a novel prior knowledge-based framework called mask distillation network (MDN). The proposed MDN consists of a segmentation network and a classification network with teacher-student branches. The segmentation network is used to generate a bulbar conjunctival mask and the classification network divides the severity of bulbar conjunctival hyperemia into four grades. In the classification network, we feed the original image and the image with the bulbar conjunctival mask into the student and teacher branches respectively, and an attention consistency loss and a classification consistency loss are used to keep a similar learning mode for these two branches. This design of “different input but same output”, named mask distillation (MD), aims to introduce the regional prior knowledge that “bulbar conjunctival hyperemia severity classification is only related to the bulbar conjunctiva region”. Extensive experiments on 5117 anterior segment images have proven the effectiveness of mask distillation technology: 1) The accuracy of the MDN student branch is 3.5% higher than that of a single optimal baseline network and 2% higher than that of the baseline network combination. 2) In the test phase, only the student branch is needed, and no additional segmentation network is required. The framework only takes 0.003 s to classify a single image, achieving the fastest speed in all the methods we compared. 3) Compared with a single baseline network, the attention of both teacher and student branches in the MDN has been intuitively improved.
To achieve automatic, fast and accurate severity classification of bulbar conjunctival hyperemia severity, we proposed a novel prior knowledge-based framework called mask distillation network (MDN). The proposed MDN consists of a segmentation network and a classification network with teacher-student branches. The segmentation network is used to generate a bulbar conjunctival mask and the classification network divides the severity of bulbar conjunctival hyperemia into four grades. In the classification network, we feed the original image and the image with the bulbar conjunctival mask into the student and teacher branches respectively, and an attention consistency loss and a classification consistency loss are used to keep a similar learning mode for these two branches. This design of “different input but same output”, named mask distillation (MD), aims to introduce the regional prior knowledge that “bulbar conjunctival hyperemia severity classification is only related to the bulbar conjunctiva region”. Extensive experiments on 5117 anterior segment images have proven the effectiveness of mask distillation technology: 1) The accuracy of the MDN student branch is 3.5% higher than that of a single optimal baseline network and 2% higher than that of the baseline network combination. 2) In the test phase, only the student branch is needed, and no additional segmentation network is required. The framework only takes 0.003 s to classify a single image, achieving the fastest speed in all the methods we compared. 3) Compared with a single baseline network, the attention of both teacher and student branches in the MDN has been intuitively improved.
2023, vol. 20, no. 6,
pp. 923-936, doi: 10.1007/s11633-022-1368-6
Abstract:
Weakly supervised object localization mines the pixel-level location information based on image-level annotations. The traditional weakly supervised object localization approaches exploit the last convolutional feature map to locate the discriminative regions with abundant semantics. Although it shows the localization ability of classification network, the process lacks the use of shallow edge and texture features, which cannot meet the requirement of object integrity in the localization task. Thus, we propose a novel shallow feature-driven dual-edges localization (DEL) network, in which dual kinds of shallow edges are utilized to mine entire target object regions. Specifically, we design an edge feature mining (EFM) module to extract the shallow edge details through the similarity measurement between the original class activation map and shallow features. We exploit the EFM module to extract two kinds of edges, named the edge of the shallow feature map and the edge of shallow gradients, for enhancing the edge details of the target object in the last convolutional feature map. The total process is proposed during the inference stage, which does not bring extra training costs. Extensive experiments on both the ILSVRC and CUB-200-2011 datasets show that the DEL method obtains consistency and substantial performance improvements compared with the existing methods.
Weakly supervised object localization mines the pixel-level location information based on image-level annotations. The traditional weakly supervised object localization approaches exploit the last convolutional feature map to locate the discriminative regions with abundant semantics. Although it shows the localization ability of classification network, the process lacks the use of shallow edge and texture features, which cannot meet the requirement of object integrity in the localization task. Thus, we propose a novel shallow feature-driven dual-edges localization (DEL) network, in which dual kinds of shallow edges are utilized to mine entire target object regions. Specifically, we design an edge feature mining (EFM) module to extract the shallow edge details through the similarity measurement between the original class activation map and shallow features. We exploit the EFM module to extract two kinds of edges, named the edge of the shallow feature map and the edge of shallow gradients, for enhancing the edge details of the target object in the last convolutional feature map. The total process is proposed during the inference stage, which does not bring extra training costs. Extensive experiments on both the ILSVRC and CUB-200-2011 datasets show that the DEL method obtains consistency and substantial performance improvements compared with the existing methods.
2023, vol. 20, no. 6,
pp. 937-951, doi: 10.1007/s11633-022-1357-9
Abstract:
Automated machine learning (AutoML) pruning methods aim at searching for a pruning strategy automatically to reduce the computational complexity of deep convolutional neural networks (deep CNNs). However, some previous work found that the results of many Auto-ML pruning methods cannot even surpass the results of the uniformly pruning method. In this paper, the ineffectiveness of Auto-ML pruning, which is caused by unfull and unfair training of the supernet, is shown. A deep supernet suffers from unfull training because it contains too many candidates. To overcome the unfull training, a stage-wise pruning (SWP) method is proposed, which splits a deep supernet into several stage-wise supernets to reduce the candidate number and utilize inplace distillation to supervise the stage training. Besides, a wide supernet is hit by unfair training since the sampling probability of each channel is unequal. Therefore, the fullnet and the tinynet are sampled in each training iteration to ensure that each channel can be overtrained. Remarkably, the proxy performance of the subnets trained with SWP is closer to the actual performance than that of most of the previous AutoML pruning work. Furthermore, experiments show that SWP achieves the state-of-the-art in both CIFAR-10 and ImageNet under the mobile setting.
Automated machine learning (AutoML) pruning methods aim at searching for a pruning strategy automatically to reduce the computational complexity of deep convolutional neural networks (deep CNNs). However, some previous work found that the results of many Auto-ML pruning methods cannot even surpass the results of the uniformly pruning method. In this paper, the ineffectiveness of Auto-ML pruning, which is caused by unfull and unfair training of the supernet, is shown. A deep supernet suffers from unfull training because it contains too many candidates. To overcome the unfull training, a stage-wise pruning (SWP) method is proposed, which splits a deep supernet into several stage-wise supernets to reduce the candidate number and utilize inplace distillation to supervise the stage training. Besides, a wide supernet is hit by unfair training since the sampling probability of each channel is unequal. Therefore, the fullnet and the tinynet are sampled in each training iteration to ensure that each channel can be overtrained. Remarkably, the proxy performance of the subnets trained with SWP is closer to the actual performance than that of most of the previous AutoML pruning work. Furthermore, experiments show that SWP achieves the state-of-the-art in both CIFAR-10 and ImageNet under the mobile setting.