Pre-training in Medical Data: A Survey

Medical data refers to health-related information associated with patient care or as part of a clinical trial program. There are many categories of such data, such as clinical imaging data, bio-signal data, electronic health records (EHR), and multi-modality medical data. With the development of deep neural networks in the last decade, the emerging pre-training paradigm has become dominant in that it has significantly improved machine learning methods′ performance in a data-limited scenario. In recent years, studies of pre-training in the medical domain have achieved significant progress. To summarize these technological advancements, researchers from The University of Queensland and The University of Adelaide provide a comprehensive survey of recent advances for pre-training on several major types of medical data. This survey summarizes a large number of related publications and the existing benchmarking in the medical domain. Especially, the survey briefly describes how some pre-training methods are applied to or developed for medical data. From a data-driven perspective, it examines the extensive use of pre-training in many medical scenarios. Moreover, based on the summary of recent pre-training studies identifies several challenges in this field to provide insights for future studies. This survey was published in the second issue of Machine Intelligence Research in 2023. Full text is open access. 



From Springer



Artificial intelligence (AI) has become a tremendously ubiquitous technique impacting our lives. Applications based on artificial intelligence assist users in making decisions, thus influencing their daily lives. Technological advances are not possible without the rapid development of deep learning (DL), especially thanks to a much wider adoption of convolutional neural networks (CNN), recurrent neural networks (RNN), and attention neural networks. Those deep neural networks have been integrated into a variety of research, including several sub-fields such as computer vision (CV) and natural language processing (NLP).


Medical data analysis is one of the most important sub-filed in AI. The task mainly focuses on processing and analysing the medical data from variant data modalities to extract essential information to help physicians make precise decisions during the diagnosis process. It is anticipated that computer-aided systems will be influential tools in health monitoring and disease diagnosis. A lot of efforts have been successful in current studies, such as processing and analysing medical imaging, electronic health records (EHRs), bio-signals and the data, which consists of multiple modalities. Hou et al. utilised CNN to diagnose tumours in the early stages, allowing for early intervention treatment planning to improve the patient greatly ′s survival rate. A medicine recommendation was developed to improve patient care by providing personalized recommendations based on electronic health records. Qiu et al. supported caregivers in identifying cardiac arrhythmias effectively and efficiently, saving more lives. Wang et al. utilised chest X-rays and the corresponding diagnosis report training a model for disease diagnosis, similarity search, and image regeneration.


Although existing works have achieved remarkable success, some works found that data-hungry is one of the primary challenges of applying the DNN for processing medical data. On the one hand, some kind of medical data can be obtained easily, but annotating the collected data requires a substantial amount of labour and money; on another hand, in many rare or new disease diagnosis tasks, the data are insufficient because they are too rare to collect or there are issues in privacy. The insufficient data have limited training for a satisfactory model because it could cause overfitting and poor generalization. To address this issue, some large-scale datasets are proposed to make it possible to train satisfactory models. However, the construction of large-scale annotated datasets is labour-consuming and expensive. It is unpractical to develop large-scale annotated datasets.


The researchers, motivated by the human learning strategy, proposed the pre-training to address the lack of annotated data. Considering the human learning strategy, learners can learn a skill based on their prior learning knowledge. For example, learning to play tennis can help in learning badminton.


As summarized in [25], the pre-training technique is specially related to transfer and self-supervised learning. As one of the most critical milestones for solving data-hungry issues, transfer learning techniques have explored utilising labelled data and leveraging the unlabelled data effectively. Transfer learning is a sub-field of machine learning inspired by the process of human learning. It learns the related knowledge in the target domain by transferring information from the same or related domain. The process of transfer learning consists of two steps, pre-training and fine-tuning. Pre-training is a process of learning universal feature representations and then using the pre-trained model in the downstream tasks, as Fig. 1 shows.  




The recently emerging self-supervised learning is another pre-training paradigm that is getting wide notice from more and more researchers. This learning paradigm is committed to extracting abundant knowledge from unlabelled data. Self-supervised learning enables the production of the supervision information by itself instead of manual annotations. In the current studying stage, transfer learning and self-supervised learning are two mainstream pre-training approaches. In this paper, we introduce these two approaches at a high level and explore their applications in the medical domain.


Why pre-training?


The emergence of pre-training provides the opportunity to use a small size of labelled data to train an effective model in the efficient method. This section lists the reasons why pre-training is essential. Firstly, the pre-training method was invented from the lack of data information, which is generally divided into the lack of labels and the lack of data volume. The lack of data volume means that many types of data cannot meet the needs of model training, such as very scarce regional rare disease data. Pre-training can effectively compensate for the impact of this lack of information. Through pre-training, clusters or potential features in the data are extracted by the model so that the model has more generalization ability for specific content.


Secondly, the utilization of pre-trained models can significantly accelerate the convergence process on downstream tasks. This is particularly beneficial in scenarios where computational resources are constrained.


Thirdly, in the past 20 years, with the rapid development of various industries and the generation of high-performance hardware, a large amount of data has been rapidly generated daily in different industries, such as the medical industry. However, the cost of manual annotation of datasets increases exponentially. Therefore, the supervised pre-training methods have challenges on the lack of data annotation. Self-supervised pre-training allows us to leverage abundant non-labelled data, getting a good initialization before the downstream tasks.


Also, with the recent advance in self-supervised learning, many studies have shown that self-supervised pre-training can alleviate the effect of training on class-imbalance datasets.


There are many applications of pre-training in the medical field. Pre-training technology was first implemented in the medical domain in 2014 by Schlegl et al., in which they proposed a semi-supervised learning approach to improve lung tissue classification. Specifically, they pre-train the model with the unsupervised strategy injecting the information from the images without annotation. There are three modalities of the data that we mainly focus on that have been processed with pre-training successfully: medical image data, bio-signal, and EHR. Also, the multi-modality scenario has been considered. For example, the pre-trained BERT model in semantic analysis can be applied to predict diagnoses using EHR data. The self-supervised pre-trained model can perform tasks such as classification and segmentation on CT and MRI images. The electrical bio-signals can be pre-trained to extract the features, thereby helping to perform prediction or diagnosis. Compared with conventional models, the use of pre-trained models in the medical field has significantly improved efficiency and accuracy.


Why is this survey necessary?


There are two reasons why we have organised this survey. First, many works using pre-trained models have achieved satisfactory results in the medical domain in the past few years, but there are few systematic and comprehensive introductions to pre-training models.


Second, [25] is a comprehensive survey for pre-training, while there is no such survey about pre-training in the medical domain. The existing surveys in the medical field only focus on investigating the pre-training models for the specific modality. Particularly, most surveys about pre-training in the medical domain are to review pre-training in medical imaging, and few surveys are published for reviewing processing biosignals and EHRs. Therefore, it is significant that we systematically review pre-training approaches in the medical domain.


To the best of our knowledge, this paper is the first systematic and comprehensive summary of the recent pre-training innovations in the medical field, consisting of medical imaging analysis, electric bio-signal data (electroencephalograms (EEG), electrocardiograms (ECG) etc.), EHRs and multi-modality.


This survey presents the techniques and analysis in a simple manner, which is suitable for a variety of audiences. However, we emphasise the core target audience of the survey mainly for two groups. One group has experts from the medical field who are interested in developing a computer-aided diagnosis system. An additional perspective reader is an expert in machine learning and deep learning and wants to learn about the current developments in pre-training in medicine.


Our contributions


This survey aims to present a systematic introduction to recent advances and new frontiers of pre-training based techniques in the medical domain. We summarized more than 200 advanced contributions in this field using pre-training technology, covering the time range from the very beginning of the emergence of pre-training approaches. We list several main contributions of this survey.


It first systematically summarized the pre-training techniques that are used for medical and clinical scenarios.


It summarized the medical pre-training models used on four main data types: medical images, bio-signal data, EHR data, and multi-modality.


It summarized the benchmark dataset of medical images, bio-signal and EHRs.


The rest of this survey is structured as follows. Section 2 briefly introduces the benchmark datasets in the medical domain and the basic models and methods for pre-training. Section 3 summarises pre-training on medical imaging analysis for different datasets. Section 4 gives an introduction to pre-training for bio-signal. Section 5 summarises the state-of-the-art pre-training methods for EHRs. In Section 6, we discuss the challenges and future directions. Finally, Section 8 gives the conclusion of the survey.



Download full text

Pre-training in Medical Data: A Survey

Yixuan Qiu, Feng Lin, Weitong Chen & Miao Xu

  • Share:
Release Date: 2023-04-06 Visited: