Nongnuch Poolsawad, Lisa Moore and Chandrasekhar Kambhampati. Issues in the Mining of Heart Failure Datasets. International Journal of Automation and Computing, vol. 11, no. 2, pp. 162-179, 2014.
Citation: Nongnuch Poolsawad, Lisa Moore and Chandrasekhar Kambhampati. Issues in the Mining of Heart Failure Datasets. International Journal of Automation and Computing, vol. 11, no. 2, pp. 162-179, 2014.

Issues in the Mining of Heart Failure Datasets

doi: 10.1007/s11633-014-0778-5
  • Received Date: 2012-10-29
  • Rev Recd Date: 2013-07-18
  • Publish Date: 2014-04-01
  • This paper investigates the characteristics of a clinical dataset using a combination of feature selection and classification methods to handle missing values and understand the underlying statistical characteristics of a typical clinical dataset. Typically, when a large clinical dataset is presented, it consists of challenges such as missing values, high dimensionality, and unbalanced classes. These pose an inherent problem when implementing feature selection and classification algorithms. With most clinical datasets, an initial exploration of the dataset is carried out, and those attributes with more than a certain percentage of missing values are eliminated from the dataset. Later, with the help of missing value imputation, feature selection and classification algorithms, prognostic and diagnostic models are developed. This paper has two main conclusions: 1) Despite the nature of clinical datasets, and their large size, methods for missing value imputation do not affect the final performance. What is crucial is that the dataset is an accurate representation of the clinical problem and those methods of imputing missing values are not critical for developing classifiers and prognostic/diagnostic models. 2) Supervised learning has proven to be more suitable for mining clinical data than unsupervised methods. It is also shown that non-parametric classifiers such as decision trees give better results when compared to parametric classifiers such as radial basis function networks (RBFNs).


  • loading
  • [1]
    A. K. Tanwani, J. Afridi, M. Z. Shafiq, M. Farooq. Guidelines to select machine learning scheme for classification of biomedical datasets. In Proceedings of the 7th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, Springer-Verlag, Berlin, Heidelberg, Germany, pp.128-139, 2009.
    A. K. Jha, C. M. DesRoches, E. G. Campbell, K. Donelan, S. R. Rao, T. G. Ferris, A. Shields, S. Rosenbaum, D. Blumenthal. Use of electronic health records in U. S. hospitals. The New England Journal of Medicine, vol.360, no.16, pp.1628-1638, 2009.
    C. Safran, H. Goldberg. Electronic patient records and the impact of the internet. International Journal of Medical Informatics, vol.60, no.2, pp.77-83, 2000.
    J. G. F. Cleland, K. Swedberg, F. Follath, M. Komajda, A. Cohen-Solal, J. C. Aguilar, R. Dietz, A. Gavazzi, R. Hobbs, J. Korewicki, H. C. Madeira, V. S. Moiseyev, I. Preda, W. H. van Gilst, J. Widimsky, N. Freemantle, J. Eastaugh, J. Mason, for the Study Group on Diagnosis of the Working Group on Heart Failure of the European Society of Cardiology, N. Freemantle, J. Eastaugh, J. Mason. The EuroHeart Failure survey programme-A survey on the quality of care among patients with heart failure in Europe, Part1: Patient characteristics and diagnosis. European Heart Journal, vol.24, no.5, pp.442-463, 2003.
    U. R. Acharya, P. S. Bhat, S. S. Iyengar, A. Rao, S. Dua. Classification of heart rate data using artificial neural network and fuzzy equivalence relation. Pattern Recognition, vol.36, no.1, pp.61-68, 2003.
    P. Shi, S. Ray, Q. F. Zhu, M. A. Kon. Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction. BMC Bioinformatics, vol.12, pp.375, 2011.
    T. Mar, S. Zaunseder, J. P. Martinez, M. Llamedo, R. Poll. Optimization of ECG classification by means of feature selection. IEEE Transactions on Biomedical Engineering, vol.58, no.8, pp.2168-2177, 2011.
    M. Sugiyama, M. Kawanabe, P. L. Chui. Dimensionality reduction for density ratio estimation in high-dimensional spaces. Neural Networks, vol.23, no.1, pp.44-59, 2010.
    P. Y. Wang, T. W. S. Chow. A new feature selection scheme using data distribution factor for transactional data. In Proceedings of the European Symposium on Artificial Neural Networks, ESANN, Bruges, Belgium, pp.169-174, 2007.
    M. Dash, H. Liu, J. Yao. Dimensionality reduction of unsupervised data. In Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence, IEEE, Newport Beach, CA, USA, pp.532-539, 1997.
    J. H. Chiang, S. H. Ho. A combination of rough-based feature selection and RBF neural network for classification using gene expression data. IEEE Transactions on Nanotechnology, vol.7, no.1, pp.91-99, 2008.
    Z. G. Yan, Z. Z. Wang, H. B. Xie. The application of mutual information-based feature selection and fuzzy LS-SVM-based classifier in motion classification. Computer Methods and Programs in Biomedicine, vol.90, no.3, pp.275-284, 2008.
    D. P Muni, B. R. Pal, J. Das. Genetic programming for simultaneous feature selection and classifier design. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol.36, no.1, pp.106-117, 2006.
    E. Yom-Tov, G. F. Inbar. Feature selection for the classification of movements from single movement-related potentials. IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol.10, no.3, pp.170-177, 2002.
    R. Varshavsky, A. Gottlieb, D. Horn, M. Linial. Unsupervised feature selection under perturbations: Meeting the challenges of biological data. Bioinformatics, vol.23, no.24, pp.3343-3349, 2007.
    J. C. Kelder, M. J. Cramer, J. Van Wijngaarden, R. Van Tooren, A. Mosterd, K. G. Moons, J. W. Lammers, M. R. Cowie, D. E. Grobbee, A. W. Hoes. The diagnostic value of physical examination and additional testing in primary care patients with suspected heart failure. Circulation, vol.124, no.25, pp.1865-2873, 2011.
    J. C. Kelder, M. R. Cowie, T. A. McDonagh, S. M. Hardman, D. E. Grobbee, B. Cost, A. W. Hoes. Quantifying the added value of BNP in suspected heart failure in general practice: An individual patient data meta-analysis. Heart, vol.97, no.12, pp.959-963, 2011.
    P. N. Peterson, J. S. Rumsfeld, L. Liang, N. M. Albert, A. F. Hernandez, E. D. Peterson, G. C. Fonarow, F. A. Masoudi. A validated risk score for in-hospital mortality in patients with heart failure from the American Heart Association get with the guidelines program. Circulation: Cardiovascular Quality and Outcomes, vol.3, no.1, pp.25-32, 2010.
    K. D. Min, M. Asakura, Y. L. Liao, K. Nakamaru, H. Okazaki, T. Takahashi, K. Fujimoto, S. Ito, A. Takahashi, H. Asanuma, S. Yamazaki, T. Minamino, S. Sanada, O. Sequchi, A. Nakano, Y. Ando, T. Otsuka, H. Furukawa, T. Isomura, S. Takashima, N. Mochizuki, M. Kitakaze. Identification of genes related to heart failure using global gene expression profiling of human failing myocardium. Biochemical Biophysical Research Communications, vol.393, no.1, pp.55-60, 2010.
    R. A. Damarell, J. Tieman, R. M. Sladek, P. M. Davidson. Development of a heart failure filter for Medline: An objective approach using evidence-based clinical practice guidelines as an alternative to hand searching. BMC Medical Research Methodology, vol.11, pp.12, 2011
    D. S. Lee, L. Donovan, P. C. Austin, Y. Y. Gong, P. P. Liu, J. L. Rouleau, J. V. Tu. Comparison of coding of heart failure and comorbidities in administrative and clinical data for use in outcomes research. Medical Care, vol.43, no.2, pp.182-188, 2005.
    D. S. Lee, P. C. Austin, J. L. Rouleau, P. P. Liu, D. Naimark, J. V. Tu. Predicting mortality among patients hospitalizeed for heart failure, derivation and validation of a clinical model. Journal of the American Medical Association, vol.290, no.19, pp.2581-2587, 2003.
    I. Holme, T. R. Pedersen, K. Boman, K. Egstrup, E. Gerdts, Y. A. Kesäniemi, W. Malbecq, S. Ray, A. B. Rossebø, K. Wachtell, R. Willenheimer, C. Gohlke-Bärwolf. A risk score for predicting mortality in patients with asymptomatic mild to moderate aortic stenosis. Heart, vol.98, no.5, pp.377-383, 2011.
    K. K. L. Ho, G. B. Moody, C. K. Peng, J. E. Mietus, M. G. Larson, D. Levy, A. L. Goldberger. Predicting survival in heart failure case and control subjects by use of fully automated methods for deriving nonlinear and conventional indices of heart rate dynamics. Circulation, vol.96, no.3, pp.842-48, 1997.
    G. C. Fonarow, W. T. Abraham, N. M. Albert, W. G. Stough, M. Gheorghiade, B. H. Greenberg, C. M. O'Connor, K. Pieper, J. L. Sun, C. Yancy, J. B. Young. Association between performance measures and clinical outcomes for patients hospitalized with heart failure. Journal of the American Medical Association, vol.297, no.1, pp.61-70, 2007.
    J. Bohacik, D. N. Davis. Data mining applied to cardiovascular data. Journal of Information Technologies, vol.3, no.2, pp.14-21, 2010.
    J. Bohacik, D. N. Davis. Alert rules for remote monitoring of cardiovascular patients. Journal of Information Technologies, vol.5, no.1, pp.16-23, 2012.
    J. Bohacik, D. N. Davis. Estimation of cardiovascular patient risk with a Bayesian network. In Proceedings of the 9th European Conference of Young Research and Scientific Workers, University of Žilina, Žilina, Slovakia, pp.37-40, 2011.
    A. Jain, D. Zongker. Feature selection: Evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.19, no.2, pp.153-158, 1997.
    Y. Saeys, T. Abeel, Y. Van de Peer. Robust feature selection using ensemble feature selection techniques. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, Springer-Verlag, Berlin, Heidelberg, Germany, pp.313-325, 2008.
    L. Yu, H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th International Conference on Machine Learning, pp.856-863, AAAI, Washington DC, USA, 2003.
    N. Zhou, L. Wang. A modified T-test feature selection method and its application on the HapMap genotype data. Genomics, Proteomics & Bioinformatics, vol.5, no.3-4, pp.242-249, 2007.
    U. M. Fayyad, K. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp.1022-1029, 1993.
    H. Liu, J. Li, L. Wong. A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Informatics, vol.13, pp.51-60, 2002.
    C. N. Hsu, H. J. Huang, S. Dietrich. The ANNIGMA-wrapper approach to fast feature selection for neural nets. IEEE Transactions Systems, Man, and Cybernetics, Part B, vol.32, no.2, pp.207-212, 2002.
    J. Bohácik, D. N. Davis, M. Benediković. Risk estimation of cardiovascular patients using Weka. In Proceedings of the International Conference OSSConf 2012, (The Society for Open Information Technologies-SOIT in Bratislava, Slovakia, Žilina, Slovakia), pp.15-20, 2012.
    E. Acuña, C. Rodriguez. The treatment of missing values and its effect in the classifier accuracy. Classification, Clustering, and Data Mining Applications, D. Banks, L. House, F. R. McMorris, P. Arabie, W. Gaul, Eds., Berlin, Heidelberg: Springer, pp.639-648, 2004.
    J. H. Lin, P. J. Haug. Data preparation framework for preprocessing clinical data in data mining. In Proceedings of AMIA Annual Symposium, AMIA, American, pp.489-493, 2006.
    N. Poolsawad, C. Kambhampati, J. G. F. Cleland. Feature selection approaches with missing values handling for data mining-A case study of heart failure dataset. World Academy of Science, Engineering and Technology, vol.60, pp.828-837, 2011.
    N. Poolsawad, L. Moore, C. Kambhampati, J. G. F. Cleland. Handling missing values in data mining-A case study of heart failure dataset. In Proceedings of the 9th International Conference on Fuzzy Systems and Knowledge Discovery, IEEE, Chongqing, China, pp.1934-2938, 2012.
    W. J. Frawley, G. Piatetsky-Shapiro, C. J. Matheus. Knowledge discovery in databases: An overview. Artificial Intelligence Magazine, vol.13, no.3, pp.57-70, 2011.
    E. L. Silva-Ramírez, R. Pino-Mejías, M. López-Coello, M. D. Cubiles-de-la-Vega. Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks, vol.24, no.1, pp.121-129, 2011.
    J. Han, M. Kamber. Data Mining: Concepts and Techniques, 2nd ed., San Francisco: Morgan Kaufman Publishers, 2006.
    D. W. Aha, R. L. Bankert. A comparative evaluation of sequential feature selection algorithms. In Proceedings of the 5th International Workshop on Artificial Intelligence and Statistics, pp.1-7, 1995.
    L. Yu, H. Liu. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, vol.5, pp.1205-1224, 2004.
    T. Jirapech-Umpai, S. Aitken. Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC Bioinformatics, vol.6, pp.148, 2005.
    F. M. Coetzee. Correcting the Kullback-Leibler distance for feature selection. Pattern Recognition Letters, vol.26, no.11, pp.1675-1683, 2005.
    B. L. Wu, T. Abbott, D. Fishman, W. McMurray, G. Mor, K. Stone, D. Ward, K. Williams, H. Y. Zhao. Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics, vol.19, no.13, pp.1636-1643, 2003.
    I. Levner. Feature selection and nearest centroid classification for protein mass spectrometry. BMC Bioinformatics, vol.6, pp.68, 2005.
    J. Jäeger, R. Sengupta, W. L. Ruzzo. Improved gene selection for classifcation of Microarrays. Pacific Symposium on Biocomputing, vol.8, pp.53-64, 2003.
    Y. Su, T. M. Murali, V. Pavlovic, M. Schaffer, S. Kasif. RankGene: Identification of diagnostic genes based on expression data. Bioinformatics, vol.19, no.12, pp.1578-1579, 2003.
    M. W. Gardner, S. R. Dorling. Artificial neural networks (the multilayer perceptron)-A review of applications in the atmospheric sciences. Atmospheric Environment, vol.32, no.14-15, pp.2627-2636, 1998.
    L. Autio, M. Juhola, J. Laurikkala. On the neural network classification of medical data and an endeavour to balance non-uniform data sets with artificial data extension. Computers in Biology and Medicine, vol.37, no.3, pp.388-397, 2007.
    A. Khemphila, V. Boonjing. Parkinsons disease classification using neural network and feature selection. World Academy of Science, Engineering and Technology, vol.64, pp.15-18, 2012.
    C. Cortes, V. Vapnik. Support-vector networks. Machine Learning, vol.20, no.3, pp.273-297, 1995.
    J. C. Platt. Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods-Support Vector Learning, B. Schoelkopf, C. Burges, A. Smola, Eds., Cambridge, MA, USA: MIT Press, pp.185-208, 1998.
    T. Hastie, R. Tibshirani. Classification by pairwise coupling. Advances in Neural Information Processing Systems, Cambridge, MA, USA: MIT Press, pp.507-513, 1998.
    L. Breiman. Random forests. Machine Learning, vol.45, no.1, pp.5-32, 2001.
    W. D. Kim, H. K. Lee, D. Lee. Fuzzy clustering of categorical data using fuzzy centroids. Pattern Recognition Letters, vol.25, no.11, pp.1263-1271, 2004.
    C. L. Bean, C. Kambhampati. Knowledge-oriented clustering for decision support. In Proceedings of the International Joint Conference on Neural Networks, IEEE, Portland, OR, USA, pp.3244-3249, 2003.
    M. Steinbach, G. Karypis, V. Kumar. A comparison of document clustering techniques. In Proceedings of KDD Workshop on Text Mining, pp.1-2, 2000.
    Z. X. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, vol.2, no.3, pp.283-304, 1998.
    T. Kanungo, M. D. Mount, S. N. Netanyahu, D. C. Piatko, R. Silverman, Y. A. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.24, no.7, pp.881-892, 2002.
    K. Alsabti, S. Ranka, V. Singh. An efficient k-means clustering algorithm. In Proceedings of IPPS/SPDP Workshop on High Performance Data Mining, pp.1-7, 1998.
    B. Mirkin. Clustering for Data Mining: A Data Recovery Approach, Florida: Chapman and Hull/CRC, 2005.
    A. Sridhar, S. Sowndarya. Efficiency of k-means clustering algorithm in mining outliers from large data sets. International Journal on Computer Science and Engineering, vol.2, no.9, pp.3043-3045, 2010.
    D. Napoleon, G. P. Lakshmi. An efficient k-means clustering algorithm for reducing time complexity using uniform distribution data points. In Proceedings of the Trendz in Information Sciences & Computing, IEEE, Chennai, India, pp.42-45, 2010.
    Y. Zhao, G. Karypis, U. Fayyad. Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, vol.10, no.2, pp.141-168, 2005.
    J. S. J. Lee, J. N. Hwang, D. T. Davis, A. C. Nelson. Integration of neural networks and decision tree classifiers for automated cytology screening. In Proceedings of the IJCNN-91-Seattle International Joint Conference on Neural Networks, IEEE, Seattle, WA, USA, vol.1, pp.257-262, 1991.
    Y. Zhang, C. Kambhampati, D. N. Davis, K. Goode, J. G. F. Cleland. A comparative study of missing value imputation with multiclass classification for clinical heart failure data. In Proceedings of the 9th International Conference on Fuzzy Systems and Knowledge Discovery, IEEE, Sichuan, China, pp.2840-2844, 2012.
    Y. Al-Najiar, K. M. Goode, J. Zhang, J. G. Cleland, A. L. Clark. Andrew. Red cell distribution width: An inexpensive and powerful prognostic marker in heart failure. European Journal Heart Failure, vol.11, no.12, pp.1155-1162, 2009.
    M. Y. Mashor. Improving the performance of k-means clustering algorithm to position the centres of RBF network. International Journal of the Computer, the Internet and Management, vol.6, no2, 1998.
    J. Herrero, A. Valencia, J. Dopazo. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics, vol.17, no.2, pp.126-136, 2000.
    W. R. Myers. Handling missing data in clinical trials: An overview. Drug Information Journal, vol.34, no.2, pp.525-533, 2000.
    C. M. Grinstead, J. L. Snell. Introduction to Probability, Rhode Island: American Mathematical Society, 1998.
    M. M. Rahman, D. N. Davis. Machine learning-based missing value imputation method for clinical datasets. IAENG Transactions on Engineering Technologies, Netherlands: Springer, pp.245-257, 2013.
  • 加载中


    通讯作者: 陈斌,
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索



    Article Metrics

    Article views (5361) PDF downloads(2685) Cited by()
    Proportional views


    DownLoad:  Full-Size Img  PowerPoint