Salman Ahmed and Shaikh Hiroyuki. Top-k Outlier Detection from Uncertain Data. International Journal of Automation and Computing, vol. 11, no. 2, pp. 128-142, 2014. https://doi.org/10.1007/s11633-014-0775-8
Citation: Salman Ahmed and Shaikh Hiroyuki. Top-k Outlier Detection from Uncertain Data. International Journal of Automation and Computing, vol. 11, no. 2, pp. 128-142, 2014. https://doi.org/10.1007/s11633-014-0775-8

Top-k Outlier Detection from Uncertain Data

doi: 10.1007/s11633-014-0775-8
Funds:

This work was partly supported by Grant-in-Aid for Scientific Research(A)(#24240015A).

  • Received Date: 2013-07-31
  • Rev Recd Date: 2013-11-11
  • Publish Date: 2014-04-01
  • Uncertain data are common due to the increasing usage of sensors, radio frequency identification (RFID), GPS and similar devices for data collection. The causes of uncertainty include limitations of measurements, inclusion of noise, inconsistent supply voltage and delay or loss of data in transfer. In order to manage, query or mine such data, data uncertainty needs to be considered. Hence, this paper studies the problem of top-k distance-based outlier detection from uncertain data objects. In this work, an uncertain object is modelled by a probability density function of a Gaussian distribution. The naive approach of distance-based outlier detection makes use of nested loop. This approach is very costly due to the expensive distance function between two uncertain objects. Therefore, a populated-cells list (PC-list) approach of outlier detection is proposed. Using the PC-list, the proposed top-k outlier detection algorithm needs to consider only a fraction of dataset objects and hence quickly identifies candidate objects for top-k outliers. Two approximate top-k outlier detection algorithms are presented to further increase the efficiency of the top-k outlier detection algorithm. An extensive empirical study on synthetic and real datasets is also presented to prove the accuracy, efficiency and scalability of the proposed algorithms.

     

  • loading
  • [1]
    A. Elías, A. Ochoa-Zezzatti, A. Padilla, J. Ponce. Outlier analysis for plastic card fraud detection a hybridized and multi-objective approach. Hybrid Artificial Intelligent Systems, Berlin, Heidelberg: Springer, pp.1-9, 2011.
    [2]
    M. V. Mahoney, P. K. Chan. Learning rules for anomaly detection of hostile network traffic. In Proceedings of the 3rd IEEE International Conference on Data Mining, IEEE, Melbourne, FL, USA, pp.601-604, 2003.
    [3]
    G. Manson, G. Pierce, K. Worden. On the long-term stability of normal condition for damage detection in a composite panel. Key Engineering Materials, vol.204-205, pp.359-370, 2001.
    [4]
    H. Garces, D. Sbarbaro. Outliers detection in environmental monitoring databases. Engineering Applications of Artificial Intelligence, vol.24, no.2, pp.341-349, 2011.
    [5]
    N. Alaydie, F. Fotouhi, C. K. Reddy, H. Soltanian-Zadeh. Noise and outlier filtering in heterogeneous medical data sources. In Proceedings of Workshops on Database and Expert Systems Applications, IEEE, Bilbao, Spain, pp.115-119, 2010.
    [6]
    D. M. Hawkins. Identification of Outliers, London: Chapman and Hall, 1980.
    [7]
    V. Barnett, T. Lewis. Outliers in Statistical Data, New York: Wiley, 1994.
    [8]
    O. Z. Maimon, L. Rokach. Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers, Norwell: Kluwer Academic, 2005.
    [9]
    C. C. Aggarwal. Outlier Analysis, New York: Springer-Verlag, 2013.
    [10]
    E. M. Knorr, R. T. Ng, V. Tucakov. Distance-based outliers: Algorithms and applications. The VLDB Journal, vol.8, no.3-4, pp.237-253, 2000.
    [11]
    E. M. Knorr, R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In Proceedings of 24th International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp.392-403, 1998.
    [12]
    S. Papadimitriou, H. Kitagawa, P. B. Gibbons, C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the 19th International Conference on Data Engineering, IEEE, Bangalore, India, pp.315-326, 2003.
    [13]
    V. Kumar. Parallel and distributed computing for cybersecurity. IEEE Distributed Systems Online, vol.6, no.10, pp.1-9, 2005.
    [14]
    G. H. Orair, C. H. C. Teixeira, W. Meira, Y. Wang, S. Parthasarathy. Distance-based outlier detection: Consolidation and renewed bearing. In Proceedings of the VLDB Endowment, vol.3, no.1-2, pp.1469-1480, 2010.
    [15]
    B. Wang, G. Xiao, H. Yu, X. C. Yang. Distance-based outlier detection on uncertain data. In Proceedings of the 9th IEEE International Conference on Computer and Information Technology, IEEE, Xiamen, China, pp.293-298, 2009.
    [16]
    C. Zhu, H. Kitagawa, S. Papadimitriou, C. Faloutsos. Outlier detection by example. Journal of Intelligent Information Systems, vol.36, no.2, pp.217-247, 2011.
    [17]
    A. B. Sharma, L. Golubchik, R. Govindan. Sensor faults: Detection methods and prevalence in real-world datasets. ACM Transactions on Sensor Networks, vol.6, no.3, pp.1-39, 2010.
    [18]
    I. Helm, L. Jalukse, I. Leito. Measurement uncertainty estimation in amperometric sensors: A tutorial review. Sensors, vol.10, no.5, pp.4430-4455, 2010.
    [19]
    Y. Diao, B. D. Li, A. N. Liu, L. P. Peng, C. Sutton, T. Tran, M. Zink. Capturing data uncertainty in high-volume stream processing. In Proceedings of the 4th Biennial Conference on Innovative Data Systems Research, Asilomar, California, USA, 2009.
    [20]
    A. A. Omer, J. P. Thomas, L. Zhu. Mutual authentication protocols for RFID systems. International Journal of Automation and Computing, vol.5, no.4, pp.348-365, 2008.
    [21]
    J. Nievergelt, H. Hinterberger, K. C. Sevick. The Grid file: An adaptable, symmetric multikey file structure. ACM Transactions on Database Systems, vol.9, no.1, 38-71, 1984.
    [22]
    S. Ramaswamy, R. Rastogi, K. Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of 2000 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, pp.427-438, 2000.
    [23]
    F. Angiulli, C. Pizzuti. Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference, PKDD 2002, Springer, Helsinki, Finland, pp.15-26, 2002.
    [24]
    F. Angiulli, F. Fassetti. Detecting distance-based outliers in streams of data. In Proceedings of the 16th ACM Conference on Information and Knowledge Management, ACM, New York, NY, USA, pp.811-820, 2007.
    [25]
    M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas, Y. Manolopoulos. Continuous monitoring of distance-based outliers over data streams. In Proceedings of the 27th IEEE International Conference on Data Engineering, IEEE, Hannover, pp.135-146, 2011.
    [26]
    K. Ishida, H. Kitagawa. Detecting current outliers: Continuous outlier detection over time-series data streams. In Proceedings of the 19th International Conference Database and Expert Systems Applications, Springer, Berlin, Heidelberg, pp.255-268, 2008.
    [27]
    C. C. Aggarwal, P. S. Yu. Outlier detection with uncertain data. In Proceedings of the SIAM International Conference on Data Mining, pp.483-493, 2008.
    [28]
    S. A. Shaikh, H. Kitagawa. Distance-based outlier detection on uncertain data of Gaussian distribution. In Proceedings of the 14th Asia-Pacific International Conference on Web Technologies and Applications, Springer-Verlag, Berlin, Heidelberg, pp.109-121, 2012.
    [29]
    S. A. Shaikh, H. Kitagawa. Efficient distance-based outlier detection on uncertain datasets of Gaussian distribution. World Wide Web, 2013. (Online first).
    [30]
    S. A. Shaikh, H. Kitagawa. Fast top-k distance-based outlier detection on uncertain data. In Proceedings of the 14th International Conference on Web-age Information Management, Springer, Berlin, Heidelberg, pp.301-313, 2013.
    [31]
    M. M. Breunig, H. P. Kriegel, R. T. Ng, J. Sander. LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, ACM, New York, NY, USA, pp.93-104, 2000.
    [32]
    Z. Y. He, X. F. Xu, S. C. Deng. Discovering cluster-based local outliers. Pattern Recognition Letters, vol.24, no.9-10, pp.1641-1650, 2003.
    [33]
    B. Jiang, J. Pei. Outlier detection on uncertain data: Objects, instances, and inferences. In Proceedings of the 27th IEEE International Conference on Data Engineering, IEEE, Hannover, pp.422-433, 2011.
    [34]
    P. Bajorski. Statistics for Imaging, Optics, and Photonics, New York: John Wiley & Sons Publication, 2012.
    [35]
    F. Pukelsheim. The three sigma rule. The American Statistician, vol.48, no.2, pp.88-91, 1994.
    [36]
    Y. F. Tao, X. K. Xiao, R. Cheng. Range search on multidimensional uncertain data. ACM Transactions on Database Systems, vol.32, no.3, pp.1-54, 2007.
    [37]
    W. J. Thistleton, J. A. Marsh, K. Nelson, C. Tsallis. Generalized Box-Müller method for generating q-Gaussian random deviates. IEEE Transactions on Information Theory, vol.53, no.12, pp.4805-4810, 2007.
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    用微信扫码二维码

    分享至好友和朋友圈

    Article Metrics

    Article views (6240) PDF downloads(2121) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return