
  • T. Hussain Department of Computer Sciences, Muhammad Ali Jinnah University (MAJU), Islamabad, Pakistan
  • S. Asghar Institute of Information Technology, University of Arid Agriculture, Rawalpindi, Pakistan




Similarity among the objects is a fundamental concept to almost all the technical field such as information retrieval; data mining; mathematics; and bioinformatics. A similarity measure symbolizes relation among the objects, which can be either, documents, queries or features of any database. Similarity measure helps to rank the objects in accordance to their importance in specific data mining application. A similarity measure is a function that computes the degree of similarity between a pair of objects. Similarity base applications are countless. Data mining is used to build the knowledge base of the large data repositories for human inferences and analysis. Data mining techniques are more frequent in all such technical fields where the similarity as' required. The proper selection of similarity or distance measure is a key to many data mining techniques such Clustering; Classification; and Outlier Detection. For categorical data, computation of similarity measure is a complex phenomenon. The measures used for continues data such as Euclidean Measures are generalized upto some extent and can be applied in any continues data domain. Euclidean measures are widely applied to categorical data without considering the domain knowledge and nature of categorical data. Due to the complex nature of categorical data, no standard measure like Euclidean is available in literature. In this paper, we are evaluating the different categorical measures in accordance with their usage in different data mining applications and techniques. We are also proposing the chi-fuzzy measure to address the categorical data issue.


T. Hussain and S. Asghar, “EVALUATION OF SIMILARITY MEASURES FOR CATEGORICAL DATA”, The Nucleus, vol. 50, no. 4, pp. 387–394, Nov. 2013.


