EVALUATION OF SIMILARITY MEASURES FOR CATEGORICAL DATA

T. Hussain, S. Asghar

Abstract


Similarity among the objects is a fundamental concept to almost all the technical field such as information retrieval; data mining; mathematics; and bioinformatics. A similarity measure symbolizes relation among the objects, which can be either, documents, queries or features of any database. Similarity measure helps to rank the objects in accordance to their importance in specific data mining application. A similarity measure is a function that computes the degree of similarity between a pair of objects. Similarity base applications are countless. Data mining is used to build the knowledge base of the large data repositories for human inferences and analysis. Data mining techniques are more frequent in all such technical fields where the similarity as' required. The proper selection of similarity or distance measure is a key to many data mining techniques such Clustering; Classification; and Outlier Detection. For categorical data, computation of similarity measure is a complex phenomenon. The measures used for continues data such as Euclidean Measures are generalized upto some extent and can be applied in any continues data domain. Euclidean measures are widely applied to categorical data without considering the domain knowledge and nature of categorical data. Due to the complex nature of categorical data, no standard measure like Euclidean is available in literature. In this paper, we are evaluating the different categorical measures in accordance with their usage in different data mining applications and techniques. We are also proposing the chi-fuzzy measure to address the categorical data issue.

Full Text:

PDF

References


M.Y. Shih, J.W. Jheng and L. F. Lai,

Tamkang Journal of Science and

Engineering 13, No. 1 (2010) 11.

G. Fung, A Comprehensive Overview of

Basic Clustering Algorithms, www.cs.wisc.

edu/~gfung/clustering.ps.gz. (June 22, 2001).

H. Lu and T.T.S. Nguyen, Experimental

Investigation of PSO Based Web User

Session Clustering, International Conference

of Soft Computing and Pattern Recognition,

IEEE (2009).

Z. Ma and O.R.L. Sheng, Clustering Web

Session Using Extended General Pages,

Proceedings of 8th Pacific Asia Conference

on Information Systems, Shangia, China

(2004) p. 5.

L. Chaofeng, Research on Web Session

Clustering, Journal of Software 4, No. 5

(2009) 460.

T. Hussain, S. Asghar, and S. Fong, A

Hierarchical Cluster Based Preprocessing

Methodology for Web Usage Mining. 6th

International Conference on Advanced

Information Management and Service (IMS).

Seoul, Korea (2010).

S. Boriah, V. Chandola and V. Kumar,

Similarity Measure for Categorical Data: A

Comparative Evaluation, Proceedings of the

Eighth SIAM International Conference on

Data Mining (2008).

C.M. Nichele and K. Becker, Clustering Web

Sessions by Levels of Page Similarity

Springer-Verlag Berlin Heidelberg (2006)

pp. 346-350.

S. Aranganayagi, K. Thangavel and

S. Sujatha, New Distance Measure based on

the Domain for Categorical Data. ICAC, IEEE

(2009).

Z.C. Johanyak and S. Kovacs, Distance

Based Similarity Measure of Fuzzy Sets

(2004).

P.H.A. Sneath and R.R. Sokal, Numerical

Taxonomy: The Principles and Practice of

Numerical Classification, San Francisco: W.

H. Freeman and Company (1973).

D.W. Goodall, Biometrics, 22, No. 4 (1966)

A. Ahmad and A. Dey, ScienceDirect, Pattern

Recognition Letters 28 (2006) 110.

S.Q. Le and T.B. Ho, Elsevier 26 (2005)

F. Lourenco, V. Lobo and F. Bacao, BinaryBased Similarity Measures for Categorical

Data and Their Application in Self Organizing

Maps, JOCLAD 2004 - XI Jornadas de

Classificacao e Anlise de Dados, Lisbon,

April 1-3, (2004).

V. Chandola, S. Boriah and V. Kumar, A

Framework for Exploring Categorical Data,

SIAM (2009) pp.187-198.

M. Setnes, R. Babuˇska, U. Kaymak and

H.R.V.N Lemke, Cybernetics 28, No. 3

(1998) 376.

W. Wang and O.R. Zaiane, Clustering Web

Sessions by Sequence Alignment, Third

International Workshop on Management of

Information on the Web in Conjunction with

th International Conference on Database

and Expert Systems Applications (2002) pp.

–398.

A. Ahmad and L. Dey, Algorithm for Fuzzy

Clustering of Mixed Data with Numeric and

Categorical Attributes. Springer -Verlag

Berlin Heidelberg (2005) pp. 561 – 572.

G. Castellano, F. Mesto, M. Minunno and

M. Torsello, A. Web User Profiling Using

Fuzzy Clustering, Springer (2007) pp. 94-


Refbacks

  • There are currently no refbacks.