June 16, 2016
Dear Faculty, graduate and undergraduate students,
You are cordially invited to my Masters thesis defense.
Title: Automatic K-Expectation-Maximization (K-EM) clustering algorithm for data-mining applications
When: Thursday, June 23, 2016 at 1:00 PM
Where: Simrall Hall, Room 228
Candidate: Archit Harsh
Degree: Masters, Electrical and Computer Engineering
Committee:
Dr. John E. Ball
Assistant Professor of Electrical and Computer Engineering (Major Professor)
Dr. Nicolas H. Younan
Professor of Electrical and Computer Engineering (Committee Member)
Dr. Mahalingam Ramkumar
Associate Professor of Computer Science and Engineering (Committee Member)
Abstract:
A non-parametric data clustering technique for achieving efficient data-clustering and improving the number of clusters is presented in this thesis. Specifically, two methods are proposed: Automatic K-Means and K-Expectation-Maximization (K-EM). The computational task of classifying the data set into k clusters is often referred to as k-clustering. K-Means and Expectation-Maximization algorithms have been widely deployed in data-clustering applications in relational databases. Result findings in related works studied in the literature revealed that both these algorithms have been found to be characterized with shortcomings. K-Means does not guarantee convergence and the choice of clusters heavily influence the results. Expectation-Maximization’s premature convergence does not assure the optimality of results and as with K-Means, the choice of clusters influences the results. To overcome the shortcomings, a fast automatic K-EM algorithm is developed and implemented that could both guarantee convergence and optimality of results. As an advantage of a non-parametric clustering technique, the proposed method provides the optimal number of clusters by utilizing various internal cluster validity metrics. thereby making it independent of the choice of clusters and provides and unbiased results. The algorithm is implemented on a wide array of data-sets including real and synthetic data sets to validate the accuracy of the results and efficiency of the algorithm.
Best Regards,
Archit Harsh