--- id: K-Modes aliases: - K-Modes: Clustering Categorical Data (Youtube) tags: [] --- ## K-Modes: Clustering Categorical Data [(Youtube)](https://www.youtube.com/watch?v=b39_vipRkUo) - _K-Means_ cannot directly handle non-numerical (categorical) data - ==how to calculate the mean? What do they mean?== - Mapping categorical value to 0/1 cannot generate quality clusters (in high-dimensional space) - _**K-Modes**_: An extension to _K-Means_ by replacing means of clusters with _**modes**_ - Mode: The value that appears the most often in a **set** of data values - Dissimilarity measure between object X and the center of a cluster $Z_l$ - $\Phi(x_j, z_j) = 1 - n_j^{\dfrac{r}{n_l}}$ when $x_j = z_j = r$; 1 when $x_j \ne z_j$ - where $z_j$ is the categorical value of attribute j in $Z_l$, $n_l$ is the number of objects in cluster $l$, and $n_j^r$ is the number of objects whose attribute value is r - This dissimilarity measure (distance function) is _**frequency-based**_ - Algorithm is still based on iterative _object_ cluster assignment and _centroid_ update - A mixture of categorical and numerical data: Using a _**K-Prototype**_ method