blob: 6dea96cfc59e5e9ac268fb59cb875e3b1ec1f5f7 (
plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
---
id: K-Modes
aliases:
- K-Modes: Clustering Categorical Data (Youtube)
tags: []
---
## K-Modes: Clustering Categorical Data [(Youtube)](https://www.youtube.com/watch?v=b39_vipRkUo)
- _K-Means_ cannot directly handle non-numerical (categorical) data - ==how to
calculate the mean? What do they mean?==
- Mapping categorical value to 0/1 cannot generate quality clusters (in
high-dimensional space)
- _**K-Modes**_: An extension to _K-Means_ by replacing means of clusters with
_**modes**_
- Mode: The value that appears the most often in a **set** of data values
- <u>Dissimilarity</u> measure between object X and the center of a cluster
$Z_l$
- $\Phi(x_j, z_j) = 1 - n_j^{\dfrac{r}{n_l}}$ when $x_j = z_j = r$; 1 when
$x_j \ne z_j$
- where $z_j$ is the categorical value of attribute j in $Z_l$, $n_l$ is the
number of objects in cluster $l$, and $n_j^r$ is the number of objects
whose attribute value is r
- This dissimilarity measure (distance function) is _**frequency-based**_
- Algorithm is still based on iterative _object_ cluster assignment and
_centroid_ update
- A mixture of categorical and numerical data: Using a _**K-Prototype**_ method
|