summaryrefslogtreecommitdiff
path: root/SI/Resource/Fundamentals of Data Mining/Content/K-Modes.md
diff options
context:
space:
mode:
Diffstat (limited to 'SI/Resource/Fundamentals of Data Mining/Content/K-Modes.md')
-rw-r--r--SI/Resource/Fundamentals of Data Mining/Content/K-Modes.md27
1 files changed, 27 insertions, 0 deletions
diff --git a/SI/Resource/Fundamentals of Data Mining/Content/K-Modes.md b/SI/Resource/Fundamentals of Data Mining/Content/K-Modes.md
new file mode 100644
index 0000000..6dea96c
--- /dev/null
+++ b/SI/Resource/Fundamentals of Data Mining/Content/K-Modes.md
@@ -0,0 +1,27 @@
+---
+id: K-Modes
+aliases:
+ - K-Modes: Clustering Categorical Data (Youtube)
+tags: []
+---
+
+## K-Modes: Clustering Categorical Data [(Youtube)](https://www.youtube.com/watch?v=b39_vipRkUo)
+
+- _K-Means_ cannot directly handle non-numerical (categorical) data - ==how to
+ calculate the mean? What do they mean?==
+ - Mapping categorical value to 0/1 cannot generate quality clusters (in
+ high-dimensional space)
+- _**K-Modes**_: An extension to _K-Means_ by replacing means of clusters with
+ _**modes**_
+ - Mode: The value that appears the most often in a **set** of data values
+- <u>Dissimilarity</u> measure between object X and the center of a cluster
+ $Z_l$
+ - $\Phi(x_j, z_j) = 1 - n_j^{\dfrac{r}{n_l}}$ when $x_j = z_j = r$; 1 when
+ $x_j \ne z_j$
+ - where $z_j$ is the categorical value of attribute j in $Z_l$, $n_l$ is the
+ number of objects in cluster $l$, and $n_j^r$ is the number of objects
+ whose attribute value is r
+- This dissimilarity measure (distance function) is _**frequency-based**_
+- Algorithm is still based on iterative _object_ cluster assignment and
+ _centroid_ update
+- A mixture of categorical and numerical data: Using a _**K-Prototype**_ method