summaryrefslogtreecommitdiff
path: root/SI/Resource/Fundamentals of Data Mining/Content/K-Medoids.md
blob: 6dc59fea70cea36028d8a0d51beade37db2943c3 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
id: K-Medoids
aliases:
  - Handling Outliers: From _K-Means_ to _K-Medoids_ (Youtube)(Youtube-E.g.)
tags:
  - Compare-and-Contrast
  - ""
  - Complexity
---

## Handling Outliers: From _K-Means_ to _K-Medoids_ [(Youtube)](https://www.youtube.com/watch?v=OFELCn-6r2o) [(Youtube-E.g.)](https://www.youtube.com/watch?v=ChBxx4aR-bY&t=0s)

- K-Medoids: Instead of taking the **mean** value of the objects in a cluster as
  a reference point, **medoids** can be used, which is the **most centrally
  located** object in a cluster

- The _K-Medoids_ clustering algorithm:
  - Select _K_ points as the initial ==representative== objects (i.e., as
    initial _K medoids_)
  - **Repeat**
    - Assigning each point to the cluster with the closest medoid
    - Randomly select a ==non-representative== object $o_i$
    - Compute ==the total cost _S_== of swapping the medoid $m$ with
      $o_i$(_==e.g.,useSSEtomeasure==_)
    - If $S<0$ (_==e.g., new SSE < previous SSE==_), then swap $m$ with $o_i$ to
      form the new set of medoids
  - **Until** convergence criterion is satisfied

### Discussion on _K-Medoids_ Clustering

- _K-Medoids_ Clustering: Find _representative_ objects (<u>medoids</u>) in
  clusters
- _PAM_ (Partitioning Around Medoids)
  - Starts from an initial set of medoids, and
  - Iteratively replaces one of the medoids by one of the non-mdedoids if it
    improves the total sum of the squared errors (SSE) of the resulting
    clustering
  - _PAM_ works effectively for small data sets but does not scale well for
    large data sets (due to the computational complexity)
  - Computational [[Complexity]]: PAM: $O(K(n - k)^2)$ (quite expensive!)
- Efficiency improvements on PAM
  - _**CLARA**_ (Kaufmann & Rousseeuw, 1990):
    - PAM on samples; $O(Ks^2 + K(n - K))$, $s$ is the sample size
  - _**CLARANS**_ (Ng & Han, 1994): ==Randomised re-sampling==, ensuring
    efficiency + quality