--- id: K-Medoids aliases: - Handling Outliers: From _K-Means_ to _K-Medoids_ (Youtube)(Youtube-E.g.) tags: - Compare-and-Contrast - "" - Complexity --- ## Handling Outliers: From _K-Means_ to _K-Medoids_ [(Youtube)](https://www.youtube.com/watch?v=OFELCn-6r2o) [(Youtube-E.g.)](https://www.youtube.com/watch?v=ChBxx4aR-bY&t=0s) - K-Medoids: Instead of taking the **mean** value of the objects in a cluster as a reference point, **medoids** can be used, which is the **most centrally located** object in a cluster - The _K-Medoids_ clustering algorithm: - Select _K_ points as the initial ==representative== objects (i.e., as initial _K medoids_) - **Repeat** - Assigning each point to the cluster with the closest medoid - Randomly select a ==non-representative== object $o_i$ - Compute ==the total cost _S_== of swapping the medoid $m$ with $o_i$(_==e.g.,useSSEtomeasure==_) - If $S<0$ (_==e.g., new SSE < previous SSE==_), then swap $m$ with $o_i$ to form the new set of medoids - **Until** convergence criterion is satisfied ### Discussion on _K-Medoids_ Clustering - _K-Medoids_ Clustering: Find _representative_ objects (medoids) in clusters - _PAM_ (Partitioning Around Medoids) - Starts from an initial set of medoids, and - Iteratively replaces one of the medoids by one of the non-mdedoids if it improves the total sum of the squared errors (SSE) of the resulting clustering - _PAM_ works effectively for small data sets but does not scale well for large data sets (due to the computational complexity) - Computational [[Complexity]]: PAM: $O(K(n - k)^2)$ (quite expensive!) - Efficiency improvements on PAM - _**CLARA**_ (Kaufmann & Rousseeuw, 1990): - PAM on samples; $O(Ks^2 + K(n - K))$, $s$ is the sample size - _**CLARANS**_ (Ng & Han, 1994): ==Randomised re-sampling==, ensuring efficiency + quality