--- id: external aliases: - Measuring Clustering Quality: ==External== Methods tags: [] --- ## Measuring Clustering Quality: ==External== Methods - Given the **ground truth** _T, Q(C, T)_ is the **quality measure** for a clustering C - _Q(C, T)_ is good if it satisfies the following **four** essential criteria - **Cluster homogeneity** - The purer, the better - **Cluster completeness** - Assign objects belonging to the same category in the ground truth to the same cluster - **Rag bag better than alien** - Putting a heterogeneous object into a pure cluster should be penalized **more** than putting it into a _rag bag_ (i.e., "miscellaneous" or "other" category) - **Small cluster preservation** - Splitting a small category into pieces is more harmful than splitting a large category into pieces ## Commonly Used External Measures - **Matching-based measure** - Purity, maximum matching, [[F-measure]] - **Entropy-Based Measures** - Conditional entropy - Normalized mutual information (NMI) - Variation of information - **Pairwise measures** - Four possibilities: True positive (TP), FN, FP, TN - Jaccard coefficient, Rand statistic, Fowlkes-Mallow measure - **Correlation measures** - Discretized Huber static, normalized discretized Huber static - Purity vs Maximum Matching ![[CleanShot 2023-10-25 at 15.57.30@2x.png]] - [[F-measure]] ![[CleanShot 2023-10-25 at 15.57.51@2x.png]] ![[CleanShot 2023-10-25 at 15.58.04@2x.png]] ![[CleanShot 2023-10-25 at 15.58.19@2x.png]] ![[CleanShot 2023-10-25 at 15.58.40@2x.png]]