summaryrefslogtreecommitdiff
path: root/SI/Resource/Fundamentals of Data Mining/Content/external.md
blob: 91faabe2cd28dd41348f7df9006895e9b937007a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
---
id: external
aliases:
  - Measuring Clustering Quality: ==External== Methods
tags: []
---

## Measuring Clustering Quality: ==External== Methods

- Given the **ground truth** _T, Q(C, T)_ is the **quality measure** for a
  clustering C
- _Q(C, T)_ is good if it satisfies the following **four** essential criteria
  - **Cluster homogeneity**
    - The purer, the better
  - **Cluster completeness**
    - Assign objects belonging to the same category in the ground truth to the
      same cluster
  - **Rag bag better than alien**
    - Putting a heterogeneous object into a pure cluster should be penalized
      **more** than putting it into a _rag bag_ (i.e., "miscellaneous" or
      "other" category)
  - **Small cluster preservation**
    - Splitting a small category into pieces is more harmful than splitting a
      large category into pieces

## Commonly Used External Measures

- **Matching-based measure**
  - Purity, maximum matching, [[F-measure]]
- **Entropy-Based Measures**
  - Conditional entropy
  - <u>Normalized mutual information (NMI)</u>
  - Variation of information
- **Pairwise measures**
  - Four possibilities: True positive (TP), FN, FP, TN
  - Jaccard coefficient, Rand statistic, Fowlkes-Mallow measure
- **Correlation measures**
  - Discretized Huber static, normalized discretized Huber static
- Purity vs Maximum Matching ![[CleanShot 2023-10-25 at 15.57.30@2x.png]]
- [[F-measure]] ![[CleanShot 2023-10-25 at 15.57.51@2x.png]] ![[CleanShot
2023-10-25 at 15.58.04@2x.png]] ![[CleanShot 2023-10-25 at 15.58.19@2x.png]]
  ![[CleanShot 2023-10-25 at 15.58.40@2x.png]]