1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
|
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id: Midterm - CS663
aliases:
- Review
tags: []
------------------------------------------------------------------------------------------------------------
# Review
## Types of Questions
- True or false
- Multi-choice
- Explain (e.g., [[K-Means]], [[NMI]])
- [[Compare and Contrast]] (e.g., [[clustering algorithms]])
- Computational questions (e.g., [[DBSCAN]] and [[OPTICS]] (similar to your
assignment questions), [[FP-growth|fp-tree]] and [[pattern discovery]]
(examples from the lecture), [[clustering evaluation]] )
## Subjects
- [[Distance measures (mixted types of attributes)]]
- How to handle [[nominal]] attributes …
- [[nominal|Match or no-match]](as a whole or individually)
- [[nominal|One-hot encoding]]
- [[Target encoding]]
- Normalization ([[z-score]], [[mixted types of attributes|min-max]], …)
- Clustering techniques:
- [[K-Means]] and its [[variants]]
- [[Hierarchical Clustering]] ([[Hierarchical Clustering|Agglomerative]])
- [[Density-based Clustering]]([[DBSCAN]], [[OPTICS]])
- [[Complexity]], [[distance functions]]
- How to measure clustering quality ([[internal]] and [[external]] measures,
[[F-measure]] and its averaging/combining options when applied to multiple
classes/clusters)
- Frequent pattern mining ([[Apriori]] Algorithm, [[FP-growth]])
---
############################################################################ [[Data Matrix and Dissimilarity Matrix]]
- Data matrix
- A data matrix of n data points with / dimensions ![[CleanShot 2023-10-23 at
17.37.59@2x.png]]
- Dissimilarity (distance) matrix (n by n)
- n data points, but registers only the distance _d(i,j)_(typically
metric)![[CleanShot 2023-10-23 at 17.41.47@2x.png]]
- Usually symmetric, thus a trinagular matrix
- **[[Distance functions]]** are usually different for real, boolean,
categorical, ordinal, ratio, and vector variables
- Weights can be associated with different variables based on applications and
data semantics
### Standardizing Numeric Data
- [[Z-score]]: $z = \dfrac{x - \mu}{\sigma}$
- X: raw score to be standardized, $\mu$: mean of the population, $\sigma$:
standard deviation
- the distance between the raw score and the population mean in units of the
standard deviation
- negative when the raw score is below the mean, "+" when above
- An alternative way: Calculate the mean absolute deviation $S_{f} =
\dfrac{1}{n}(|x_{1f} - m_f| + |x_{2f} - m_f| + ... + |x_{nf} - m_f|)$ where
$m_f = \dfrac{1}{n}(x_{1f} + x_{2f} + ... + x_{nf})$
- standardized measure (z-score): $z_{if} = \dfrac{x_{if} - m_f}{S_f}$
- **Using mean absolute devication is more robust than using standard
deviation**
### Proximity Measure for [[Binary|Binary Attributes]]
############################################################################ Proximity Measure for [[nominal|Categorical Attributes]]
############################################################################ [[Ordinal]] Variables
|