SI/Resource/Fundamentals of Data Mining/Midterm - CS663.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id: Midterm - CS663
aliases:
  - Review
tags: []
------------------------------------------------------------------------------------------------------------
# Review
## Types of Questions
- True or false
- Multi-choice
- Explain (e.g., [[K-Means]], [[NMI]])
- [[Compare and Contrast]] (e.g., [[clustering algorithms]])
- Computational questions (e.g., [[DBSCAN]] and [[OPTICS]] (similar to your
  assignment questions), [[FP-growth|fp-tree]] and [[pattern discovery]]
  (examples from the lecture), [[clustering evaluation]] )

## Subjects
- [[Distance measures (mixted types of attributes)]]
  - How to handle [[nominal]] attributes …
    - [[nominal|Match or no-match]](as a whole or individually)
    - [[nominal|One-hot encoding]]
    - [[Target encoding]]
- Normalization ([[z-score]], [[mixted types of attributes|min-max]], …)
- Clustering techniques:
  - [[K-Means]] and its [[variants]]
  - [[Hierarchical Clustering]] ([[Hierarchical Clustering|Agglomerative]])
  - [[Density-based Clustering]]([[DBSCAN]], [[OPTICS]])
  - [[Complexity]], [[distance functions]]
- How to measure clustering quality ([[internal]] and [[external]] measures,
  [[F-measure]] and its averaging/combining options when applied to multiple
  classes/clusters)
- Frequent pattern mining ([[Apriori]] Algorithm, [[FP-growth]])

---

############################################################################ [[Data Matrix and Dissimilarity Matrix]]

- Data matrix
  - A data matrix of n data points with / dimensions ![[CleanShot 2023-10-23 at
17.37.59@2x.png]]
- Dissimilarity (distance) matrix (n by n)
  - n data points, but registers only the distance _d(i,j)_(typically
    metric)![[CleanShot 2023-10-23 at 17.41.47@2x.png]]
  - Usually symmetric, thus a trinagular matrix
  - **[[Distance functions]]** are usually different for real, boolean,
    categorical, ordinal, ratio, and vector variables
  - Weights can be associated with different variables based on applications and
    data semantics

### Standardizing Numeric Data

- [[Z-score]]: $z = \dfrac{x - \mu}{\sigma}$
  - X: raw score to be standardized, $\mu$: mean of the population, $\sigma$:
    standard deviation
  - the distance between the raw score and the population mean in units of the
    standard deviation
  - negative when the raw score is below the mean, "+" when above
- An alternative way: Calculate the mean absolute deviation $S_{f} =
  \dfrac{1}{n}(|x_{1f} - m_f| + |x_{2f} - m_f| + ... + |x_{nf} - m_f|)$ where
  $m_f = \dfrac{1}{n}(x_{1f} + x_{2f} + ... + x_{nf})$
  - standardized measure (z-score): $z_{if} = \dfrac{x_{if} - m_f}{S_f}$
- **Using mean absolute devication is more robust than using standard
  deviation**

### Proximity Measure for [[Binary|Binary Attributes]]

############################################################################ Proximity Measure for [[nominal|Categorical Attributes]]

############################################################################ [[Ordinal]] Variables