1 files changed, 69 insertions, 0 deletions
diff --git a/SI/Resource/Fundamentals of Data Mining/Midterm - CS663.md b/SI/Resource/Fundamentals of Data Mining/Midterm - CS663.md
new file mode 100644
index 0000000..358ed89
--- /dev/null
+++ b/SI/Resource/Fundamentals of Data Mining/Midterm - CS663.md
@@ -0,0 +1,69 @@
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+id: Midterm - CS663
+aliases:
+  - Review
+tags: []
+------------------------------------------------------------------------------------------------------------
+# Review
+## Types of Questions
+- True or false
+- Multi-choice
+- Explain (e.g., [[K-Means]], [[NMI]])
+- [[Compare and Contrast]] (e.g., [[clustering algorithms]])
+- Computational questions (e.g., [[DBSCAN]] and [[OPTICS]] (similar to your
+  assignment questions), [[FP-growth|fp-tree]] and [[pattern discovery]]
+  (examples from the lecture), [[clustering evaluation]] )
+
+## Subjects
+- [[Distance measures (mixted types of attributes)]]
+  - How to handle [[nominal]] attributes …
+    - [[nominal|Match or no-match]](as a whole or individually)
+    - [[nominal|One-hot encoding]]
+    - [[Target encoding]]
+- Normalization ([[z-score]], [[mixted types of attributes|min-max]], …)
+- Clustering techniques:
+  - [[K-Means]] and its [[variants]]
+  - [[Hierarchical Clustering]] ([[Hierarchical Clustering|Agglomerative]])
+  - [[Density-based Clustering]]([[DBSCAN]], [[OPTICS]])
+  - [[Complexity]], [[distance functions]]
+- How to measure clustering quality ([[internal]] and [[external]] measures,
+  [[F-measure]] and its averaging/combining options when applied to multiple
+  classes/clusters)
+- Frequent pattern mining ([[Apriori]] Algorithm, [[FP-growth]])
+
+---
+
+############################################################################ [[Data Matrix and Dissimilarity Matrix]]
+
+- Data matrix
+  - A data matrix of n data points with / dimensions ![[CleanShot 2023-10-23 at
+17.37.59@2x.png]]
+- Dissimilarity (distance) matrix (n by n)
+  - n data points, but registers only the distance _d(i,j)_(typically
+    metric)![[CleanShot 2023-10-23 at 17.41.47@2x.png]]
+  - Usually symmetric, thus a trinagular matrix
+  - **[[Distance functions]]** are usually different for real, boolean,
+    categorical, ordinal, ratio, and vector variables
+  - Weights can be associated with different variables based on applications and
+    data semantics
+
+### Standardizing Numeric Data
+
+- [[Z-score]]: $z = \dfrac{x - \mu}{\sigma}$
+  - X: raw score to be standardized, $\mu$: mean of the population, $\sigma$:
+    standard deviation
+  - the distance between the raw score and the population mean in units of the
+    standard deviation
+  - negative when the raw score is below the mean, "+" when above
+- An alternative way: Calculate the mean absolute deviation $S_{f} =
+  \dfrac{1}{n}(|x_{1f} - m_f| + |x_{2f} - m_f| + ... + |x_{nf} - m_f|)$ where
+  $m_f = \dfrac{1}{n}(x_{1f} + x_{2f} + ... + x_{nf})$
+  - standardized measure (z-score): $z_{if} = \dfrac{x_{if} - m_f}{S_f}$
+- **Using mean absolute devication is more robust than using standard
+  deviation**
+
+### Proximity Measure for [[Binary|Binary Attributes]]
+
+############################################################################ Proximity Measure for [[nominal|Categorical Attributes]]
+
+############################################################################ [[Ordinal]] Variables