summaryrefslogtreecommitdiff
path: root/SI/Resource/Fundamentals of Data Mining/Midterm - CS663.md
diff options
context:
space:
mode:
Diffstat (limited to 'SI/Resource/Fundamentals of Data Mining/Midterm - CS663.md')
-rw-r--r--SI/Resource/Fundamentals of Data Mining/Midterm - CS663.md69
1 files changed, 69 insertions, 0 deletions
diff --git a/SI/Resource/Fundamentals of Data Mining/Midterm - CS663.md b/SI/Resource/Fundamentals of Data Mining/Midterm - CS663.md
new file mode 100644
index 0000000..358ed89
--- /dev/null
+++ b/SI/Resource/Fundamentals of Data Mining/Midterm - CS663.md
@@ -0,0 +1,69 @@
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+id: Midterm - CS663
+aliases:
+ - Review
+tags: []
+------------------------------------------------------------------------------------------------------------
+# Review
+## Types of Questions
+- True or false
+- Multi-choice
+- Explain (e.g., [[K-Means]], [[NMI]])
+- [[Compare and Contrast]] (e.g., [[clustering algorithms]])
+- Computational questions (e.g., [[DBSCAN]] and [[OPTICS]] (similar to your
+ assignment questions), [[FP-growth|fp-tree]] and [[pattern discovery]]
+ (examples from the lecture), [[clustering evaluation]] )
+
+## Subjects
+- [[Distance measures (mixted types of attributes)]]
+ - How to handle [[nominal]] attributes …
+ - [[nominal|Match or no-match]](as a whole or individually)
+ - [[nominal|One-hot encoding]]
+ - [[Target encoding]]
+- Normalization ([[z-score]], [[mixted types of attributes|min-max]], …)
+- Clustering techniques:
+ - [[K-Means]] and its [[variants]]
+ - [[Hierarchical Clustering]] ([[Hierarchical Clustering|Agglomerative]])
+ - [[Density-based Clustering]]([[DBSCAN]], [[OPTICS]])
+ - [[Complexity]], [[distance functions]]
+- How to measure clustering quality ([[internal]] and [[external]] measures,
+ [[F-measure]] and its averaging/combining options when applied to multiple
+ classes/clusters)
+- Frequent pattern mining ([[Apriori]] Algorithm, [[FP-growth]])
+
+---
+
+############################################################################ [[Data Matrix and Dissimilarity Matrix]]
+
+- Data matrix
+ - A data matrix of n data points with / dimensions ![[CleanShot 2023-10-23 at
+17.37.59@2x.png]]
+- Dissimilarity (distance) matrix (n by n)
+ - n data points, but registers only the distance _d(i,j)_(typically
+ metric)![[CleanShot 2023-10-23 at 17.41.47@2x.png]]
+ - Usually symmetric, thus a trinagular matrix
+ - **[[Distance functions]]** are usually different for real, boolean,
+ categorical, ordinal, ratio, and vector variables
+ - Weights can be associated with different variables based on applications and
+ data semantics
+
+### Standardizing Numeric Data
+
+- [[Z-score]]: $z = \dfrac{x - \mu}{\sigma}$
+ - X: raw score to be standardized, $\mu$: mean of the population, $\sigma$:
+ standard deviation
+ - the distance between the raw score and the population mean in units of the
+ standard deviation
+ - negative when the raw score is below the mean, "+" when above
+- An alternative way: Calculate the mean absolute deviation $S_{f} =
+ \dfrac{1}{n}(|x_{1f} - m_f| + |x_{2f} - m_f| + ... + |x_{nf} - m_f|)$ where
+ $m_f = \dfrac{1}{n}(x_{1f} + x_{2f} + ... + x_{nf})$
+ - standardized measure (z-score): $z_{if} = \dfrac{x_{if} - m_f}{S_f}$
+- **Using mean absolute devication is more robust than using standard
+ deviation**
+
+### Proximity Measure for [[Binary|Binary Attributes]]
+
+############################################################################ Proximity Measure for [[nominal|Categorical Attributes]]
+
+############################################################################ [[Ordinal]] Variables