--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- id: Midterm - CS663 aliases: - Review tags: [] ------------------------------------------------------------------------------------------------------------ # Review ## Types of Questions - True or false - Multi-choice - Explain (e.g., [[K-Means]], [[NMI]]) - [[Compare and Contrast]] (e.g., [[clustering algorithms]]) - Computational questions (e.g., [[DBSCAN]] and [[OPTICS]] (similar to your assignment questions), [[FP-growth|fp-tree]] and [[pattern discovery]] (examples from the lecture), [[clustering evaluation]] ) ## Subjects - [[Distance measures (mixted types of attributes)]] - How to handle [[nominal]] attributes … - [[nominal|Match or no-match]](as a whole or individually) - [[nominal|One-hot encoding]] - [[Target encoding]] - Normalization ([[z-score]], [[mixted types of attributes|min-max]], …) - Clustering techniques: - [[K-Means]] and its [[variants]] - [[Hierarchical Clustering]] ([[Hierarchical Clustering|Agglomerative]]) - [[Density-based Clustering]]([[DBSCAN]], [[OPTICS]]) - [[Complexity]], [[distance functions]] - How to measure clustering quality ([[internal]] and [[external]] measures, [[F-measure]] and its averaging/combining options when applied to multiple classes/clusters) - Frequent pattern mining ([[Apriori]] Algorithm, [[FP-growth]]) --- ############################################################################ [[Data Matrix and Dissimilarity Matrix]] - Data matrix - A data matrix of n data points with / dimensions ![[CleanShot 2023-10-23 at 17.37.59@2x.png]] - Dissimilarity (distance) matrix (n by n) - n data points, but registers only the distance _d(i,j)_(typically metric)![[CleanShot 2023-10-23 at 17.41.47@2x.png]] - Usually symmetric, thus a trinagular matrix - **[[Distance functions]]** are usually different for real, boolean, categorical, ordinal, ratio, and vector variables - Weights can be associated with different variables based on applications and data semantics ### Standardizing Numeric Data - [[Z-score]]: $z = \dfrac{x - \mu}{\sigma}$ - X: raw score to be standardized, $\mu$: mean of the population, $\sigma$: standard deviation - the distance between the raw score and the population mean in units of the standard deviation - negative when the raw score is below the mean, "+" when above - An alternative way: Calculate the mean absolute deviation $S_{f} = \dfrac{1}{n}(|x_{1f} - m_f| + |x_{2f} - m_f| + ... + |x_{nf} - m_f|)$ where $m_f = \dfrac{1}{n}(x_{1f} + x_{2f} + ... + x_{nf})$ - standardized measure (z-score): $z_{if} = \dfrac{x_{if} - m_f}{S_f}$ - **Using mean absolute devication is more robust than using standard deviation** ### Proximity Measure for [[Binary|Binary Attributes]] ############################################################################ Proximity Measure for [[nominal|Categorical Attributes]] ############################################################################ [[Ordinal]] Variables