diff options
Diffstat (limited to 'SI/Resource/Fundamentals of Data Mining/Midterm - CS663.md')
| -rw-r--r-- | SI/Resource/Fundamentals of Data Mining/Midterm - CS663.md | 69 |
1 files changed, 69 insertions, 0 deletions
diff --git a/SI/Resource/Fundamentals of Data Mining/Midterm - CS663.md b/SI/Resource/Fundamentals of Data Mining/Midterm - CS663.md new file mode 100644 index 0000000..358ed89 --- /dev/null +++ b/SI/Resource/Fundamentals of Data Mining/Midterm - CS663.md @@ -0,0 +1,69 @@ +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- +id: Midterm - CS663 +aliases: + - Review +tags: [] +------------------------------------------------------------------------------------------------------------ +# Review +## Types of Questions +- True or false +- Multi-choice +- Explain (e.g., [[K-Means]], [[NMI]]) +- [[Compare and Contrast]] (e.g., [[clustering algorithms]]) +- Computational questions (e.g., [[DBSCAN]] and [[OPTICS]] (similar to your + assignment questions), [[FP-growth|fp-tree]] and [[pattern discovery]] + (examples from the lecture), [[clustering evaluation]] ) + +## Subjects +- [[Distance measures (mixted types of attributes)]] + - How to handle [[nominal]] attributes … + - [[nominal|Match or no-match]](as a whole or individually) + - [[nominal|One-hot encoding]] + - [[Target encoding]] +- Normalization ([[z-score]], [[mixted types of attributes|min-max]], …) +- Clustering techniques: + - [[K-Means]] and its [[variants]] + - [[Hierarchical Clustering]] ([[Hierarchical Clustering|Agglomerative]]) + - [[Density-based Clustering]]([[DBSCAN]], [[OPTICS]]) + - [[Complexity]], [[distance functions]] +- How to measure clustering quality ([[internal]] and [[external]] measures, + [[F-measure]] and its averaging/combining options when applied to multiple + classes/clusters) +- Frequent pattern mining ([[Apriori]] Algorithm, [[FP-growth]]) + +--- + +############################################################################ [[Data Matrix and Dissimilarity Matrix]] + +- Data matrix + - A data matrix of n data points with / dimensions ![[CleanShot 2023-10-23 at +17.37.59@2x.png]] +- Dissimilarity (distance) matrix (n by n) + - n data points, but registers only the distance _d(i,j)_(typically + metric)![[CleanShot 2023-10-23 at 17.41.47@2x.png]] + - Usually symmetric, thus a trinagular matrix + - **[[Distance functions]]** are usually different for real, boolean, + categorical, ordinal, ratio, and vector variables + - Weights can be associated with different variables based on applications and + data semantics + +### Standardizing Numeric Data + +- [[Z-score]]: $z = \dfrac{x - \mu}{\sigma}$ + - X: raw score to be standardized, $\mu$: mean of the population, $\sigma$: + standard deviation + - the distance between the raw score and the population mean in units of the + standard deviation + - negative when the raw score is below the mean, "+" when above +- An alternative way: Calculate the mean absolute deviation $S_{f} = + \dfrac{1}{n}(|x_{1f} - m_f| + |x_{2f} - m_f| + ... + |x_{nf} - m_f|)$ where + $m_f = \dfrac{1}{n}(x_{1f} + x_{2f} + ... + x_{nf})$ + - standardized measure (z-score): $z_{if} = \dfrac{x_{if} - m_f}{S_f}$ +- **Using mean absolute devication is more robust than using standard + deviation** + +### Proximity Measure for [[Binary|Binary Attributes]] + +############################################################################ Proximity Measure for [[nominal|Categorical Attributes]] + +############################################################################ [[Ordinal]] Variables |
