This course discusses basics of the knowledge discovery process and data mining. The course focuses on three main aspects of knowledge discovery: mathematical tools, programming infrastructure for manipulating large data, and a well-defined and structured knowledge discovery process consisting of a number of interactive and iterative steps. The examples concentrate on discovering knowledge in large collections of text documents.

In recent years the amount of data that we produce increased dramatically. Very soon we will produce more data than we are able to store with the current technological solutions. Therefore, making sense out of these huge amount of data, or extracting useful, valid, understandable, and novel patterns from this data is of cruicial importance. Knowledge discovery and data mining are one of the approaches to tackle this problem. The other similar, but somewhat different approaches include database technology, machine learning, or statistics.

In this course we will investigate, analyze, and discuss a well-defined process for knowledge discovery in such a large data. Apart from the process we will also discuss the mathematics needed for data mining, as well as a new progamming methods such as Map-Reduce that have been designed to process large-scale data.

- Denis Helic (website)
- Roman Kern

Course topics include:

- Review of mathematics needed in data mining
- Knowledge discovery process
- Infrastructure for large scale data mining
- Text classification and clustering
- Semantic analysis of text documents

In this course the students will:

- Learn about the mathematical basics of data mining algorithms
- Learn about the infrastructure for large scale data mining
- Learn about the steps from a knowledge discovery process
- Learn about selected data mining algorithms

At the end of this course the students will know how to:

- Analyze and design a typical knowledge discovery project (however, without implementation - that is already KDDM2)

- 06.10.2016: Course organization, Introduction and Motivation
- 13.10.2016: Review of probability theory and linear algebra
- 20.10.2016: Preprocessing
- 27.10.2016: Feature Extraction
- 03.11.2016: Feature Engineering
- 10.11.2016: Partial Exam 1 / Project presentations
- 17.11.2016: Data Matrices
- 24.11.2016: Principal Component Analysis
- 01.12.2016: SVD and Latent Semantic Analysis
- 15.12.2016: Recommender Systems: Matrix Factorization
- 12.01.2017: Classification
- 19.01.2017: Clustering
- 26.01.2017: Partial Exam 2 / Project presentations
- 02.02.2017: Examination (Sample questions)

- Mining massive datasets
- Advanced Data Analysis from an Elementary Point of View
- Probability primer YouTube series
- Lecture slides "Stochastic Systems" from University of Virginia
- Lecture slides "Mining Massive Datasets" from Stanford University
- Introduction to Information Retrieval
- Machine Learning Course by Andrew Ng
- Machine Learning Course by Pedro Domingos

- Probability Essentials by Jacod and Protter
- Machine Learning by Tom Mitchell

There will be two partial examinations written within the classes. You will write the exam in the beginning of a lecture for 45 minutes. Each partial examination will have 3 questions with difficulty adjusted to solve both problems in approx. 35 minutes. You can get max 15 points for each question resulting in a total of 90 points.

Appart from the partial examinations there will be a standard written examination at the end of the course. 4 questions with max 20 points for each question. The total number of points that can be reached will be 80.

The grading scheme is as follows:

- 0-40 points: 5
- 41-50 points: 4
- 51-60 points: 3
- 61-70 points: 2
- 71-80 points: 1

Points from study year: 2016

The instructions for the practical project.