Knowledge Discovery and Data Mining 2

VU (706.715)

The goal of this course is to continue and build upon the theory covered in Knowledge Discovery and Data Mining 1. But for KDDM2 the emphasis lies on practical aspects and therefore the practical exercise is integral part of this course. In addition a number of algorithmic approaches will be covered in detail, which are related to the project topics. Participants may choose one out of a number of proposed projects from different stages of the Knowledge Discovery process and different data sets.

Knowledge Discovery and Data Mining are closely related to the concepts Big Data, Data Science and Data Analytics. Data science encompasses a number of diverse skills, ranging from software engineering, data mining and statistics by following a scientific, evaluation driven approach. This course should help to develop some of these skill, with a focus on the areas of Natural Language Processing, Machine Learning and Information Retrieval. In addition, is a necessary prerequisite for many Knowledge Discovery applications to develop strong skills in analysing big data sets and preprocessing them.

The slides and resources from the previous years are available here: 2016, 2015, 2014

  • Roman Kern
  • Denis Helic

Content

Course topics include:

  • Data Mining
  • Information Retrieval
  • Pattern Mining
  • Machine Learning
  • Text Mining
  • Time Series

Theoretical Goals

In this course the students will learn about:

  • Learn about the KDD process in detail
  • Learn about working with big data sets
  • Learn about real-world problem settings
  • Learn about advanced statistics and algorithms

Practical Goals

At the end of this course the students will know how to:

  • Preprocess (big) data sets
  • Feature engineering on (textual) data
  • Clustering and classification algorithms
  • Information retrieval and recommender algorithms

Lectures

Work Plan

Created with Ganttproject: kddm2-project-plan-2017.gan

Scientific Publications

E-Mail Processing

  • Kern, R., Seifert, C., Zechner, M., & Granitzer, M. (2011, September). Vote/Veto Meta-Classifier for Authorship Identification Notebook for PAN at CLEF 2011. In CLEF (Notebook Papers/Labs/Workshop) (Vol. 11).
  • Lampert, A., Dale, R., & Paris, C. (2009, August). Segmenting email message text into zones. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2 (pp. 919-928). Association for Computational Linguistics.
  • Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. arXiv preprint arXiv:1405.4053.

Web Resources

All students are required to register for the VU in TUGOnline until 09.03.2017, 23:59.

It is not planned that there is a written (or oral) examination, as the focus lies on the practical exercise. Therefore, the grading will entirely will depend on how the exercise is conducted and on its results.

Overview

There are several practical projects from different phases of the Knowledge Discovery process to choose from.

The work on the projects will be conducted by single students on their own (groups of one). But there is also the possibility to form groups of two people, where the project scope is then expanded appropriately (see advanced). The students are expected to present the progress of their work in two presentations: i) analysis phase of the data set outline of the algorithmic approach, and ii) a presentation of the conducted work and the results.

For all projects the evaluation is considered to be an integral part. One needs to be able to state how good a solution works and what are the expected limitations.

There are data sets proposed for each of the project, but students are free to come up with data sets of their own, or make project proposals.

Grading is conducted on these criteria:

  • How is the problem tackled?
  • What is the complexity of the solution?
  • How is the solution evaluated?
  • How is the problem & solution being presented?

Presentations

Interim Presentation

  • What problem are you working on?
  • What data set do you plan to use?
  • What is the planned approach?
  • Why did you choose this approach?
  • What are your expected results?
  • What problems have you encountered so far?

Final Presentation

  • How have you tackled the problem?
  • What are your evaluation results (is the problem solved)?
  • What have you learnt (new insights)?
  • Did something unexpected happen?
  • Would the solution apply to other scenarios (and how well)?

Topics

1. Document Representation

Task: Build and evaluate different approaches to combine multiple word vectors into a single document vector.

Data-Set: Stanford Sentiment Treebank Dataset

Advanced: Demonstrate its application in the context of sentiment detection

Keywords: Word2Vec, Doc2Vec

Questions about the topic: Roman Kern

2. E-Mail Parsing

Task: Parse a semi-structured data set with the goal to add more structure to the data. The goal is to separate correct sentences from other textual fragments, e.g. ascii tables, greetings, etc.

Use-Case: Pre-processing of textual data for further processing, e.g. for prediction of potential receivers.

Data-Set: The data being worked on are e-mails, which already are semi-structured and the header contains information like the sender.

Suggested data-sets:

Advanced: Write a prediction algorithms that is able to recover the sender and the receivers once removed from the mail.

Questions about the topic: Roman Kern

3. Sensor Analytics

Task: Collect sensor data and detect certain states (only for teams)

Option A: Biosensors (e.g. heartrate, ...), detect positions

Option B: Industrial sensors (e.g. temperature, ...), estimate the numner of people within a room

Option C: Fluid & gas sensors (e.g. CO2, ...), detect certain liquids

Data-Set: Needs to be collected in course of the project

Questions about the topic: Roman Kern

4. Machine Learning

Task: Automatic tagging of resources, using either an unsupervised or a supervised approach. The goal is to apply tags to an unseen resource.

Approach: For this project the approach may vary widely. A supervised approach requires a training dataset and may include classification algorithms. Unsupervised approaches may either look at a set of resources or a single resource at a time.

Suggested data-sets:

  • Stack Exchange
    One example for tagged data are the stack exchange pages, which can be downloaded here: Stack Exchange Dump
  • Last.fm
    The Million Song Dataset, which contains tracks, similar tracks as well as tags. The dataset is already split into a training and testing dataset and can be accessed here: Last.fm Dataset

Advanced: Implement an unsupervised and a supervised approach, and then compare the two approaches. Measure their differences in accuracy as well as discuss their individual strengths and weaknesses.

Questions about the topic: Roman Kern

5. Information Retrieval - Query Completion

Scenario: A user is starting to search by typing in some words...

Task: The system should automatically suggest word completions, depending on the already entered words.

Suggested data-sets:

  • Wikipedia
    The Wikipedia can be downloaded as dump, either as XML or as MySQL Database from the Wikimedia website.
  • Europeana
    Instead of using a dataset to retrieve relevant items from, one can directly use a search engine. The Europeana project directly supports a JSON query interface, which can be accessed with an API key: Europeana API Portal

Framework:

  • For processing of the text you might use: Sensium

Advanced: Identify Wikipedia concepts within the written text. For example, if the text contains the word Graz it should be linked to the corresponding Wikipedia page.

Questions about the topic: Roman Kern

6. Information Retrieval - Blog Search

Task: Provide a list of matching resources for a given piece of text. The goal is to produce a ranked list of items relevant to a context.

Use-Case: Consider a user writing a text, for instance a blog post. While typing the user is presented a list of suggested items, which might be relevant or helpful.

Suggested data-sets: Same as previous task.

Framework:

  • For processing of the text you might use: Sensium

Advanced: Identify Wikipedia concepts within the written text. For example, if the text contains the word Graz it should be linked to the corresponding Wikipedia page.

Questions about the topic: Roman Kern

7. Timeseries Prediction

Task: Given a set of sensor data for multiple streams, e.g. temperature, power consumption. The goal is to predict the future values of these signals, optimally including a confidence range.

Approach: Take the stream of data and build a prediction algorithm, that is able to predict the future values of the streams a accurately as possible.

Suggested data-sets:

  • Intel Berkeley Research Lab
    Take for example the data from the sensors of the Intel Berkeley Research Labs, see Stream Data Mining Repository. The data is in a format used by many machine learning frameworks, e.g. Weka.
  • Powersupply Stream
    Use the power supply stream from the same data source: Stream Data Mining Repository. Here the challenge is to integrate seasonality into the analysis.
  • UCI Repository
    Repository of multiple data-sets, including timeseries.

Advanced: Try to detect events (e.g. meetings) within the data. This is a hard task, as there is no ground truth to evaluate against, thus it is part of the project to with strategies on how to measure the quality of the algorithms.

Questions about the topic: Roman Kern

8. Pattern Mining in Timeseries

Task: Given a set of sensor data for multiple streams, e.g. temperature, power consumption. Apply sequential pattern mining on a time series data-set (optionally apply SAX beforehand).

Suggested data-sets:

  • UCI Repository
    Repository of multiple data-sets, including timeseries.

Advanced: Pre-process the data via piecewise linear approximation.

Questions about the topic: Roman Kern

9. Spectral and K-Means Clustering

Task: Spectral clustering is a traditional method for partitioning (graph) data in different groups. The algorithm relies on calculating the eigenvectors of the graph Laplacian matrix and dividing the graph according to the values in these eigenvectors. The traditional algorithm analyses the second smallest eigenvector of the graph Laplacian. Extend this approach to include more eigenvectors and to apply K-means on the eigenvectors.

Approach: Calculate the eigenvectors of the (normalized) graph Laplacian. Analyse the spectrum of the graph by visually investigating the values in the eigenvectors. Select a number of eigenvectors and map the graph to the coordinate space spanned by those eigenvectors. Apply K-means clustering on the points from the new space.

Suggested data-sets:

Paper:

Questions about the topic: Denis Helic