Databases 2

VU (706.711 / 707.030)

This course discusses the database systems that deal with large-scale semi-structured data. Typical examples of such data include large hypertext collections such as Wikipedia or the Web. Another example is the data generated by various user activities in distributed database systems, such as Web access logs.

In recent years the amount of semi-structured data that users generate increased significantly. For example, in various Web applications users either generate content or leave huge amount of trails when interacting with the system, e.g. access logs. When dealing with such data the traditional relational database systems very soon hit their limits. Therefore, a new database technology has emerged recently to manipulate these huge data collections of semi-structured and unstructured data. Typically, this new technology is called "NoSQL".

In this course we will investigate some of the properties of NoSQL systems and we will discuss in more details technologies such as Map-Reduce or graph databases.

  • Roman Kern (website)
  • Mark Kröll

Content

Course topics include:

  • NoSQL
  • Map-Reduce
  • Stream processing
  • Graph databases
  • Timeseries databases

Theoretical Goals

In this course the students will:

  • Understand the properties of semi-structured data
  • Learn about NoSQL database systems
  • Learn about Map-Reduce programming method
  • Understand the workings of graph databases

Practical Goals

At the end of this course the students will know how to:

  • To use traditional, relational databases and contemporary, NoSQL databases
  • Work with large datasets
  • Conduct analytics with (sensor) data
Mondays, 13:00 - 13:45, HS i9

Data Sets

Dataset used for the practical project.

Online resources

About

Goal of the practical project is to experience the differences between traditional databases and NoSQL databases first hand. There is a practical project which needs to be conducted during the lecture duration. You will need to compare traditional database concepts with NoSQL database concepts.

The Studentlab dataset

  • Approx. 8 GB of sensor data recorded by approx. 80 sensors
    • Mainly located in our Studentlab and offices
  • Used tinkerforge, gembird and openweathermap to record
    • Indoor temperature, illuminance, CO2 levels, etc.
    • Hardware performance values such as voltage, current, power, etc.
    • Outdoor temperature, cloudiness, wind, rain volume, air pressure, etc.

Steps

There are two sets of requests: i) mandatory requests, and ii) hypothesis requests. All mandatory requests need to be implemented, while you a free in how many hypothesis you want to define (minimum of one). You may choose one (or multiple) of the suggested hypothesis, or define your own.

  • Define a number of hypothesis
    • Each hypothesis should be validated via a number of database requests
  • Pick a relational database of your choice
    • Populate the relational database with the data from the dump
    • Conduct the database requests (mandatory request + hypothesis requests) with the relational database
    • Collect measures of performance, e.g., memory consumption, runtime, ...
  • Pick a NoSQL database of your choice
    • Populate the NoSQL database with the data from the dump
    • Conduct the database requests (mandatory request + hypothesis requests) with the NoSQL database
    • Collect measures of performance, e.g., memory consumption, runtime, ...
  • Optionally, use a distributed technique with NoSQL database
  • Collect and present results
    • Prepare slides for a presentation of about 10 minutes
    • Present your results in the "Presentation of student projects" lecture

Mandatory Requests

  • Retrieval all data within a given time range for a single sensor
  • Retrieve a single data point for a given time stamp and sensor
    • Interpolation vs. forward fill (latest value)
  • Maximum, minimum and average within a time range for a single sensor
  • Generate a "soft sensor"
    • New sensor as a combination of at least two other sensors

Potential Hypothesis

  • In the room "PZ2 05 050" there are two sets of sensors (prefix "04/pz205050" and "05/pz205050")
    • Are both sets necessary (i.e., can one set nearly perfectly predict the other set)?
  • Are there "office-hours" in the sensor data?
    • Are the differences between the rooms?
  • Are correlations between the temperature sensors and i) the outside temperature, and ii) the intensity of the sun?
    • For which room/sensor is the correlation the highest?
  • What is the relation between "power" and "apparent power"?
    • Is the difference dependant on the "power" value?
  • Is there a relation between the activity of the server ("id: studentlab/server01") and its power consumption ("id: html/p1eg028f/server01")?
    • Is this relation time dependant?
  • Are there reliable/unreliable sensors?
    • Which sensor-ids have the fewest or most missing values?
  • Are differences in the duration of the sensors?
    • Are there sensors which continuously produce data, and sensors that only generate data on occasions?

Method, grading

Total points for the practical project is 40. At the end of the term you will also have a final examination with two questions (20 points each for total of 40). You will have to have at least 10 points for the project and 10 points for the examination and 41 points combined to get a positive grade.

Students who do not conduct/present the project will get zero points on the first part and will automatically fail the course!

The grading scheme is as follows:

  • 0-40 points: 5
  • 41-50 points: 4
  • 51-60 points: 3
  • 61-70 points: 2
  • 71-80 points: 1