When you start processing streams of events, there always comes a time to decide on how to group them. We have a few kinds of window functions that we can use for such a grouping.

First, I have to start by stating the obvious thing. All window operations output the result at the end of the window. Of course, they do, it is not possible to have a five second-long sliding window that sees the events in the future.

When we think about the issue of grouping events for a while, we understand that it is the only possible option, but some people forget about it during job interviews. Hopefully, you will remember when you hear a tricky question about window functions.

The other trap is the fact that some projects misuse the names of window functions. For example, the function that Apache Flink calls a sliding window is described as a hopping window in Azure Stream Analytics.

Window functions

Now, let’s move to the actual topic. I’m going to begin with the most popular type of window function - the sliding window. There are two options; we can either have a time-based sliding window or an eviction-based sliding window.

If the tool you use implements the sliding window as an actual sliding window, you will never get an empty set as the output. Of course, if the “sliding window” you are using is, in fact, a hopping window in disguise, empty results may occur.

The time-based sliding window gives us the events that happened during the last t-seconds. Let’s look at an example. We have a ten seconds-long stream of events which we group into five second-long sliding windows.

Eviction-based sliding windows always contains n elements. For example, when I apply the sliding window function to get five-element slices of the previous events, I am going to get the following result:

The third kind of window function is a hopping window. As the name suggests, it is the window function that “jumps.” Because of that, we must specify the length of the window and the length of the jump. It does not need to be the same number!

For example, I can specify the 2.5 second-long jumps and five-second long window:

The last kind of window function is the tumbling window. It is a hopping function with equal “jump” and length. In this case, my example events get grouped into only two windows:

Older post

I put a carnivorous plant on the Internet of Things to save its life, and it did not survive

This article is a text version of my talk, "I put a carnivorous plant on the Internet of Things," which I presented during the DataNatives conference (November 25-26, 2019 in Berlin, Germany).

Newer post

Data streaming with Apache Kafka - guide for data engineers

Are you preparing for a data engineer job interview? Here are my answers to job interview questions about data streaming.