This article shows you how to use the window function and random sorting to select a random sample of rows grouped by a column.

First, we will use the window function to group the rows by a given column and order them randomly. Let’s assume that I have an Athena table called the_table with a column called column_A. In this case, the window function looks like this:

ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY RANDOM()) AS rn

I will put that window function in a subquery:

WITH random_order AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY RANDOM()) AS rn
    FROM the_table
)

If I want to get 1000 random samples in every group, I have to select the rows with the rn parameter equal or less than 1000:

WITH random_order AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY RANDOM()) AS rn
    FROM the_table
)
SELECT *
FROM random_order
WHERE rn <= 1000

Want to build AI systems that actually work?

Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.

Want to build AI systems that actually work?

Download my expert-crafted GenAI Transformation Guide for Data Teams and discover how to properly measure AI performance, set up guardrails, and continuously improve your AI solutions like the pros.

Older post

Use HttpSensor to pause an Airflow DAG until a website is available

Pause an Airflow DAG until an HTTP endpoint returns 200 OK

Newer post

Copy directories in S3 using s3-dist-cp

How to copy files in S3 and preserve the directory structure

Are you looking for an experienced AI consultant? Do you need assistance with your RAG or Agentic Workflow?
Book a Quick Consultation, send me a message on LinkedIn. Book a Quick Consultation or send me a message on LinkedIn

>