Imagine that I have a SQL query that calculates the number of rows grouped by some column:

Table of Contents

  1. Get Weekly AI Implementation Insights

SELECT column_A, count(*) as cnt
FROM table_name
GROUP BY column_A

I want to get the top 10 rows in every group. To do this, I have to use the query above as a subquery and use the ROW_NUMBER() function and a window function. The window function will group the rows by the column_A and order them by the number of rows:

SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY cnt DESC) as rank FROM (
    SELECT column_A, count(*) as cnt
    FROM table_name
    GROUP BY column_A
) t

To get only the top 10 rows, I am going to put all of that in another subquery and use WHERE to select only top values:

SELECT t2.column_A as column_A, t2.cnt as cnt FROM (
    SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_A ORDER BY cnt DESC) as rank FROM (
        SELECT column_A, count(*) as cnt
        FROM table_name
        GROUP BY column_A
        ) t
    ) t2
WHERE t2.rank <= 10;

Get Weekly AI Implementation Insights

Join engineering leaders who receive my analysis of common AI production failures and how to prevent them. No fluff, just actionable techniques.

Get Weekly AI Implementation Insights

Join engineering leaders who receive my analysis of common AI production failures and how to prevent them. No fluff, just actionable techniques.

Older post

How to configure both core and spot instances in EMR using Terraform

Use EMR instance group to add spot instances to an EMR cluster

Newer post

Run a command on a remote server using SSH in Airflow

How to use the SSHHook in a PythonOperator to connect to a remote server from Airflow using SSH and execute a command.

Engineering leaders: Is your AI failing in production? Take the 10-minute assessment
>