Engineering Blog

Engineering Blog

Bartosz Mikulski's articles on AI, MLOps, and data engineering

Fine-tuning Mistral-7B LLM using QLoRA in Axolotl

A comprehensive tutorial on fine-tuning Mistral-7B using QLoRA and Axolotl, covering data preparation, model configuration, and text classification optimization.
Published on: 11 Nov 2024

Prompt Management and Request Tracking for LLM Applications Using Langfuse

Learn how to use Langfuse to manage prompts and track LLM requests in your AI applications. Discover how to version prompts, monitor usage, and improve your LLM applications with detailed analytics.
Published on: 04 Nov 2024

How to use the OpenAI Swarm Agentic Framework

Learn to use OpenAI's Swarm Agentic Framework to build intelligent AI workflows. This guide covers the basics of agents, defining functions, and defining interactions between agents.
Published on: 28 Oct 2024

How to setup student Jupyter Notebook servers for a workshop using CoCalc

Learn how to set up Jupyter Notebook servers with GPU access for students using CoCalc. A guide for educators and workshop organizers looking to provide hardware for their students without (too much) hassle.
Published on: 21 Oct 2024

AI Receptionist for Apartment Rentals built with CrewAI

Learn how to build an AI-powered receptionist for apartment rentals using CrewAI. This tutorial covers implementing key features like access code management, WiFi password updates, and automated guest assistance, enhancing the rental experience while reducing manual workload.
Published on: 10 Sep 2024

How to fine-tune a super-fast small language model SmolLM from HuggingFace

You don't always need a large language model to get good results. A fine-tuned small language model is often just as good, much faster, and cheaper.
Published on: 30 Jul 2024

Enhancing RAG System Accuracy - Advanced RAG techniques explained

Discover advanced techniques to enhance the accuracy of your Retrieval-Augmented Generation (RAG) systems. Learn about semantic search, query expansion, HyDE, and keyword search to improve data retrieval and answer quality.
Published on: 16 Jul 2024

How to prevent LLM hallucinations from reaching the users in RAG systems

Hallucinations erode trust, so we should prevent them from reaching the users. In this article, I show how to use the FaithfulnessEvaluator from the Llama-Index library to determine whether the documents retrieved by vector search contain the answer to the user's question.
Published on: 30 Jun 2024

How can we measure improvement in information retrieval quality in RAG systems?

Every RAG system starts with retrieval. How do you know if your retrieval code is good enough? You measure it. The article shows how to use the ir_measures library to calculate Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) to quantify the performance of your retrieval code.
Published on: 20 Jun 2024

Building a Data Retrieval Workflow for AI with Structured Output Libraries like Marvin and Instructor

How to use Marvin and Instructor to define a structured output for LLMs and build a data retrieval workflow that can answer user's questions about data by checking if the required data is available in the database, planning what data has to be retrieved, generating the query, executing it, and generating a human-readable answer.
Published on: 10 Jun 2024

Building an agentic AI workflow with Llama 3 open-source LLM using LangGraph

How to build an agentic AI workflow using the Llama 3 open-source LLM model and LangGraph. We will create an autonomous multi-step process that autonomically handles a data retrieval task and answers user's questions using multiple specialized AI agents
Published on: 20 May 2024

Building a chatbot with a custom GPT assistant using OpenAI Assistant API and Streamlit

Build a chat with a YouTube video or a PDF chatbot in one hour (or less) with OpenAI Assistant API and Streamlit
Published on: 10 May 2024

The Ultimate 2024 Guide to Prompt Engineering

Discover the difference between proven prompt engineering techniques and tricks
Published on: 30 Apr 2024

Monitoring employees leaking secret data to AI in ChatGPT with GPTBoost and HuggingChat

How to monitor whether your employees leak company secrets, PII, or passwords to AI
Published on: 10 Jan 2024

Language learning with AI: building an AI-powered Anki plugin

AI in everyday usage: How to build an Anki plugin for generating example sentences in a foreign language using AI
Published on: 20 Dec 2023

Debugging, controlling OpenAI usage cost, and monitoring AI applications using Langsmith and GPTBoost

Discover how to effectively monitor AI applications, track costs, and optimize API usage with Langsmith and GPTBoost. Get insights into managing OpenAI API interactions for improved efficiency and cost control.
Published on: 30 Nov 2023

GPTBoost Case-Study: Blending AI with Human Creativity in Content Creation

Explore the innovative approach of using AI in content creation through the GPTBoost case-study. Learn how Veselina Staneva masterfully blends AI efficiency with human creativity in her writing process.
Published on: 22 Nov 2023

What mistakes do product managers make while using AI, and what to do instead

How product managers can use AI to make their work easier without suffering the consequences of AI misuse
Published on: 20 Nov 2023

Integrate OpenAI custom GPT with your business automation workflow using REST API and webhooks

How to integrate OpenAI custom GPT with make.com scenario using webhooks - a tutorial on using GPTs in business automation workflows.
Published on: 10 Nov 2023

Chat with a YouTube Video: How to build an AI chatbot using YouTube video transcripts

How to build an AI chatbot that retrieves answers to user's questions from transcripts of YouTube videos
Published on: 30 Oct 2023

AI-Powered Topic Modeling: Using Word Embeddings and Clustering for Document Analysis

Explore the seamless integration of artificial intelligence with classical machine learning techniques for effective topic modeling and document clustering. Learn how word embeddings enable higher accuracy, semantic context preservation, and robust results.
Published on: 30 Sep 2023

Build a question-answering service with AI and vector databases in JavaScript

How to use Langchain.js to build a question-answering service that uses vector databases to store the documents and AI to generate the answers
Published on: 20 Sep 2023

What's the difference between Langchain Agents and OpenAI Functions?

What should you use when your AI needs access to external systems? Is it better to use Langchain Agents or OpenAI Functions?
Published on: 10 Sep 2023

Save time and money by caching OpenAI (and other LLM) API calls with Langchain

How to use Langchain model response and document embeddings caching to save time and money when using Large Language Models
Published on: 30 Aug 2023

Generate questions and answers from any document using AI

How to use OpenAI GPT models, Langchain, and Doctran to generate questions and answers from long documents
Published on: 25 Aug 2023

What to do when a document doesn't fit in AI prompt window

Using Langchain MapReduceChain to handle documents longer than the prompt limit
Published on: 20 Aug 2023

Finding information in long documents with AI using vector databases and MapReduceChain from Langchain

How to find information in long documents with AI, vector databases, and Langchain using MapReduceChain and ParentDocumentRetriever
Published on: 15 Aug 2023

Building a classification service with Llama2 in Python

How to use the Llama2 AI model in Python to build a text classification service
Published on: 10 Aug 2023

Monitoring AI applications with Langsmith

Monitor interactions with LLM in Langchain and gather feedback about the model's performance using Langsmith
Published on: 30 Jul 2023

Using AI and automation to keep up with industry news

Discover how AI and automation can streamline the process of gathering relevant information from newsletters, summarizing it, and delivering a weekly summary
Published on: 30 Jun 2023

Use OpenAI API Function Calling to Build a Chatbot for Slack with Access to a REST API (updated for OpenAI SDK version 1.1.1+)'

Build an AI-powered chatbot that can interact with REST API using the Function Calling feature of OpenAI Completion API. Updated to cover the changes introduced after OpenAI DevDay 2023!
Published on: 16 Jun 2023

How to use AI in Python with LMQL?

Explore LMQL, a powerful SQL-like language designed for machine learning tasks.
Published on: 10 Jun 2023

Which index should you use while building an application with LlamaIndex?

Which Llama index should you use? When is it better to use GPTVectorStoreIndex, GPTListIndex, GPTKeywordTableIndex, or GPTKnowledgeGraphIndex?
Published on: 30 May 2023

How to fine-tune an OpenAI model using custom data

How to prepare the training data for an OpenAI model and how to fine-tune OpenAI's GPT model in Python
Published on: 08 May 2023

Deploy LLMs with Confidence: A Comprehensive Guide to Software Architecture for Production-Ready AI

Learn the essentials of deploying large language models in production with our comprehensive guide on software architecture for AI
Published on: 30 Apr 2023

Improve AI Output Using the Guardrails Library with Custom Validators

Use AI to validate another AI's output. Learn how to create custom validators and corrections using the Guardrails library.
Published on: 20 Apr 2023

How to Build a ChatGPT Plugin in Python?

A step-by-step guide to building a ChatGPT plugin in Python to retrieve data from the knowledge base stored in a vector database
Published on: 15 Apr 2023

Don't use AI to generate tests for your code or how to do test-driven development with AI

How to use AI to geneate test cases for your code
Published on: 10 Apr 2023

Alternatives to OpenAI GPT model: using an open-source Cerebras model with LangChain

Discover how to leverage the powerful open-source Cerebras model with LangChain in this comprehensive guide, featuring step-by-step instructions for loading the model with HuggingFace Transformers, creating prompt templates, and integrating it with LangChain Agents.
Published on: 30 Mar 2023

AI-Powered Pair Programming: Enhance Your Web Development Skills with GPT-4 Assistance

Improve your coding skills and elevate your writing with GPT-4 as your AI-driven pair programming partner, guiding you through the process of building a web application that functions as a user-friendly reverse dictionary.
Published on: 20 Mar 2023

Build an AI-powered Newsletter Generator with dust.tt and OpenAI

How to create an AI-powered newsletter generator using the dust.tt and OpenAI API. You'll learn how to set up a website with the AI application, use the few-shot in-context learning technique to train the AI model, and deploy the API to generate newsletters.
Published on: 10 Mar 2023

Get Started with ChatGPT API: A Step-by-Step Guide for Python Programmers (updated for OpenAI SDK version 1.1.1+)

A step-by-step tutorial on ChatGPT API (versions 1.1.1+) in Python. You'll also learn about prompt engineering, interactivity, optimizing API calls, and using parameters to get better results. Updated to cover the changes introduced after OpenAI DevDay 2023!
Published on: 01 Mar 2023

Maximize Customer Support Efficiency: Build an AI Chatbot to Answer Common Client Questions

How to build an AI-powered Facebook chatbot using GPT-3 from OpenAI and vector databases to answer client questions using your documentation - a tutorial with step-by-step instructions. You will learn how to set up a database, create text embeddings, use MLOps and prompt engineering to retrieve answers, and build a web application to connect with the Facebook API.
Published on: 28 Feb 2023

Detection of Text Duplicates and Text Search with Word Embeddings and Vector Databases

Discover how word embeddings and vector databases can revolutionize text search and duplicate detection. Learn how to implement it with OpenAI GPT-3 and Milvus vector database.
Published on: 20 Feb 2023

Automating Git Commit Messages with GPT-3 for Faster Software Development Workflows

Learn how to use GPT-3 to automate Git commit message generation and speed up your development workflows.
Published on: 15 Feb 2023

Connect GPT-3 to the Internet: Create a Slack Bot and Perform Web Search, Calculations, and More

Unleash the potential of GPT-3 and make it access the Internet. Learn how to use Langchain and build a Slack bot that can do a web search, extract text from websites, and perform calculations.
Published on: 10 Feb 2023

Unlocking the Power of In-Context Learning With Zero-Shot, One-Shot, and Few-Shot Prompt Engineering for GPT

How the in-context learning prompt engineering technique improves GPT-3 results, and why does it work? What's the difference between zero-shot, one-shot, and few-shot prompting?
Published on: 30 Jan 2023

Create an AI Data Analyst bot for Slack that can lookup data in your database

Build an AI-powered Slack bot that reads data from your production database and answers simple analitics questions
Published on: 23 Jan 2023

Generate a landing page for a newsletter in 17 minutes using ChatGPT or GPT-3

Are you looking for a way to generate high-quality content quickly and effectively? This article outlines how to use AI to create a landing page. You will learn more about the AIDA marketing model and the use of OpenAI's ChatGPT and GPT-3 to generate the text. Also, I will show you how to incorporate audience research into the prompt to improve the result. Get all the information you need to create high-quality content quickly and effectively with AI.
Published on: 20 Jan 2023

Why do you need a text summarization service, and how to deploy a text summarization model in 15 minutes using HuggingFace and Qwak?

Why do you need text summarization services in your business? How can you deploy a model downloaded from HuggingFace using the Qwak ML platform in 15 minutes?
Published on: 10 Jan 2023

What does modern software architecture look like in 2022?

Do architecture diagrams still matter? How do we deal with constant changes? How to design software architecture?
Published on: 20 Dec 2022

Using Abstraction Layers to Tackle Common Problems with Legacy Code

Are you struggling to manage and update your legacy codebase? In this article, I'll show you how to leverage the power of abstraction layers to overcome common challenges with legacy code.
Published on: 10 Dec 2022

What does kill IT projects?

What does kill IT projects? What you should avoid, at all costs, to ensure the success of your startup or software project
Published on: 30 Nov 2022

How to write a growth plan as a programmer?

How to write a growth plan that helps you get promoted and doesn't get in the way when you want to focus on your hobbies
Published on: 20 Nov 2022

Test-Driven Development in Python with Pytest

How to setup and use Pytest to test Python code
Published on: 10 Nov 2022

Marketing for SaaS startups: how to describe your product?

How to use the "benefits over features" technique to advertise your SaaS product and get more clients than your competition
Published on: 30 Oct 2022

How to pitch your idea

What a co-founder of DeepMind teaches us about pitching our ideas to investors
Published on: 20 Oct 2022

MLOps at small companies

How to do MLOps while working on a small data engineering team
Published on: 10 Oct 2022

Why should you practice TDD?

What are the benefits of TDD for programmers and companies that hire them?
Published on: 30 Sep 2022

How to debug code

How to debug code and solve problems as fast as possible
Published on: 20 Sep 2022

CUPID properties in data engineering

Does it make sense to use SOLID principles in data engineering? What about CUPID properties in data pipelines?
Published on: 10 Sep 2022

How to add tests to existing code in data transformation pipelines

How data engineers can write tests for legacy code in their ETL pipelines without breaking the existing implementation
Published on: 30 Aug 2022

Software engineering practices in data engineering and data science

How to produce high-quality software in data teams by applying software engineering practices to data science and data engineering
Published on: 20 Aug 2022

How to sort a Pandas DataFrame by month name

How to use an ordered categorical variable to sort a Pandas Dataframe by months while displaying their names
Published on: 15 Aug 2022

How to become a data engineer for free

What do you need to know to become a data engineer? Does a data engineer need a degree? How can you get your first data engineering job?
Published on: 10 Aug 2022

A comprehensive guide to Kappa Architecture

What is Kappa Architecture? When should we use Kappa Architecture? What's the difference between Kappa Architecture and Lambda Architecture? And way, way more!
Published on: 30 Jul 2022

The secret of working with legacy code on a software team

How to work with code written by other people? What to do when you join a new team?
Published on: 20 Jul 2022

Functional programming in Python

Does functional programming in Python make sense? How to do functional programming in Python?
Published on: 10 Jul 2022

How to write technical documentation

How to document a software project?
Published on: 20 Jun 2022

ETL vs ELT - what's the difference? Which one should you choose?

Should you use a data warehouse or build a data lake? When is a data warehouse a better choice? When is it better to build a data lake?
Published on: 10 Jun 2022

Selecting rows in Pandas

How to use loc, iloc, slice, and row filtering in Pandas
Published on: 27 May 2022

Python decorators explained

How can we define a Python decorator, and when should we use Python decorators.
Published on: 25 May 2022

What is shuffling in Apache Spark, and when does it happen?

When does an Apache Spark cluster perform the shuffle operation?
Published on: 20 May 2022

What is the root cause of problems in software engineering?

What is the primary, unrepairable cause of almost all bugs, data leaks, human problems, etc.?
Published on: 13 May 2022

How to become a better programmer

What's stopping us from getting better at coding
Published on: 06 May 2022

How to teach programming workshops to adults

How to prepare an enjoyable programming workshop that teaches people the skills they need without overwhelming them with new knowledge.
Published on: 15 Apr 2022

How does a bad interview look like in data engineering

What you should avoid when you interview programmers for a data engineer positition
Published on: 08 Apr 2022

How to throw useful exceptions

How to make debugging easier by paying attention to the errors you report
Published on: 01 Apr 2022

Why are programmers slow, and what to do about it?

The one practice that makes every team faster (in the long run)
Published on: 25 Mar 2022

How to advertise to software engineers, or how do we make terrible tech choices

Why do programmers make wrong decisions when they choose the tools they use?
Published on: 18 Mar 2022

Data engineers are data librarians or how to upgrade your data lake to 2500 BCE technology.

What can data engineers learn from (ancient) librarians?
Published on: 11 Mar 2022

I worked as a data scientist and that was the worst job I have ever had.

I believed in the Sexiest Job of 21 Century hype. I was wrong.
Published on: 04 Mar 2022

MLOps engineer, you will need those three books every day!

Don't reinvent the wheel as a MLOps engineer. The 3 books you must read in 2022
Published on: 25 Feb 2022

Why should you use a feature store

Benefits of having a feature store and what happens when you don't have one
Published on: 18 Feb 2022

Data pipeline documentation without wasting your time

How to document an ETL pipeline or ML inference pipeline without doing useless work
Published on: 11 Feb 2022

How to run batch inference using Sagemaker Batch Transform Jobs

Running a batch machine learning job using Sagemaker and data stored in S3.
Published on: 04 Feb 2022

How to build maintainable software by abstracting the business rules in data engineering

Are we building the right abstractions in software?
Published on: 28 Jan 2022

Testing legacy data pipelines

Do you struggle with maintaining your legacy data pipelines? Check out our article on how to add tests and refactor your code while working with legacy data pipelines.
Published on: 21 Jan 2022

Secrets of mentoring junior software engineers

How to quickly train junior engineers to make them as productive as the rest of the team
Published on: 14 Jan 2022

What does your data pipeline need in production?

When you're debugging a failing production pipeline at 2 am, what do you need?
Published on: 07 Jan 2022

How to pass a machine learning engineer interview

Trivial (and easily fixable) mistakes that will make you fail a job interview
Published on: 31 Dec 2021

Why do data engineers quit?

Why do data engineers quit their jobs?
Published on: 24 Dec 2021

What is the essential KPI of an MLOps team?

What KPI to measure in an MLOps team
Published on: 17 Dec 2021

Deploying your first ML model in production

The minimal setup for ML deployment without the things you DON'T need yet
Published on: 10 Dec 2021

Is it overengineered?

What's the difference between reasonable future-proof architecture and overengineering? Is there a difference?
Published on: 04 Dec 2021

Pattern matching in Python vs Scala

What is the difference between pattern matching in Python and Scala?
Published on: 26 Nov 2021

Should you use machine learning in your product?

How to put AI in production without overengineering your system
Published on: 19 Nov 2021

How does the Atlan data platform help you ensure data quality?

Atlan - a tool for facilitating a collaborative data culture
Published on: 15 Nov 2021

What should you learn as a data engineer?

Should you spend time learning data engineering tools and libraries?
Published on: 12 Nov 2021

Shadow deployment vs. canary release of machine learning models

What is shadow deployment in machine learning? What is a canary release? What is the difference?
Published on: 05 Nov 2021

How to deploy a Transformer-based model with custom preprocessing code to Sagemaker Endpoints using BentoML

Deploy a machine learning model with custom inference code to a Sagemaker Endpoint using BentoML
Published on: 01 Oct 2021

How to teach your team to write automated tests?

How to teach writing automated tests: TDD, BDD, and other techniques
Published on: 24 Sep 2021

Using AWS Deequ in Python with Python-Deequ

How to use Python-Deequ to validate Spark Dataframes
Published on: 17 Sep 2021

Building and deploying ML models using Qwak ML platform

What is Qwak ML platform and how does it work?
Published on: 03 Sep 2021

How to learn TDD

Learning Test-Driven Development is hard and there is nothing we can do about it
Published on: 27 Aug 2021

Data Engineering - the first principles

What is true in every data engineering project?
Published on: 20 Aug 2021

How to deploy MLFlow on Heroku

How to deploy MLFlow on Heroku using PostgreSQL as the database, S3 as the artifact storage and with BasicAuth authentication
Published on: 06 Aug 2021

What is MLOps? Do we need MLOps?

A complete definition of MLOps. No, MLOps isn't just DevOps applied to machine learning!
Published on: 30 Jul 2021

How to add a new dataset to the Feast feature store

How to use Feast feature store in a local environment
Published on: 09 Jul 2021

Building trustworthy data pipelines

How to build a trustworthy data pipeline?
Published on: 02 Jul 2021

Theory of constraints in data engineering

Are you busy, but nothing ever gets done? Perhaps, theory of constraints will help you
Published on: 25 Jun 2021

How writing can improve your programming skills

How writing texts for people makes you a better programmer
Published on: 18 Jun 2021

The ugly truth about product demo storytelling in data teams

How to make data product demos more engaging and persuade people to care about the data
Published on: 11 Jun 2021

Multimodel deployment in Sagemaker Endpoints

How to deploy multiple models in a single Sagemaker Endpoint?
Published on: 28 May 2021

How to speed up Pandas?

Is the Pandas library too slow? Here are two methods to speed it up!
Published on: 21 May 2021

Data versioning with LakeFS

Why you should use LakeFS to build a data lake that supports data versioning
Published on: 14 May 2021

How to add custom preprocessing code to a Sagemaker Endpoint running a Tensorflow model

How to customize input/output of a Sagemaker Endpoint running a Tensorflow model
Published on: 07 May 2021

How to A/B test Tensorflow models using Sagemaker Endpoints

How to deploy multiple model versions as one Sagemaker Endpoint
Published on: 30 Apr 2021

How to predict the value of time series using Tensorflow and RNN

How to train the RNN model in Tensorflow to predict time series?
Published on: 23 Apr 2021

How to deploy a REST API AWS Lambda using Chalice and AWS Code Pipeline

How to create a REST API Endpoint using AWS Lambda, Chalice, and AWS Code Pipeline
Published on: 16 Apr 2021

How to deploy a Tensorflow model using Sagemaker Endpoints and AWS Code Pipeline

How to build a Docker image using AWS Code Pipeline and deploy it as an Sagemaker Endpoint
Published on: 09 Apr 2021

How to deal with days of the week in machine learning

How to encode week days as features for machine learning models
Published on: 26 Mar 2021

On technical blogging

How to start blogging as a programmer
Published on: 19 Mar 2021

Why do we use dropout in artificial neural networks?

How does dropout work in artificial neural networks?
Published on: 12 Mar 2021

How to measure Spark performance and gather metrics about written data

How to track Spark metrics in AWS CloudWatch
Published on: 05 Mar 2021

How to use AWS Batch to run a Python script

How to build a Docker image, define an AWS Batch job using Terraform, and run the AWS Batch job using Airflow
Published on: 26 Feb 2021

Anomaly detection in Airflow DAG using Prophet library

How to detect problems in Airflow pipeline using Prophet for time series anomaly detection
Published on: 12 Feb 2021

How to test REST API contract using BDD

Testing a REST API using Behave in Python
Published on: 05 Feb 2021

Testing data products: BDD for data engineers

How to use BDD to test PySpark code
Published on: 29 Jan 2021

Definition of done for data engineers

When can data engineers be sure that they have done the task?
Published on: 14 Jan 2021

Don't learn another programming language

Should you learn a new programming language this year?
Published on: 07 Jan 2021

How to read from SQL table in PySpark using a query instead of specifying a table

Fetching data using a SQL query in PySpark
Published on: 01 Jan 2021

How to restart a stuck Airflow DAG

What to do when an Airflow DAG gets stuck and does not want to run
Published on: 31 Dec 2020

Why does the DayOfWeekSensor exist in Airflow?

How to make an Airflow DAG wait until a specified day of the week
Published on: 30 Dec 2020

Send SMS from an Airflow DAG using AWS SNS

How to configure SNS subscription to send SMS messages and use Airflow to send them
Published on: 29 Dec 2020

How to emulate temporary tables in Athena

Use CTAS to create a temporary table in Athena
Published on: 28 Dec 2020

How to enable S3 bucket versioning using Terraform

How to configure S3 bucket versioning in Terraform
Published on: 27 Dec 2020

How to get a notification when a new file is uploaded to an S3 bucket

Get a Slack notification when a file is uploaded to an S3 bucket
Published on: 26 Dec 2020

Get an XCom value in the Airflow on_failure_callback function

How to get the task instance in the on_failure_callback to get access to XCom
Published on: 25 Dec 2020

Add the row insertion time to a MySQL table

Automatically add the insertion and update time in MySQL
Published on: 24 Dec 2020

Best practices about partitioning data in S3 by date

How to partition data in S3 by date in a way that makes your life easier
Published on: 23 Dec 2020

How to write to a SQL database using JDBC in PySpark

How to use JDBC driver in PySpark to write a DataFrame to a SQL database
Published on: 22 Dec 2020

How to add dependencies as jar files or Python scripts to PySpark

How to add a jar file or a Python file as a Pyspark dependency
Published on: 21 Dec 2020

How to reset the consumer offset in Apache Kafka topic

How to use kafka-consumer-groups.sh to reset topic offsets
Published on: 20 Dec 2020

How to purge a Kafka topic

How to remove all messages from a Kafka topic
Published on: 19 Dec 2020

Get the last day of the month in Redshift

How to use the last_day function in Redshift
Published on: 18 Dec 2020

How to make an unconditional join in Redshift

LEFT OUTER JOIN ON 1=1 in Redshift
Published on: 17 Dec 2020

How to count the number of rows that match a condition in Redshift

How to count the rows by multiple conditions at the same time in SQL
Published on: 16 Dec 2020

How to index data in Redshift

How to create an equivalent of an index in Redshift
Published on: 15 Dec 2020

How to generate a sequence of dates in Redshift

How to use the generate_series function to generate a sequence of dates
Published on: 14 Dec 2020

How to assign rows to ranked groups in AWS Athena

How to use the NTILE function in Athena
Published on: 13 Dec 2020

How to define an AWS Athena view using Airflow

How to use the AWSAthenaOperator
Published on: 12 Dec 2020

How to write Hive queries with column position number in the GROUP BY or ORDER BY clauses

How to enable column position support in Hive GROUP BY or ORDER BY
Published on: 11 Dec 2020

How to check whether a regular expression matches a string in Hive

What is the equivalent of Athena/Presto regexp_like in Hive
Published on: 10 Dec 2020

How to check whether a YARN application has finished

How to use Airflow PythonSensor to check whether a YARN application finished running
Published on: 09 Dec 2020

How to use WHEN CASE queires in AWS Athena

Using conditions in AWS Athena queries
Published on: 08 Dec 2020

How to decode base64 to text in AWS Athena

How to use from_base64 in AWS Athena
Published on: 07 Dec 2020

How to combine two DataFrames with no common columns in Apache Spark

Use full outer join to combine two Apache Spark DataFrames with no common columns
Published on: 06 Dec 2020

How to get names of columns with missing values in PySpark

How to get the names of missing properties for every row in a PySpark Dataframe
Published on: 05 Dec 2020

How to set a different retry delay for every task in an Airflow DAG

How to use a different retry delay in every Airflow task
Published on: 04 Dec 2020

How to find the Hive partition closest to a given date

How to use Airflow to find the Hive partition closest to a given date
Published on: 03 Dec 2020

Get the date of the previous successful DAG run in Airflow.

Get the start time or the execution date of the previous successful DAG run in Airflow
Published on: 02 Dec 2020

How to prevent Airflow from backfilling old DAG runs

How to disable backfilling of an Airflow DAG or skip a part of the DAG during a backfill
Published on: 01 Dec 2020

What is s3:TestEvent, and why does it break my event processing?

S3 sends s3:TestEvent to SQS after setting up the bucket notifications
Published on: 30 Nov 2020

Making OFFSET LIMIT queries in AWS Athena

How to use OFFSET in AWS Athena queries
Published on: 29 Nov 2020

How to get an alert if an AWS lambda does not get invoked during the last 24 hours

How to get a notification when AWS Lambda stops begin used
Published on: 28 Nov 2020

How to set Airflow variables while creating a dev environment

How to use command-line to set Airflow variables
Published on: 27 Nov 2020

How to check when an Athena table was updated

How to track the time when an Athena table was updated
Published on: 26 Nov 2020

How to run an Airflow DAG in a loop

How to keep running an Airflow DAG indefinitely
Published on: 25 Nov 2020

How to use xcom_pull to get a variable from another DAG

Get an XCOM variable from another DAG
Published on: 24 Nov 2020

What to do when Airflow BashOperator fails with TemplateNotFound error

How to fix TemplateNotFound error when using Airflow BashOperator
Published on: 23 Nov 2020

Copy directories in S3 using s3-dist-cp

How to copy files in S3 and preserve the directory structure
Published on: 22 Nov 2020

How to select a random sample of rows using Athena

How to use a window function to select random rows from Athena
Published on: 21 Nov 2020

Use HttpSensor to pause an Airflow DAG until a website is available

Pause an Airflow DAG until an HTTP endpoint returns 200 OK
Published on: 20 Nov 2020

How to add an EMR step in Airflow and wait until it finishes running

How to use AwsHook and EmrStepSensor to add an EMR step and wait until it finishes running
Published on: 19 Nov 2020

How to use Virtualenv to prepare a separate environment for Python function running in Airflow

How to use the PythonVirtualenvOperator in Airflow
Published on: 18 Nov 2020

Remove a directory from S3 using Airflow S3Hook

How to remove files with a common prefix from S3
Published on: 17 Nov 2020

Run a command on a remote server using SSH in Airflow

How to use the SSHHook in a PythonOperator to connect to a remote server from Airflow using SSH and execute a command.
Published on: 16 Nov 2020

Use the ROW_NUMBER() function to get top rows by partition in Hive

How to calculate row number by partition in Hive and use it to filter rows
Published on: 15 Nov 2020

How to configure both core and spot instances in EMR using Terraform

Use EMR instance group to add spot instances to an EMR cluster
Published on: 14 Nov 2020

How to temporarily disable an AWS Lambda function using AWS CLI without removing the function

Disable an AWS Lambda using AWS CLI
Published on: 13 Nov 2020

How to add an EMR step from AWS Lambda

How to configure a new EMR step using AWS Lambda in Python
Published on: 12 Nov 2020

Send event to AWS Lambda when a file is added to an S3 bucket

Trigger AWS Lambda when a file is created in an S3 bucket
Published on: 11 Nov 2020

Select Serverless configuration variables using the stage parameter

How to pass environment parameters to Serverless that depend on the deployment stage
Published on: 10 Nov 2020

Use a custom function in Airflow templates

How to add a custom function to Airflow and use it in a template
Published on: 08 Nov 2020

Speed up counting the distinct elements in a Spark DataFrame

Use HyperLogLog to calculate the approximate number of distinct elements in Apache Spark
Published on: 07 Nov 2020

Pass parameters to SQL query when using PostgresOperator in Airflow

How to pass parameters to SQL template when using PostgresOperator in Airflow
Published on: 06 Nov 2020

Use regexp_replace to replace a matched string with a value of another column in PySpark

Use regex to replace the matched string with the content of another column in PySpark
Published on: 05 Nov 2020

How to read multiple Parquet files with different schemas in Apache Spark

What to do when Apache Spark skips Parquet files with incompatible schemas
Published on: 04 Nov 2020

How to determine the partition size in Apache Spark

How to choose the proper partition size and the number of partitions to run an Apache Spark job
Published on: 03 Nov 2020

How to download all available values from DynamoDB using pagination

How to use pagination to retrieve all DynamoDB values
Published on: 02 Nov 2020

How to make sure that you did not leave an EMR cluster running

How to get notifications about running EMR cluster
Published on: 01 Nov 2020

How to automatically remove files from S3 using lifecycle rules defined in Terraform

How to define S3 lifecycle rules using Terraform
Published on: 31 Oct 2020

How to retry a Python function call

How to retry a Python function call in case of an error
Published on: 30 Oct 2020

Send a Slack message from an Airflow DAG

How to use the SlackAPIPostOperator to send a templated message to a Slack channel
Published on: 29 Oct 2020

How to delay an Airflow DAG until a given hour using the DateTimeSensor

How to use the DateTimeSensor in Airflow
Published on: 28 Oct 2020

How to run PySpark code using the Airflow SSHOperator

How to submit a PySpark job using SSHOperator in Airflow
Published on: 27 Oct 2020

How to add a manual step to an Airflow DAG using the JiraOperator

How can you add a human action to an Airflow DAG?
Published on: 26 Oct 2020

How Data Mechanics can reduce your Apache Spark costs by 70%

Stop wasting time and money tuning Apache Spark parameters
Published on: 26 Oct 2020

Conditionally pick an Airflow DAG branch using an SQL query

How to use the BranchSQLOperator to choose a DAG branch to execute
Published on: 25 Oct 2020

How to trigger an Airflow DAG from another DAG

How to trigger another DAG from an Airflow DAG
Published on: 24 Oct 2020

Why does the ExternalTaskSensor get stuck?

How to fix the stuck ExternalTaskSensor
Published on: 23 Oct 2020

How to render an Airflow template for testing

How to generate the code of an Airflow task from a template and a given execution date
Published on: 22 Oct 2020

How to check the next execution date of an Airflow DAG

How to use Airflow CLI to get the next execution date of a DAG
Published on: 21 Oct 2020

Doing data quality checks using the SQLCheckOperator

How to use SQLCheckOperator to verify that the database contains an expected number of rows
Published on: 20 Oct 2020

How to deal with the jinja2 TemplateNotFound error in Airflow

How to fix the TemplateNotFound error while using a custom Airflow operator
Published on: 19 Oct 2020

How to postpone Airflow DAG until files get uploaded into an S3 bucket

How to use Airflow sensors to detect that files have been uploaded into an S3 bucket
Published on: 18 Oct 2020

What is the difference between a transformation and an action in Apache Spark?

What is an action in Apache Spark? What do you understand as transformations in Apache Spark?
Published on: 17 Oct 2020

Use LatestOnlyOperator to skip some tasks while running a backfill in Airflow

How to skip some tasks when backfilling a DAG in the past
Published on: 16 Oct 2020

Christopher Bergh - How the DataOps principles help data engineers make data pipelines trustworthy

An interview with Christopher Bergh who explains how the DataOps principles help data engineers make data pipelines trustworthy
Published on: 16 Oct 2020

How to retrieve the statuses of the recent DAG executions from Airflow database

How to make a dashboard that displays Airflow DAG statuses
Published on: 15 Oct 2020

How to find and terminate an idle Redshift session

How to find the idle session that is blocking the connection pool in Redshift
Published on: 14 Oct 2020

How to configure Spark to maximize resource usage while using AWS EMR

How to configure EMR to use all available resources when running a Spark cluster
Published on: 13 Oct 2020

How to use AWSAthenaOperator in Airflow to verify that a DAG finished successfully

How to check that an AWS Athena table contains data after running an Airflow DAG.
Published on: 12 Oct 2020

How to start an AWS Glue Crawler to refresh Athena tables using boto3

How to create and start an AWS Glue Crawler from Python code using boto3
Published on: 11 Oct 2020

How to retrieve the table descriptions from Glue Data Catalog using boto3

How to get the comments from the create table statements when the metadata is stored in the Glue Data Catalog
Published on: 10 Oct 2020

How to perform a batch write to DynamoDB using boto3

How to write multiple DynamoDB objects at once using boto3
Published on: 09 Oct 2020

How to populate a PostgreSQL (RDS) database with data from CSV files stored in AWS S3

How to upload S3 data into RDS tables
Published on: 08 Oct 2020

How to concatenate multiple MySQL rows into a single field?

How to concatenate multiple rows into a string in MySQL
Published on: 07 Oct 2020

How to get an array/bag of elements from the Hive group by operator?

How to get an array of elements from one column when grouping by another column in Hive
Published on: 06 Oct 2020

Working with dates and time in Apache Spark

How to get relative dates (yesterday, tomorrow) in Apache Spark, and how to calculate the difference between two dates
Published on: 05 Oct 2020

How to save an Apache Spark DataFrame as a dynamically partitioned table in Hive

How to use the saveAsTable function to create a partitioned table
Published on: 04 Oct 2020

When to cache an Apache Spark DataFrame?

Should we cache everything in Apache Spark, or are there any rules?
Published on: 03 Oct 2020

How to flatten a struct in a Spark DataFrame?

How to convert DataFrame fields into separate columns.
Published on: 02 Oct 2020

What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?

Desc: How to use the cube and rollup functions in Apache Spark or PySpark. What is the difference between a cube and a rollup.
Published on: 01 Oct 2020

How to concatenate columns in a PySpark DataFrame

How to use the concat and concat_ws functions to merge multiple columns into one in PySpark
Published on: 30 Sep 2020

How to derive multiple columns from a single column in a PySpark DataFrame

Extract multiple columns from a single column using the withColumn function and a PySpark UDF
Published on: 29 Sep 2020

Broadcast variables and broadcast joins in Apache Spark

How to speed up joins of small DataFrames by using the broadcast join
Published on: 28 Sep 2020

How to use the window function to get a single row from each group in Apache Spark

How to group values by a key and extract a single row from each group in Apache Spark
Published on: 27 Sep 2020

How to make a pivot table in AWS Athena or PrestoSQL

How to make a pivot table in AWS Athena, and why the pivot function does not exist
Published on: 26 Sep 2020

What is the difference between repartition and coalesce in Apache Spark?

When should you use coalesce instead of repartition in Apache Spark
Published on: 25 Sep 2020

How to pivot an Apache Spark DataFrame

How to turn an Apache Spark or PySpark DataFrame into a pivot table.
Published on: 24 Sep 2020

What is the difference between cache and persist in Apache Spark?

When should you use the cache, and when you should use the persist function
Published on: 23 Sep 2020

Why your company should use PrestoSQL

Should your team use PrestoSQL?
Published on: 16 Sep 2020

Is counting rows all we can do?

How to detect problems in data pipelines before they turn into hard to debug bugs? I wish I knew.
Published on: 08 Sep 2020

How to Speed Up AWS Athena Queries Using Partition Projection

How to define partition projection while creating an Athena table
Published on: 30 Aug 2020

How to send a customized Slack notification when an Airflow task fails

How to customize a Slack notification before sending it to the Slack incoming webhook.
Published on: 27 Aug 2020

How to use one SparkSession to run all Pytest tests

How to speed us Pytest tests by reusing the same SparkSession in all of them
Published on: 20 Jul 2020

How to send AWS CloudWatch Alerts to a Slack channel using Terraform

How to use Terraform to configure a CloudWatch alert and send the message to a Slack channel.
Published on: 13 Jul 2020

Check-Engine - data quality validation for PySpark 3.0.0

A PySpark library for data quality checks and data validation.
Published on: 06 Jul 2020

Measuring data quality using AWS Deequ

How to measure data quality in Athena tables using AWS Deequ running on an EMR cluster.
Published on: 29 Jun 2020

How to conditionally skip tasks in an Airflow DAG

How to use XCom and PythonSensor to skip remaining tasks in an Airflow DAG.
Published on: 22 Jun 2020

The problem with software testing in data engineering

Why data engineers don't write unit tests?
Published on: 15 Jun 2020

How does Kafka Connect work?

How does a Connector work? What is a Worker in Kafka Connect? How does the data get processed inside Kafka Connect, and why does it need internal Kafka topics?
Published on: 08 Jun 2020

Why my Airflow tasks got stuck in "no_status" and how I fixed it

A story about debugging an Airflow DAG that was not starting tasks
Published on: 01 Jun 2020

What is Kafka log compaction, and how does it work?

How the log compaction is implemented in Apache Kafka and how to configure Kafka log compaction properly
Published on: 22 May 2020

How does a Kafka Cluster work?

What is the difference between a leader and a replica broker? What is the cluster controller? How is the controller elected?
Published on: 18 May 2020

Athena performance tips explained

How to use query execution plans to speed up Athena queries
Published on: 11 May 2020

Data flow - what functional programming and Unix philosophy can teach us about data streaming

How to write data stream processing code that is easy to maintain
Published on: 04 May 2020

AWS IAM roles and policies explained

How to give users access rights in AWS
Published on: 06 Apr 2020

How to be happy at work - lessons learned from "Career superpowers" book

What can you learn from the book "Career superpowers" by James Whittaker
Published on: 30 Mar 2020

How to send metrics to AWS CloudWatch from custom Python code

How to use boto3 to send custom metrics to AWS CloudWatch from Python
Published on: 23 Mar 2020

How to unit test PySpark

How to speed up development by unit testing PySpark DAGs
Published on: 24 Feb 2020

How to speed up a PySpark job

Why one Spark executor is running much longer than others and what you can do about it
Published on: 17 Feb 2020

How does MapReduce work, and how is it similar to Apache Spark?

The explanation of the original MapReduce paper and a description of similarities between MapReduce and Apache Spark
Published on: 10 Feb 2020

Data streaming with Apache Kafka - guide for data engineers

Are you preparing for a data engineer job interview? Here are my answers to job interview questions about data streaming.
Published on: 03 Feb 2020

Data streaming: what is the difference between the tumbling and sliding window?

There are many kinds of sliding windows. Which one should you use?
Published on: 27 Jan 2020

I put a carnivorous plant on the Internet of Things to save its life, and it did not survive

This article is a text version of my talk, "I put a carnivorous plant on the Internet of Things," which I presented during the DataNatives conference (November 25-26, 2019 in Berlin, Germany).
Published on: 23 Jan 2020

What are the 4 V's of big data, and which one is the most important?

Volume, velocity, variety, and veracity
Published on: 20 Jan 2020

10x software architecture: high cohesion

How to achieve high cohesion and a few common problems.
Published on: 12 Jan 2020

How to add dependencies to AWS lambda

How to deploy an AWS Lambda with dependencies
Published on: 08 Jan 2020

Four books to boost your programmer career

I quit my dream job because of a book
Published on: 06 Jan 2020

What is the difference between data lake, data warehouse, and data mart

We can easily distinguish between them by focusing on three qualities: data structure (schema), data quality, and ownership.
Published on: 18 Dec 2019

Three biggest traps to avoid while setting Spark executor memory

Apache Spark is wasting a lot of RAM!
Published on: 16 Dec 2019

How to use Airflow backfill to run DAGs for a specified date in the past?

How does Airflow backfill work?
Published on: 11 Dec 2019

What do you need to know about storing passwords in AWS?

How to use the AWS Secrets Manager
Published on: 09 Dec 2019

Apache Spark: should we use RDD, Dataset, or DataFrame?

Is there a difference between Dataset and DataFrame? Why do we even have both?
Published on: 04 Dec 2019

What a data engineer can learn from The Unicorn Project?

Review of The Unicorn Project by Gene Kim
Published on: 02 Dec 2019

AI in production: Roobits Events360

What would you do if you were writing an application which had to process one billion events per day?
Published on: 18 Nov 2019

AI in production: Carta Healthcare

AI systems in healthcare
Published on: 11 Nov 2019

Using Exponentially Weighted Moving Average for anomaly detection


Published on: 05 Nov 2019

Using Boltzmann distribution as the exploration policy in TensorFlow-agent reinforcement learning models

There is a whole spectrum of exploration strategies between random and greedy policies.
Published on: 04 Nov 2019

Data engineering principles according to Gatis Seja

Lessons learnt from Gatis Seja's presentation about data engineering principles
Published on: 15 Oct 2019

How to remove outliers from Seaborn boxplot charts

Hide outliers when displaying boxplot in Seaborn
Published on: 14 Oct 2019

Python memory management in Jupyter Notebook

How to avoid memory leaks in Jupyter Notebook
Published on: 08 Oct 2019

AI in production: make data as easy as using your phone

Interview with Gautam Bakshi - the CEO of 15 Rock
Published on: 07 Oct 2019

How to connect a Dask cluster (in Docker) to Amazon S3

How to connect a Dask cluster (in Docker) to Amazon S3
Published on: 01 Oct 2019

A.I. in production: your next stylist is going to be a neural network

What if your phone could tell you what you should wear?
Published on: 30 Sep 2019

Loading tensorflow models from Amazon S3 with Tensorflow Serving

How to save the model in a file, upload it to S3, and serve it using the Docker image of Tensorflow Serving
Published on: 24 Sep 2019

Pandas stack and unstack explained

How to use the stack and unstack functions in Pandas
Published on: 23 Sep 2019

How to split a data frame into time-series for LSTM deep neural network

How to prepare data for LSTM model
Published on: 15 Sep 2019

How to monitor Scrapy spiders using InfluxDB and Grafana

How to write Scrapy statistics to InfluxDB and setup Grafana alerts
Published on: 10 Sep 2019

What is the difference between training, validation, and test sets in machine learning

Training a machine learning model is like learning before an exam.
Published on: 09 Sep 2019

How to write to a Parquet file in Python

Define a schema, write to a file, partition the data
Published on: 03 Sep 2019

Numpy reshape explained

How to use the reshape function in Numpy
Published on: 02 Sep 2019

Human bias in A/B testing

Underpowered tests, true negative, and ignored tests results
Published on: 28 Aug 2019

How to plot the decision trees from XGBoost classifier

How to plot the decision rules of XGBoost
Published on: 26 Aug 2019

Smoothing time series in Python using Savitzky–Golay filter

Smoothing Bitcoin price time-series
Published on: 21 Aug 2019

XGBoost hyperparameter tuning in Python using grid search

Using GridSearchCV from Scikit-Learn to tune XGBoost classifier
Published on: 19 Aug 2019

Forecasting time series: using lag features

How to generate lag features from time series
Published on: 16 Aug 2019

How to turn Pandas data frame into time-series input for RNN

From Pandas dataframe to RNN input
Published on: 14 Aug 2019

How to automatically select the hyperparameters of a ResNet neural network

Training ResNet network for multiclass image classification using keras-tuner
Published on: 12 Aug 2019

Using Hyperband for TensorFlow hyperparameter tuning with keras-tuner

Tuning TensorFlow with Hyperband
Published on: 09 Aug 2019

Using keras-tuner to tune hyperparameters of a TensorFlow model

Tuning Keras hyperparameters with keras-tuner
Published on: 07 Aug 2019

Understanding the Keras layer input shapes

What is the input_shape in Keras/TensorFlow?
Published on: 05 Aug 2019

How to train a model in TensorFlow 2.0

TensorFlow 2 - example
Published on: 02 Aug 2019

How to train a Reinforcement Learning Agent using Tensorflow Agents

The reinforcement learning loop with Tensorflow Agents
Published on: 31 Jul 2019

How to use a custom metric with Tensorflow Agents

How to define a new Tensorflow Agents metric and add it to the driver
Published on: 29 Jul 2019

How to use a behavior policy with Tensorflow Agents

Random and scripted behavior policies
Published on: 26 Jul 2019

How to create an environment for a Tensorflow Agent?

Implementing a Tensorflow Agent environment to play a board game
Published on: 24 Jul 2019

Deep Q-network terminology in plain English

The terminology used in the paper "Human-level control through deep reinforcement learning"
Published on: 22 Jul 2019

Bellman equation explained

The fundamental equation of reinforcement learning
Published on: 19 Jul 2019

Dependencies between DAGs: How to wait until another DAG finishes in Airflow?

How to trigger Airflow DAG when another DAG is completed
Published on: 17 Jul 2019

How to run Airflow in Docker (with a persistent database)

How to configure Airflow in a Docker container
Published on: 15 Jul 2019

Using machine learning for software testing

How to sample production data to get representative testing dataset?
Published on: 12 Jul 2019

How to measure the similarity of sequence values

Levenshtein distance and Kendall tau distance
Published on: 10 Jul 2019

Measuring document similarity in machine learning

How to measure the similarity of two datasets?
Published on: 08 Jul 2019

Minkowski distance explained

Manhattan distance, Euclidean distance, and Chebyshev distance are types of Minkowski distances
Published on: 05 Jul 2019

Why most data science projects fail?

What are the biggest challenges in data science?
Published on: 03 Jul 2019

Product/market fit - buidling a data-driven product

How to test a product idea?
Published on: 30 Jun 2019

How to assign people to groups in a fair way using genetic algorithms

Using Helisa and Jenetics in Scala
Published on: 21 Jun 2019

Genetic algorithms in Scala - solving optimization problems

Using Helisa and Jenetics to help Fallout players
Published on: 19 Jun 2019

Re: DataOps Principles: How Startups Do Data The Right Way

Team vs. a bunch of individuals reporting work time in the same spreadsheet
Published on: 17 Jun 2019

From Scala to Python - Python dataclasses

Domain model in Python
Published on: 14 Jun 2019

Notetaking for data science

How to document a project?
Published on: 12 Jun 2019

Wilson score in Python - example

How to calculate page popularity using the Wilson Score
Published on: 10 Jun 2019

Using a surrogate model to interpret a machine learning model

How to explain a machine learning model?
Published on: 07 Jun 2019

Generalized Linear Models — Using linear regression when the dependent variable does not follow Gaussian distribution

Understanding the GLM from the statsmodels package
Published on: 05 Jun 2019

PCA — how to choose the number of components?

How many principal components do we need when using Principal Component Analysis?
Published on: 03 Jun 2019

How to avoid bias against underrepresented target classes while training a machine learning model

The difference between KFold and StratifiedKFold in Scikit-learn
Published on: 31 May 2019

How to get the value by rank from a grouped Pandas dataframe

How to rank a grouped data frame in Pandas
Published on: 29 May 2019

The difference between the expanding and rolling window in Pandas

How to use rolling window with datetime (and other types) in Pandas
Published on: 27 May 2019

Write everything down

Lessons learnt from "Practical Data Cleaning" by Lee Baker
Published on: 24 May 2019

Understanding layer size in Convolutional Neural Networks

Filter size, padding, and stride explained
Published on: 22 May 2019

Calculating the cumulative sum of a group using Apache Spark

How to use the window function to calculate a cumulative sum
Published on: 20 May 2019

How to write to a Parquet file in Scala without using Apache Spark

How to use Parquet4s to write Parquet files in Scala
Published on: 10 May 2019

Row number in Apache Spark window — row_number, rank, and dense_rank

This article is mostly a “note to self” because I don’t want to google that anymore ;)
Published on: 08 May 2019

How to display all columns of a Pandas DataFrame in Jupyter Notebook

How to set the max columns in Pandas
Published on: 06 May 2019

Review of “Conversations On Data Science” by Roger D. Peng and Hilary Parker

Data Science book recommendation
Published on: 03 May 2019

The silly mistakes in exploratory data analysis

My most interesting Data Analysis failures
Published on: 01 May 2019

Understanding the softmax activation function

Softmax function explained
Published on: 29 Apr 2019

How to increase accuracy of a deep learning model

Debugging a machine learning model
Published on: 26 Apr 2019

Smoothing time series in Pandas

How to use the exponentially weighted window functions in Pandas
Published on: 24 Apr 2019

Which hyperparameters of deep learning model are important and how to find them

How to speed up finding the right hyperparameters of a machine learning model
Published on: 22 Apr 2019

How to choose the right mini-batch size in deep learning

Andrew Ng recommendation about mini batch size
Published on: 19 Apr 2019

How to deal with underfitting and overfitting in deep learning

The lessons learned from Andrew Ng’s online course
Published on: 17 Apr 2019

How to reduce memory usage in Pandas

Fit more data in the same amount of memory
Published on: 15 Apr 2019

How Airflow scheduler works

Explanation of the Airflow interval and start_date parameters
Published on: 12 Apr 2019

Guidelines for data science teams — a summary of Daniel Molnar’s talks

Avoiding over-engineering in machine learning
Published on: 10 Apr 2019

Ludwig machine learing model in Kaggle

My first attempt to use Ludwig
Published on: 08 Apr 2019

The problem of large categorical variables in machine learning

How to use FeatureHasher in Scikit-learn
Published on: 05 Apr 2019

Encoding categorical variables in machine learning

One-hot encoding, dummy coding, and effect coding in Scikit learn and Pandas
Published on: 03 Apr 2019

How To Avoid Data Leakage While Building A Machine Learning Model

What to do when your model works perfectly during testing but fails in production
Published on: 01 Apr 2019

Using scikit-automl for building a classification model

My first attempt to use scikit-automl and how I got it working
Published on: 29 Mar 2019

How to return rows with missing values in Pandas DataFrame

How does it work and why the most popular solution is wrong
Published on: 27 Mar 2019

Preprocessing the input Pandas DataFrame using ColumnTransformer in Scikit-learn

How to encode text/categorical variables and scale numerical values using only one Scikit-learn class
Published on: 25 Mar 2019

How to install scikit-automl in a Kaggle notebook

error: command ‘swig’ failed with exit status 1 while installing scikit-automl
Published on: 22 Mar 2019

Predicting customer lifetime value using the Pareto/NBD model and Gamma-Gamma model

How to estimate the CLV from a list of customer transactions using the lifetimes library in Python
Published on: 20 Mar 2019

Predicting customer churn using the Pareto/NBD model

How to use a Python lifetimes library to build a Pareto/NBD model.
Published on: 18 Mar 2019

Business metrics that make no sense

There are three kinds of metrics that won’t destroy your business.
Published on: 15 Mar 2019

Nested cross-validation in time series forecasting using Scikit-learn and Statsmodels

Tweaking the parameters of Statsmodels
Published on: 13 Mar 2019

How to perform an A/B test correctly in Python

What can we expect from a correctly performed A/B test?
Published on: 11 Mar 2019

[book review] The hundred-page machine learning book

I have mixed feelings about this book.
Published on: 08 Mar 2019

A few useful things to know about machine learning

Pedro Domingo’s observations about feature engineering
Published on: 06 Mar 2019

Recommendations vs. raw data — what is better?

Should we suggest an action when we visualize data?
Published on: 04 Mar 2019

How to interpret ROC curve and AUC metrics

In my opinion, AUC is a metric that is both easy to use and easy to misuse. Do you want to know why? Keep reading ;)
Published on: 01 Mar 2019

How to display mathematical equations in Jupyter Notebook

LaTeX support in Jupyter Notebook
Published on: 27 Feb 2019

Apriori algorithm explained

Using association rule learning to make recommendations
Published on: 25 Feb 2019

How to change plot size in Jupyter Notebook

Pyplot parameter that configures the chart size
Published on: 22 Feb 2019

Looking for structure in data — Andrews curves plot explained

How to read Andrews curves chart
Published on: 20 Feb 2019

Making your Scrapy spider undetectable by applying basic statistics

How to delay scraper requests to make it look like a human visiting the website
Published on: 18 Feb 2019

Finding seasonality in time series using autocorrelation plot

How to interpret autocorrelation plot?
Published on: 15 Feb 2019

My favourite data science podcasts

I was asked for some podcast recommendation, so here is my very short list ;)
Published on: 13 Feb 2019

A podcast that changed my perspective on exploratory data analysis

How to avoid bad science
Published on: 11 Feb 2019

How to read a confusion matrix

Predicted labels are in columns, right? Or maybe in rows? Do you remember? ;)
Published on: 08 Feb 2019

The optimal learning rate during fine-tuning of an artificial neural network

How to set the learning rate after you unfreeze the network layers in fast.ai
Published on: 06 Feb 2019

F1 score explained

The mathematics behind F1 score.
Published on: 04 Feb 2019

How to display a progress bar in Jupyter Notebook

Display a progress bar with no additional dependencies, just Python + Jupyter Notebook
Published on: 01 Feb 2019

Save and restore a Tensorflow model using Keras for continuous model training

How to run fit function multiple time and improve the model?
Published on: 30 Jan 2019

A comprehensive guide to putting a machine learning model in production using Flask, Docker, and Kubernetes

How to use Docker and Flask to put a Scikit model in production as a microservice.
Published on: 28 Jan 2019

Query string validation in Fastify

How to validate query parameters using Fastify
Published on: 25 Jan 2019

Mental models: inversion

Solve the opposite problem to avoid stupidity.
Published on: 23 Jan 2019

How to save a machine learning model into a file

Saving a Scikit-learn model using the joblib library in Python
Published on: 21 Jan 2019

Music and other distractions

Why is it difficult to work in the office?
Published on: 18 Jan 2019

[book review] You had me at Hello, World

Did you ever want to have a mentor?
Published on: 16 Jan 2019

[book review] Deep work by Cal Newport

How to focus on the high outcome tasks and avoid being distracted
Published on: 14 Jan 2019

Bootstrapping vs. bagging

The difference explained
Published on: 11 Jan 2019

Git fixup explained

How to change the commit history
Published on: 09 Jan 2019

[book review] So good they can’t ignore you

A polarizing book
Published on: 07 Jan 2019

[book review] The effective engineer

What is the best investment of your time?
Published on: 21 Dec 2018

How to remove all Docker images and containers

An explanation of removing Docker images and containers.
Published on: 19 Dec 2018

Understanding uncertainty intervals generated by Prophet

How to tweak uncertainty intervals in Prophet.
Published on: 17 Dec 2018

A Python HTTP server for serving static content

How to easily serve static content on localhost or in the local network
Published on: 14 Dec 2018

Assert object pattern

The easiest way to make your tests more readable and easier to maintain
Published on: 12 Dec 2018

5 best books I read in 2018

A list that surprised even me…
Published on: 10 Dec 2018

How to run a single test in SBT

Why testOnly does not work?
Published on: 07 Dec 2018

How to use Scrapy to follow links on the scraped pages

A web spider that does not follow links is not very useful, let’s fix that.
Published on: 05 Dec 2018

How to scrape a single web page using Scrapy in Jupyter Notebook?

Scrapy Spiders and processing pipelines 101
Published on: 03 Dec 2018

What is inside a Docker image?

How to unpack a Docker image
Published on: 30 Nov 2018

What is wrong with tech conferences?

Why are tech conferences boring?
Published on: 28 Nov 2018

Java performance testing — Epsilon garbage collector

How to make sure that GC does not stop the JVM during a test?
Published on: 21 Nov 2018

Turning greenfield projects into brownfields projects

What happens when the team lacks software craft skills?
Published on: 19 Nov 2018

"The war of art" and other books I did not finish reading

You can read more good books if you skip the lousy ones.
Published on: 16 Nov 2018

Prophet plot explained

How to read the Prophet forecast plot
Published on: 14 Nov 2018

Brain dump — programmer productivity experiment #2

How to generate new ideas instead of thinking about the same thing over and over again
Published on: 12 Nov 2018

Programmer diary — programmer productivity experiment #1

One of the most intriguing ideas described in the book "How Google works" is writing "snippets."
Published on: 09 Nov 2018

Smart creative — the new role model

It may look like a unicorn, but it is real
Published on: 07 Nov 2018

"The Box: How the Shipping Container Made the World Smaller and the World Economy Bigger" by Marc Levinson

What happens when one invention makes the whole industry obsolete?
Published on: 05 Nov 2018

How to visualise prediction errors

How to explain the errors of a linear regression model
Published on: 02 Nov 2018

User story mapping for developers

A natural way of splitting work into small, but useful parts
Published on: 29 Oct 2018

Is programming art or science?

On a quest to find the right metaphor
Published on: 24 Oct 2018

Test-driven development in Jupyter Notebook

TDD for data scientists working with Jupyter Notebook
Published on: 22 Oct 2018

Machine learning cheat sheets

A collection of machine learning cheat sheets I find useful and google repeatedly.
Published on: 19 Oct 2018

Does a tester break the product?

How does a name influence our attitude?
Published on: 17 Oct 2018

Dealing with dates and time in Pandas

How to use Pandas to parse dates or calculate time in a different timezone.
Published on: 15 Oct 2018

Fill missing values using Random Forest

How to predict the missing values using Scikit-Learn
Published on: 12 Oct 2018

The one important thing I learned from "Beyond Developer" by Dan North

How to motivate software engineers?
Published on: 10 Oct 2018

Programmers love new toys but hate new habits

We talk about toys. We love new buzzwords. We adore things that sound cool. Yes, we do.
Published on: 08 Oct 2018

A "known bug" is still a bug

What does a "known bug" or an update say our users?
Published on: 05 Oct 2018

How to build a project inside a Docker container

How to safely run code downloaded from the Internet
Published on: 03 Oct 2018

[book review] Dichotomy of leadership

The follow-up to “Extreme ownership”
Published on: 01 Oct 2018

Box and whiskers plot

How to plot and interpret the box and whiskers plot
Published on: 28 Sep 2018

[book review] Team Geek

This book deserves a 3-star review on Amazon for many reasons.
Published on: 26 Sep 2018

How I failed to plot parallel coordinates in Matplotlib

Built-in matplotlib functions are not enough in this case
Published on: 24 Sep 2018

Import Jupyter Notebook from GitHub

The easiest way to access someone else’s code in your own notebook
Published on: 21 Sep 2018

Fill missing values in Pandas

Use the next or previous value to fill the missing values in Pandas
Published on: 19 Sep 2018

Forward feature selection in Scikit-Learn

Two workarounds to get an equivalent of forward feature selection in Scikit-Learn
Published on: 17 Sep 2018

Heat map with Matplotlib

A short tutorial about generating a heat map of the values stored in a Pandas dataframe
Published on: 14 Sep 2018

Language is all about nouns

Programmers are afraid of nouns. We often replace them with poorly written descriptions of things.
Published on: 12 Sep 2018

Outlier detection with Scikit Learn

Z-score and Density-Based Spatial Clustering of Applications with Noise
Published on: 10 Sep 2018

How to set the global random_state in Scikit Learn

What to do if you keep forgetting to set the random_state?
Published on: 31 Aug 2018

JUG Thüringen meetup - retrospective

My opinion about my presentation at a meetup in Erfurt, Germany.
Published on: 29 Aug 2018

How to split a list inside a Dataframe cell into rows in Pandas

Step by step instructions to "explode" a list into DataFrame rows.
Published on: 27 Aug 2018

Interactive plots in Jupyter Notebook

How to create a plot that supports zooming
Published on: 24 Aug 2018

[book review] James Whittaker's Little Book of the Future

Read this book if you believe we can use A.I. and IoT to build a bright future.
Published on: 22 Aug 2018

Probability plot - visually compare probability distributions

How to visually check whether your sample is normally distributed?
Published on: 20 Aug 2018

Count unique elements of an infinite stream of objects

HyperLogLog - probabilistic counting algorithm
Published on: 19 Aug 2018

Live unit testing with sbt

Can I have the coolest Visual Studio feature in IntelliJ?
Published on: 18 Aug 2018

Monte Carlo simulation in Python

How to make business decisions using the Monte Carlo simulation?
Published on: 17 Aug 2018

Word cloud from a Pandas data frame

Create a nice visualization of the most popular words in your data frame
Published on: 07 Aug 2018

Scala structural types with generics

A short example of defining a structural type which matches a generic class
Published on: 05 Aug 2018

Visualize common elements of two datasets using NetworkX

How to use undirected graph to visualize common elements of two Pandas data frames
Published on: 03 Aug 2018

How to load data from Google Drive to Pandas running in Google Colaboratory

How to import a CSV file from Google Drive into Google Colab
Published on: 14 Jul 2018

Precision vs. recall - explanation

How to understand the difference between precision and recall?
Published on: 15 Jun 2018

The cake pattern is a lie

Cake pattern was a terrible idea.
Published on: 12 Jun 2018

Can we make it more generic?

What can we learn from a horrible mistake made by a programmer who wanted to make the code more generic?
Published on: 06 Jun 2018

[JUG Thüringen] Effortless Domain-Driven Design - The real Power of Scala

How to use some parts of Domain Driven Design to create maintainable code in Scala?
Published on: 11 May 2018

Buzzwords, buzzwords everywhere

Do we behave like a child in a toy store?
Published on: 07 Apr 2018

Can’t learn anything? You’re doing it wrong

Have you been trying to learn something for a few months? What to do when you keep learning but still don’t understand anything?
Published on: 15 Mar 2018

Re: “I Don’t Want To Maintain Their Code”

How can we facilitate knowledge sharing? Will easily accessible documentation foster cooperation?
Published on: 03 Mar 2018

Discipline Equals Freedom — Jocko Willink

A review of Jocko Willink’s book: “Discipline Equals Freedom.” Should you read it even if you don’t want to run a marathon?
Published on: 24 Feb 2018

Prevent accidental deployments on Friday

You feel you should not deploy your code on Fridays but nothing stops you. Can you prevent accidental deployments?
Published on: 18 Feb 2018

Developers just wanna have fun

Software maintenance is painful because of hype driven development.
Published on: 22 Jan 2018

Support for old browsers — is it necessary?

Do you think that every web page should support all existing browsers? How about all versions of those browsers?
Published on: 17 Jan 2018

The beauty of properly used statically typed languages

The real power of programming in Scala is not in mimicking Haskell and overusing monads, but in taking advantage of its type system.
Published on: 14 Jan 2018

Extreme ownership and software engineering

What a software engineer can learn from “Extreme ownership” book? How can it influence your daily work?
Published on: 11 Jan 2018

Perpetually dysfunctional software

What happens when we release another beta version? Are users happy or angry? What if the reality is different than we think?
Published on: 26 Jul 2017

4 reasons why TDD slows you down

It is easy to announce that TDD slows you down, but have you ever wondered why it happens? Is there anything you can do better?
Published on: 08 May 2017

Always stop unused Akka actors

Akka actors do not magically disappear when you no longer need them.
Published on: 03 May 2017

Scalar 2017

Scalar Conference 2017 — everything I liked
Published on: 12 Apr 2017

Reversing a binary tree and other great interview questions

We do not like being asked to write an algorithm on a whiteboard during job interviews, but is there a better way?
Published on: 19 Mar 2017

The importance of documenting things

What happens when someone asks you about your code and you cannot answer because you have no idea how it works? That happened to me… again.
Published on: 12 Mar 2017

One thing can improve LambdaDays

One thing that can significantly improve LambdaDays.
Published on: 12 Feb 2017

Can we write cheaper software?

Can we write our web application in a way which saves our company money? Can we make our software cheaper?
Published on: 29 Jan 2017

A year of Poznan Scala User Group

How can a group created because of a tweet exists for over a year? Have we learned anything in a year? What are we going to do now?
Published on: 23 Jan 2017

Rage against unprofessionalism in software engineering

Today software engineers disappointed another person…
Published on: 15 Jan 2017