Why is it so hard to correctly estimate AI projects?
Why can't you estimate an AI project correctly and can you do anything about it?
Published on: 17 Feb 2025
Bartosz Mikulski's articles on AI, MLOps, and data engineering
Why can't you estimate an AI project correctly and can you do anything about it?
Published on: 17 Feb 2025
Learn how to properly test AI systems using familiar software testing concepts. Discover key metrics, alignment checks, and robustness testing strategies for reliable AI deployment.
Published on: 10 Feb 2025
API wrapper or production-ready AI? Learn how proper LLMOps separates prototypes from reliable applications
Published on: 27 Jan 2025
Expert strategies for improving AI agent performance through better data retrieval, query generation, automated decision-making process, and response generation. The article covers data collection, metrics, and techniques to improve the agent's performance.
Published on: 20 Jan 2025
Learn how to implement AI workflows and autonomous agents with PydanticAI. This guide shows an example implementation of patterns described in the Anthropic article 'Building effective agents' such as prompt chaining, routing, parallelization, and orchestrator-workers.
Published on: 13 Jan 2025
A data-driven approach for improving RAG performance. Learn how to gather data and how much data you need for RAG, fine-tuning LLM, and training a specialized LLM from scratch.
Published on: 06 Jan 2025
A step-by-step tutorial on implementing HyDE technique to improve RAG retrieval accuracy, with code examples and performance evaluation using Ragas.
Published on: 30 Dec 2024
A comprehensive guide to evaluating SQL-generating AI agents using Ragas metrics, focusing on query equivalence and output format validation.
Published on: 23 Dec 2024
Published on: 16 Dec 2024
Published on: 09 Dec 2024
A comprehensive tutorial on fine-tuning Mistral-7B using QLoRA and Axolotl, covering data preparation, model configuration, and text classification optimization.
Published on: 11 Nov 2024
Learn how to use Langfuse to manage prompts and track LLM requests in your AI applications. Discover how to version prompts, monitor usage, and improve your LLM applications with detailed analytics.
Published on: 04 Nov 2024
Learn to use OpenAI's Swarm Agentic Framework to build intelligent AI workflows. This guide covers the basics of agents, defining functions, and defining interactions between agents.
Published on: 28 Oct 2024
Learn how to set up Jupyter Notebook servers with GPU access for students using CoCalc. A guide for educators and workshop organizers looking to provide hardware for their students without (too much) hassle.
Published on: 21 Oct 2024
Published on: 10 Sep 2024
You don't always need a large language model to get good results. A fine-tuned small language model is often just as good, much faster, and cheaper.
Published on: 30 Jul 2024
Discover advanced techniques to enhance the accuracy of your Retrieval-Augmented Generation (RAG) systems. Learn about semantic search, query expansion, HyDE, and keyword search to improve data retrieval and answer quality.
Published on: 16 Jul 2024
Hallucinations erode trust, so we should prevent them from reaching the users. In this article, I show how to use the FaithfulnessEvaluator from the Llama-Index library to determine whether the documents retrieved by vector search contain the answer to the user's question.
Published on: 30 Jun 2024
Every RAG system starts with retrieval. How do you know if your retrieval code is good enough? You measure it. The article shows how to use the ir_measures library to calculate Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) to quantify the performance of your retrieval code.
Published on: 20 Jun 2024
How to use Marvin and Instructor to define a structured output for LLMs and build a data retrieval workflow that can answer user's questions about data by checking if the required data is available in the database, planning what data has to be retrieved, generating the query, executing it, and generating a human-readable answer.
Published on: 10 Jun 2024
How to build an agentic AI workflow using the Llama 3 open-source LLM model and LangGraph. We will create an autonomous multi-step process that autonomically handles a data retrieval task and answers user's questions using multiple specialized AI agents
Published on: 20 May 2024
Build a chat with a YouTube video or a PDF chatbot in one hour (or less) with OpenAI Assistant API and Streamlit
Published on: 10 May 2024
Discover the difference between proven prompt engineering techniques and tricks
Published on: 30 Apr 2024
How to monitor whether your employees leak company secrets, PII, or passwords to AI
Published on: 10 Jan 2024
AI in everyday usage: How to build an Anki plugin for generating example sentences in a foreign language using AI
Published on: 20 Dec 2023
Discover how to effectively monitor AI applications, track costs, and optimize API usage with Langsmith and GPTBoost. Get insights into managing OpenAI API interactions for improved efficiency and cost control.
Published on: 30 Nov 2023
Explore the innovative approach of using AI in content creation through the GPTBoost case-study. Learn how Veselina Staneva masterfully blends AI efficiency with human creativity in her writing process.
Published on: 22 Nov 2023
How product managers can use AI to make their work easier without suffering the consequences of AI misuse
Published on: 20 Nov 2023
How to integrate OpenAI custom GPT with make.com scenario using webhooks - a tutorial on using GPTs in business automation workflows.
Published on: 10 Nov 2023
How to build an AI chatbot that retrieves answers to user's questions from transcripts of YouTube videos
Published on: 30 Oct 2023
Explore the seamless integration of artificial intelligence with classical machine learning techniques for effective topic modeling and document clustering. Learn how word embeddings enable higher accuracy, semantic context preservation, and robust results.
Published on: 30 Sep 2023
How to use Langchain.js to build a question-answering service that uses vector databases to store the documents and AI to generate the answers
Published on: 20 Sep 2023
What should you use when your AI needs access to external systems? Is it better to use Langchain Agents or OpenAI Functions?
Published on: 10 Sep 2023
How to use Langchain model response and document embeddings caching to save time and money when using Large Language Models
Published on: 30 Aug 2023
How to use OpenAI GPT models, Langchain, and Doctran to generate questions and answers from long documents
Published on: 25 Aug 2023
Using Langchain MapReduceChain to handle documents longer than the prompt limit
Published on: 20 Aug 2023
How to find information in long documents with AI, vector databases, and Langchain using MapReduceChain and ParentDocumentRetriever
Published on: 15 Aug 2023
How to use the Llama2 AI model in Python to build a text classification service
Published on: 10 Aug 2023
Monitor interactions with LLM in Langchain and gather feedback about the model's performance using Langsmith
Published on: 30 Jul 2023
Discover how AI and automation can streamline the process of gathering relevant information from newsletters, summarizing it, and delivering a weekly summary
Published on: 30 Jun 2023
Build an AI-powered chatbot that can interact with REST API using the Function Calling feature of OpenAI Completion API. Updated to cover the changes introduced after OpenAI DevDay 2023!
Published on: 16 Jun 2023
Explore LMQL, a powerful SQL-like language designed for machine learning tasks.
Published on: 10 Jun 2023
Which Llama index should you use? When is it better to use GPTVectorStoreIndex, GPTListIndex, GPTKeywordTableIndex, or GPTKnowledgeGraphIndex?
Published on: 30 May 2023
How to prepare the training data for an OpenAI model and how to fine-tune OpenAI's GPT model in Python
Published on: 08 May 2023
Learn the essentials of deploying large language models in production with our comprehensive guide on software architecture for AI
Published on: 30 Apr 2023
Use AI to validate another AI's output. Learn how to create custom validators and corrections using the Guardrails library.
Published on: 20 Apr 2023
A step-by-step guide to building a ChatGPT plugin in Python to retrieve data from the knowledge base stored in a vector database
Published on: 15 Apr 2023
How to use AI to geneate test cases for your code
Published on: 10 Apr 2023
Discover how to leverage the powerful open-source Cerebras model with LangChain in this comprehensive guide, featuring step-by-step instructions for loading the model with HuggingFace Transformers, creating prompt templates, and integrating it with LangChain Agents.
Published on: 30 Mar 2023
Improve your coding skills and elevate your writing with GPT-4 as your AI-driven pair programming partner, guiding you through the process of building a web application that functions as a user-friendly reverse dictionary.
Published on: 20 Mar 2023
How to create an AI-powered newsletter generator using the dust.tt and OpenAI API. You'll learn how to set up a website with the AI application, use the few-shot in-context learning technique to train the AI model, and deploy the API to generate newsletters.
Published on: 10 Mar 2023
A step-by-step tutorial on ChatGPT API (versions 1.1.1+) in Python. You'll also learn about prompt engineering, interactivity, optimizing API calls, and using parameters to get better results. Updated to cover the changes introduced after OpenAI DevDay 2023!
Published on: 01 Mar 2023
How to build an AI-powered Facebook chatbot using GPT-3 from OpenAI and vector databases to answer client questions using your documentation - a tutorial with step-by-step instructions. You will learn how to set up a database, create text embeddings, use MLOps and prompt engineering to retrieve answers, and build a web application to connect with the Facebook API.
Published on: 28 Feb 2023
Discover how word embeddings and vector databases can revolutionize text search and duplicate detection. Learn how to implement it with OpenAI GPT-3 and Milvus vector database.
Published on: 20 Feb 2023
Learn how to use GPT-3 to automate Git commit message generation and speed up your development workflows.
Published on: 15 Feb 2023
Unleash the potential of GPT-3 and make it access the Internet. Learn how to use Langchain and build a Slack bot that can do a web search, extract text from websites, and perform calculations.
Published on: 10 Feb 2023
How the in-context learning prompt engineering technique improves GPT-3 results, and why does it work? What's the difference between zero-shot, one-shot, and few-shot prompting?
Published on: 30 Jan 2023
Build an AI-powered Slack bot that reads data from your production database and answers simple analitics questions
Published on: 23 Jan 2023
Are you looking for a way to generate high-quality content quickly and effectively? This article outlines how to use AI to create a landing page. You will learn more about the AIDA marketing model and the use of OpenAI's ChatGPT and GPT-3 to generate the text. Also, I will show you how to incorporate audience research into the prompt to improve the result. Get all the information you need to create high-quality content quickly and effectively with AI.
Published on: 20 Jan 2023
Why do you need text summarization services in your business? How can you deploy a model downloaded from HuggingFace using the Qwak ML platform in 15 minutes?
Published on: 10 Jan 2023
Do architecture diagrams still matter? How do we deal with constant changes? How to design software architecture?
Published on: 20 Dec 2022
Are you struggling to manage and update your legacy codebase? In this article, I'll show you how to leverage the power of abstraction layers to overcome common challenges with legacy code.
Published on: 10 Dec 2022
What does kill IT projects? What you should avoid, at all costs, to ensure the success of your startup or software project
Published on: 30 Nov 2022
How to write a growth plan that helps you get promoted and doesn't get in the way when you want to focus on your hobbies
Published on: 20 Nov 2022
How to setup and use Pytest to test Python code
Published on: 10 Nov 2022
How to use the "benefits over features" technique to advertise your SaaS product and get more clients than your competition
Published on: 30 Oct 2022
What a co-founder of DeepMind teaches us about pitching our ideas to investors
Published on: 20 Oct 2022
How to do MLOps while working on a small data engineering team
Published on: 10 Oct 2022
What are the benefits of TDD for programmers and companies that hire them?
Published on: 30 Sep 2022
How to debug code and solve problems as fast as possible
Published on: 20 Sep 2022
Does it make sense to use SOLID principles in data engineering? What about CUPID properties in data pipelines?
Published on: 10 Sep 2022
How data engineers can write tests for legacy code in their ETL pipelines without breaking the existing implementation
Published on: 30 Aug 2022
How to produce high-quality software in data teams by applying software engineering practices to data science and data engineering
Published on: 20 Aug 2022
How to use an ordered categorical variable to sort a Pandas Dataframe by months while displaying their names
Published on: 15 Aug 2022
What do you need to know to become a data engineer? Does a data engineer need a degree? How can you get your first data engineering job?
Published on: 10 Aug 2022
What is Kappa Architecture? When should we use Kappa Architecture? What's the difference between Kappa Architecture and Lambda Architecture? And way, way more!
Published on: 30 Jul 2022
How to work with code written by other people? What to do when you join a new team?
Published on: 20 Jul 2022
Does functional programming in Python make sense? How to do functional programming in Python?
Published on: 10 Jul 2022
How to document a software project?
Published on: 20 Jun 2022
Should you use a data warehouse or build a data lake? When is a data warehouse a better choice? When is it better to build a data lake?
Published on: 10 Jun 2022
How to use loc, iloc, slice, and row filtering in Pandas
Published on: 27 May 2022
How can we define a Python decorator, and when should we use Python decorators.
Published on: 25 May 2022
When does an Apache Spark cluster perform the shuffle operation?
Published on: 20 May 2022
What is the primary, unrepairable cause of almost all bugs, data leaks, human problems, etc.?
Published on: 13 May 2022
What's stopping us from getting better at coding
Published on: 06 May 2022
How to prepare an enjoyable programming workshop that teaches people the skills they need without overwhelming them with new knowledge.
Published on: 15 Apr 2022
What you should avoid when you interview programmers for a data engineer positition
Published on: 08 Apr 2022
How to make debugging easier by paying attention to the errors you report
Published on: 01 Apr 2022
The one practice that makes every team faster (in the long run)
Published on: 25 Mar 2022
Why do programmers make wrong decisions when they choose the tools they use?
Published on: 18 Mar 2022
What can data engineers learn from (ancient) librarians?
Published on: 11 Mar 2022
I believed in the Sexiest Job of 21 Century hype. I was wrong.
Published on: 04 Mar 2022
Don't reinvent the wheel as a MLOps engineer. The 3 books you must read in 2022
Published on: 25 Feb 2022
Benefits of having a feature store and what happens when you don't have one
Published on: 18 Feb 2022
How to document an ETL pipeline or ML inference pipeline without doing useless work
Published on: 11 Feb 2022
Running a batch machine learning job using Sagemaker and data stored in S3.
Published on: 04 Feb 2022
Are we building the right abstractions in software?
Published on: 28 Jan 2022
Do you struggle with maintaining your legacy data pipelines? Check out our article on how to add tests and refactor your code while working with legacy data pipelines.
Published on: 21 Jan 2022
How to quickly train junior engineers to make them as productive as the rest of the team
Published on: 14 Jan 2022
When you're debugging a failing production pipeline at 2 am, what do you need?
Published on: 07 Jan 2022
Trivial (and easily fixable) mistakes that will make you fail a job interview
Published on: 31 Dec 2021
Why do data engineers quit their jobs?
Published on: 24 Dec 2021
What KPI to measure in an MLOps team
Published on: 17 Dec 2021
The minimal setup for ML deployment without the things you DON'T need yet
Published on: 10 Dec 2021
What's the difference between reasonable future-proof architecture and overengineering? Is there a difference?
Published on: 04 Dec 2021
What is the difference between pattern matching in Python and Scala?
Published on: 26 Nov 2021
How to put AI in production without overengineering your system
Published on: 19 Nov 2021
Atlan - a tool for facilitating a collaborative data culture
Published on: 15 Nov 2021
Should you spend time learning data engineering tools and libraries?
Published on: 12 Nov 2021
What is shadow deployment in machine learning? What is a canary release? What is the difference?
Published on: 05 Nov 2021
Deploy a machine learning model with custom inference code to a Sagemaker Endpoint using BentoML
Published on: 01 Oct 2021
How to teach writing automated tests: TDD, BDD, and other techniques
Published on: 24 Sep 2021
How to use Python-Deequ to validate Spark Dataframes
Published on: 17 Sep 2021
What is Qwak ML platform and how does it work?
Published on: 03 Sep 2021
Learning Test-Driven Development is hard and there is nothing we can do about it
Published on: 27 Aug 2021
What is true in every data engineering project?
Published on: 20 Aug 2021
How to deploy MLFlow on Heroku using PostgreSQL as the database, S3 as the artifact storage and with BasicAuth authentication
Published on: 06 Aug 2021
A complete definition of MLOps. No, MLOps isn't just DevOps applied to machine learning!
Published on: 30 Jul 2021
How to use Feast feature store in a local environment
Published on: 09 Jul 2021
How to build a trustworthy data pipeline?
Published on: 02 Jul 2021
Are you busy, but nothing ever gets done? Perhaps, theory of constraints will help you
Published on: 25 Jun 2021
How writing texts for people makes you a better programmer
Published on: 18 Jun 2021
How to make data product demos more engaging and persuade people to care about the data
Published on: 11 Jun 2021
How to deploy multiple models in a single Sagemaker Endpoint?
Published on: 28 May 2021
Is the Pandas library too slow? Here are two methods to speed it up!
Published on: 21 May 2021
Why you should use LakeFS to build a data lake that supports data versioning
Published on: 14 May 2021
How to customize input/output of a Sagemaker Endpoint running a Tensorflow model
Published on: 07 May 2021
How to deploy multiple model versions as one Sagemaker Endpoint
Published on: 30 Apr 2021
How to train the RNN model in Tensorflow to predict time series?
Published on: 23 Apr 2021
How to create a REST API Endpoint using AWS Lambda, Chalice, and AWS Code Pipeline
Published on: 16 Apr 2021
How to build a Docker image using AWS Code Pipeline and deploy it as an Sagemaker Endpoint
Published on: 09 Apr 2021
How to encode week days as features for machine learning models
Published on: 26 Mar 2021
How to start blogging as a programmer
Published on: 19 Mar 2021
How does dropout work in artificial neural networks?
Published on: 12 Mar 2021
How to track Spark metrics in AWS CloudWatch
Published on: 05 Mar 2021
How to build a Docker image, define an AWS Batch job using Terraform, and run the AWS Batch job using Airflow
Published on: 26 Feb 2021
How to detect problems in Airflow pipeline using Prophet for time series anomaly detection
Published on: 12 Feb 2021
Testing a REST API using Behave in Python
Published on: 05 Feb 2021
How to use BDD to test PySpark code
Published on: 29 Jan 2021
When can data engineers be sure that they have done the task?
Published on: 14 Jan 2021
Should you learn a new programming language this year?
Published on: 07 Jan 2021
Fetching data using a SQL query in PySpark
Published on: 01 Jan 2021
What to do when an Airflow DAG gets stuck and does not want to run
Published on: 31 Dec 2020
How to make an Airflow DAG wait until a specified day of the week
Published on: 30 Dec 2020
How to configure SNS subscription to send SMS messages and use Airflow to send them
Published on: 29 Dec 2020
Use CTAS to create a temporary table in Athena
Published on: 28 Dec 2020
How to configure S3 bucket versioning in Terraform
Published on: 27 Dec 2020
Get a Slack notification when a file is uploaded to an S3 bucket
Published on: 26 Dec 2020
How to get the task instance in the on_failure_callback to get access to XCom
Published on: 25 Dec 2020
Automatically add the insertion and update time in MySQL
Published on: 24 Dec 2020
How to partition data in S3 by date in a way that makes your life easier
Published on: 23 Dec 2020
How to use JDBC driver in PySpark to write a DataFrame to a SQL database
Published on: 22 Dec 2020
How to add a jar file or a Python file as a Pyspark dependency
Published on: 21 Dec 2020
How to use kafka-consumer-groups.sh to reset topic offsets
Published on: 20 Dec 2020
How to remove all messages from a Kafka topic
Published on: 19 Dec 2020
How to use the last_day function in Redshift
Published on: 18 Dec 2020
LEFT OUTER JOIN ON 1=1 in Redshift
Published on: 17 Dec 2020
How to count the rows by multiple conditions at the same time in SQL
Published on: 16 Dec 2020
How to create an equivalent of an index in Redshift
Published on: 15 Dec 2020
How to use the generate_series function to generate a sequence of dates
Published on: 14 Dec 2020
How to use the NTILE function in Athena
Published on: 13 Dec 2020
How to use the AWSAthenaOperator
Published on: 12 Dec 2020
How to enable column position support in Hive GROUP BY or ORDER BY
Published on: 11 Dec 2020
What is the equivalent of Athena/Presto regexp_like in Hive
Published on: 10 Dec 2020
How to use Airflow PythonSensor to check whether a YARN application finished running
Published on: 09 Dec 2020
Using conditions in AWS Athena queries
Published on: 08 Dec 2020
How to use from_base64 in AWS Athena
Published on: 07 Dec 2020
Use full outer join to combine two Apache Spark DataFrames with no common columns
Published on: 06 Dec 2020
How to get the names of missing properties for every row in a PySpark Dataframe
Published on: 05 Dec 2020
How to use a different retry delay in every Airflow task
Published on: 04 Dec 2020
How to use Airflow to find the Hive partition closest to a given date
Published on: 03 Dec 2020
Get the start time or the execution date of the previous successful DAG run in Airflow
Published on: 02 Dec 2020
How to disable backfilling of an Airflow DAG or skip a part of the DAG during a backfill
Published on: 01 Dec 2020
S3 sends s3:TestEvent to SQS after setting up the bucket notifications
Published on: 30 Nov 2020
How to use OFFSET in AWS Athena queries
Published on: 29 Nov 2020
How to get a notification when AWS Lambda stops begin used
Published on: 28 Nov 2020
How to use command-line to set Airflow variables
Published on: 27 Nov 2020
How to track the time when an Athena table was updated
Published on: 26 Nov 2020
How to keep running an Airflow DAG indefinitely
Published on: 25 Nov 2020
Get an XCOM variable from another DAG
Published on: 24 Nov 2020
How to fix TemplateNotFound error when using Airflow BashOperator
Published on: 23 Nov 2020
How to copy files in S3 and preserve the directory structure
Published on: 22 Nov 2020
How to use a window function to select random rows from Athena
Published on: 21 Nov 2020
Pause an Airflow DAG until an HTTP endpoint returns 200 OK
Published on: 20 Nov 2020
How to use AwsHook and EmrStepSensor to add an EMR step and wait until it finishes running
Published on: 19 Nov 2020
How to use the PythonVirtualenvOperator in Airflow
Published on: 18 Nov 2020
How to remove files with a common prefix from S3
Published on: 17 Nov 2020
How to use the SSHHook in a PythonOperator to connect to a remote server from Airflow using SSH and execute a command.
Published on: 16 Nov 2020
How to calculate row number by partition in Hive and use it to filter rows
Published on: 15 Nov 2020
Use EMR instance group to add spot instances to an EMR cluster
Published on: 14 Nov 2020
Disable an AWS Lambda using AWS CLI
Published on: 13 Nov 2020
How to configure a new EMR step using AWS Lambda in Python
Published on: 12 Nov 2020
Trigger AWS Lambda when a file is created in an S3 bucket
Published on: 11 Nov 2020
How to pass environment parameters to Serverless that depend on the deployment stage
Published on: 10 Nov 2020
How to add a custom function to Airflow and use it in a template
Published on: 08 Nov 2020
Use HyperLogLog to calculate the approximate number of distinct elements in Apache Spark
Published on: 07 Nov 2020
How to pass parameters to SQL template when using PostgresOperator in Airflow
Published on: 06 Nov 2020
Use regex to replace the matched string with the content of another column in PySpark
Published on: 05 Nov 2020
What to do when Apache Spark skips Parquet files with incompatible schemas
Published on: 04 Nov 2020
How to choose the proper partition size and the number of partitions to run an Apache Spark job
Published on: 03 Nov 2020
How to use pagination to retrieve all DynamoDB values
Published on: 02 Nov 2020
How to get notifications about running EMR cluster
Published on: 01 Nov 2020
How to define S3 lifecycle rules using Terraform
Published on: 31 Oct 2020
How to retry a Python function call in case of an error
Published on: 30 Oct 2020
How to use the SlackAPIPostOperator to send a templated message to a Slack channel
Published on: 29 Oct 2020
How to use the DateTimeSensor in Airflow
Published on: 28 Oct 2020
How to submit a PySpark job using SSHOperator in Airflow
Published on: 27 Oct 2020
How can you add a human action to an Airflow DAG?
Published on: 26 Oct 2020
Stop wasting time and money tuning Apache Spark parameters
Published on: 26 Oct 2020
How to use the BranchSQLOperator to choose a DAG branch to execute
Published on: 25 Oct 2020
How to trigger another DAG from an Airflow DAG
Published on: 24 Oct 2020
How to fix the stuck ExternalTaskSensor
Published on: 23 Oct 2020
How to generate the code of an Airflow task from a template and a given execution date
Published on: 22 Oct 2020
How to use Airflow CLI to get the next execution date of a DAG
Published on: 21 Oct 2020
How to use SQLCheckOperator to verify that the database contains an expected number of rows
Published on: 20 Oct 2020
How to fix the TemplateNotFound error while using a custom Airflow operator
Published on: 19 Oct 2020
How to use Airflow sensors to detect that files have been uploaded into an S3 bucket
Published on: 18 Oct 2020
What is an action in Apache Spark? What do you understand as transformations in Apache Spark?
Published on: 17 Oct 2020
How to skip some tasks when backfilling a DAG in the past
Published on: 16 Oct 2020
An interview with Christopher Bergh who explains how the DataOps principles help data engineers make data pipelines trustworthy
Published on: 16 Oct 2020
How to make a dashboard that displays Airflow DAG statuses
Published on: 15 Oct 2020
How to find the idle session that is blocking the connection pool in Redshift
Published on: 14 Oct 2020
How to configure EMR to use all available resources when running a Spark cluster
Published on: 13 Oct 2020
How to check that an AWS Athena table contains data after running an Airflow DAG.
Published on: 12 Oct 2020
How to create and start an AWS Glue Crawler from Python code using boto3
Published on: 11 Oct 2020
How to get the comments from the create table statements when the metadata is stored in the Glue Data Catalog
Published on: 10 Oct 2020
How to write multiple DynamoDB objects at once using boto3
Published on: 09 Oct 2020
How to upload S3 data into RDS tables
Published on: 08 Oct 2020
How to concatenate multiple rows into a string in MySQL
Published on: 07 Oct 2020
How to get an array of elements from one column when grouping by another column in Hive
Published on: 06 Oct 2020
How to get relative dates (yesterday, tomorrow) in Apache Spark, and how to calculate the difference between two dates
Published on: 05 Oct 2020
How to use the saveAsTable function to create a partitioned table
Published on: 04 Oct 2020
Should we cache everything in Apache Spark, or are there any rules?
Published on: 03 Oct 2020
How to convert DataFrame fields into separate columns.
Published on: 02 Oct 2020
Desc: How to use the cube and rollup functions in Apache Spark or PySpark. What is the difference between a cube and a rollup.
Published on: 01 Oct 2020
How to use the concat and concat_ws functions to merge multiple columns into one in PySpark
Published on: 30 Sep 2020
Extract multiple columns from a single column using the withColumn function and a PySpark UDF
Published on: 29 Sep 2020
How to speed up joins of small DataFrames by using the broadcast join
Published on: 28 Sep 2020
How to group values by a key and extract a single row from each group in Apache Spark
Published on: 27 Sep 2020
How to make a pivot table in AWS Athena, and why the pivot function does not exist
Published on: 26 Sep 2020
When should you use coalesce instead of repartition in Apache Spark
Published on: 25 Sep 2020
How to turn an Apache Spark or PySpark DataFrame into a pivot table.
Published on: 24 Sep 2020
When should you use the cache, and when you should use the persist function
Published on: 23 Sep 2020
Should your team use PrestoSQL?
Published on: 16 Sep 2020
How to detect problems in data pipelines before they turn into hard to debug bugs? I wish I knew.
Published on: 08 Sep 2020
How to define partition projection while creating an Athena table
Published on: 30 Aug 2020
How to customize a Slack notification before sending it to the Slack incoming webhook.
Published on: 27 Aug 2020
How to speed us Pytest tests by reusing the same SparkSession in all of them
Published on: 20 Jul 2020
How to use Terraform to configure a CloudWatch alert and send the message to a Slack channel.
Published on: 13 Jul 2020
A PySpark library for data quality checks and data validation.
Published on: 06 Jul 2020
How to measure data quality in Athena tables using AWS Deequ running on an EMR cluster.
Published on: 29 Jun 2020
How to use XCom and PythonSensor to skip remaining tasks in an Airflow DAG.
Published on: 22 Jun 2020
Why data engineers don't write unit tests?
Published on: 15 Jun 2020
How does a Connector work? What is a Worker in Kafka Connect? How does the data get processed inside Kafka Connect, and why does it need internal Kafka topics?
Published on: 08 Jun 2020
A story about debugging an Airflow DAG that was not starting tasks
Published on: 01 Jun 2020
How the log compaction is implemented in Apache Kafka and how to configure Kafka log compaction properly
Published on: 22 May 2020
What is the difference between a leader and a replica broker? What is the cluster controller? How is the controller elected?
Published on: 18 May 2020
How to use query execution plans to speed up Athena queries
Published on: 11 May 2020
How to write data stream processing code that is easy to maintain
Published on: 04 May 2020
How to give users access rights in AWS
Published on: 06 Apr 2020
What can you learn from the book "Career superpowers" by James Whittaker
Published on: 30 Mar 2020
How to use boto3 to send custom metrics to AWS CloudWatch from Python
Published on: 23 Mar 2020
How to speed up development by unit testing PySpark DAGs
Published on: 24 Feb 2020
Why one Spark executor is running much longer than others and what you can do about it
Published on: 17 Feb 2020
The explanation of the original MapReduce paper and a description of similarities between MapReduce and Apache Spark
Published on: 10 Feb 2020
Are you preparing for a data engineer job interview? Here are my answers to job interview questions about data streaming.
Published on: 03 Feb 2020
There are many kinds of sliding windows. Which one should you use?
Published on: 27 Jan 2020
This article is a text version of my talk, "I put a carnivorous plant on the Internet of Things," which I presented during the DataNatives conference (November 25-26, 2019 in Berlin, Germany).
Published on: 23 Jan 2020
Volume, velocity, variety, and veracity
Published on: 20 Jan 2020
How to achieve high cohesion and a few common problems.
Published on: 12 Jan 2020
How to deploy an AWS Lambda with dependencies
Published on: 08 Jan 2020
I quit my dream job because of a book
Published on: 06 Jan 2020
We can easily distinguish between them by focusing on three qualities: data structure (schema), data quality, and ownership.
Published on: 18 Dec 2019
Apache Spark is wasting a lot of RAM!
Published on: 16 Dec 2019
How does Airflow backfill work?
Published on: 11 Dec 2019
How to use the AWS Secrets Manager
Published on: 09 Dec 2019
Is there a difference between Dataset and DataFrame? Why do we even have both?
Published on: 04 Dec 2019
Review of The Unicorn Project by Gene Kim
Published on: 02 Dec 2019
What would you do if you were writing an application which had to process one billion events per day?
Published on: 18 Nov 2019
AI systems in healthcare
Published on: 11 Nov 2019
Published on: 05 Nov 2019
There is a whole spectrum of exploration strategies between random and greedy policies.
Published on: 04 Nov 2019
Lessons learnt from Gatis Seja's presentation about data engineering principles
Published on: 15 Oct 2019
Hide outliers when displaying boxplot in Seaborn
Published on: 14 Oct 2019
How to avoid memory leaks in Jupyter Notebook
Published on: 08 Oct 2019
Interview with Gautam Bakshi - the CEO of 15 Rock
Published on: 07 Oct 2019
How to connect a Dask cluster (in Docker) to Amazon S3
Published on: 01 Oct 2019
What if your phone could tell you what you should wear?
Published on: 30 Sep 2019
How to save the model in a file, upload it to S3, and serve it using the Docker image of Tensorflow Serving
Published on: 24 Sep 2019
How to use the stack and unstack functions in Pandas
Published on: 23 Sep 2019
How to prepare data for LSTM model
Published on: 15 Sep 2019
How to write Scrapy statistics to InfluxDB and setup Grafana alerts
Published on: 10 Sep 2019
Training a machine learning model is like learning before an exam.
Published on: 09 Sep 2019
Define a schema, write to a file, partition the data
Published on: 03 Sep 2019
How to use the reshape function in Numpy
Published on: 02 Sep 2019
Underpowered tests, true negative, and ignored tests results
Published on: 28 Aug 2019
How to plot the decision rules of XGBoost
Published on: 26 Aug 2019
Smoothing Bitcoin price time-series
Published on: 21 Aug 2019
Using GridSearchCV from Scikit-Learn to tune XGBoost classifier
Published on: 19 Aug 2019
How to generate lag features from time series
Published on: 16 Aug 2019
From Pandas dataframe to RNN input
Published on: 14 Aug 2019
Training ResNet network for multiclass image classification using keras-tuner
Published on: 12 Aug 2019
Tuning TensorFlow with Hyperband
Published on: 09 Aug 2019
Tuning Keras hyperparameters with keras-tuner
Published on: 07 Aug 2019
What is the input_shape in Keras/TensorFlow?
Published on: 05 Aug 2019
TensorFlow 2 - example
Published on: 02 Aug 2019
The reinforcement learning loop with Tensorflow Agents
Published on: 31 Jul 2019
How to define a new Tensorflow Agents metric and add it to the driver
Published on: 29 Jul 2019
Random and scripted behavior policies
Published on: 26 Jul 2019
Implementing a Tensorflow Agent environment to play a board game
Published on: 24 Jul 2019
The terminology used in the paper "Human-level control through deep reinforcement learning"
Published on: 22 Jul 2019
The fundamental equation of reinforcement learning
Published on: 19 Jul 2019
How to trigger Airflow DAG when another DAG is completed
Published on: 17 Jul 2019
How to configure Airflow in a Docker container
Published on: 15 Jul 2019
How to sample production data to get representative testing dataset?
Published on: 12 Jul 2019
Levenshtein distance and Kendall tau distance
Published on: 10 Jul 2019
How to measure the similarity of two datasets?
Published on: 08 Jul 2019
Manhattan distance, Euclidean distance, and Chebyshev distance are types of Minkowski distances
Published on: 05 Jul 2019
What are the biggest challenges in data science?
Published on: 03 Jul 2019
How to test a product idea?
Published on: 30 Jun 2019
Using Helisa and Jenetics in Scala
Published on: 21 Jun 2019
Using Helisa and Jenetics to help Fallout players
Published on: 19 Jun 2019
Team vs. a bunch of individuals reporting work time in the same spreadsheet
Published on: 17 Jun 2019
Domain model in Python
Published on: 14 Jun 2019
How to document a project?
Published on: 12 Jun 2019
How to calculate page popularity using the Wilson Score
Published on: 10 Jun 2019
How to explain a machine learning model?
Published on: 07 Jun 2019
Understanding the GLM from the statsmodels package
Published on: 05 Jun 2019
How many principal components do we need when using Principal Component Analysis?
Published on: 03 Jun 2019
The difference between KFold and StratifiedKFold in Scikit-learn
Published on: 31 May 2019
How to rank a grouped data frame in Pandas
Published on: 29 May 2019
How to use rolling window with datetime (and other types) in Pandas
Published on: 27 May 2019
Lessons learnt from "Practical Data Cleaning" by Lee Baker
Published on: 24 May 2019
Filter size, padding, and stride explained
Published on: 22 May 2019
How to use the window function to calculate a cumulative sum
Published on: 20 May 2019
How to use Parquet4s to write Parquet files in Scala
Published on: 10 May 2019
This article is mostly a “note to self” because I don’t want to google that anymore ;)
Published on: 08 May 2019
How to set the max columns in Pandas
Published on: 06 May 2019
Data Science book recommendation
Published on: 03 May 2019
My most interesting Data Analysis failures
Published on: 01 May 2019
Softmax function explained
Published on: 29 Apr 2019
Debugging a machine learning model
Published on: 26 Apr 2019
How to use the exponentially weighted window functions in Pandas
Published on: 24 Apr 2019
How to speed up finding the right hyperparameters of a machine learning model
Published on: 22 Apr 2019
Andrew Ng recommendation about mini batch size
Published on: 19 Apr 2019
The lessons learned from Andrew Ng’s online course
Published on: 17 Apr 2019
Fit more data in the same amount of memory
Published on: 15 Apr 2019
Explanation of the Airflow interval and start_date parameters
Published on: 12 Apr 2019
Avoiding over-engineering in machine learning
Published on: 10 Apr 2019
My first attempt to use Ludwig
Published on: 08 Apr 2019
How to use FeatureHasher in Scikit-learn
Published on: 05 Apr 2019
One-hot encoding, dummy coding, and effect coding in Scikit learn and Pandas
Published on: 03 Apr 2019
What to do when your model works perfectly during testing but fails in production
Published on: 01 Apr 2019
My first attempt to use scikit-automl and how I got it working
Published on: 29 Mar 2019
How does it work and why the most popular solution is wrong
Published on: 27 Mar 2019
How to encode text/categorical variables and scale numerical values using only one Scikit-learn class
Published on: 25 Mar 2019
error: command ‘swig’ failed with exit status 1 while installing scikit-automl
Published on: 22 Mar 2019
How to estimate the CLV from a list of customer transactions using the lifetimes library in Python
Published on: 20 Mar 2019
How to use a Python lifetimes library to build a Pareto/NBD model.
Published on: 18 Mar 2019
There are three kinds of metrics that won’t destroy your business.
Published on: 15 Mar 2019
Tweaking the parameters of Statsmodels
Published on: 13 Mar 2019
What can we expect from a correctly performed A/B test?
Published on: 11 Mar 2019
I have mixed feelings about this book.
Published on: 08 Mar 2019
Pedro Domingo’s observations about feature engineering
Published on: 06 Mar 2019
Should we suggest an action when we visualize data?
Published on: 04 Mar 2019
In my opinion, AUC is a metric that is both easy to use and easy to misuse. Do you want to know why? Keep reading ;)
Published on: 01 Mar 2019
LaTeX support in Jupyter Notebook
Published on: 27 Feb 2019
Using association rule learning to make recommendations
Published on: 25 Feb 2019
Pyplot parameter that configures the chart size
Published on: 22 Feb 2019
How to read Andrews curves chart
Published on: 20 Feb 2019
How to delay scraper requests to make it look like a human visiting the website
Published on: 18 Feb 2019
How to interpret autocorrelation plot?
Published on: 15 Feb 2019
I was asked for some podcast recommendation, so here is my very short list ;)
Published on: 13 Feb 2019
How to avoid bad science
Published on: 11 Feb 2019
Predicted labels are in columns, right? Or maybe in rows? Do you remember? ;)
Published on: 08 Feb 2019
How to set the learning rate after you unfreeze the network layers in fast.ai
Published on: 06 Feb 2019
The mathematics behind F1 score.
Published on: 04 Feb 2019
Display a progress bar with no additional dependencies, just Python + Jupyter Notebook
Published on: 01 Feb 2019
How to run fit function multiple time and improve the model?
Published on: 30 Jan 2019
How to use Docker and Flask to put a Scikit model in production as a microservice.
Published on: 28 Jan 2019
How to validate query parameters using Fastify
Published on: 25 Jan 2019
Solve the opposite problem to avoid stupidity.
Published on: 23 Jan 2019
Saving a Scikit-learn model using the joblib library in Python
Published on: 21 Jan 2019
Why is it difficult to work in the office?
Published on: 18 Jan 2019
Did you ever want to have a mentor?
Published on: 16 Jan 2019
How to focus on the high outcome tasks and avoid being distracted
Published on: 14 Jan 2019
The difference explained
Published on: 11 Jan 2019
How to change the commit history
Published on: 09 Jan 2019
A polarizing book
Published on: 07 Jan 2019
What is the best investment of your time?
Published on: 21 Dec 2018
An explanation of removing Docker images and containers.
Published on: 19 Dec 2018
How to tweak uncertainty intervals in Prophet.
Published on: 17 Dec 2018
How to easily serve static content on localhost or in the local network
Published on: 14 Dec 2018
The easiest way to make your tests more readable and easier to maintain
Published on: 12 Dec 2018
A list that surprised even me…
Published on: 10 Dec 2018
Why testOnly does not work?
Published on: 07 Dec 2018
A web spider that does not follow links is not very useful, let’s fix that.
Published on: 05 Dec 2018
Scrapy Spiders and processing pipelines 101
Published on: 03 Dec 2018
How to unpack a Docker image
Published on: 30 Nov 2018
Why are tech conferences boring?
Published on: 28 Nov 2018
How to make sure that GC does not stop the JVM during a test?
Published on: 21 Nov 2018
What happens when the team lacks software craft skills?
Published on: 19 Nov 2018
You can read more good books if you skip the lousy ones.
Published on: 16 Nov 2018
How to read the Prophet forecast plot
Published on: 14 Nov 2018
How to generate new ideas instead of thinking about the same thing over and over again
Published on: 12 Nov 2018
One of the most intriguing ideas described in the book "How Google works" is writing "snippets."
Published on: 09 Nov 2018
It may look like a unicorn, but it is real
Published on: 07 Nov 2018
What happens when one invention makes the whole industry obsolete?
Published on: 05 Nov 2018
How to explain the errors of a linear regression model
Published on: 02 Nov 2018
A natural way of splitting work into small, but useful parts
Published on: 29 Oct 2018
On a quest to find the right metaphor
Published on: 24 Oct 2018
TDD for data scientists working with Jupyter Notebook
Published on: 22 Oct 2018
A collection of machine learning cheat sheets I find useful and google repeatedly.
Published on: 19 Oct 2018
How does a name influence our attitude?
Published on: 17 Oct 2018
How to use Pandas to parse dates or calculate time in a different timezone.
Published on: 15 Oct 2018
How to predict the missing values using Scikit-Learn
Published on: 12 Oct 2018
How to motivate software engineers?
Published on: 10 Oct 2018
We talk about toys. We love new buzzwords. We adore things that sound cool. Yes, we do.
Published on: 08 Oct 2018
What does a "known bug" or an update say our users?
Published on: 05 Oct 2018
How to safely run code downloaded from the Internet
Published on: 03 Oct 2018
The follow-up to “Extreme ownership”
Published on: 01 Oct 2018
How to plot and interpret the box and whiskers plot
Published on: 28 Sep 2018
This book deserves a 3-star review on Amazon for many reasons.
Published on: 26 Sep 2018
Built-in matplotlib functions are not enough in this case
Published on: 24 Sep 2018
The easiest way to access someone else’s code in your own notebook
Published on: 21 Sep 2018
Use the next or previous value to fill the missing values in Pandas
Published on: 19 Sep 2018
Two workarounds to get an equivalent of forward feature selection in Scikit-Learn
Published on: 17 Sep 2018
A short tutorial about generating a heat map of the values stored in a Pandas dataframe
Published on: 14 Sep 2018
Programmers are afraid of nouns. We often replace them with poorly written descriptions of things.
Published on: 12 Sep 2018
Z-score and Density-Based Spatial Clustering of Applications with Noise
Published on: 10 Sep 2018
What to do if you keep forgetting to set the random_state?
Published on: 31 Aug 2018
My opinion about my presentation at a meetup in Erfurt, Germany.
Published on: 29 Aug 2018
Step by step instructions to "explode" a list into DataFrame rows.
Published on: 27 Aug 2018
How to create a plot that supports zooming
Published on: 24 Aug 2018
Read this book if you believe we can use A.I. and IoT to build a bright future.
Published on: 22 Aug 2018
How to visually check whether your sample is normally distributed?
Published on: 20 Aug 2018
HyperLogLog - probabilistic counting algorithm
Published on: 19 Aug 2018
Can I have the coolest Visual Studio feature in IntelliJ?
Published on: 18 Aug 2018
How to make business decisions using the Monte Carlo simulation?
Published on: 17 Aug 2018
Create a nice visualization of the most popular words in your data frame
Published on: 07 Aug 2018
A short example of defining a structural type which matches a generic class
Published on: 05 Aug 2018
How to use undirected graph to visualize common elements of two Pandas data frames
Published on: 03 Aug 2018
How to import a CSV file from Google Drive into Google Colab
Published on: 14 Jul 2018
How to understand the difference between precision and recall?
Published on: 15 Jun 2018
Cake pattern was a terrible idea.
Published on: 12 Jun 2018
What can we learn from a horrible mistake made by a programmer who wanted to make the code more generic?
Published on: 06 Jun 2018
How to use some parts of Domain Driven Design to create maintainable code in Scala?
Published on: 11 May 2018
Do we behave like a child in a toy store?
Published on: 07 Apr 2018
Have you been trying to learn something for a few months? What to do when you keep learning but still don’t understand anything?
Published on: 15 Mar 2018
How can we facilitate knowledge sharing? Will easily accessible documentation foster cooperation?
Published on: 03 Mar 2018
A review of Jocko Willink’s book: “Discipline Equals Freedom.” Should you read it even if you don’t want to run a marathon?
Published on: 24 Feb 2018
You feel you should not deploy your code on Fridays but nothing stops you. Can you prevent accidental deployments?
Published on: 18 Feb 2018
Software maintenance is painful because of hype driven development.
Published on: 22 Jan 2018
Do you think that every web page should support all existing browsers? How about all versions of those browsers?
Published on: 17 Jan 2018
The real power of programming in Scala is not in mimicking Haskell and overusing monads, but in taking advantage of its type system.
Published on: 14 Jan 2018
What a software engineer can learn from “Extreme ownership” book? How can it influence your daily work?
Published on: 11 Jan 2018
What happens when we release another beta version? Are users happy or angry? What if the reality is different than we think?
Published on: 26 Jul 2017
It is easy to announce that TDD slows you down, but have you ever wondered why it happens? Is there anything you can do better?
Published on: 08 May 2017
Akka actors do not magically disappear when you no longer need them.
Published on: 03 May 2017
Scalar Conference 2017 — everything I liked
Published on: 12 Apr 2017
We do not like being asked to write an algorithm on a whiteboard during job interviews, but is there a better way?
Published on: 19 Mar 2017
What happens when someone asks you about your code and you cannot answer because you have no idea how it works? That happened to me… again.
Published on: 12 Mar 2017
One thing that can significantly improve LambdaDays.
Published on: 12 Feb 2017
Can we write our web application in a way which saves our company money? Can we make our software cheaper?
Published on: 29 Jan 2017
How can a group created because of a tweet exists for over a year? Have we learned anything in a year? What are we going to do now?
Published on: 23 Jan 2017
Today software engineers disappointed another person…
Published on: 15 Jan 2017