Engineering Blog

AI LLM Guardrails

How to Detect and Block AI Hallucinations in Chatbots

A step-by-step guide to creating custom input and output guardrails that keep your language model from inventing facts.
Published on: 22 Jul 2025

AI LLM

Stop LLM Hallucinations in Fintech Apps: A CTO’s Guide to Risk-Proof AI Evaluation

A step-by-step guide to align AI with human expectations
Published on: 14 Jul 2025

Startup

Disposable Company Syndrome

What went wrong with startup culture? A coder reveals the truth behind “move fast,” buggy releases, and founder exits over impact.
Published on: 07 Jul 2025

AI LLM

LLM Sampling Demystified: How to Stop Hallucinations in Your Stack

Understand top-k, top-p, min-p, and temperature settings to make LLMs more reliable
Published on: 10 Jun 2025

AI Fintech Hallucinations

The AI Hallucination Crisis in Fintech

AI hallucinations are fintech's biggest risk. Discover why most deployments stall and what CTOs can do to secure their AI stack.
Published on: 26 May 2025

AI Cursor GitHub Copilot

How to Use AI Coding Tools Effectively: A Developer's Guide to GitHub Copilot, Cursor, and Beyond

Master AI coding tools with this comprehensive guide. Learn how to use GitHub Copilot, Cursor, and other AI coding tools effectively, write better prompts, and implement AI coding best practices. Transform your development workflow with practical tips and proven strategies.
Published on: 18 May 2025

AI Hallucinations

A Framework for Measuring and Fixing AI Hallucinations

AI hallucinations are killing trust in your product. This guide helps you measure, debug, and prevent them — starting today.
Published on: 12 May 2025

AI Error Analysis

Fixing AI Hallucinations: What a Crashing Arduino Car Taught Me About Error Analysis

AI hallucinations don’t come from lack of data, but from the wrong data. This article reveals how systematic error analysis helps fix hallucinations and improve data quality, with lessons from DIY disasters and enterprise AI alike.
Published on: 05 May 2025

AI LLM Evals

AI Evaluation Best Practices: Why Data Analysis Matters For Systematic AI Improvements

Discover how data analysis helps engineering teams improve AI applications more effectively than focusing on model selection and prompt engineering. Learn proven best practices for systematic AI evaluation.
Published on: 31 Mar 2025

AI MLOps BAML

Stop Explaining Failed AI Projects: BAML Turns Prompt Engineering into Real Engineering

Engineering leaders are silently wasting $50K on unreliable AI implementations. Learn how BAML brings fault tolerance, structured outputs, and automated testing to production AI systems.
Published on: 24 Mar 2025

AI Evaluation MLOps

How to Make AI Evaluation Affordable: Research-Backed Methods to Cut LLM Evaluation Costs

Why are your AI evaluation costs spiraling out of control, and what are the proven methods to reduce them without sacrificing quality?
Published on: 10 Mar 2025

AI RAG Vector Databases

The Hidden Reason Your RAG System Is Failing - The Problems Caused by Approximate Nearest Neighbor Search in Vector Databases

Discover why your RAG system might be failing due to Approximate Nearest Neighbor search limitations in vector databases. Learn how compute budgets affect search accuracy, why metadata filters complicate retrieval, and implement practical solutions to dramatically improve your RAG performance.
Published on: 03 Mar 2025

AI PydanticAI

How to use Pydantic Graph (an alternative to LangGraph)

A guide to using Pydantic Graph - a type-safe alternative to LangGraph for building AI workflows.
Published on: 24 Feb 2025

AI

Why is it so hard to correctly estimate AI projects?

Why can't you estimate an AI project correctly and can you do anything about it?
Published on: 17 Feb 2025

AI MLOps

Building Reliable AI: A Testing-First Approach

Learn how to properly test AI systems using familiar software testing concepts. Discover key metrics, alignment checks, and robustness testing strategies for reliable AI deployment.
Published on: 10 Feb 2025

AI MLOps

From API Wrappers to Reliable AI: Essential MLOps Practices for LLM Applications

API wrapper or production-ready AI? Learn how proper LLMOps separates prototypes from reliable applications
Published on: 27 Jan 2025

AI Agents RAG

Troubleshooting AI Agents: Advanced Data-Driven Techniques of Improving AI Agent Performance

Expert strategies for improving AI agent performance through better data retrieval, query generation, automated decision-making process, and response generation. The article covers data collection, metrics, and techniques to improve the agent's performance.
Published on: 20 Jan 2025

AI RAG AI Agents PydanticAI

Comprehensive Guide to AI Workflow Design Patterns with PydanticAI code examples

Learn how to implement AI workflows and autonomous agents with PydanticAI. This guide shows an example implementation of patterns described in the Anthropic article 'Building effective agents' such as prompt chaining, routing, parallelization, and orchestrator-workers.
Published on: 13 Jan 2025

AI RAG

How Much Data Do You Need to Improve RAG Performance?

A data-driven approach for improving RAG performance. Learn how to gather data and how much data you need for RAG, fine-tuning LLM, and training a specialized LLM from scratch.
Published on: 06 Jan 2025

AI RAG PydanticAI Evals

Improving RAG Retrieval Accuracy: A Practical Implementation Guide with PydanticAI and Ragas

A step-by-step tutorial on implementing HyDE technique to improve RAG retrieval accuracy, with code examples and performance evaluation using Ragas.
Published on: 30 Dec 2024

AI AI Agents Evals

Measuring SQL Generation Performance of AI Agents with Ragas

A comprehensive guide to evaluating SQL-generating AI agents using Ragas metrics, focusing on query equivalence and output format validation.
Published on: 23 Dec 2024

AI RAG PydanticAI

Using PydanticAI to obtain Structured Output from RAG in Python

Discover how PydanticAI transforms RAG outputs into structured data, ensuring type safety and validation while simplifying AI response handling in your applications.
Published on: 16 Dec 2024

AI Chatbot

Building Safer AI Systems with Content Moderation - an example with LLama-Guard

A comprehensive guide to implementing AI content moderation with LLama-Guard to build safer chatbots. Learn how to prevent inappropriate responses, protect brand reputation, and handle user interactions responsibly with practical Python examples.
Published on: 09 Dec 2024

AI Fine-tuning QLoRA

Fine-tuning Mistral-7B LLM using QLoRA in Axolotl

A comprehensive tutorial on fine-tuning Mistral-7B using QLoRA and Axolotl, covering data preparation, model configuration, and text classification optimization.
Published on: 11 Nov 2024

AI Langfuse

Prompt Management and Request Tracking for LLM Applications Using Langfuse

Learn how to use Langfuse to manage prompts and track LLM requests in your AI applications. Discover how to version prompts, monitor usage, and improve your LLM applications with detailed analytics.
Published on: 04 Nov 2024

AI OpenAI

How to use the OpenAI Swarm Agentic Framework

Learn to use OpenAI's Swarm Agentic Framework to build intelligent AI workflows. This guide covers the basics of agents, defining functions, and defining interactions between agents.
Published on: 28 Oct 2024

Jupyter Notebook Teaching

How to setup student Jupyter Notebook servers for a workshop using CoCalc

Learn how to set up Jupyter Notebook servers with GPU access for students using CoCalc. A guide for educators and workshop organizers looking to provide hardware for their students without (too much) hassle.
Published on: 21 Oct 2024

AI CrewAI

AI Receptionist for Apartment Rentals built with CrewAI

Learn how to build an AI-powered receptionist for apartment rentals using CrewAI. This tutorial covers implementing key features like access code management, WiFi password updates, and automated guest assistance, enhancing the rental experience while reducing manual workload.
Published on: 10 Sep 2024

AI

How to fine-tune a super-fast small language model SmolLM from HuggingFace

You don't always need a large language model to get good results. A fine-tuned small language model is often just as good, much faster, and cheaper.
Published on: 30 Jul 2024

RAG AI

Enhancing RAG System Accuracy - Advanced RAG techniques explained

Discover advanced techniques to enhance the accuracy of your Retrieval-Augmented Generation (RAG) systems. Learn about semantic search, query expansion, HyDE, and keyword search to improve data retrieval and answer quality.
Published on: 16 Jul 2024

AI RAG

How to prevent LLM hallucinations from reaching the users in RAG systems

Hallucinations erode trust, so we should prevent them from reaching the users. In this article, I show how to use the FaithfulnessEvaluator from the Llama-Index library to determine whether the documents retrieved by vector search contain the answer to the user's question.
Published on: 30 Jun 2024

AI RAG

How can we measure improvement in information retrieval quality in RAG systems?

Every RAG system starts with retrieval. How do you know if your retrieval code is good enough? You measure it. The article shows how to use the ir_measures library to calculate Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) to quantify the performance of your retrieval code.
Published on: 20 Jun 2024

AI

Building a Data Retrieval Workflow for AI with Structured Output Libraries like Marvin and Instructor

How to use Marvin and Instructor to define a structured output for LLMs and build a data retrieval workflow that can answer user's questions about data by checking if the required data is available in the database, planning what data has to be retrieved, generating the query, executing it, and generating a human-readable answer.
Published on: 10 Jun 2024

AI LLama

Building an agentic AI workflow with Llama 3 open-source LLM using LangGraph

How to build an agentic AI workflow using the Llama 3 open-source LLM model and LangGraph. We will create an autonomous multi-step process that autonomically handles a data retrieval task and answers user's questions using multiple specialized AI agents
Published on: 20 May 2024

OpenAI AI

Building a chatbot with a custom GPT assistant using OpenAI Assistant API and Streamlit

Build a chat with a YouTube video or a PDF chatbot in one hour (or less) with OpenAI Assistant API and Streamlit
Published on: 10 May 2024

AI Prompt Engineering

The Ultimate 2025 Guide to Prompt Engineering

Discover the difference between proven prompt engineering techniques and tricks
Published on: 30 Apr 2024

AI ChatGPT

Monitoring employees leaking secret data to AI in ChatGPT with GPTBoost and HuggingChat

How to monitor whether your employees leak company secrets, PII, or passwords to AI
Published on: 10 Jan 2024

AI ChatGPT

Language learning with AI: building an AI-powered Anki plugin

AI in everyday usage: How to build an Anki plugin for generating example sentences in a foreign language using AI
Published on: 20 Dec 2023

AI in business Langchain

Debugging, controlling OpenAI usage cost, and monitoring AI applications using Langsmith and GPTBoost

Discover how to effectively monitor AI applications, track costs, and optimize API usage with Langsmith and GPTBoost. Get insights into managing OpenAI API interactions for improved efficiency and cost control.
Published on: 30 Nov 2023

AI in Business Case-study

GPTBoost Case-Study: Blending AI with Human Creativity in Content Creation

Explore the innovative approach of using AI in content creation through the GPTBoost case-study. Learn how Veselina Staneva masterfully blends AI efficiency with human creativity in her writing process.
Published on: 22 Nov 2023

AI AI in business

What mistakes do product managers make while using AI, and what to do instead

How product managers can use AI to make their work easier without suffering the consequences of AI misuse
Published on: 20 Nov 2023

OpenAI ChatGPT

Integrate OpenAI custom GPT with your business automation workflow using REST API and webhooks

How to integrate OpenAI custom GPT with make.com scenario using webhooks - a tutorial on using GPTs in business automation workflows.
Published on: 10 Nov 2023

AI

Chat with a YouTube Video: How to build an AI chatbot using YouTube video transcripts

How to build an AI chatbot that retrieves answers to user's questions from transcripts of YouTube videos
Published on: 30 Oct 2023

Machine Learning AI

AI-Powered Topic Modeling: Using Word Embeddings and Clustering for Document Analysis

Explore the seamless integration of artificial intelligence with classical machine learning techniques for effective topic modeling and document clustering. Learn how word embeddings enable higher accuracy, semantic context preservation, and robust results.
Published on: 30 Sep 2023

Langchain.js AI

Build a question-answering service with AI and vector databases in JavaScript

How to use Langchain.js to build a question-answering service that uses vector databases to store the documents and AI to generate the answers
Published on: 20 Sep 2023

Langchain AI

What's the difference between Langchain Agents and OpenAI Functions?

What should you use when your AI needs access to external systems? Is it better to use Langchain Agents or OpenAI Functions?
Published on: 10 Sep 2023

AI Langchain

Save time and money by caching OpenAI (and other LLM) API calls with Langchain

How to use Langchain model response and document embeddings caching to save time and money when using Large Language Models
Published on: 30 Aug 2023

AI AI in business

Generate questions and answers from any document using AI

How to use OpenAI GPT models, Langchain, and Doctran to generate questions and answers from long documents
Published on: 25 Aug 2023

AI

What to do when a document doesn't fit in AI prompt window

Using Langchain MapReduceChain to handle documents longer than the prompt limit
Published on: 20 Aug 2023

AI AI in business

Finding information in long documents with AI using vector databases and MapReduceChain from Langchain

How to find information in long documents with AI, vector databases, and Langchain using MapReduceChain and ParentDocumentRetriever
Published on: 15 Aug 2023

AI

Building a classification service with Llama2 in Python

How to use the Llama2 AI model in Python to build a text classification service
Published on: 10 Aug 2023

AI Langchain

Monitoring AI applications with Langsmith

Monitor interactions with LLM in Langchain and gather feedback about the model's performance using Langsmith
Published on: 30 Jul 2023

Business Process Automation AI

Using AI and automation to keep up with industry news

Discover how AI and automation can streamline the process of gathering relevant information from newsletters, summarizing it, and delivering a weekly summary
Published on: 30 Jun 2023

AI ChatGPT

Use OpenAI API Function Calling to Build a Chatbot for Slack with Access to a REST API (updated for OpenAI SDK version 1.1.1+)'

Build an AI-powered chatbot that can interact with REST API using the Function Calling feature of OpenAI Completion API. Updated to cover the changes introduced after OpenAI DevDay 2023!
Published on: 16 Jun 2023

AI LMQL

How to use AI in Python with LMQL?

Explore LMQL, a powerful SQL-like language designed for machine learning tasks.
Published on: 10 Jun 2023

LlamaIndex AI

Which index should you use while building an application with LlamaIndex?

Which Llama index should you use? When is it better to use GPTVectorStoreIndex, GPTListIndex, GPTKeywordTableIndex, or GPTKnowledgeGraphIndex?
Published on: 30 May 2023

AI OpenAI

How to fine-tune an OpenAI model using custom data

How to prepare the training data for an OpenAI model and how to fine-tune OpenAI's GPT model in Python
Published on: 08 May 2023

AI Software Architecture

Deploy LLMs with Confidence: A Comprehensive Guide to Software Architecture for Production-Ready AI

Learn the essentials of deploying large language models in production with our comprehensive guide on software architecture for AI
Published on: 30 Apr 2023

Guardrails AI in business

Improve AI Output Using the Guardrails Library with Custom Validators

Use AI to validate another AI's output. Learn how to create custom validators and corrections using the Guardrails library.
Published on: 20 Apr 2023

ChatGPT AI in business

How to Build a ChatGPT Plugin in Python?

A step-by-step guide to building a ChatGPT plugin in Python to retrieve data from the knowledge base stored in a vector database
Published on: 15 Apr 2023

GPT Software Craft

Don't use AI to generate tests for your code or how to do test-driven development with AI

How to use AI to geneate test cases for your code
Published on: 10 Apr 2023

GPT AI in business

Alternatives to OpenAI GPT model: using an open-source Cerebras model with LangChain

Discover how to leverage the powerful open-source Cerebras model with LangChain in this comprehensive guide, featuring step-by-step instructions for loading the model with HuggingFace Transformers, creating prompt templates, and integrating it with LangChain Agents.
Published on: 30 Mar 2023

AI in business GPT

AI-Powered Pair Programming: Enhance Your Web Development Skills with GPT-4 Assistance

Improve your coding skills and elevate your writing with GPT-4 as your AI-driven pair programming partner, guiding you through the process of building a web application that functions as a user-friendly reverse dictionary.
Published on: 20 Mar 2023

dust.tt AI in business

Build an AI-powered Newsletter Generator with dust.tt and OpenAI

How to create an AI-powered newsletter generator using the dust.tt and OpenAI API. You'll learn how to set up a website with the AI application, use the few-shot in-context learning technique to train the AI model, and deploy the API to generate newsletters.
Published on: 10 Mar 2023

ChatGPT

Get Started with ChatGPT API: A Step-by-Step Guide for Python Programmers (updated for OpenAI SDK version 1.1.1+)

A step-by-step tutorial on ChatGPT API (versions 1.1.1+) in Python. You'll also learn about prompt engineering, interactivity, optimizing API calls, and using parameters to get better results. Updated to cover the changes introduced after OpenAI DevDay 2023!
Published on: 01 Mar 2023

GPT AI in business

Maximize Customer Support Efficiency: Build an AI Chatbot to Answer Common Client Questions

How to build an AI-powered Facebook chatbot using GPT-3 from OpenAI and vector databases to answer client questions using your documentation - a tutorial with step-by-step instructions. You will learn how to set up a database, create text embeddings, use MLOps and prompt engineering to retrieve answers, and build a web application to connect with the Facebook API.
Published on: 28 Feb 2023

Embeddings AI in business

Detection of Text Duplicates and Text Search with Word Embeddings and Vector Databases

Discover how word embeddings and vector databases can revolutionize text search and duplicate detection. Learn how to implement it with OpenAI GPT-3 and Milvus vector database.
Published on: 20 Feb 2023

GPT

Automating Git Commit Messages with GPT-3 for Faster Software Development Workflows

Learn how to use GPT-3 to automate Git commit message generation and speed up your development workflows.
Published on: 15 Feb 2023

GPT AI in business

Connect GPT-3 to the Internet: Create a Slack Bot and Perform Web Search, Calculations, and More

Unleash the potential of GPT-3 and make it access the Internet. Learn how to use Langchain and build a Slack bot that can do a web search, extract text from websites, and perform calculations.
Published on: 10 Feb 2023

GPT Prompt Engineering

Unlocking the Power of In-Context Learning With Zero-Shot, One-Shot, and Few-Shot Prompt Engineering for GPT

How the in-context learning prompt engineering technique improves GPT-3 results, and why does it work? What's the difference between zero-shot, one-shot, and few-shot prompting?
Published on: 30 Jan 2023

GPT AI in business

Create an AI Data Analyst bot for Slack that can lookup data in your database

Build an AI-powered Slack bot that reads data from your production database and answers simple analitics questions
Published on: 23 Jan 2023

Marketing AI in business

Generate a landing page for a newsletter in 17 minutes using ChatGPT or GPT-3

Are you looking for a way to generate high-quality content quickly and effectively? This article outlines how to use AI to create a landing page. You will learn more about the AIDA marketing model and the use of OpenAI's ChatGPT and GPT-3 to generate the text. Also, I will show you how to incorporate audience research into the prompt to improve the result. Get all the information you need to create high-quality content quickly and effectively with AI.
Published on: 20 Jan 2023

MLOps AI in business

Why do you need a text summarization service, and how to deploy a text summarization model in 15 minutes using HuggingFace and Qwak?

Why do you need text summarization services in your business? How can you deploy a model downloaded from HuggingFace using the Qwak ML platform in 15 minutes?
Published on: 10 Jan 2023

Software Architecture

What does modern software architecture look like in 2022?

Do architecture diagrams still matter? How do we deal with constant changes? How to design software architecture?
Published on: 20 Dec 2022

Software Craft

Using Abstraction Layers to Tackle Common Problems with Legacy Code

Are you struggling to manage and update your legacy codebase? In this article, I'll show you how to leverage the power of abstraction layers to overcome common challenges with legacy code.
Published on: 10 Dec 2022

Management

What does kill IT projects?

What does kill IT projects? What you should avoid, at all costs, to ensure the success of your startup or software project
Published on: 30 Nov 2022

Career

How to write a growth plan as a programmer?

How to write a growth plan that helps you get promoted and doesn't get in the way when you want to focus on your hobbies
Published on: 20 Nov 2022

TDD Python

Test-Driven Development in Python with Pytest

How to setup and use Pytest to test Python code
Published on: 10 Nov 2022

Marketing Copywriting

Marketing for SaaS startups: how to describe your product?

How to use the "benefits over features" technique to advertise your SaaS product and get more clients than your competition
Published on: 30 Oct 2022

Marketing Copywriting

How to pitch your idea

What a co-founder of DeepMind teaches us about pitching our ideas to investors
Published on: 20 Oct 2022

MLOps

MLOps at small companies

How to do MLOps while working on a small data engineering team
Published on: 10 Oct 2022

Software Craft TDD

Why should you practice TDD?

What are the benefits of TDD for programmers and companies that hire them?
Published on: 30 Sep 2022

Software Craft

How to debug code

How to debug code and solve problems as fast as possible
Published on: 20 Sep 2022

Data Engineering Software Craft

CUPID properties in data engineering

Does it make sense to use SOLID principles in data engineering? What about CUPID properties in data pipelines?
Published on: 10 Sep 2022

Software Engineering Data Engineering

How to add tests to existing code in data transformation pipelines

How data engineers can write tests for legacy code in their ETL pipelines without breaking the existing implementation
Published on: 30 Aug 2022

Data Engineering

Software engineering practices in data engineering and data science

How to produce high-quality software in data teams by applying software engineering practices to data science and data engineering
Published on: 20 Aug 2022

Pandas

How to sort a Pandas DataFrame by month name

How to use an ordered categorical variable to sort a Pandas Dataframe by months while displaying their names
Published on: 15 Aug 2022

Career

How to become a data engineer for free

What do you need to know to become a data engineer? Does a data engineer need a degree? How can you get your first data engineering job?
Published on: 10 Aug 2022

Data Engineering Stream Processing

A comprehensive guide to Kappa Architecture

What is Kappa Architecture? When should we use Kappa Architecture? What's the difference between Kappa Architecture and Lambda Architecture? And way, way more!
Published on: 30 Jul 2022

Software Craft

The secret of working with legacy code on a software team

How to work with code written by other people? What to do when you join a new team?
Published on: 20 Jul 2022

Python Functional programming

Functional programming in Python

Does functional programming in Python make sense? How to do functional programming in Python?
Published on: 10 Jul 2022

Software Craft

How to write technical documentation

How to document a software project?
Published on: 20 Jun 2022

Data Engineering

ETL vs ELT - what's the difference? Which one should you choose?

Should you use a data warehouse or build a data lake? When is a data warehouse a better choice? When is it better to build a data lake?
Published on: 10 Jun 2022

Python

Selecting rows in Pandas

How to use loc, iloc, slice, and row filtering in Pandas
Published on: 27 May 2022

Python

Python decorators explained

How can we define a Python decorator, and when should we use Python decorators.
Published on: 25 May 2022

Data Engineering Apache Spark

What is shuffling in Apache Spark, and when does it happen?

When does an Apache Spark cluster perform the shuffle operation?
Published on: 20 May 2022

Software Craft

What is the root cause of problems in software engineering?

What is the primary, unrepairable cause of almost all bugs, data leaks, human problems, etc.?
Published on: 13 May 2022

Software Craft

How to become a better programmer

What's stopping us from getting better at coding
Published on: 06 May 2022

Dev Rel

How to teach programming workshops to adults

How to prepare an enjoyable programming workshop that teaches people the skills they need without overwhelming them with new knowledge.
Published on: 15 Apr 2022

Career

How does a bad interview look like in data engineering

What you should avoid when you interview programmers for a data engineer positition
Published on: 08 Apr 2022

Software Craft

How to throw useful exceptions

How to make debugging easier by paying attention to the errors you report
Published on: 01 Apr 2022

Software Craft

Why are programmers slow, and what to do about it?

The one practice that makes every team faster (in the long run)
Published on: 25 Mar 2022

Software Engineering

How to advertise to software engineers, or how do we make terrible tech choices

Why do programmers make wrong decisions when they choose the tools they use?
Published on: 18 Mar 2022

Data Engineering

Data engineers are data librarians or how to upgrade your data lake to 2500 BCE technology.

What can data engineers learn from (ancient) librarians?
Published on: 11 Mar 2022

Career Data Science

I worked as a data scientist and that was the worst job I have ever had.

I believed in the Sexiest Job of 21 Century hype. I was wrong.
Published on: 04 Mar 2022

MLOps

MLOps engineer, you will need those three books every day!

Don't reinvent the wheel as a MLOps engineer. The 3 books you must read in 2022
Published on: 25 Feb 2022

MLOps

Why should you use a feature store

Benefits of having a feature store and what happens when you don't have one
Published on: 18 Feb 2022

Data Engineering

Data pipeline documentation without wasting your time

How to document an ETL pipeline or ML inference pipeline without doing useless work
Published on: 11 Feb 2022

MLOps

How to run batch inference using Sagemaker Batch Transform Jobs

Running a batch machine learning job using Sagemaker and data stored in S3.
Published on: 04 Feb 2022

Software Craft

How to build maintainable software by abstracting the business rules in data engineering

Are we building the right abstractions in software?
Published on: 28 Jan 2022

TDD Data Engineering

Testing legacy data pipelines

Do you struggle with maintaining your legacy data pipelines? Check out our article on how to add tests and refactor your code while working with legacy data pipelines.
Published on: 21 Jan 2022

Career

Secrets of mentoring junior software engineers

How to quickly train junior engineers to make them as productive as the rest of the team
Published on: 14 Jan 2022

Data Engineering

What does your data pipeline need in production?

When you're debugging a failing production pipeline at 2 am, what do you need?
Published on: 07 Jan 2022

Career

How to pass a machine learning engineer interview

Trivial (and easily fixable) mistakes that will make you fail a job interview
Published on: 31 Dec 2021

Career

Why do data engineers quit?

Why do data engineers quit their jobs?
Published on: 24 Dec 2021

MLOps

What is the essential KPI of an MLOps team?

What KPI to measure in an MLOps team
Published on: 17 Dec 2021

MLOps

Deploying your first ML model in production

The minimal setup for ML deployment without the things you DON'T need yet
Published on: 10 Dec 2021

Data Engineering

Is it overengineered?

What's the difference between reasonable future-proof architecture and overengineering? Is there a difference?
Published on: 04 Dec 2021

Python

Pattern matching in Python vs Scala

What is the difference between pattern matching in Python and Scala?
Published on: 26 Nov 2021

AI

Should you use machine learning in your product?

How to put AI in production without overengineering your system
Published on: 19 Nov 2021

AI in production

How does the Atlan data platform help you ensure data quality?

Atlan - a tool for facilitating a collaborative data culture
Published on: 15 Nov 2021

Data Engineering

What should you learn as a data engineer?

Should you spend time learning data engineering tools and libraries?
Published on: 12 Nov 2021

MLOps

Shadow deployment vs. canary release of machine learning models

What is shadow deployment in machine learning? What is a canary release? What is the difference?
Published on: 05 Nov 2021

MLOps AWS

How to deploy a Transformer-based model with custom preprocessing code to Sagemaker Endpoints using BentoML

Deploy a machine learning model with custom inference code to a Sagemaker Endpoint using BentoML
Published on: 01 Oct 2021

TDD

How to teach your team to write automated tests?

How to teach writing automated tests: TDD, BDD, and other techniques
Published on: 24 Sep 2021

Python

Using AWS Deequ in Python with Python-Deequ

How to use Python-Deequ to validate Spark Dataframes
Published on: 17 Sep 2021

AI in production MLOps

Building and deploying ML models using Qwak ML platform

What is Qwak ML platform and how does it work?
Published on: 03 Sep 2021

TDD Software Craft

How to learn TDD

Learning Test-Driven Development is hard and there is nothing we can do about it
Published on: 27 Aug 2021

Data Engineering

Data Engineering - the first principles

What is true in every data engineering project?
Published on: 20 Aug 2021

MLOps

How to deploy MLFlow on Heroku

How to deploy MLFlow on Heroku using PostgreSQL as the database, S3 as the artifact storage and with BasicAuth authentication
Published on: 06 Aug 2021

MLOps

What is MLOps? Do we need MLOps?

A complete definition of MLOps. No, MLOps isn't just DevOps applied to machine learning!
Published on: 30 Jul 2021

MLOps

How to add a new dataset to the Feast feature store

How to use Feast feature store in a local environment
Published on: 09 Jul 2021

Data Engineering

Building trustworthy data pipelines

How to build a trustworthy data pipeline?
Published on: 02 Jul 2021

Data engineering Project management

Theory of constraints in data engineering

Are you busy, but nothing ever gets done? Perhaps, theory of constraints will help you
Published on: 25 Jun 2021

Career Writing

How writing can improve your programming skills

How writing texts for people makes you a better programmer
Published on: 18 Jun 2021

Storytelling

The ugly truth about product demo storytelling in data teams

How to make data product demos more engaging and persuade people to care about the data
Published on: 11 Jun 2021

AWS MLOps

Multimodel deployment in Sagemaker Endpoints

How to deploy multiple models in a single Sagemaker Endpoint?
Published on: 28 May 2021

Machine Learning

How to speed up Pandas?

Is the Pandas library too slow? Here are two methods to speed it up!
Published on: 21 May 2021

Data Lake LakeFS

Data versioning with LakeFS

Why you should use LakeFS to build a data lake that supports data versioning
Published on: 14 May 2021

Tensorflow Machine Learning

How to add custom preprocessing code to a Sagemaker Endpoint running a Tensorflow model

How to customize input/output of a Sagemaker Endpoint running a Tensorflow model
Published on: 07 May 2021

Tensorflow Machine Learning

How to A/B test Tensorflow models using Sagemaker Endpoints

How to deploy multiple model versions as one Sagemaker Endpoint
Published on: 30 Apr 2021

Tensorflow Machine Learning

How to predict the value of time series using Tensorflow and RNN

How to train the RNN model in Tensorflow to predict time series?
Published on: 23 Apr 2021

AWS

How to deploy a REST API AWS Lambda using Chalice and AWS Code Pipeline

How to create a REST API Endpoint using AWS Lambda, Chalice, and AWS Code Pipeline
Published on: 16 Apr 2021

Data Engineering MLOps

How to deploy a Tensorflow model using Sagemaker Endpoints and AWS Code Pipeline

How to build a Docker image using AWS Code Pipeline and deploy it as an Sagemaker Endpoint
Published on: 09 Apr 2021

Machine Learning

How to deal with days of the week in machine learning

How to encode week days as features for machine learning models
Published on: 26 Mar 2021

Blogging

On technical blogging

How to start blogging as a programmer
Published on: 19 Mar 2021

Deep learning

Why do we use dropout in artificial neural networks?

How does dropout work in artificial neural networks?
Published on: 12 Mar 2021

Apache Spark

How to measure Spark performance and gather metrics about written data

How to track Spark metrics in AWS CloudWatch
Published on: 05 Mar 2021

AWS

How to use AWS Batch to run a Python script

How to build a Docker image, define an AWS Batch job using Terraform, and run the AWS Batch job using Airflow
Published on: 26 Feb 2021

Airflow Prophet

Anomaly detection in Airflow DAG using Prophet library

How to detect problems in Airflow pipeline using Prophet for time series anomaly detection
Published on: 12 Feb 2021

BDD

How to test REST API contract using BDD

Testing a REST API using Behave in Python
Published on: 05 Feb 2021

Data engineering PySpark BDD

Testing data products: BDD for data engineers

How to use BDD to test PySpark code
Published on: 29 Jan 2021

Data engineering

Definition of done for data engineers

When can data engineers be sure that they have done the task?
Published on: 14 Jan 2021

Learning

Don't learn another programming language

Should you learn a new programming language this year?
Published on: 07 Jan 2021

PySpark

How to read from SQL table in PySpark using a query instead of specifying a table

Fetching data using a SQL query in PySpark
Published on: 01 Jan 2021

Airflow

How to restart a stuck Airflow DAG

What to do when an Airflow DAG gets stuck and does not want to run
Published on: 31 Dec 2020

Airflow

Why does the DayOfWeekSensor exist in Airflow?

How to make an Airflow DAG wait until a specified day of the week
Published on: 30 Dec 2020

Airflow AWS

Send SMS from an Airflow DAG using AWS SNS

How to configure SNS subscription to send SMS messages and use Airflow to send them
Published on: 29 Dec 2020

Athena

How to emulate temporary tables in Athena

Use CTAS to create a temporary table in Athena
Published on: 28 Dec 2020

S3 Terraform

How to enable S3 bucket versioning using Terraform

How to configure S3 bucket versioning in Terraform
Published on: 27 Dec 2020

AWS

How to get a notification when a new file is uploaded to an S3 bucket

Get a Slack notification when a file is uploaded to an S3 bucket
Published on: 26 Dec 2020

Airflow

Get an XCom value in the Airflow on_failure_callback function

How to get the task instance in the on_failure_callback to get access to XCom
Published on: 25 Dec 2020

SQL

Add the row insertion time to a MySQL table

Automatically add the insertion and update time in MySQL
Published on: 24 Dec 2020

AWS

Best practices about partitioning data in S3 by date

How to partition data in S3 by date in a way that makes your life easier
Published on: 23 Dec 2020

PySpark

How to write to a SQL database using JDBC in PySpark

How to use JDBC driver in PySpark to write a DataFrame to a SQL database
Published on: 22 Dec 2020

PySpark

How to add dependencies as jar files or Python scripts to PySpark

How to add a jar file or a Python file as a Pyspark dependency
Published on: 21 Dec 2020

Apache Kafka

How to reset the consumer offset in Apache Kafka topic

How to use kafka-consumer-groups.sh to reset topic offsets
Published on: 20 Dec 2020

Apache Kafka

How to purge a Kafka topic

How to remove all messages from a Kafka topic
Published on: 19 Dec 2020

Redshift

Get the last day of the month in Redshift

How to use the last_day function in Redshift
Published on: 18 Dec 2020

Redshift

How to make an unconditional join in Redshift

LEFT OUTER JOIN ON 1=1 in Redshift
Published on: 17 Dec 2020

Redshift SQL

How to count the number of rows that match a condition in Redshift

How to count the rows by multiple conditions at the same time in SQL
Published on: 16 Dec 2020

Redshift

How to index data in Redshift

How to create an equivalent of an index in Redshift
Published on: 15 Dec 2020

Redshift

How to generate a sequence of dates in Redshift

How to use the generate_series function to generate a sequence of dates
Published on: 14 Dec 2020

AWS

How to assign rows to ranked groups in AWS Athena

How to use the NTILE function in Athena
Published on: 13 Dec 2020

Airflow

How to define an AWS Athena view using Airflow

How to use the AWSAthenaOperator
Published on: 12 Dec 2020

Hive

How to write Hive queries with column position number in the GROUP BY or ORDER BY clauses

How to enable column position support in Hive GROUP BY or ORDER BY
Published on: 11 Dec 2020

Hive

How to check whether a regular expression matches a string in Hive

What is the equivalent of Athena/Presto regexp_like in Hive
Published on: 10 Dec 2020

Airflow

How to check whether a YARN application has finished

How to use Airflow PythonSensor to check whether a YARN application finished running
Published on: 09 Dec 2020

AWS

How to use WHEN CASE queires in AWS Athena

Using conditions in AWS Athena queries
Published on: 08 Dec 2020

AWS

How to decode base64 to text in AWS Athena

How to use from_base64 in AWS Athena
Published on: 07 Dec 2020

Apache Spark

How to combine two DataFrames with no common columns in Apache Spark

Use full outer join to combine two Apache Spark DataFrames with no common columns
Published on: 06 Dec 2020

Apache Spark

How to get names of columns with missing values in PySpark

How to get the names of missing properties for every row in a PySpark Dataframe
Published on: 05 Dec 2020

Airflow

How to set a different retry delay for every task in an Airflow DAG

How to use a different retry delay in every Airflow task
Published on: 04 Dec 2020

Airflow Hive

How to find the Hive partition closest to a given date

How to use Airflow to find the Hive partition closest to a given date
Published on: 03 Dec 2020

Airflow

Get the date of the previous successful DAG run in Airflow.

Get the start time or the execution date of the previous successful DAG run in Airflow
Published on: 02 Dec 2020

Airflow

How to prevent Airflow from backfilling old DAG runs

How to disable backfilling of an Airflow DAG or skip a part of the DAG during a backfill
Published on: 01 Dec 2020

AWS

What is s3:TestEvent, and why does it break my event processing?

S3 sends s3:TestEvent to SQS after setting up the bucket notifications
Published on: 30 Nov 2020

AWS

Making OFFSET LIMIT queries in AWS Athena

How to use OFFSET in AWS Athena queries
Published on: 29 Nov 2020

AWS

How to get an alert if an AWS lambda does not get invoked during the last 24 hours

How to get a notification when AWS Lambda stops begin used
Published on: 28 Nov 2020

Airflow

How to set Airflow variables while creating a dev environment

How to use command-line to set Airflow variables
Published on: 27 Nov 2020

AWS

How to check when an Athena table was updated

How to track the time when an Athena table was updated
Published on: 26 Nov 2020

Airflow

How to run an Airflow DAG in a loop

How to keep running an Airflow DAG indefinitely
Published on: 25 Nov 2020

Airflow

How to use xcom_pull to get a variable from another DAG

Get an XCOM variable from another DAG
Published on: 24 Nov 2020

Airflow

What to do when Airflow BashOperator fails with TemplateNotFound error

How to fix TemplateNotFound error when using Airflow BashOperator
Published on: 23 Nov 2020

AWS

Copy directories in S3 using s3-dist-cp

How to copy files in S3 and preserve the directory structure
Published on: 22 Nov 2020

AWS

How to select a random sample of rows using Athena

How to use a window function to select random rows from Athena
Published on: 21 Nov 2020

Airflow

Use HttpSensor to pause an Airflow DAG until a website is available

Pause an Airflow DAG until an HTTP endpoint returns 200 OK
Published on: 20 Nov 2020

Airflow

How to add an EMR step in Airflow and wait until it finishes running

How to use AwsHook and EmrStepSensor to add an EMR step and wait until it finishes running
Published on: 19 Nov 2020

Airflow

How to use Virtualenv to prepare a separate environment for Python function running in Airflow

How to use the PythonVirtualenvOperator in Airflow
Published on: 18 Nov 2020

AWS Airflow

Remove a directory from S3 using Airflow S3Hook

How to remove files with a common prefix from S3
Published on: 17 Nov 2020

Airflow

Run a command on a remote server using SSH in Airflow

How to use the SSHHook in a PythonOperator to connect to a remote server from Airflow using SSH and execute a command.
Published on: 16 Nov 2020

Hive

Use the ROW_NUMBER() function to get top rows by partition in Hive

How to calculate row number by partition in Hive and use it to filter rows
Published on: 15 Nov 2020

Terraform

How to configure both core and spot instances in EMR using Terraform

Use EMR instance group to add spot instances to an EMR cluster
Published on: 14 Nov 2020

AWS

How to temporarily disable an AWS Lambda function using AWS CLI without removing the function

Disable an AWS Lambda using AWS CLI
Published on: 13 Nov 2020

AWS

How to add an EMR step from AWS Lambda

How to configure a new EMR step using AWS Lambda in Python
Published on: 12 Nov 2020

AWS

Send event to AWS Lambda when a file is added to an S3 bucket

Trigger AWS Lambda when a file is created in an S3 bucket
Published on: 11 Nov 2020

Serverless

Select Serverless configuration variables using the stage parameter

How to pass environment parameters to Serverless that depend on the deployment stage
Published on: 10 Nov 2020

Airflow

Use a custom function in Airflow templates

How to add a custom function to Airflow and use it in a template
Published on: 08 Nov 2020

PySpark

Speed up counting the distinct elements in a Spark DataFrame

Use HyperLogLog to calculate the approximate number of distinct elements in Apache Spark
Published on: 07 Nov 2020

Airflow

Pass parameters to SQL query when using PostgresOperator in Airflow

How to pass parameters to SQL template when using PostgresOperator in Airflow
Published on: 06 Nov 2020

PySpark

Use regexp_replace to replace a matched string with a value of another column in PySpark

Use regex to replace the matched string with the content of another column in PySpark
Published on: 05 Nov 2020

Apache Spark

How to read multiple Parquet files with different schemas in Apache Spark

What to do when Apache Spark skips Parquet files with incompatible schemas
Published on: 04 Nov 2020

Apache Spark

How to determine the partition size in Apache Spark

How to choose the proper partition size and the number of partitions to run an Apache Spark job
Published on: 03 Nov 2020

DynamoDB

How to download all available values from DynamoDB using pagination

How to use pagination to retrieve all DynamoDB values
Published on: 02 Nov 2020

EMR

How to make sure that you did not leave an EMR cluster running

How to get notifications about running EMR cluster
Published on: 01 Nov 2020

S3 Terraform

How to automatically remove files from S3 using lifecycle rules defined in Terraform

How to define S3 lifecycle rules using Terraform
Published on: 31 Oct 2020

Python

How to retry a Python function call

How to retry a Python function call in case of an error
Published on: 30 Oct 2020

Airflow

Send a Slack message from an Airflow DAG

How to use the SlackAPIPostOperator to send a templated message to a Slack channel
Published on: 29 Oct 2020

Airflow

How to delay an Airflow DAG until a given hour using the DateTimeSensor

How to use the DateTimeSensor in Airflow
Published on: 28 Oct 2020

Apache Spark Airflow

How to run PySpark code using the Airflow SSHOperator

How to submit a PySpark job using SSHOperator in Airflow
Published on: 27 Oct 2020

Airflow

How to add a manual step to an Airflow DAG using the JiraOperator

How can you add a human action to an Airflow DAG?
Published on: 26 Oct 2020

Apache Spark

How Data Mechanics can reduce your Apache Spark costs by 70%

Stop wasting time and money tuning Apache Spark parameters
Published on: 26 Oct 2020

Airflow

Conditionally pick an Airflow DAG branch using an SQL query

How to use the BranchSQLOperator to choose a DAG branch to execute
Published on: 25 Oct 2020

Airflow

How to trigger an Airflow DAG from another DAG

How to trigger another DAG from an Airflow DAG
Published on: 24 Oct 2020

Airflow

Why does the ExternalTaskSensor get stuck?

How to fix the stuck ExternalTaskSensor
Published on: 23 Oct 2020

Airflow

How to render an Airflow template for testing

How to generate the code of an Airflow task from a template and a given execution date
Published on: 22 Oct 2020

Airflow

How to check the next execution date of an Airflow DAG

How to use Airflow CLI to get the next execution date of a DAG
Published on: 21 Oct 2020

Airflow

Doing data quality checks using the SQLCheckOperator

How to use SQLCheckOperator to verify that the database contains an expected number of rows
Published on: 20 Oct 2020

Airflow

How to deal with the jinja2 TemplateNotFound error in Airflow

How to fix the TemplateNotFound error while using a custom Airflow operator
Published on: 19 Oct 2020

Airflow AWS

How to postpone Airflow DAG until files get uploaded into an S3 bucket

How to use Airflow sensors to detect that files have been uploaded into an S3 bucket
Published on: 18 Oct 2020

Apache Spark

What is the difference between a transformation and an action in Apache Spark?

What is an action in Apache Spark? What do you understand as transformations in Apache Spark?
Published on: 17 Oct 2020

Airflow

Use LatestOnlyOperator to skip some tasks while running a backfill in Airflow

How to skip some tasks when backfilling a DAG in the past
Published on: 16 Oct 2020

Interviews DataOps

Christopher Bergh - How the DataOps principles help data engineers make data pipelines trustworthy

An interview with Christopher Bergh who explains how the DataOps principles help data engineers make data pipelines trustworthy
Published on: 16 Oct 2020

Airflow

How to retrieve the statuses of the recent DAG executions from Airflow database

How to make a dashboard that displays Airflow DAG statuses
Published on: 15 Oct 2020

Redshift

How to find and terminate an idle Redshift session

How to find the idle session that is blocking the connection pool in Redshift
Published on: 14 Oct 2020

Apache Spark AWS EMR

How to configure Spark to maximize resource usage while using AWS EMR

How to configure EMR to use all available resources when running a Spark cluster
Published on: 13 Oct 2020

Athena Airflow

How to use AWSAthenaOperator in Airflow to verify that a DAG finished successfully

How to check that an AWS Athena table contains data after running an Airflow DAG.
Published on: 12 Oct 2020

AWS Glue

How to start an AWS Glue Crawler to refresh Athena tables using boto3

How to create and start an AWS Glue Crawler from Python code using boto3
Published on: 11 Oct 2020

AWS

How to retrieve the table descriptions from Glue Data Catalog using boto3

How to get the comments from the create table statements when the metadata is stored in the Glue Data Catalog
Published on: 10 Oct 2020

DynamoDB

How to perform a batch write to DynamoDB using boto3

How to write multiple DynamoDB objects at once using boto3
Published on: 09 Oct 2020

RDS AWS

How to populate a PostgreSQL (RDS) database with data from CSV files stored in AWS S3

How to upload S3 data into RDS tables
Published on: 08 Oct 2020

SQL

How to concatenate multiple MySQL rows into a single field?

How to concatenate multiple rows into a string in MySQL
Published on: 07 Oct 2020

Hive

How to get an array/bag of elements from the Hive group by operator?

How to get an array of elements from one column when grouping by another column in Hive
Published on: 06 Oct 2020

PySpark Apache Spark

Working with dates and time in Apache Spark

How to get relative dates (yesterday, tomorrow) in Apache Spark, and how to calculate the difference between two dates
Published on: 05 Oct 2020

PySpark Apache Spark

How to save an Apache Spark DataFrame as a dynamically partitioned table in Hive

How to use the saveAsTable function to create a partitioned table
Published on: 04 Oct 2020

Apache Spark

When to cache an Apache Spark DataFrame?

Should we cache everything in Apache Spark, or are there any rules?
Published on: 03 Oct 2020

Apache Spark PySpark

How to flatten a struct in a Spark DataFrame?

How to convert DataFrame fields into separate columns.
Published on: 02 Oct 2020

Apache Spark PySpark

What is the difference between CUBE and ROLLUP and how to use it in Apache Spark?

Desc: How to use the cube and rollup functions in Apache Spark or PySpark. What is the difference between a cube and a rollup.
Published on: 01 Oct 2020

Apache Spark PySpark

How to concatenate columns in a PySpark DataFrame

How to use the concat and concat_ws functions to merge multiple columns into one in PySpark
Published on: 30 Sep 2020

Apache Spark PySpark

How to derive multiple columns from a single column in a PySpark DataFrame

Extract multiple columns from a single column using the withColumn function and a PySpark UDF
Published on: 29 Sep 2020

Apache Spark

Broadcast variables and broadcast joins in Apache Spark

How to speed up joins of small DataFrames by using the broadcast join
Published on: 28 Sep 2020

Apache Spark

How to use the window function to get a single row from each group in Apache Spark

How to group values by a key and extract a single row from each group in Apache Spark
Published on: 27 Sep 2020

PrestoSQL AWS Athena

How to make a pivot table in AWS Athena or PrestoSQL

How to make a pivot table in AWS Athena, and why the pivot function does not exist
Published on: 26 Sep 2020

Apache Spark

What is the difference between repartition and coalesce in Apache Spark?

When should you use coalesce instead of repartition in Apache Spark
Published on: 25 Sep 2020

Apache Spark

How to pivot an Apache Spark DataFrame

How to turn an Apache Spark or PySpark DataFrame into a pivot table.
Published on: 24 Sep 2020

Apache Spark

What is the difference between cache and persist in Apache Spark?

When should you use the cache, and when you should use the persist function
Published on: 23 Sep 2020

Presto Data Engineering

Why your company should use PrestoSQL

Should your team use PrestoSQL?
Published on: 16 Sep 2020

Data Engineering Data Quality

Is counting rows all we can do?

How to detect problems in data pipelines before they turn into hard to debug bugs? I wish I knew.
Published on: 08 Sep 2020

AWS AWS Athena

How to Speed Up AWS Athena Queries Using Partition Projection

How to define partition projection while creating an Athena table
Published on: 30 Aug 2020

Apache Airflow

How to send a customized Slack notification when an Airflow task fails

How to customize a Slack notification before sending it to the Slack incoming webhook.
Published on: 27 Aug 2020

Apache Spark Pytest

How to use one SparkSession to run all Pytest tests

How to speed us Pytest tests by reusing the same SparkSession in all of them
Published on: 20 Jul 2020

AWS Monitoring

How to send AWS CloudWatch Alerts to a Slack channel using Terraform

How to use Terraform to configure a CloudWatch alert and send the message to a Slack channel.
Published on: 13 Jul 2020

Data Engineering Apache Spark Check-Engine

Check-Engine - data quality validation for PySpark 3.0.0

A PySpark library for data quality checks and data validation.
Published on: 06 Jul 2020

Data Engineering Data quality AWS Deequ

Measuring data quality using AWS Deequ

How to measure data quality in Athena tables using AWS Deequ running on an EMR cluster.
Published on: 29 Jun 2020

Data Engineering Airflow

How to conditionally skip tasks in an Airflow DAG

How to use XCom and PythonSensor to skip remaining tasks in an Airflow DAG.
Published on: 22 Jun 2020

Data Engineering Software Craft

The problem with software testing in data engineering

Why data engineers don't write unit tests?
Published on: 15 Jun 2020

Data Engineering Apache Kafka

How does Kafka Connect work?

How does a Connector work? What is a Worker in Kafka Connect? How does the data get processed inside Kafka Connect, and why does it need internal Kafka topics?
Published on: 08 Jun 2020

Airflow Data Engineering

Why my Airflow tasks got stuck in "no_status" and how I fixed it

A story about debugging an Airflow DAG that was not starting tasks
Published on: 01 Jun 2020

Apache Kafka Data Engineering

What is Kafka log compaction, and how does it work?

How the log compaction is implemented in Apache Kafka and how to configure Kafka log compaction properly
Published on: 22 May 2020

Data engineering Apache Kafka

How does a Kafka Cluster work?

What is the difference between a leader and a replica broker? What is the cluster controller? How is the controller elected?
Published on: 18 May 2020

Data Engineering AWS

Athena performance tips explained

How to use query execution plans to speed up Athena queries
Published on: 11 May 2020

Data Engineering Stream processing

Data flow - what functional programming and Unix philosophy can teach us about data streaming

How to write data stream processing code that is easy to maintain
Published on: 04 May 2020

AWS

AWS IAM roles and policies explained

How to give users access rights in AWS
Published on: 06 Apr 2020

Book Learning

How to be happy at work - lessons learned from "Career superpowers" book

What can you learn from the book "Career superpowers" by James Whittaker
Published on: 30 Mar 2020

AWS Data Engineering

How to send metrics to AWS CloudWatch from custom Python code

How to use boto3 to send custom metrics to AWS CloudWatch from Python
Published on: 23 Mar 2020

Data Engineering Apache Spark

How to unit test PySpark

How to speed up development by unit testing PySpark DAGs
Published on: 24 Feb 2020

Data Engineering Apache Spark

How to speed up a PySpark job

Why one Spark executor is running much longer than others and what you can do about it
Published on: 17 Feb 2020

Data Engineering Papers We Love

How does MapReduce work, and how is it similar to Apache Spark?

The explanation of the original MapReduce paper and a description of similarities between MapReduce and Apache Spark
Published on: 10 Feb 2020

Data Streaming Data engineering

Data streaming with Apache Kafka - guide for data engineers

Are you preparing for a data engineer job interview? Here are my answers to job interview questions about data streaming.
Published on: 03 Feb 2020

Big data Event Streaming

Data streaming: what is the difference between the tumbling and sliding window?

There are many kinds of sliding windows. Which one should you use?
Published on: 27 Jan 2020

Arduino IoT

I put a carnivorous plant on the Internet of Things to save its life, and it did not survive

This article is a text version of my talk, "I put a carnivorous plant on the Internet of Things," which I presented during the DataNatives conference (November 25-26, 2019 in Berlin, Germany).
Published on: 23 Jan 2020

Big data Data engineering

What are the 4 V's of big data, and which one is the most important?

Volume, velocity, variety, and veracity
Published on: 20 Jan 2020

Software architecture

10x software architecture: high cohesion

How to achieve high cohesion and a few common problems.
Published on: 12 Jan 2020

AWS Serverless

How to add dependencies to AWS lambda

How to deploy an AWS Lambda with dependencies
Published on: 08 Jan 2020

Book

Four books to boost your programmer career

I quit my dream job because of a book
Published on: 06 Jan 2020

Data engineering

What is the difference between data lake, data warehouse, and data mart

We can easily distinguish between them by focusing on three qualities: data structure (schema), data quality, and ownership.
Published on: 18 Dec 2019

Spark Data engineering

Three biggest traps to avoid while setting Spark executor memory

Apache Spark is wasting a lot of RAM!
Published on: 16 Dec 2019

Airflow

How to use Airflow backfill to run DAGs for a specified date in the past?

How does Airflow backfill work?
Published on: 11 Dec 2019

AWS

What do you need to know about storing passwords in AWS?

How to use the AWS Secrets Manager
Published on: 09 Dec 2019

Apache Spark Data engineering

Apache Spark: should we use RDD, Dataset, or DataFrame?

Is there a difference between Dataset and DataFrame? Why do we even have both?
Published on: 04 Dec 2019

Book review

What a data engineer can learn from The Unicorn Project?

Review of The Unicorn Project by Gene Kim
Published on: 02 Dec 2019

AI in production Data Engineering

AI in production: Roobits Events360

What would you do if you were writing an application which had to process one billion events per day?
Published on: 18 Nov 2019

AI in production AI

AI in production: Carta Healthcare

AI systems in healthcare
Published on: 11 Nov 2019

Time Series Anomaly Detection

Using Exponentially Weighted Moving Average for anomaly detection

Published on: 05 Nov 2019

Reinforcement Learning TensorFlow

Using Boltzmann distribution as the exploration policy in TensorFlow-agent reinforcement learning models

There is a whole spectrum of exploration strategies between random and greedy policies.
Published on: 04 Nov 2019

Data engineering Principles

Data engineering principles according to Gatis Seja

Lessons learnt from Gatis Seja's presentation about data engineering principles
Published on: 15 Oct 2019

Data Science

How to remove outliers from Seaborn boxplot charts

Hide outliers when displaying boxplot in Seaborn
Published on: 14 Oct 2019

Python Performance Engineering

Python memory management in Jupyter Notebook

How to avoid memory leaks in Jupyter Notebook
Published on: 08 Oct 2019

Data engineering AI in production

AI in production: make data as easy as using your phone

Interview with Gautam Bakshi - the CEO of 15 Rock
Published on: 07 Oct 2019

Dask AWS Amazon S3

How to connect a Dask cluster (in Docker) to Amazon S3

How to connect a Dask cluster (in Docker) to Amazon S3
Published on: 01 Oct 2019

Machine learning AI in production

A.I. in production: your next stylist is going to be a neural network

What if your phone could tell you what you should wear?
Published on: 30 Sep 2019

Tensorflow AWS Amazon S3

Loading tensorflow models from Amazon S3 with Tensorflow Serving

How to save the model in a file, upload it to S3, and serve it using the Docker image of Tensorflow Serving
Published on: 24 Sep 2019

Data Science

Pandas stack and unstack explained

How to use the stack and unstack functions in Pandas
Published on: 23 Sep 2019

Tensorflow Deep Learning

How to split a data frame into time-series for LSTM deep neural network

How to prepare data for LSTM model
Published on: 15 Sep 2019

Scrapy Grafana

How to monitor Scrapy spiders using InfluxDB and Grafana

How to write Scrapy statistics to InfluxDB and setup Grafana alerts
Published on: 10 Sep 2019

Machine learning

What is the difference between training, validation, and test sets in machine learning

Training a machine learning model is like learning before an exam.
Published on: 09 Sep 2019

Parquet Python

How to write to a Parquet file in Python

Define a schema, write to a file, partition the data
Published on: 03 Sep 2019

Data Science Numpy

Numpy reshape explained

How to use the reshape function in Numpy
Published on: 02 Sep 2019

Data Science

Human bias in A/B testing

Underpowered tests, true negative, and ignored tests results
Published on: 28 Aug 2019

Machine learning

How to plot the decision trees from XGBoost classifier

How to plot the decision rules of XGBoost
Published on: 26 Aug 2019

Data Science

Smoothing time series in Python using Savitzky–Golay filter

Smoothing Bitcoin price time-series
Published on: 21 Aug 2019

Data Science Machine Learning

XGBoost hyperparameter tuning in Python using grid search

Using GridSearchCV from Scikit-Learn to tune XGBoost classifier
Published on: 19 Aug 2019

Data Science Machine Learning

Forecasting time series: using lag features

How to generate lag features from time series
Published on: 16 Aug 2019

Data Science TensorFlow

How to turn Pandas data frame into time-series input for RNN

From Pandas dataframe to RNN input
Published on: 14 Aug 2019

TensorFlow Deep learning

How to automatically select the hyperparameters of a ResNet neural network

Training ResNet network for multiclass image classification using keras-tuner
Published on: 12 Aug 2019

TensorFlow Deep learning

Using Hyperband for TensorFlow hyperparameter tuning with keras-tuner

Tuning TensorFlow with Hyperband
Published on: 09 Aug 2019

TensorFlow Deep learning

Using keras-tuner to tune hyperparameters of a TensorFlow model

Tuning Keras hyperparameters with keras-tuner
Published on: 07 Aug 2019

Deep learning TensorFlow

Understanding the Keras layer input shapes

What is the input_shape in Keras/TensorFlow?
Published on: 05 Aug 2019

Deep learning TensorFlow

How to train a model in TensorFlow 2.0

TensorFlow 2 - example
Published on: 02 Aug 2019

TensorFlow Reinforcement learning

How to train a Reinforcement Learning Agent using Tensorflow Agents

The reinforcement learning loop with Tensorflow Agents
Published on: 31 Jul 2019

TensorFlow Reinforcement learning

How to use a custom metric with Tensorflow Agents

How to define a new Tensorflow Agents metric and add it to the driver
Published on: 29 Jul 2019

TensorFlow Reinforcement learning

How to use a behavior policy with Tensorflow Agents

Random and scripted behavior policies
Published on: 26 Jul 2019

TensorFlow Reinforcement learning

How to create an environment for a Tensorflow Agent?

Implementing a Tensorflow Agent environment to play a board game
Published on: 24 Jul 2019

Reinforcement learning

Deep Q-network terminology in plain English

The terminology used in the paper "Human-level control through deep reinforcement learning"
Published on: 22 Jul 2019

Math Reinforcement learning

Bellman equation explained

The fundamental equation of reinforcement learning
Published on: 19 Jul 2019

Airflow Data engineering

Dependencies between DAGs: How to wait until another DAG finishes in Airflow?

How to trigger Airflow DAG when another DAG is completed
Published on: 17 Jul 2019

Airflow Data engineering

How to run Airflow in Docker (with a persistent database)

How to configure Airflow in a Docker container
Published on: 15 Jul 2019

Machine learning

Using machine learning for software testing

How to sample production data to get representative testing dataset?
Published on: 12 Jul 2019

Math Data Science

How to measure the similarity of sequence values

Levenshtein distance and Kendall tau distance
Published on: 10 Jul 2019

Math Data Science

Measuring document similarity in machine learning

How to measure the similarity of two datasets?
Published on: 08 Jul 2019

Math Data science

Minkowski distance explained

Manhattan distance, Euclidean distance, and Chebyshev distance are types of Minkowski distances
Published on: 05 Jul 2019

Data Science

Why most data science projects fail?

What are the biggest challenges in data science?
Published on: 03 Jul 2019

Data Science Startup YouTube

Product/market fit - buidling a data-driven product

How to test a product idea?
Published on: 30 Jun 2019

Genetic algorithms

How to assign people to groups in a fair way using genetic algorithms

Using Helisa and Jenetics in Scala
Published on: 21 Jun 2019

Genetic algorithms

Genetic algorithms in Scala - solving optimization problems

Using Helisa and Jenetics to help Fallout players
Published on: 19 Jun 2019

Teamwork Startup

Re: DataOps Principles: How Startups Do Data The Right Way

Team vs. a bunch of individuals reporting work time in the same spreadsheet
Published on: 17 Jun 2019

Python

From Scala to Python - Python dataclasses

Domain model in Python
Published on: 14 Jun 2019

Data Science

Notetaking for data science

How to document a project?
Published on: 12 Jun 2019

Data Science Statistics

Wilson score in Python - example

How to calculate page popularity using the Wilson Score
Published on: 10 Jun 2019

Machine learning

Using a surrogate model to interpret a machine learning model

How to explain a machine learning model?
Published on: 07 Jun 2019

Machine learning

Generalized Linear Models — Using linear regression when the dependent variable does not follow Gaussian distribution

Understanding the GLM from the statsmodels package
Published on: 05 Jun 2019

Machine learning

PCA — how to choose the number of components?

How many principal components do we need when using Principal Component Analysis?
Published on: 03 Jun 2019

Machine learning

How to avoid bias against underrepresented target classes while training a machine learning model

The difference between KFold and StratifiedKFold in Scikit-learn
Published on: 31 May 2019

Data Science

How to get the value by rank from a grouped Pandas dataframe

How to rank a grouped data frame in Pandas
Published on: 29 May 2019

Data Science

The difference between the expanding and rolling window in Pandas

How to use rolling window with datetime (and other types) in Pandas
Published on: 27 May 2019

Data Science

Write everything down

Lessons learnt from "Practical Data Cleaning" by Lee Baker
Published on: 24 May 2019

Deep learning

Understanding layer size in Convolutional Neural Networks

Filter size, padding, and stride explained
Published on: 22 May 2019

Data engineering

Calculating the cumulative sum of a group using Apache Spark

How to use the window function to calculate a cumulative sum
Published on: 20 May 2019

Data engineering

How to write to a Parquet file in Scala without using Apache Spark

How to use Parquet4s to write Parquet files in Scala
Published on: 10 May 2019

Data engineering

Row number in Apache Spark window — row_number, rank, and dense_rank

This article is mostly a “note to self” because I don’t want to google that anymore ;)
Published on: 08 May 2019

Data Science

How to display all columns of a Pandas DataFrame in Jupyter Notebook

How to set the max columns in Pandas
Published on: 06 May 2019

Book

Review of “Conversations On Data Science” by Roger D. Peng and Hilary Parker

Data Science book recommendation
Published on: 03 May 2019

Data Science

The silly mistakes in exploratory data analysis

My most interesting Data Analysis failures
Published on: 01 May 2019

Deep learning

Understanding the softmax activation function

Softmax function explained
Published on: 29 Apr 2019

Deep learning

How to increase accuracy of a deep learning model

Debugging a machine learning model
Published on: 26 Apr 2019

Data Science

Smoothing time series in Pandas

How to use the exponentially weighted window functions in Pandas
Published on: 24 Apr 2019

Deep learning

Which hyperparameters of deep learning model are important and how to find them

How to speed up finding the right hyperparameters of a machine learning model
Published on: 22 Apr 2019

Deep learning

How to choose the right mini-batch size in deep learning

Andrew Ng recommendation about mini batch size
Published on: 19 Apr 2019

Deep learning

How to deal with underfitting and overfitting in deep learning

The lessons learned from Andrew Ng’s online course
Published on: 17 Apr 2019

Data Science

How to reduce memory usage in Pandas

Fit more data in the same amount of memory
Published on: 15 Apr 2019

Data engineering

How Airflow scheduler works

Explanation of the Airflow interval and start_date parameters
Published on: 12 Apr 2019

Data Science

Guidelines for data science teams — a summary of Daniel Molnar’s talks

Avoiding over-engineering in machine learning
Published on: 10 Apr 2019

Deep learning

Ludwig machine learing model in Kaggle

My first attempt to use Ludwig
Published on: 08 Apr 2019

Machine learning

The problem of large categorical variables in machine learning

How to use FeatureHasher in Scikit-learn
Published on: 05 Apr 2019

Machine learning

Encoding categorical variables in machine learning

One-hot encoding, dummy coding, and effect coding in Scikit learn and Pandas
Published on: 03 Apr 2019

Machine learning

How To Avoid Data Leakage While Building A Machine Learning Model

What to do when your model works perfectly during testing but fails in production
Published on: 01 Apr 2019

Machine learning

Using scikit-automl for building a classification model

My first attempt to use scikit-automl and how I got it working
Published on: 29 Mar 2019

Data Science

How to return rows with missing values in Pandas DataFrame

How does it work and why the most popular solution is wrong
Published on: 27 Mar 2019

Machine learning

Preprocessing the input Pandas DataFrame using ColumnTransformer in Scikit-learn

How to encode text/categorical variables and scale numerical values using only one Scikit-learn class
Published on: 25 Mar 2019

Machine learning

How to install scikit-automl in a Kaggle notebook

error: command ‘swig’ failed with exit status 1 while installing scikit-automl
Published on: 22 Mar 2019

Data Science

Predicting customer lifetime value using the Pareto/NBD model and Gamma-Gamma model

How to estimate the CLV from a list of customer transactions using the lifetimes library in Python
Published on: 20 Mar 2019

Data Science

Predicting customer churn using the Pareto/NBD model

How to use a Python lifetimes library to build a Pareto/NBD model.
Published on: 18 Mar 2019

Data Science

Business metrics that make no sense

There are three kinds of metrics that won’t destroy your business.
Published on: 15 Mar 2019

Machine learning

Nested cross-validation in time series forecasting using Scikit-learn and Statsmodels

Tweaking the parameters of Statsmodels
Published on: 13 Mar 2019

Data Science

How to perform an A/B test correctly in Python

What can we expect from a correctly performed A/B test?
Published on: 11 Mar 2019

Book

[book review] The hundred-page machine learning book

I have mixed feelings about this book.
Published on: 08 Mar 2019

Machine learning

A few useful things to know about machine learning

Pedro Domingo’s observations about feature engineering
Published on: 06 Mar 2019

Data Science

Recommendations vs. raw data — what is better?

Should we suggest an action when we visualize data?
Published on: 04 Mar 2019

Machine learning

How to interpret ROC curve and AUC metrics

In my opinion, AUC is a metric that is both easy to use and easy to misuse. Do you want to know why? Keep reading ;)
Published on: 01 Mar 2019

Data Science

How to display mathematical equations in Jupyter Notebook

LaTeX support in Jupyter Notebook
Published on: 27 Feb 2019

Data Science

Apriori algorithm explained

Using association rule learning to make recommendations
Published on: 25 Feb 2019

Data Science

How to change plot size in Jupyter Notebook

Pyplot parameter that configures the chart size
Published on: 22 Feb 2019

Data Science

Looking for structure in data — Andrews curves plot explained

How to read Andrews curves chart
Published on: 20 Feb 2019

Data engineering

Making your Scrapy spider undetectable by applying basic statistics

How to delay scraper requests to make it look like a human visiting the website
Published on: 18 Feb 2019

Data Science

Finding seasonality in time series using autocorrelation plot

How to interpret autocorrelation plot?
Published on: 15 Feb 2019

Data Science

My favourite data science podcasts

I was asked for some podcast recommendation, so here is my very short list ;)
Published on: 13 Feb 2019

Data Science

A podcast that changed my perspective on exploratory data analysis

How to avoid bad science
Published on: 11 Feb 2019

Data Science

How to read a confusion matrix

Predicted labels are in columns, right? Or maybe in rows? Do you remember? ;)
Published on: 08 Feb 2019

Deep learning fast.ai

The optimal learning rate during fine-tuning of an artificial neural network

How to set the learning rate after you unfreeze the network layers in fast.ai
Published on: 06 Feb 2019

Data Science Machine learning

F1 score explained

The mathematics behind F1 score.
Published on: 04 Feb 2019

Data Science

How to display a progress bar in Jupyter Notebook

Display a progress bar with no additional dependencies, just Python + Jupyter Notebook
Published on: 01 Feb 2019

TensorFlow Deep learning

Save and restore a Tensorflow model using Keras for continuous model training

How to run fit function multiple time and improve the model?
Published on: 30 Jan 2019

Machine learning

A comprehensive guide to putting a machine learning model in production using Flask, Docker, and Kubernetes

How to use Docker and Flask to put a Scikit model in production as a microservice.
Published on: 28 Jan 2019

Software engineering

Query string validation in Fastify

How to validate query parameters using Fastify
Published on: 25 Jan 2019

Problem solving

Mental models: inversion

Solve the opposite problem to avoid stupidity.
Published on: 23 Jan 2019

Data Science Machine learning

How to save a machine learning model into a file

Saving a Scikit-learn model using the joblib library in Python
Published on: 21 Jan 2019

Productivity

Music and other distractions

Why is it difficult to work in the office?
Published on: 18 Jan 2019

Book

[book review] You had me at Hello, World

Did you ever want to have a mentor?
Published on: 16 Jan 2019

Book

[book review] Deep work by Cal Newport

How to focus on the high outcome tasks and avoid being distracted
Published on: 14 Jan 2019

Data Science

Bootstrapping vs. bagging

The difference explained
Published on: 11 Jan 2019

Git

Git fixup explained

How to change the commit history
Published on: 09 Jan 2019

Book

[book review] So good they can’t ignore you

A polarizing book
Published on: 07 Jan 2019

Book

[book review] The effective engineer

What is the best investment of your time?
Published on: 21 Dec 2018

Docker

How to remove all Docker images and containers

An explanation of removing Docker images and containers.
Published on: 19 Dec 2018

Machine learning Data Science

Understanding uncertainty intervals generated by Prophet

How to tweak uncertainty intervals in Prophet.
Published on: 17 Dec 2018

Python

A Python HTTP server for serving static content

How to easily serve static content on localhost or in the local network
Published on: 14 Dec 2018

Software engineering

Assert object pattern

The easiest way to make your tests more readable and easier to maintain
Published on: 12 Dec 2018

Book

5 best books I read in 2018

A list that surprised even me…
Published on: 10 Dec 2018

Scala

How to run a single test in SBT

Why testOnly does not work?
Published on: 07 Dec 2018

Data engineering

How to use Scrapy to follow links on the scraped pages

A web spider that does not follow links is not very useful, let’s fix that.
Published on: 05 Dec 2018

Data engineering

How to scrape a single web page using Scrapy in Jupyter Notebook?

Scrapy Spiders and processing pipelines 101
Published on: 03 Dec 2018

Docker

What is inside a Docker image?

How to unpack a Docker image
Published on: 30 Nov 2018

Public Speaking

What is wrong with tech conferences?

Why are tech conferences boring?
Published on: 28 Nov 2018

JVM

Java performance testing — Epsilon garbage collector

How to make sure that GC does not stop the JVM during a test?
Published on: 21 Nov 2018

Software craft

Turning greenfield projects into brownfields projects

What happens when the team lacks software craft skills?
Published on: 19 Nov 2018

Book

"The war of art" and other books I did not finish reading

You can read more good books if you skip the lousy ones.
Published on: 16 Nov 2018

Data Science Machine learning

Prophet plot explained

How to read the Prophet forecast plot
Published on: 14 Nov 2018

Productivity

Brain dump — programmer productivity experiment #2

How to generate new ideas instead of thinking about the same thing over and over again
Published on: 12 Nov 2018

Productivity

Programmer diary — programmer productivity experiment #1

One of the most intriguing ideas described in the book "How Google works" is writing "snippets."
Published on: 09 Nov 2018

Book Productivity

Smart creative — the new role model

It may look like a unicorn, but it is real
Published on: 07 Nov 2018

Book

"The Box: How the Shipping Container Made the World Smaller and the World Economy Bigger" by Marc Levinson

What happens when one invention makes the whole industry obsolete?
Published on: 05 Nov 2018

Data Science

How to visualise prediction errors

How to explain the errors of a linear regression model
Published on: 02 Nov 2018

Project management

User story mapping for developers

A natural way of splitting work into small, but useful parts
Published on: 29 Oct 2018

Software craft

Is programming art or science?

On a quest to find the right metaphor
Published on: 24 Oct 2018

TDD Data Science

Test-driven development in Jupyter Notebook

TDD for data scientists working with Jupyter Notebook
Published on: 22 Oct 2018

Machine learning

Machine learning cheat sheets

A collection of machine learning cheat sheets I find useful and google repeatedly.
Published on: 19 Oct 2018

Software craft

Does a tester break the product?

How does a name influence our attitude?
Published on: 17 Oct 2018

Data Science

Dealing with dates and time in Pandas

How to use Pandas to parse dates or calculate time in a different timezone.
Published on: 15 Oct 2018

Data Science

Fill missing values using Random Forest

How to predict the missing values using Scikit-Learn
Published on: 12 Oct 2018

Teamwork

The one important thing I learned from "Beyond Developer" by Dan North

How to motivate software engineers?
Published on: 10 Oct 2018

Software craft

Programmers love new toys but hate new habits

We talk about toys. We love new buzzwords. We adore things that sound cool. Yes, we do.
Published on: 08 Oct 2018

Software engineering

A "known bug" is still a bug

What does a "known bug" or an update say our users?
Published on: 05 Oct 2018

Docker

How to build a project inside a Docker container

How to safely run code downloaded from the Internet
Published on: 03 Oct 2018

Book

[book review] Dichotomy of leadership

The follow-up to “Extreme ownership”
Published on: 01 Oct 2018

Data Science

Box and whiskers plot

How to plot and interpret the box and whiskers plot
Published on: 28 Sep 2018

Teamwork

[book review] Team Geek

This book deserves a 3-star review on Amazon for many reasons.
Published on: 26 Sep 2018

Data Science

How I failed to plot parallel coordinates in Matplotlib

Built-in matplotlib functions are not enough in this case
Published on: 24 Sep 2018

Data Science

Import Jupyter Notebook from GitHub

The easiest way to access someone else’s code in your own notebook
Published on: 21 Sep 2018

Data Science

Fill missing values in Pandas

Use the next or previous value to fill the missing values in Pandas
Published on: 19 Sep 2018

Machine learning

Forward feature selection in Scikit-Learn

Two workarounds to get an equivalent of forward feature selection in Scikit-Learn
Published on: 17 Sep 2018

Data Science

Heat map with Matplotlib

A short tutorial about generating a heat map of the values stored in a Pandas dataframe
Published on: 14 Sep 2018

Software engineering

Language is all about nouns

Programmers are afraid of nouns. We often replace them with poorly written descriptions of things.
Published on: 12 Sep 2018

Data Science

Outlier detection with Scikit Learn

Z-score and Density-Based Spatial Clustering of Applications with Noise
Published on: 10 Sep 2018

Machine learning

How to set the global random_state in Scikit Learn

What to do if you keep forgetting to set the random_state?
Published on: 31 Aug 2018

Meetup

JUG Thüringen meetup - retrospective

My opinion about my presentation at a meetup in Erfurt, Germany.
Published on: 29 Aug 2018

Data Science

How to split a list inside a Dataframe cell into rows in Pandas

Step by step instructions to "explode" a list into DataFrame rows.
Published on: 27 Aug 2018

Data Science

Interactive plots in Jupyter Notebook

How to create a plot that supports zooming
Published on: 24 Aug 2018

Book review

[book review] James Whittaker's Little Book of the Future

Read this book if you believe we can use A.I. and IoT to build a bright future.
Published on: 22 Aug 2018

Data Science

Probability plot - visually compare probability distributions

How to visually check whether your sample is normally distributed?
Published on: 20 Aug 2018

Algorithms

Count unique elements of an infinite stream of objects

HyperLogLog - probabilistic counting algorithm
Published on: 19 Aug 2018

Scala

Live unit testing with sbt

Can I have the coolest Visual Studio feature in IntelliJ?
Published on: 18 Aug 2018

Data Science

Monte Carlo simulation in Python

How to make business decisions using the Monte Carlo simulation?
Published on: 17 Aug 2018

Data Science NLP

Word cloud from a Pandas data frame

Create a nice visualization of the most popular words in your data frame
Published on: 07 Aug 2018

Scala

Scala structural types with generics

A short example of defining a structural type which matches a generic class
Published on: 05 Aug 2018

Data Science

Visualize common elements of two datasets using NetworkX

How to use undirected graph to visualize common elements of two Pandas data frames
Published on: 03 Aug 2018

Data Science Machine learning

How to load data from Google Drive to Pandas running in Google Colaboratory

How to import a CSV file from Google Drive into Google Colab
Published on: 14 Jul 2018

Data science Machine learning

Precision vs. recall - explanation

How to understand the difference between precision and recall?
Published on: 15 Jun 2018

Software engineering Scala

The cake pattern is a lie

Cake pattern was a terrible idea.
Published on: 12 Jun 2018

Software engineering

Can we make it more generic?

What can we learn from a horrible mistake made by a programmer who wanted to make the code more generic?
Published on: 06 Jun 2018

Software engineering Domain Driven Design Scala

[JUG Thüringen] Effortless Domain-Driven Design - The real Power of Scala

How to use some parts of Domain Driven Design to create maintainable code in Scala?
Published on: 11 May 2018

Conference Software craft

Buzzwords, buzzwords everywhere

Do we behave like a child in a toy store?
Published on: 07 Apr 2018

Learning

Can’t learn anything? You’re doing it wrong

Have you been trying to learn something for a few months? What to do when you keep learning but still don’t understand anything?
Published on: 15 Mar 2018

Software architecture

Re: “I Don’t Want To Maintain Their Code”

How can we facilitate knowledge sharing? Will easily accessible documentation foster cooperation?
Published on: 03 Mar 2018

Book review

Discipline Equals Freedom — Jocko Willink

A review of Jocko Willink’s book: “Discipline Equals Freedom.” Should you read it even if you don’t want to run a marathon?
Published on: 24 Feb 2018

Software engineering Git

Prevent accidental deployments on Friday

You feel you should not deploy your code on Fridays but nothing stops you. Can you prevent accidental deployments?
Published on: 18 Feb 2018

Software engineering Project management

Developers just wanna have fun

Software maintenance is painful because of hype driven development.
Published on: 22 Jan 2018

Software engineering Project management

Support for old browsers — is it necessary?

Do you think that every web page should support all existing browsers? How about all versions of those browsers?
Published on: 17 Jan 2018

Software engineering Scala Domain Driven Design

The beauty of properly used statically typed languages

The real power of programming in Scala is not in mimicking Haskell and overusing monads, but in taking advantage of its type system.
Published on: 14 Jan 2018

Software craft Project management

Extreme ownership and software engineering

What a software engineer can learn from “Extreme ownership” book? How can it influence your daily work?
Published on: 11 Jan 2018

Software craft

Perpetually dysfunctional software

What happens when we release another beta version? Are users happy or angry? What if the reality is different than we think?
Published on: 26 Jul 2017

Software engineering TDD

4 reasons why TDD slows you down

It is easy to announce that TDD slows you down, but have you ever wondered why it happens? Is there anything you can do better?
Published on: 08 May 2017

Scala Akka

Always stop unused Akka actors

Akka actors do not magically disappear when you no longer need them.
Published on: 03 May 2017

Conference

Scalar 2017

Scalar Conference 2017 — everything I liked
Published on: 12 Apr 2017

Software engineering Recruiting

Reversing a binary tree and other great interview questions

We do not like being asked to write an algorithm on a whiteboard during job interviews, but is there a better way?
Published on: 19 Mar 2017

Software engineering Software craft

The importance of documenting things

What happens when someone asks you about your code and you cannot answer because you have no idea how it works? That happened to me… again.
Published on: 12 Mar 2017

Software engineering Conference

One thing can improve LambdaDays

One thing that can significantly improve LambdaDays.
Published on: 12 Feb 2017

Software engineering

Can we write cheaper software?

Can we write our web application in a way which saves our company money? Can we make our software cheaper?
Published on: 29 Jan 2017

Scala Meetup

A year of Poznan Scala User Group

How can a group created because of a tweet exists for over a year? Have we learned anything in a year? What are we going to do now?
Published on: 23 Jan 2017

Software engineering Software craft

Rage against unprofessionalism in software engineering

Today software engineers disappointed another person…
Published on: 15 Jan 2017