This blog post contains a summary of Andrew Ng’s advice regarding choosing the mini-batch size for gradient descent while training a deep learning model. Fortunately, this hint is not complicated, so the blog post is going to be extremely short ;)
Andrew Ng recommends not using mini-batches if the number of observations is smaller then 2000.
In all other cases, he suggests using a power of 2 as the mini-batch size. So the minibatch should be 64, 128, 256, 512, or 1024 elements large.
Engineering Reliable AI
Strategies for building production-grade, deterministic AI systems.
"I've learned a lot already from your blog."
— A Substack reader who has pledged $80 per year for this content.
Most AI newsletters hype the latest model releases. This one focuses on the boring, critical engineering required to make those models actually work in production.
Join engineers moving from "notebook" to "production":
-
Architectural Deep Dives: Designing deterministic RAG pipelines and Agentic workflows using schema enforcement (BAML).
-
Production MLOps: Real-world strategies for evaluation, FinOps, and "Shift Left" data quality.
-
System Reliability: Post-mortems on why AI systems fail at scale and how to prevent it.
Subscribe for Free
The most important aspect of the advice is making sure that the mini-batch fits in the CPU/GPU memory! If data fits in CPU/GPU, we can leverage the speed of processor cache, which significantly reduces the time required to train a model!