In this article, I am going to explain the original MapReduce paper “MapReduce: Simplified Data Processing on Large Clusters,” published in 2004 by Jeffrey Dean and Sanjay Ghemawat.
First, I will describe the MapReduce concept and explain how it works. After that, I will focus on the parts that make MapReduce fast and fault-tolerant. In the end, I will show how the concepts from the original MapReduce paper apply to the current implementation of Apache Spark.
What is MapReduce?
MapReduce is a programming model that simplifies the fast processing of large data sets by providing an abstraction over the underlying complexity of handling partitioning of input data, scheduling the execution across multiple machines, handling failures, and managing communication between workers.
The authors of the paper realized that most data-intensive computations involve transforming the input data to a different representation (mapping) and subsequent aggregation of the data derived in the transformation step (reducing).
That observation allowed them to define a computation model that requires users to specify only two functions: map and reduce. The MapReduce library automatically handles everything else.
How does MapReduce work?
The MapReduce library takes two functions from the user. The map function takes key/value pairs and produces a set of output key/value pairs:
map(k1, v1) -> list(k2, v2)
MapReduce uses the output of the map function as a set of intermediate key/value pairs. The library automatically groups all intermediate values associated with the same key and passes them to the reduce function.
The reduce function gets a key and a set of values as the input. That function is supposed to merge the values to form a smaller set of values:
reduce(k2, list(v2) -> list(v2)
Note that the input keys and values are not from the same domain as the output of the map function (the intermediate key/values), but the reduce function produces output that belongs to the same domain as its input.
What does happen when you run it?
When programmers use the original MapReduce library, they specify the mappings and reducing functions. In addition to that, they have to specify the partitioning parameters: the number of partitions for mappings (the number of M splits) and reducing (the number of R splits). Optionally, the programmers may also provide a custom partitioning function.
First, the MapReduce library splits the input files into a given number of partitions and copies the functions to cluster machines.
The cluster consists of worker nodes and one master node, which coordinates the execution of the program. A worker gets a map task from the master node and reads the corresponding input partition as a set of key/value pairs.
The worker node passes each pair to the map function and buffers the intermediate key/value pairs in memory. Periodically, the worker writes the buffered data to the local disk.
The output location of the intermediate key/value pairs is partitioned into R regions that correspond to the number of reducing nodes that will be later used to aggregate the data. The worker node reports the location of the intermediate pairs back to the master node, who will forward the location to reduce workers.
When the mapping phase is finished, the master node starts the reduce workers. The reduce workers read the buffered data from map workers using remote procedure calls. Before running the reduce phase, the reduce workers sort the intermediate keys to group together all occurrences of the same key.
The reduce workers pass the key and value iterator to the reduce function, which produces the final result of aggregating all values associated with the given key. The result of the reduce function is appended to the output file that corresponds to the reduce partition.
Every reduce worker produces one file, so the output is split into the number of files equal to the number of reduce partitions.
Why is MapReduce fast?
MapReduce speeds up data processing because most such computations are straightforward but operate on vast amounts of data. The speed-up is achieved by distributing that simple computation over a large cluster of machines.
Data partitioning - number of partitions
The MapReduce library splits the input data into a given number of partitions. Map workers process those partitions in parallel and store the results on their local disc. That approach allows programmers to process amounts of data larger than the capability of a single worker.
Additionally, by creating more partitions than the worker nodes, we help the library to optimally use the available resources (all nodes are working all the time) and reduce the time required for reprocessing in case of failures (we have to reprocess smaller tasks).
Data partitioning - load-balancing
When partitioning data for the reduce phase, it is crucial to produce splits that can be processed at a similar time. We should not have partitions that contain significantly more data than the others because such skewed partitions slow down the computation and are problematic when we have to repeat a task due to a worker failure.
Of course, this requirement is use-case specific. The output of a MapReduce job is partitioned into the same number of files as the number of reducing phase partitions. In some cases, it may make sense to create skewed partitions on purpose to have data stored in a more user-friendly way.
In the paper, the authors inform us that they store the input files in the Google File System, which is conceptually similar to HDFS. Because of that, copies of the input data are distributed across all worker nodes, and multiple workers may process the same data without sending the files across the network.
The network bandwidth is the biggest bottleneck of the data processing clusters. Because of that, the master node considers the data location when scheduling the map tasks and attempts to use the machine that has a replica of the required data. If it is not possible, it attempts to copy the data from the nearest worker (for example, a worker connected to the same network switch).
There are cases when the reduce function is both commutative and associative. That allows us to reduce the amount of data transferred over the network by applying the reduce function separately to all outputs of the map function, transferring partially reduced data to the reduce workers, and finishing reducing by using the reduce function again.
If a worker node has a damaged disc, it may get extremely slow. This faulty node may slow down the whole computation. To avoid such problems, MapReduce schedules multiple executions of the same task if there are idle worker nodes. The task is marked as finished when any node completes it.
This feature is enabled when the whole MapReduce job is close to completion, so it does not cause duplicate work for the whole cluster.
Speed-up subsequent processing
Finally, the subsequent usage of the output is speed-up by producing sorted output files. Having sorted outputs helps to perform efficient lookups by key.
Why is MapReduce fault-tolerant?
In this section, I am going to describe the methods used by MapReduce authors to ensure that it can successfully finish data processing even if some of the worker nodes fail.
First and foremost, the MapReduce library allows fault-tolerant processing of large data sets by hiding the complexity of data distribution and load balancing.
It automatically takes care of partitioning the data, sharing the result of the map phase with the workers running the reduce function, and repeating only the lost computations in case of worker failures.
Because the programmers don’t need to worry about the details of such complex tasks, the whole application is less likely to fail because of a programmer error.
Tracking progress and reprocessing tasks
The master node keeps the state of every map and reduce task in an internal data structure. It also pings the worker nodes periodically. If no response is received from the worker, the master node marks it as failed.
All of the map tasks completed by that failed worker (and the task it was processing at the time of the failure) have to be reprocessed because the results of map tasks are stored in the worker’s local memory, which may be unavailable or corrupted after the failure.
Fortunately, reduce tasks do not need to be reprocessed because their output is stored in an external shared file system.
By design, both map and reduce function should produce deterministic results that depend only on the input data. Because of that, MapReduce can reprocess any failed tasks and still produce the correct output.
The ability to produce deterministic output relies on the atomic commits feature. When a map task produces the intermediate key/value pairs and stores them in the temporary file, the file is not shared with anyone until the task finishes. When the map task is completed, the worker node sends the file locations to the master node (one file per reduce phase partition). The master node stores the file locations and shares them with the reduce nodes.
In the case of reduce tasks, the reduce worker renames the temporary file to the expected final output name after finishing the task.
This feature is called atomic commit because nobody knows about the files until they are ready to be consumed.
Dealing with corrupted data
The original MapReduce implementation had an interesting feature that allowed the library to skip corrupted data automatically.
The MapReduce library stores the sequence number of the processed key/value pair. If the function code fails, it reports the sequence number to the master node. When the master node gets more than one failure report with the same sequence number, it marks that key/value pair as corrupted, and it gets skipped during the re-execution of the failed task.
The authors of the paper described the counter feature that allows the programmers to count occurrences of various events. The programmers may use that feature for sanity checks by counting the number of elements of a particular type. In addition to that, the MapReduce library automatically counts the number of input and output key/value pairs.
This feature does not count duplicate event occurrences when a task is re-executed because the local counter values are aggregated only if the task has completed successfully.
Is MapReduce similar to Apache Spark?
It seems that we made huge progress between the original MapReduce and the current implementation of Apache Spark, but those tools still have a lot in common.
First of all, in both cases, the most time-consuming operation is shuffling data between worker nodes. In Apache Spark, we minimize shuffling by minimizing the number of operations that cause the creation of a new stage (data within a single stage is processed without shuffling).
Second, we still have a partitioning problem. It is still a good practice to have significantly more partitions than the worker nodes, so fair load-balancing is possible, and there are no idle nodes.
In the case of MapReduce, it was possible to implement own data readers to retrieve only the relevant data and reduce the size of input files. In Apache Spark, we tend to use columnar files to get only the data we need, and we attempt to push filtering predicates to the data source to avoid retrieving redundant information.
The Combiner function from MapReduce was implemented in Apache Spark as the
reduceByKey function, for precisely the same purpose. In both tools, we wanted to reduce the amount of data transferred between workers.
For sanity checks and basic data validation, Spark offers Accumulators, which work similarly to MapReduce counters.
It is important to remember that Spark distinguishes between transformation and action functions. Accumulators work correctly only when we use them inside action functions. An accumulator used in transformations may count duplicate events if the transformation gets reprocessed.
Does MapReduce still matter, or is it just an interesting historical fact? In my opinion, it is still the basis of many tools we use today. After all, a Spark job may be represented as a series of MapReduce jobs running in a sequence.
It does not mean that the technology has not evolved. Apache Spark still has the RDD API, but we tend to use the more advanced Datasets and DataFrames because they handle code optimization.
On the other hand, we are still limited by the same constraints as MapReduce users. For example, repartitioning data and shuffling it between nodes is still the most significant bottleneck.