MapReduce notes

MapReduce’s model basically requires a programmer to specify two functions for computation: Map and Reduce. The Map function takes an input pair and produces a set of intermediate key/value pairs. The MapReduce system then groups together all the intermediate key values and passes them to the reduce function. The Reduce function accepts the intermediate key generated by the Map function and merges the generated values together, possibly forming a smaller set of values.
Google’s MapReduce implementation was targeted to run on large clusters. It runs over data that is already available on disk. The map invocations are distributed across multiple ma- chines, by splitting input data into several Splits. These Splits can then be processed in parallel by different machines. Reduce invocations are distributed by partitioning the intermediate key space into R pieces, using a partitioning function. The partitioning function is specified by the user.
Fault tolerance on MapReduce is fundamental, since it works over large amounts of data, distributed over hundreds or thousands of machines. It checks for worker failures by using periodic pings, and tasks can be rescheduled to run on another machine. Intermediate calculations are stored on a local disk. If the disk fails, the map task will be ran again on another machine. Reduce tasks do not need to be re-executed after saving to disk, because the results are stored to Google’s global file system.