Affiliations: [a] Chengdu Institute of Computer Applications, Chinese Academy of Sciences, Chengdu, Sichuan, China | [b] University of Chinese Academy of Sciences, Beijing, China | [c] Key Laboratory of Advanced Manufacturing Technology, Ministry of Education, Guizhou University, Guiyang, Guizhou, China
Abstract: Data skew in parallel joins results in poor load balancing which can lead to significantly varying execution times for the reducers in MapReduce. The performance of join operation is severely degraded in the presence of heavy skew in the datasets to be joined. Previous work mainly focuses on either input or output load imbalance among reducers, which is ineffective for load balancing. In this paper, we present a new data skew handling method based on Cluster Cost Partitioning (CCP) for optimizing parallel joins in MapReduce. A new cost model which considers the properties of both input and output is defined to estimate the cost of the parallel join. CCP employs clusters instead of join keys from input relations to create join matrix. Using the cost model, CCP identifies and splits heavy cells in the cluster join matrix. Then CCP assigns a set of non-heavy cells to reducers for join load-balancing. For different applications, the input and output weight values in the cost model could be dynamically adjusted to depict the join costs more precisely. The experimental results demonstrate that CCP achieves a more accurate load balancing result among reducers.