Category Archives: Paper Summary

Topic: tunable pattern-based parallelization

[title: confidential]

This paper implements a tool that can detect parallelizable codes and transform them into parallel codes in sequential software. At first the paper proves that their work is feasible and beneficial by conducting pilot experiments on 31 Java programs with about 300,000 lines of code. It shows that every 60 lines of code has a parallelizable location and by using Java’s parallel data type Future the average speedup is 1.8 (The maximum speedup gain of Future is limited to 2.0).

Since Future doesn’t have good scalability on multicores. So the real implementation of tool would transform sequential codes into two common parallel patterns: Master/Worker and Pipeline. And tunable parameters are added to the description of these two parallel patterns. The tunable parameter would decide how to combine or divide each worker or stage and how many threads would be assigned to each work or stage. All tunable information is stored in configuration file. Basically the tool consists four steps. 1) Create code-profiling graph that describe data dependency and control flow. 2) Statically and dynamically analyze the graph and identify the subgraphs that meet conditions for Master/Worker or Pipeline. 3) Based on step 2, construct according tunable parallel architecture. 4) Code transform.

The evaluations show two conclusions. First, on average 99.98% parallelizable locations are detected and 65.56% of them can result in speedup. Second, the tool can even outperform skilled and experienced engineer’s manually code refactoring by 4%. Also different tunable parameters would lead to different performances, which demonstrate the value of tunable parameters in performance optimization on different target platform. The paper advances its previous work mainly in 1) automatic detection and transform. 2) Higher detection accuracy.

For related work, some researches try to detect and exchange parallelizable code to its parallel counterpart. But the alternative parallel implementation has to exist, so this method cannot deal with code refactoring. Some researches only focus on loop and pipeline parallelism with assumption that most runtime is spent in program loops. But this paper covers different form of parallelism. Another important aspect is that they detect parameters that influence the runtime behavior to enable the parallel code to adapt to current and future multicore hardware. At last some commercial tool try to get the region of highest runtime share by sampling or instrumentation. These regions might bear the highest parallel potential. But these tools neither answer if a parallelization is possible concerning the data or control dependencies, nor do they help in identifying how to parallelize this location.

For auto code transform, they use the Tunable Architecture Description Language (TADL) to represent detected tunable parallel patterns, which is also a source-to-source compiler finally executes the transformation on the base of the description. Related technique includes parallel language (or extension) and parallel compiler. The parallel language still cannot tell where and how to achieve parallelization. The parallel compiler’s parallel potential is bound to parallel loops with a fixed number of iterations and no dependencies.

LLC management on integrated CPU and GPU architecture

Previous last level cache (LLC) manage policies are not apply to heterogeneous core architecture that contains CPU and GPU. Because of GPU application’s two important features: 1. can dominate LCC by high access rate; 2. can tolerate memory access latency by high thread-level parallelism (TLP). Based on these facts following papers try to propose new LLC manage policy.

[1] “TAP: A TLP-Aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture” 2012 HPCA, Georgia Institute of Technology, Jaekyu Lee Hyesoon Kim
[2] “Managing Shared Last-Level Cache in a Heterogeneous Multicore Processor” 2013 PACT, University of Minnesota, Vineeth Mekkat (PHD) Anup Holey Pen-Chung Yew and Antonia Zhai

The first paper declares that they are the first to address cache-sharing problem in GPU-CPU heterogeneous architecture. They proposed method called TAP (TLP-aware cache management policy for CPU-GPU heterogeneous architectures.) which outperforms LRU by 12%. They use a technique called core sampling to identify the cache sensitivity of GPU application. It picks two cores employing different underlying cache policies to run the GPU application. If the performance difference is not significant then it indicates that GPU application is cache insensitive. It would partition cache and allocate limited number of ways to GPU application. Other cores would follow this decision in one period time. At the same time GPU block wouldn’t be promoted and GPGPU blocks will be replaced first when both CPU and GPGPU blocks are replaceable. Another technique they use called Cache Block Lifetime Normalization. It balances the lifetime of CPU and GPU block in LLC by measuring the access ratio of them. The GPU block’s lifetime would be reduced by divided by this ratio.

The second paper tries to beat the first one. They proposed method called HeLM (Heterogeneous LLC Management) which outperforms LRU by 12.5% and TAP by 5.6%. They argued that TAP has two problems: 1. Core sampling would leave LLC be occupied by GPU dead blocks by 40%, which would degrade cache sensitive CPU application. 2. Core sampling is coarse-grained in GPU application cache sensitivity decision, which is slow to adapt to the runtime variations in the application’s behavior.

HelM uses TLP as a runtime metric to define the sensitivity of CPU and GPU applications. If letting GPU application bypasses LLC or not doesn’t causes significant performance difference of GPU application, this GPU application is cache insensitive. If letting GPU application bypasses LLC or not causes significant performance difference of CPU application, this CPU application is cache sensitive. Based on above combined information and the degree of GPU application’s sensitivity, in one time period the GPU application’s bypassing ratio would be determined. The algorithm to compute this ratio/threshold called Threshold Selection Algorithm (TSA). All above method are based on set dueling which applies two opposing techniques to two distinct sets, and identifies the characteristic of the application from the performance difference among the sets. The paper says that some theory show that sampling a small number of sets in the LLC can indicate the cache access behavior with high accuracy. Also the experiment shows that HeLM is able to scale with the number of CPU cores in a heterogeneous multicore.

Performance Implications of SSDs in Virtualized Hadoop Clusters

Virtualization in Hadoop cluster would cause fragmented I/O pattern (more seeks and smaller I/O block size). This pattern would create I/O bottleneck in Hadoop cluster using HDD. In the contrary, SSD get small degradation because it has no mechanical part causing seek overhead. But SSD favor large I/O block size that can make use of parallelism supported by the internal structure of SSD, so there is still slight degradation. When concern about deployment of VMs, it shows the more VMs the worse performance in HDD-based environment. But the SSD-based environment almost ignores the increased number of VMs because of no I/O bottleneck and high CPU utilization for parallelism.