LLC management on integrated CPU and GPU architecture

Previous last level cache (LLC) manage policies are not apply to heterogeneous core architecture that contains CPU and GPU. Because of GPU application’s two important features: 1. can dominate LCC by high access rate; 2. can tolerate memory access latency by high thread-level parallelism (TLP). Based on these facts following papers try to propose new LLC manage policy.

[1] “TAP: A TLP-Aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture” 2012 HPCA, Georgia Institute of Technology, Jaekyu Lee Hyesoon Kim
[2] “Managing Shared Last-Level Cache in a Heterogeneous Multicore Processor” 2013 PACT, University of Minnesota, Vineeth Mekkat (PHD) Anup Holey Pen-Chung Yew and Antonia Zhai

The first paper declares that they are the first to address cache-sharing problem in GPU-CPU heterogeneous architecture. They proposed method called TAP (TLP-aware cache management policy for CPU-GPU heterogeneous architectures.) which outperforms LRU by 12%. They use a technique called core sampling to identify the cache sensitivity of GPU application. It picks two cores employing different underlying cache policies to run the GPU application. If the performance difference is not significant then it indicates that GPU application is cache insensitive. It would partition cache and allocate limited number of ways to GPU application. Other cores would follow this decision in one period time. At the same time GPU block wouldn’t be promoted and GPGPU blocks will be replaced first when both CPU and GPGPU blocks are replaceable. Another technique they use called Cache Block Lifetime Normalization. It balances the lifetime of CPU and GPU block in LLC by measuring the access ratio of them. The GPU block’s lifetime would be reduced by divided by this ratio.

The second paper tries to beat the first one. They proposed method called HeLM (Heterogeneous LLC Management) which outperforms LRU by 12.5% and TAP by 5.6%. They argued that TAP has two problems: 1. Core sampling would leave LLC be occupied by GPU dead blocks by 40%, which would degrade cache sensitive CPU application. 2. Core sampling is coarse-grained in GPU application cache sensitivity decision, which is slow to adapt to the runtime variations in the application’s behavior.

HelM uses TLP as a runtime metric to define the sensitivity of CPU and GPU applications. If letting GPU application bypasses LLC or not doesn’t causes significant performance difference of GPU application, this GPU application is cache insensitive. If letting GPU application bypasses LLC or not causes significant performance difference of CPU application, this CPU application is cache sensitive. Based on above combined information and the degree of GPU application’s sensitivity, in one time period the GPU application’s bypassing ratio would be determined. The algorithm to compute this ratio/threshold called Threshold Selection Algorithm (TSA). All above method are based on set dueling which applies two opposing techniques to two distinct sets, and identifies the characteristic of the application from the performance difference among the sets. The paper says that some theory show that sampling a small number of sets in the LLC can indicate the cache access behavior with high accuracy. Also the experiment shows that HeLM is able to scale with the number of CPU cores in a heterogeneous multicore.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: