Author Archive: jianchenshan

Use sudo without password

When you do experiments in your own computer or lab, security is not a big issue, but every time you use sudo commmand, the password is needed; or if you want to run a script which would run remote sudo command in remote computer, under that circumstance, the script cannot be runned automatically since user would be promoted to input the password, which is quite inconvenient. So let’s make your life easier.

Do following things then you can run sudo command without password:

sudo visudo

insert this line to the file


That’s it!


Understanding the cpu id in multi-core system with hyperthreading

When you do experiment in multi-core system with Hyperthreading enabled, you may wish to set the cpu affinity to get different settings. Then, the first thing you need to know is the meaning of cpu id (here cpu id is actually the id of logic core) because you need to know which logical core are in the same cpu or which logic core are sharing the same physical core. We can get all these infomation from the /proc/cpuinfo. But we only need some relevant infomation to understand the cpu id, they are processor #, core id and physical id.

The processor # is the cpu id, each logic core has an unique processor #. The core id is the physical core id, each physical core has an unique core id. The logic cores that have same core id would share the same physical core. The physical id refer to the socket id, each socket has an unique physical id. The logic cores that have same physical id would be in the same cpu.

The command to retrive these data is: egrep “(( id|processo).*:|^ *$)” /proc/cpuinfo. Following is the sample output:

processor: 0
physical id: 0
core id: 0

processor: 1
physical id: 1
core id: 0

processor: 2
physical id: 0
core id: 1

Based on these infomation, you can understand the cpu id numbering. I draw a figure to show the relationship of cpu id in my system which use Intel(R) Xeon(R) CPU E5-2665: two sockets(cpus), each has 8 cores, each core can emulate 2 logic cores.



BTW, when you want to check each logic core’s performance like cpu usage, you can easily use top command, after type top you then press 1, here is a example:



Hijack the divide_error (Interrupt 0) exception handler

My technique for hijacking the divide_error interrupt 0 follows these steps:

1. Obtain original IDT pointer from a specific register to retrieve the address and size of original IDT.
2. Create new IDT by allocating one page and memory copy from original IDT.
3. In new IDT, modify the divide_error entry with the new address that point to my assembly handler.
4. My assembly handler would call my C handler first and then just simply jump to the original assembly handler.
5. The address of original assembly handler is obtained in and is hardcoded in the source code.
6. Create new IDT pointer based on new IDT.
7. Active new IDT by loading new IDT pointer to that specific register.
8. Recover by loading original IDT pointer to that specific register.

All these steps are implemented in my module called hook. Loading my module would hijack the IDT and removing my module would recover the IDT and report the total number of divide error interrupts handled during hijacking. And my C handler would just simply maintain and print out a counter each time it is invoked. I write a C program called float to generate divide error interrupt. It looks like:

int a,b;
a = 1;
b = 0;

All source code could be found in Appendix at last. Moreover, the following snapshot would demonstrate my work, which is tested in Linux kernel 3.12.6.



File: hook.c

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/tty.h>
#include <linux/sched.h>
#include <linux/mm.h>
#include <linux/slab.h>
#include <asm/desc.h>

#define DIVIDE_ERROR 0x00

/* global varaible */
char msg[200];
char *str = "test";
struct desc_ptr newidtr, oldidtr,tempidtr;
gate_desc *newidt, *oldidt, *tempidt;
int counter = 0;
unsigned long old_stub = 0xc15d9c64;
struct tty_struct *my_tty;

/* global function */
extern asmlinkage void new_stub(void);

/* print message to console */
int write_console (char *str)
         struct tty_struct *my_tty;
         if((my_tty=current->signal->tty) != NULL)
                ((my_tty->driver->ops->write) (my_tty,str,strlen(str)));
                return 0;
         else return -1;

/* active idt_table by loading new idt pointer to the register */
static void load_IDTR(void *addr)
	asm volatile("lidt %0"::"m"(*(unsigned short *)addr));

/* my C handler */
void my_func(void)
	/* add the counter and send messge to console */
	sprintf(msg, "Counter = %d \r\n", ++counter);

/* my Assembly handler */
void my_dummy(void)
        __asm__ (
        ".globl new_stub    \n\t"
        ".align 4, 0x90     \n\t"
        "new_stub:	    \n\t"
        "pushfl	            \n\t"
        "pushal	            \n\t"
        "call my_func 	    \n\t"
        "popal	            \n\t"
        "popfl	            \n\t"
        "jmp *old_stub      \n\t"
int __init hook_init(void){

	/* message */
	write_console("Jianchen hijacked interrupt_0\r\n");

	/* initialize tty for console print */
	my_tty = current->signal->tty;

	/* create new idt_table copied from old one */
	oldidt = (gate_desc *)oldidtr.address;
	newidtr.address = __get_free_page(GFP_KERNEL);
		return -1;
	newidtr.size = oldidtr.size;
	newidt = (gate_desc *)newidtr.address; 
	memcpy(newidt, oldidt, oldidtr.size);

	/* modify the divide_error entry to point to my assembly handler */ 
	pack_gate(&newidt[DIVIDE_ERROR], GATE_INTERRUPT, (unsigned long)new_stub, 0, 0, __KERNEL_CS);

	/* active the new idt_table */
	load_IDTR((void *)&newidtr);

	/* for smp architecture */
     //smp_call_function(load_IDTR,(void *)&newidtr, 0);

   	return 0; 
void __exit hook_exit(void){

	/* message */
	write_console("Jianchen recovered interrupt_0 \r\n");
	sprintf(msg, "Interrupt_0 handled during hijacking = %d \r\n", counter);
	/* active old idt_table */

	/* for smp architecture */
     //smp_call_function(load_IDTR, (void *)&oldidtr, 0);

	/* free the allocated page for new idt_table */

File: Makefile

obj-m += hook.o

	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

File: float.c

#include <stdio.h>
#include <iostream>

using namespace std;

int main()
   	int a,b;
   	a = 1;
   	b = 0;
   	return 0; 


linux !g command

!g would run most recent gcc/g++ compilation command
#1: gcc test.c -o test
#2: ls
#3: !g
In the example, the !g would run #1 again. This would make thing efficient.

linux fg command

When you run a job in background, you can bring it to foreground by command fg
./test &
fg 1
Then the job “test” would be in foreground now.
fg [%job_id], job_id specifies the job that you want to run in the foreground and 1 means most recently background job.
This command is useful when you don’t want to open new terminal to issue some other command, you can now do this in same window by putting one job in the background and pulling back it after issuing another command.

My own function to print slabinfo of kmalloc_cache as the form of /proc/slabinfo

Routine prototype: void my_get_kmalloc_cache_slabinfo(void)
Definition location: source/mm/slub.c
Declaration location: source/include/linux/slab.h
Call location:
Kernel_init( )
Source code:

void my_get_kmalloc_cache_slabinfo(void)
    int i, j;
    struct kmem_cache *s;
    struct kmem_cache_node *n;
    struct page *pos;
    //statistics parameters
    unsigned long nr_active_objs, nr_objs, obj_size, objs_per_slab, pages_per_slab, nr_active_slabs, nr_slabs, nr_free;
    //KMALLOC_SHIFT_HIGH is 13 when using CONFIG_SLUB
    for (i = 0; i <= KMALLOC_SHIFT_HIGH; i++) {
        nr_active_objs    = 0;
        nr_objs           = 0;
        obj_size          = 0;
        objs_per_slab     = 0;
        pages_per_slab    = 0;
        nr_active_slabs   = 0;
        nr_slabs          = 0;
        nr_free           = 0;
        s = kmalloc_caches[i];
        if (!s) {
        for (j = 0; j <= MAX_NUMNODES; j++) {
            n = s->node[j];
            if (!n) {
            nr_slabs += (long)n->nr_slabs.counter;
            nr_objs += (long)n->total_objects.counter;
            //iterate node->partial to get struct page and page.nr_free
            list_for_each_entry(pos, &n->partial, lru)
                nr_free += pos->objects - pos->inuse;
        nr_active_objs  = nr_objs - nr_free;
        obj_size        = s->object_size;//without metadata
        objs_per_slab   = nr_objs / nr_slabs;
        pages_per_slab  = 1 + obj_size * objs_per_slab / (1<<12); //pagesize is 4KB
        nr_active_slabs = nr_slabs;
        printk("%s -> %lu %lu %lu %lu %lu 
                           :tunables %d %d %d 
                           :slabdata %lu %lu %d\n",
                           s->name, nr_active_objs, nr_objs, obj_size,
                           objs_per_slab, pages_per_slab, 0, 0, 0, 
                           nr_active_slabs, nr_slabs, 0);
        //parameters with value 0 doesn't apply to slub allocator

Topic: tunable pattern-based parallelization

[title: confidential]

This paper implements a tool that can detect parallelizable codes and transform them into parallel codes in sequential software. At first the paper proves that their work is feasible and beneficial by conducting pilot experiments on 31 Java programs with about 300,000 lines of code. It shows that every 60 lines of code has a parallelizable location and by using Java’s parallel data type Future the average speedup is 1.8 (The maximum speedup gain of Future is limited to 2.0).

Since Future doesn’t have good scalability on multicores. So the real implementation of tool would transform sequential codes into two common parallel patterns: Master/Worker and Pipeline. And tunable parameters are added to the description of these two parallel patterns. The tunable parameter would decide how to combine or divide each worker or stage and how many threads would be assigned to each work or stage. All tunable information is stored in configuration file. Basically the tool consists four steps. 1) Create code-profiling graph that describe data dependency and control flow. 2) Statically and dynamically analyze the graph and identify the subgraphs that meet conditions for Master/Worker or Pipeline. 3) Based on step 2, construct according tunable parallel architecture. 4) Code transform.

The evaluations show two conclusions. First, on average 99.98% parallelizable locations are detected and 65.56% of them can result in speedup. Second, the tool can even outperform skilled and experienced engineer’s manually code refactoring by 4%. Also different tunable parameters would lead to different performances, which demonstrate the value of tunable parameters in performance optimization on different target platform. The paper advances its previous work mainly in 1) automatic detection and transform. 2) Higher detection accuracy.

For related work, some researches try to detect and exchange parallelizable code to its parallel counterpart. But the alternative parallel implementation has to exist, so this method cannot deal with code refactoring. Some researches only focus on loop and pipeline parallelism with assumption that most runtime is spent in program loops. But this paper covers different form of parallelism. Another important aspect is that they detect parameters that influence the runtime behavior to enable the parallel code to adapt to current and future multicore hardware. At last some commercial tool try to get the region of highest runtime share by sampling or instrumentation. These regions might bear the highest parallel potential. But these tools neither answer if a parallelization is possible concerning the data or control dependencies, nor do they help in identifying how to parallelize this location.

For auto code transform, they use the Tunable Architecture Description Language (TADL) to represent detected tunable parallel patterns, which is also a source-to-source compiler finally executes the transformation on the base of the description. Related technique includes parallel language (or extension) and parallel compiler. The parallel language still cannot tell where and how to achieve parallelization. The parallel compiler’s parallel potential is bound to parallel loops with a fixed number of iterations and no dependencies.

Implement function to print buddyinfo

Following function would print the buddyinfo, which can also be got from /proc/buddyinfo

int my_buddyinfo_show()
    int i;
    int j;
    struct pglist_data* node_0;
    struct zone* pzone;
    node_0 = NODE_DATA(0);
    if (node_0 == NULL)
        return 0;
    for (i = 0; i < MAX_NR_ZONES; ++i)
        pzone = &node_0->node_zones[i];
        if (pzone == NULL)
        printk("Node 0 Zone %s", pzone->name);
        if (pzone->free_area == NULL)
        for (j = 0; j < MAX_ORDER; ++j)
            printk("%5lu", pzone->free_area[j].nr_free);
    return 0;

LLC management on integrated CPU and GPU architecture

Previous last level cache (LLC) manage policies are not apply to heterogeneous core architecture that contains CPU and GPU. Because of GPU application’s two important features: 1. can dominate LCC by high access rate; 2. can tolerate memory access latency by high thread-level parallelism (TLP). Based on these facts following papers try to propose new LLC manage policy.

[1] “TAP: A TLP-Aware Cache Management Policy for a CPU-GPU Heterogeneous Architecture” 2012 HPCA, Georgia Institute of Technology, Jaekyu Lee Hyesoon Kim
[2] “Managing Shared Last-Level Cache in a Heterogeneous Multicore Processor” 2013 PACT, University of Minnesota, Vineeth Mekkat (PHD) Anup Holey Pen-Chung Yew and Antonia Zhai

The first paper declares that they are the first to address cache-sharing problem in GPU-CPU heterogeneous architecture. They proposed method called TAP (TLP-aware cache management policy for CPU-GPU heterogeneous architectures.) which outperforms LRU by 12%. They use a technique called core sampling to identify the cache sensitivity of GPU application. It picks two cores employing different underlying cache policies to run the GPU application. If the performance difference is not significant then it indicates that GPU application is cache insensitive. It would partition cache and allocate limited number of ways to GPU application. Other cores would follow this decision in one period time. At the same time GPU block wouldn’t be promoted and GPGPU blocks will be replaced first when both CPU and GPGPU blocks are replaceable. Another technique they use called Cache Block Lifetime Normalization. It balances the lifetime of CPU and GPU block in LLC by measuring the access ratio of them. The GPU block’s lifetime would be reduced by divided by this ratio.

The second paper tries to beat the first one. They proposed method called HeLM (Heterogeneous LLC Management) which outperforms LRU by 12.5% and TAP by 5.6%. They argued that TAP has two problems: 1. Core sampling would leave LLC be occupied by GPU dead blocks by 40%, which would degrade cache sensitive CPU application. 2. Core sampling is coarse-grained in GPU application cache sensitivity decision, which is slow to adapt to the runtime variations in the application’s behavior.

HelM uses TLP as a runtime metric to define the sensitivity of CPU and GPU applications. If letting GPU application bypasses LLC or not doesn’t causes significant performance difference of GPU application, this GPU application is cache insensitive. If letting GPU application bypasses LLC or not causes significant performance difference of CPU application, this CPU application is cache sensitive. Based on above combined information and the degree of GPU application’s sensitivity, in one time period the GPU application’s bypassing ratio would be determined. The algorithm to compute this ratio/threshold called Threshold Selection Algorithm (TSA). All above method are based on set dueling which applies two opposing techniques to two distinct sets, and identifies the characteristic of the application from the performance difference among the sets. The paper says that some theory show that sampling a small number of sets in the LLC can indicate the cache access behavior with high accuracy. Also the experiment shows that HeLM is able to scale with the number of CPU cores in a heterogeneous multicore.

Performance Implications of SSDs in Virtualized Hadoop Clusters

Virtualization in Hadoop cluster would cause fragmented I/O pattern (more seeks and smaller I/O block size). This pattern would create I/O bottleneck in Hadoop cluster using HDD. In the contrary, SSD get small degradation because it has no mechanical part causing seek overhead. But SSD favor large I/O block size that can make use of parallelism supported by the internal structure of SSD, so there is still slight degradation. When concern about deployment of VMs, it shows the more VMs the worse performance in HDD-based environment. But the SSD-based environment almost ignores the increased number of VMs because of no I/O bottleneck and high CPU utilization for parallelism.