Using the Cycle Counter Registers on the Raspberry Pi 3

ARM processors support various performance monitoring registers, the most basic being a cycle count register. This is how to make use of it on the Raspberry Pi 3 with its ARM Cortex-A53 processor. The A53 implements the ARMv8 architecture which can operate in both 64- and 32-bit modes, the Pi 3 uses the 32-bit AArch32 mode, which is more or less backwards compatible with the ARMv7-A architecture, as implemented for example by the Cortex-A7 (used in the early Pi 2’s) and Cortex-A8. I hope I’ve got that right, all these names are confusing

The performance counters are made available through coprocessor registers and the mrc and mcr instructions, the precise registers used depending on the particular architecture.

By default, use of these instructions is only possible in “privileged” mode, ie. from the kernel, so the first thing we need to do is to enable register access from userspace. This can be done through a simple kernel module that can also set up the cycle counter parameters needed (we could do this from userspace after the kernel module has enabled access, but it’s simpler to do everything at once).

To compile a kernel module, you need a set of header files compatible with the kernel you are running. Fortunately, if you have installed a kernel with the raspberrypi-kernel package, the corresponding headers should be in raspberrypi-kernel-headers – if you have used rpi-update, you may need to do something else to get the right headers, and of course if you have built your own kernel, you should use the headers from there. So:

$ sudo apt-get install raspberrypi-kernel
$ sudo apt-get install raspberrypi-kernel-headers

Our Makefile is just:

obj-m += enable_ccr.o

    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

and the kernel module source is:

#include <linux/module.h>
#include <linux/kernel.h>

void enable_ccr(void *info) {
  // Set the User Enable register, bit 0
  asm volatile ("mcr p15, 0, %0, c9, c14, 0" :: "r" (1));
  // Enable all counters in the PNMC control-register
  asm volatile ("MCR p15, 0, %0, c9, c12, 0\t\n" :: "r"(1));
  // Enable cycle counter specifically
  // bit 31: enable cycle counter
  // bits 0-3: enable performance counters 0-3
  asm volatile ("MCR p15, 0, %0, c9, c12, 1\t\n" :: "r"(0x80000000));

int init_module(void) {
  // Each cpu has its own set of registers
  printk (KERN_INFO "Userspace access to CCR enabled\n");
  return 0;

void cleanup_module(void) {

To build the module, just use make:

$ make

and if all goes well, the module itself should be built as enable_ccr.ko

Install it:

$ sudo insmod enable_ccr.ko

$ dmesg | tail

should show something like:

[ 430.244803] enable_ccr: loading out-of-tree module taints kernel.
[ 430.244820] enable_ccr: module license 'unspecified' taints kernel.
[ 430.244824] Disabling lock debugging due to kernel taint
[ 430.245300] User-level access to CCR has been turned on

It should go without saying that making your own kernel modules & allowing normally forbidden access from userspace may result in all sorts of potential vulnerabilities that you should be wary of).

Now we can use the cycle counters in user code:

#include <stdio.h>
#include <stdint.h>

static inline uint32_t ccnt_read (void)
  uint32_t cc = 0;
  __asm__ volatile ("mrc p15, 0, %0, c9, c13, 0":"=r" (cc));
  return cc;

int main() {
  uint32_t t0 = ccnt_read();
  uint32_t t1 = ccnt_read();       
  printf("%u\n", t1-t0);
  volatile uint64_t n = 100000000;
  while(n > 0) n--;
  t1 = ccnt_read();
  printf("%u\n", t1-t0);

We use a volatile loop counter so the loop isn’t optimized away completely.

Using taskset to keep the process on one CPU:

$ gcc -Wall -O3 cycles.c -o cycles
pi@pi3:~/raspbian-ccr$ time taskset 0x1 ./cycles

real 0m0.712s
user 0m0.700s
sys 0m0.010s

Looks like we can count a single cycle and since the Pi 3 has a 1.2GHz clock the loop time looks about right (the clock seems to be scaled if the processor is idle so we don’t necessarily get a full 1.2 billion cycles per second – for example, if we replace the loop above with a sleep).


ARMv8 coprocessor registers:

A useful forum discussion (which includes some details on accessing other performance counters):

Cycle counters on the Pi 1: