Performance measurements with RDTSC

This article explain a method for comparing optimized functions and choosing the fastest one.

The scope of the method

This method is useful for measuring the tasks that are mostly computation-bounded, e.g. various calculations, cryptographic functions, fractal generation, some text-processing tasks, etc.
Your data should be located in L1 or L2 cache entirely. If you are measuring memcpy function for several megabytes of data, you should use a different method. Ditto for most image-processing functions and other memory-intensive tasks.
If your task involves reading or writing files, calling "heavy" API functions, you should use a different method.
This method is ideal if you have a simple function that is called very often, doesn't call any API functions, and operates on a small data set.

Using RDTSC instruction

The RDTSC instruction returns a 64-bit time stamp counter (TSC), which is increased on every clock cycle. It's the most precise counter available on x86 architecture.

MSVC++ 2005 compiler supports a handy __rdtsc intrinsic that returns the result in 64-bit variable. However, you should flush the instruction pipeline before using RDTSC, so you usually have to use inline assembly function shown below.

Serializing the instruction stream

RDTSC can be executed out-of-order, so you should flush the instruction pipeline to prevent the counter from stopping measurement before the code has actually finished executing.

unsigned __int64 inline GetRDTSC() {
   __asm {
      ; Flush the pipeline
      XOR eax, eax
      CPUID
      ; Get RDTSC counter in edx:eax
      RDTSC
   }
}

Dealing with context switches

Many people worry about context switches that may occur during the measurement. Context switches on Windows NT take several thousand clock cycles, so they might bias your results. The best way to avoid this problem is to arrange a small test case, so that your thread will be rarely interrupted.

The scheduling quant on Windows NT is 20 msec. If your function takes 60 000 clock cycles on 1 GHz processor, there is 0.3% probability that the context switch will happen. On the other hand, if your function takes 100 times more, it will be interrupted with 30% probability.

Some people use SetPriorityClass with REALTIME_PRIORITY_CLASS, SetThreadPriority with THREAD_PRIORITY_TIME_CRITICAL, or SetProcessPriorityBoost to prevent their threads from being preempted. This can help sometimes, but it's not a panacea. Please don't try to measure the functions that take several seconds with this method.

Handling multi-core processors

Time-stamp counters on different cores or different processors are not synchonized with each other. Use SetThreadAffinityMask function to prevent your function from executing on different cores.

Repeating the measurements

To detect the context switches and eliminate cache warm-up effects, you should repeat the measurements at least 5 times. The last measurements should give constant results:

     strlen:       1489 ticks  <== cache warm-up
     strlen:       1041 ticks
     strlen:       1041 ticks
     strlen:       1034 ticks
     strlen:       1013 ticks
     strlen:       1019 ticks  <== constant performance is reached
     strlen:       1019 ticks
     strlen:       1019 ticks
     strlen:       1019 ticks
     strlen:       1019 ticks
     strlen:       1019 ticks

The result for this function will be 1019 clock cycles.

Subtracting overhead

Intel and Agner Fog recommend measuring the overhead of RDTSC function and subtracting it from your result. The overhead is relatively low (150-200 clock cycles) and it occurs in all tested functions, so you can neglect it when measuring long functions (e.g., 100 000 clock cycles).

If you are measuring a short function, you should subtract the overhead using the method described in Intel's paper.

Preventing frequency changes

If your processor supports Intel SpeedStep (usually supported on laptop computers), you should set the power management scheme in Windows to "Always on" before starting long measurements. Otherwise, the processor will change its frequency, and the process of switching to another frequency ("power state transition", in Intel terminology) may bias your results.

Reporting the result in clock cycles, not in seconds

You should not convert your results to seconds. Report them in clock cycles.

From user's point of view, execution time in seconds makes more sense than clock cycles. But remember that:

time_in_seconds = number_of_clock_cycles / frequency

Frequency is a constant (we are comparing the functions on the same processor), so both methods will give the same result.

Clock cycles are more useful, because you can calculate the theoretical number of clock cycles using the Agner Fog's instruction tables and compare it with the real number. If you also monitor performance events (see below), you will know not only which function is faster, but also why it is faster and how to improve it.

Also note that your processor counts in clock cycles, not in seconds. For example, your function takes 5000 clock cycles on 1.5 GHz Pentium M processor. If you will switch the processor to 600 MHz (using Intel SpeedStep), it will take the same 5000 clock cycles. Moreover, on another Pentium M with different frequency, the function will take 5000 clock cycles again.

The time in clock cycles in a consistent, predicable measure, which is independent of frequency changes.

Getting not only timings, but also performance event counters

Reporting only execution time is not enough. If you wish to know the reasons of low performance, you should use performance events monitor (RDPMC instruction), which report you the precise reasons of slow down (for example, is it slow because of cache misses? partial register stalls? instruction fetch stalls?). You can get performance events data with these programs:

AMD CodeAnalyst (freeware, for AMD processors only);
Intel VTune (very expensive and buggy bloatware);
Agner Fog's TestP.Zip package (open source, for all kinds of processors).

Common misunderstandings of the concept

Responses collected from private communication and articles by other authors:

> You should repeat the test 1000 times, so it will take several seconds, and then average the time.

You will get a lot of context switches. The possible implications:

Because the real code does not call the function 1000 times on the same data, your cache usage pattern will be different from the real code. You must be careful enough to reproduce the real cache usage pattern in your tests. It's hard.
The processes executing in background may do crazy things, for example, your Vista indexing service will decide it's a good time to index your folders.
You cannot monitor performance events, because you don't know which performance events are from your program, and which are not. So you will have to optimize blindly, using trial-and-error method without understanding the real reasons of bad performance.

Instead of such tests, you should run the whole program on real data. Your cache usage pattern will then be absolutely real and reliable. Also, the execution time on real data is more valuable to the end user than the time that some synthesized test takes.

So, you should have two tests: the one with the small function, which is known to be a bottleneck, and the one with the large program and the real data for estimating the overall effect of your optimization. In the first case, the time will be measured with RDTSC in clock cycles. In the second case, it will be measured in seconds (you can even use GetTickCount for this case).

> Microsoft recommends using QueryPerformanceCounter instead of RDTSC.

Note that Microsoft does not recommend using RDTSC for game timing, not for performance measurements. Timing in games is a completely different topic.
QueryPerformanceCounter can report data from various counters, and one of them is RDTSC. On single-core processors without Intel SpeedStep, QueryPerformanceCounter is implemented using RDTSC. The function just adds a lot of overhead and some additional problems. As the developer of VirtualDub says: "So, realistically, using QueryPerformanceCounter actually exposes you to all of the existing problems of the time stamp counter AND some other bugs".
When used with care and understanding, RDTSC is more precise than QueryPerformanceCounter. See the thread where a person tried both methods and finally chose RDTSC.
Microsoft itself uses RDTSC in Visual Studio profiler.

Error in Intel's paper

On page 5:

rdtsc
sub             eax, time_low  
sub             edx, time_high

should be

rdtsc
sub             eax, time_low  
sbb             edx, time_high  ; Subtract with borrow from low 32 bits

27 comments

ace, 16 years ago

Re: quantums

"In Windows Server (that uses a multi-level feedback queue algorithm, FYI), the default quantum is a fixed 120ms (...) Compare this to the workstation-level products (Windows Vista/XP/2000 Pro) that have a variable quantum that’s much shorter and also provide a quantum (not priority) boost to the foreground process (the process in the currently active window). In the workstation products, the quantum ranges from 20-60ms typically, with the background processes always relegated to the smallest possible quantum, ensuring that the application one is currently using “feels” responsive and that no background task hampers perceived performance too much."

according to:

http://recoverymonkey.net/wordpress/?p=42

Peter Kankowski, 16 years ago

Thank you. I took "20 msec" figure from Tanenbaum's book "Modern OSes" (he also says that server versions have 120 msec). The page that you found is more recent, so it may be more accurate.

Assembly, OCaml e o memory dump « blog, 16 years ago

[...] porque estava mexendo com umas rotinazinhas em assembly e medindo o desempenho com a instrução rdtsc. Mas me deparei com um comportamento bizarro ao fazer a subtração dos tempos final e inicial que [...]

Marcus Aurelius, 16 years ago

Be careful that the cpuid changes eax, ebx, ecx and edx. Since ebx should not be changed by functions, you must save and restore ebx with push and pop (this is not needed for the other registers).

Otherwise, functions written in C might be using ebx to store something and they do not expect ebx (or esi, or edi) to be changed by GetRDTSC.

I was bitten by this (my blog is in Portuguese, as you see in the pingback above), and had a hard time trying to discover why i could measure my assembly routine but got weird results measuring optimized C routines. I think i will post something about this in my blog soon.

Hello, Marcus,

MSVC++ compiler is clever enough to insert PUSH and POP in such cases. You don't need to worry about saving registers in inline assembly block.

If you are using a stand-alone assembler (MASM or FASM), then it's necessary to save registers. I once was bitten by this, too :).

Hmmm... I didn't know that. Sometimes I use __declspec(naked), and that was the problem. I could not have guessed it would make such a difference! :-)

Using __declspec(naked), I thought I only had to add the ret instruction and manipulate esp/ebp (when necessary), but I didn't know that the compiler was so clever as to detect that ebx is modified and add push and pop for it (but only when you do not use __declspec(naked)!)

Simon (enhzflep), 16 years ago

Great site Peter, I found this after following a link from your CodeProject article on determining the correct form of a plural to use.
It's so refreshing to see somebody that still cares about small code size, efficient execution. Just learned about the concept of cache-warmup from here today. Many thanks, and keep up the good work.

Thank you for kind words, Simon!

henrin, 14 years ago

Here is how I use rdtsc with VC6:

--1-- In some perf.h file:

#define GCLK(m) {__int64 _t=rdtsc(); m; _t=rdtsc()-_t; printf("\n:CLK: %-50.50s :%14I64d",#m,_t);}
static __int64 rdtsc(void) { __asm rdtsc }		/* read CPU clock counter */

--2-- In the source file where I need to trace the execution time:

#include "perf.h"
..
GLCK(some code)
..

Remarks:
- ease of use: bracket the tested code and put GCLK before the 1st bracket.
- this piece of code is the m parameter of the macro GCLK
- if cannot exceed about 512 characters
- is used as compiled code and as a text string (only the 1st 50 characters are printed)
- the 64-bit integer declaration and format may vary with other compilers

Stein0r, 14 years ago

Which instructions / functions do you recommend for the performance measurement of bigger program parts involving I/O and "heavy" API functions?

If I e.g. want to know how long it averagely takes my program to process a certain file...
Can't I simply take a rdtsc measurement before and after and do this very often to eliminate caching effects, etc. ?

sry, my previous post might be misunderstood so i have to rectify it: i don't want to "eliminate" caching effects, just average them as they are a "natural part" of my process :-).

i can't find a reason why rdtsc might be unsuitable to measure the time my program needs to perform certain operations involving I/O and lots of memory :-).

am i right to suppose that the "scope of the method" is only targeted at people who directly want to compare the performance of two implementations of the same function ... ?

great site btw!

Peter Kankowski, 14 years ago

Thank you for kind words :) RDTSC can be useful for longer intervals if you take some measures:

prevent switches to another core (call SetThreadAffinityMask),
eliminate changes in processor frequency (use processors with a constant rate TSC),
convert the result to milliseconds (because it's meaningless to measure, say, HDD access time in processor clock cycles).

I would use QueryPerformanceCounter for I/O and heavy API functions. The resolution is less than 1 microsecond; the result can be easily converted to microseconds (10⁻⁶ sec) or milliseconds (10⁻³ sec) using QueryPerformanceFrequency. There is no problem with changing processor frequency.

thanks for your quick response...

...but i still have some questions :-) .

my measurements are taken in different threads (main thread, I/O thread, ...)
why do i have to make sure my code is executed on one core only?
the measurements are taken inside of threads running on one of the cores---measuring durations.
so even if the time stamp counters on different cores are not synchronized that shouldn't matter imho
because each of them is calculating a relative value ... (?)

isn't it possible to eliminate changes to processor frequency by adjusting the power settings of the operating system, as you said?

the conversion to milliseconds is done in the last step by measuring how many cycles it takes to wait one second (several times).
even if there is a slight measurement error in this, this error is still relative as all measured values are affected in the same way :-) ...

i still think that rdtsc is the best choice for those kinds of measurements under controlled measuring conditions. as you said: it is very fast and precise and has no overhead or bugs associated with it. it's just a pretty light weight, readily available cpu instruction, not an os api call.

best regards

you really should implement a possibility to edit comments...
i wanted to add: are my assumptions correct? :-)

Your thread may be scheduled to different cores. Normally, OS selects the first available core, so you may end up starting the counter on one core, and finishing on another core (if there is a context switch during measurement). SetThreadAffinityMask prevents this.

On Pentium M, you can turn off SpeedStep using power settings. Newer processors have Turbo Boost, but they also feature a constant-rate counter, which is independent of frequency changes.

Generally, you are right: RDTSC is a good choice if you are careful to avoid the problem with multiple cores.

I will add the ability to edit comments later. Thanks :)

ace, 14 years ago

IMHO, using RDTSC "for the performance measurement of bigger program parts involving I/O and "heavy" API functions" with multiple threads is almost like using a single 15-inch ruler to measure the distance between Moscow and Paris. It's technically possible but it doesn't have much sense.

pritesh, 13 years ago

can i use rdtsc for arm processors? If not, whats the equivalent on arm? I am trying to do kernel instrumentation on android device - samsung nexus s which has a single core arm processor.

Peter Kankowski, 13 years ago

Please use a search engine.

ace, 13 years ago

How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures

http://download.intel.com/embedded/software/IA/324264.pdf

Thank you! The paper basically says RDTSCP is more precise than RDTSC. I will compare them.

What I understood is that they use both -- one to start and one to end the measurement, since one combination of cpuid and one prevents instructions to be executed before, and another combination prevents instructions to be executed after, therefore reducing deviation in the kind of measurements where only a few instructions are measured. For longer time periods, it should matter less.

On another side, I don't even know if CPU I use here (i5) has RDTSCP, and if it's one of those newest where RDTSC counter started to increment even when the power management brings the CPU to sleep. Anybody has the code to write such properties?

BTW, I typically don't measure things that lasts only a few instructions. Then even the resolution of GetTickCount is good enough for me. If I measure something that takes around 1 second, and GetTickCount changes once in 1/60th of second, my accuracy is better than 2%. This allows me to measure systems that provide only such time resolution, i.e. I don't need RDTSC to calculate that IE8 on my computer can do order of 6M FPU calculations per second, Firefox order of 240M, Opera order of 600M and C with classic FPU order of 700M (everything on one core of course). What I'm interested in are typically "order of" numbers.

TS, 11 years ago

Is it possible to use "RDTSC" alone without calling "CPUID"? "RSTSC" takes around 20-25 cycles, but CPUID takes 100+ cycles.

The path that I want to measure roughly takes around 2000 cycles.

"RDTSC" with "CPUID" is around 200 cycles, so the overhead is 10% approx. If RDTSC used alone without CPUID, the overhead will be around 1%. The question is how accurate if RDTSC is used alone?

Peter Kankowski, 11 years ago

RDTSC can be executed out-of-order (before or after the nearby instructions), so the measurement will be inaccurate. CPUID flushes the instruction pipeline; all preceding instructions are completed before starting RDTSC.

You can subtract the bias that RDTSC adds to the results. Agner Fog does such subtraction in his test programs.

Morris, 11 years ago

How can measure the nhz speed of rdtsc counter? How to compare hpet to rdtsc to verify which is the faster?

Morris,

please save the time-stamp counter value (using RDTSC) and wait for one second. Find the difference between the current value and the saved value.

Note that the frequency (i.e., CPU speed) may change because of power saving on older processors. Newer processors have a constant-rate time-stamp counter. See https://en.wikipedia.org/wiki/Time_Stamp_Counter

zzzzzz, 11 years ago

minimum on old (iP2) PC maybe replacent cpuid => cld (L1 not flushes, but for cmd-anti-parallesation, for cmd micro-tests)