View previous topic :: View next topic |
Author |
Message |
bfc_forever n00b
Joined: 10 Aug 2006 Posts: 49
|
Posted: Thu Mar 22, 2007 12:32 am Post subject: Why RDTSC is so slow? |
|
|
Hi,
I tested the following code on a core 2 duo E6400 computer running Gentoo linux and found the time between of the two consecutive reads of time stamp register was 88 cycles. This result is much slower than I expected. I wonder if the result is normal or I did something wrong. Thanks a lot for your help.
Code: | #include <stdio.h>
#include <stdint.h>
extern "C" {
__inline__ uint64_t rdtsc() {
uint32_t lo, hi;
/* We cannot use "=A", since this would use %rax on x86_64 */
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return (uint64_t)hi << 32 | lo;
}
}
int main(int argc, char* argv[])
{
unsigned long long a, b;
a = rdtsc();
b = rdtsc();
printf("%llu\n", b-a);
return 0;
}
|
Best,
bfc_forever |
|
Back to top |
|
|
reillyeon n00b
Joined: 26 Mar 2003 Posts: 44 Location: Boston (ish)
|
Posted: Fri Mar 23, 2007 4:35 am Post subject: |
|
|
If you look at the generated assembly for this code you see that, without any explicit optimization, GCC doesn't inline the function, and there is actually quite a bit of code between each call to rdtsc(), and thus between each execution of the instruction. If compiled with -O1 the difference jumps from about 45 cycles (on an Athlon X2) to 8 cycles, which makes sense given there are still 4 instructions between each read. Similar results should be seen on your machine. _________________ Linux user #309501 |
|
Back to top |
|
|
bfc_forever n00b
Joined: 10 Aug 2006 Posts: 49
|
Posted: Fri Mar 23, 2007 4:47 am Post subject: |
|
|
Thanks a lot for your information. I noticed that the compiler didn't inline it and optimized the function call away. But on my computer, which is a core 2 duo E6400, the time drops to 64 cycles from 88 cycles. I will check it further to see if there is any more space to optimize.
reillyeon wrote: | If you look at the generated assembly for this code you see that, without any explicit optimization, GCC doesn't inline the function, and there is actually quite a bit of code between each call to rdtsc(), and thus between each execution of the instruction. If compiled with -O1 the difference jumps from about 45 cycles (on an Athlon X2) to 8 cycles, which makes sense given there are still 4 instructions between each read. Similar results should be seen on your machine. |
Best,
bfc_forever |
|
Back to top |
|
|
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|