The question comes up often and absolutely yes, there is a skewing of time and resource usage values that occurs with virtualization; it’s an artifact of the way the hypervisor works. You can think of time in any individual VM as being “eventually consistent”. When using a hypervisor the OS is being scheduled to run in much the same way as the processes in your application are being scheduled to run.
Since the OS is not running all the time, as it is in a physical server without a hypervisor, the values that you see from the OS are often times skewed. This is due to the fact that the OS is not aware of the fact that it is not running all the time. For example, Linux uses a periodic timer interrupt to create what it calls jiffies. This is just a counter that is incremented when the timer interrupt occurs. The interrupt occurs every 10Ms (it varies on H/W architectures, sometimes the interrupt occurs every 100Ms). Therefore, 100 jiffies is 1 second of elapsed time. You can refer to the constant HZ (param.h) for the specific jiffies value. Windows is a bit more complete in the way they utilize timer hardware to account fot the potential skewing. I won’t bore you with all the details, but you can rely on the performance counter values to be correct to a large degree.
So, given that the OS relies on continuous operation in order to calculate time and time related resource utilization things can get skewed. The better hypervisors, attempt to account for the missing time. For example, if an OS has been in a wait queue while the hypervisor schedules other OSs to run a period of time elapses. When the OS is scheduled to run by the hypervisor it is not aware of the actual amount of time that has elapsed. So, the hypervisor tries to fix this by, for example, simulating (virtualizing) a number of additional timer interrupts so that the OS catches up; by incrementing jiffies a bunch of time in the case of Linux. The OS eventually catches up, more or less. That what I meant by eventually consistent.
This hardware virtualization of timer related hardware works sort of OK for wall clock time. As you have experienced, it’s not very good for real-time and small critical measurements. The values calculated by the OS can be misleading. It’s an artifact of virtualization. Here’s something to think about; what happens if you want to know response time on a socket connection? It needs to be accurate with respect to wall clock time because that’s what the client sees as response time. If you use calculations that do not over come this skew then you get really messed up values. Especially when you are dealing with microsecond values.
So, what do you do if you want to get viable timing and resource usage values for virtualized environments, including clouds? You need to utilize tools that understand this issue and provide you with correct values. Such a solution can not rely on OS interfaces directly to provide you with values. However, most monitoring solutions do exactly that; use OS capabilities to gather the information for monitoring.
The solution turns out to be pretty low-level and somewhat complex. I won’t bore you with all the details (unless someone wants to get into it). So let me try to summarize this way; 1) an effective cloud monitoring solution has to get values independent of the OS and calculate values independently 2) an effective cloud monitoring solution must use a hardware mechanism that is not skewed by the fact that the OS is not running constantly. Number 1 is kind of difficult to explain but if there is interest can perhaps be another blog post.
Number 2 can be summarized, maybe. At AppFirst we use the TSC counter on Intel & AMD CPUs. Before you stop reading and tell me I’m wrong; let me clarify. A few google searches will point you at various docs describing why this timing mechanism is not safe to use. There are potential problems in a couple areas; does the counter increment consistently when frequencies vary on the CPU (which happens a lot with power management and other activities) and do all cores start with the same values and increment in the same way as all other cores? With older CPUs these were serious limitations. With newer CPUs both Intel & AMD have resolved these issues. Where you have a CPU that supports what Intel calls an Invariant TSC you can rely on the counter to be consistent. There is lots of confusion about this. I’ve found that the only information to rely on for this topic is section 16.11 of the Intel Systems Programming Guide.
It’s required that a solution attempting to use an Invariant TSC take care in several ways. We check to see if the CPU supports an Invariant TSC. Where it does values can be quite accurate and they avoid the skew created by virtualization. Of course, it’s a lot of fairly heavy lifting to collect all the necessary data and make the calculations independent of the OS, but frankly, it’s the only way to get you accurate information. So, that’s what we do at AppFirst.
In order to really use the cloud and virtual environments you need the real data, guesses aren’t something you should be running your business upon.