xen-tscmode



  • OVERVIEW
        As of Xen 4.0, a new config option called tsc_mode may be specified for
        each domain. The default for tsc_mode handles the vast majority of
        hardware and software environments. This document is targeted for Xen
        users and administrators that may need to select a non-default tsc_mode.
    
        Proper selection of tsc_mode depends on an understanding not only of the
        guest operating system (OS), but also of the application set that will
        ever run on this guest OS. This is because tsc_mode applies equally to
        both the OS and ALL apps that are running on this domain, now or in the
        future.
    
        Key questions to be answered for the OS and/or each application are:
    
        *   Does the OS/app use the rdtsc instruction at all? (We will explain
            below how to determine this.)
    
        *   At what frequency is the rdtsc instruction executed by either the OS
            or any running apps? If the sum exceeds about 10,000 rdtsc
            instructions per second per processor, we call this a
            "high-TSC-frequency" OS/app/environment. (This is relatively rare, and
            developers of OS's and apps that are high-TSC-frequency are usually
            aware of it.)
    
        *   If the OS/app does use rdtsc, will it behave incorrectly if "time goes
            backwards" or if the frequency of the TSC suddenly changes? If so, we
            call this a "TSC-sensitive" app or OS; otherwise it is
            "TSC-resilient".
    
        This last is the US$64,000 question as it may be very difficult (or, for
        legacy apps, even impossible) to predict all possible failure cases. As a
        result, unless proven otherwise, any app that uses rdtsc must be assumed
        to be TSC-sensitive and, as we will see, this is the default starting in
        Xen 4.0.
    
        Xen's new tsc_mode parameter determines the circumstances under which the
        family of rdtsc instructions are executed "natively" vs emulated. Roughly
        speaking, native means rdtsc is fast but TSC-sensitive apps may, under
        unpredictable circumstances, run incorrectly; emulated means there is some
        performance degradation (unobservable in most cases), but TSC-sensitive
        apps will always run correctly. Prior to Xen 4.0, all rdtsc instructions
        were native: "fast but potentially incorrect." Starting at Xen 4.0, the
        default is that all rdtsc instructions are "correct but potentially slow".
        The tsc_mode parameter in 4.0 provides an intelligent default but allows
        system administrator's to adjust how rdtsc instructions are executed
        differently for different domains.
    
        The non-default choices for tsc_mode are:
    
        *   tsc_mode=1 (always emulate).
    
            All rdtsc instructions are emulated; this is the best choice when
            TSC-sensitive apps are running and it is necessary to understand
            worst-case performance degradation for a specific hardware
            environment.
    
        *   tsc_mode=2 (never emulate).
    
            This is the same as prior to Xen 4.0 and is the best choice if it is
            certain that all apps running in this VM are TSC-resilient and highest
            performance is required.
    
        *   tsc_mode=3 (PVRDTSCP).
    
            High-TSC-frequency apps may be paravirtualized (modified) to obtain
            both correctness and highest performance; any unmodified apps must be
            TSC-resilient.
    
        If tsc_mode is left unspecified (or set to tsc_mode=0), a hybrid algorithm
        is utilized to ensure correctness while providing the best performance
        possible given:
    
        *   the requirement of correctness,
    
        *   the underlying hardware, and
    
        *   whether or not the VM has been saved/restored/migrated To understand
            this in more detail, the rest of this document must be read.
    
    DETERMINING RDTSC FREQUENCY
        To determine the frequency of rdtsc instructions that are emulated, an
        "xm" command can be used by a privileged user of domain0. The command:
    
            # xm debug-key s; xm dmesg | tail
    
        provides information about TSC usage in each domain where TSC emulation is
        currently enabled.
    
    TSC HISTORY
        To understand tsc_mode completely, some background on TSC is required:
    
        The x86 "timestamp counter", or TSC, is a 64-bit register on each
        processor that increases monotonically. Historically, TSC incremented
        every processor cycle, but on recent processors, it increases at a
        constant rate even if the processor changes frequency (for example, to
        reduce processor power usage). TSC is known by x86 programmers as the
        fastest, highest-precision measurement of the passage of time so it is
        often used as a foundation for performance monitoring. And since it is
        guaranteed to be monotonically increasing and, at 64 bits, is guaranteed
        to not wraparound within 10 years, it is sometimes used as a random number
        or a unique sequence identifier, such as to stamp transactions so they can
        be replayed in a specific order.
    
        On most older SMP and early multi-core machines, TSC was not synchronized
        between processors. Thus if an application were to read the TSC on one
        processor, then was moved by the OS to another processor, then read TSC
        again, it might appear that "time went backwards". This loss of
        monotonicity resulted in many obscure application bugs when TSC-sensitive
        apps were ported from a uniprocessor to an SMP environment; as a result,
        many applications -- especially in the Windows world -- removed their
        dependency on TSC and replaced their timestamp needs with OS-specific
        functions, losing both performance and precision. On some more recent
        generations of multi-core machines, especially multi-socket multi-core
        machines, the TSC was synchronized but if one processor were to enter
        certain low-power states, its TSC would stop, destroying the synchrony and
        again causing obscure bugs. This reinforced decisions to avoid use of TSC
        altogether. On the most recent generations of multi-core machines,
        however, synchronization is provided across all processors in all power
        states, even on multi-socket machines, and provide a flag that indicates
        that TSC is synchronized and "invariant". Thus TSC is once again useful
        for applications, and even newer operating systems are using and depending
        upon TSC for critical timekeeping tasks when running on these recent
        machines.
    
        We will refer to hardware that ensures TSC is both synchronized and
        invariant as "TSC-safe" and any hardware on which TSC is not (or may not
        remain) synchronized as "TSC-unsafe".
    
        As a result of TSC's sordid history, two classes of applications use TSC:
        old applications designed for single processors, and the most recent
        enterprise applications which require high-frequency high-precision
        timestamping.
    
        We will refer to apps that might break if running on a TSC-unsafe machine
        as "TSC-sensitive"; apps that don't use TSC, or do use TSC but use it in a
        way that monotonicity and frequency invariance are unimportant as
        "TSC-resilient".
    
        The emergence of virtualization once again complicates the usage of TSC.
        When features such as save/restore or live migration are employed, a guest
        OS and all its currently running applications may be invisibly transported
        to an entirely different physical machine. While TSC may be "safe" on one
        machine, it is essentially impossible to precisely synchronize TSC across
        a data center or even a pool of machines. As a result, when run in a
        virtualized environment, rare and obscure "time going backwards" problems
        might once again occur for those TSC-sensitive applications. Worse, if a
        guest OS moves from, for example, a 3GHz machine to a 1.5GHz machine,
        attempts by an OS/app to measure time intervals with TSC may without
        notice be incorrect by a factor of two.
    
        The rdtsc (read timestamp counter) instruction is used to read the TSC
        register. The rdtscp instruction is a variant of rdtsc on recent
        processors. We refer to these together as the rdtsc family of
        instructions, or just "rdtsc". Instructions in the rdtsc family are
        non-privileged, but privileged software may set a cpuid bit to cause all
        rdtsc family instructions to trap. This trap can be detected by Xen, which
        can then transparently "emulate" the results of the rdtsc instruction and
        return control to the code following the rdtsc instruction.
    
        To provide a "safe" TSC, i.e. to ensure both TSC monotonicity and a fixed
        rate, Xen provides rdtsc emulation whenever necessary or when explicitly
        specified by a per-VM configuration option. TSC emulation is relatively
        slow -- roughly 15-20 times slower than the rdtsc instruction when
        executed natively. However, except when an OS or application uses the
        rdtsc instruction at a high frequency (e.g. more than about 10,000 times
        per second per processor), this performance degradation is not noticeable
        (i.e. <0.3%). And, TSC emulation is nearly always faster than OS-provided
        alternatives (e.g. Linux's gettimeofday). For environments where it is
        certain that all apps are TSC-resilient (e.g. "TSC-safeness" is not
        necessary) and highest performance is a requirement, TSC emulation may be
        entirely disabled (tsc_mode==2).
    
        The default mode (tsc_mode==0) checks TSC-safeness of the underlying
        hardware on which the virtual machine is launched. If it is TSC-safe,
        rdtsc will execute at hardware speed; if it is not, rdtsc will be
        emulated. Once a virtual machine is save/restored or migrated, however,
        there are two possibilities: TSC remains native IF the source physical
        machine and target physical machine have the same TSC frequency (or, for
        HVM/PVH guests, if TSC scaling support is available); else TSC is
        emulated. Note that, though emulated, the "apparent" TSC frequency will be
        the TSC frequency of the initial physical machine, even after migration.
    
        For environments where both TSC-safeness AND highest performance even
        across migration is a requirement, application code can be specially
        modified to use an algorithm explicitly designed into Xen for this
        purpose. This mode (tsc_mode==3) is called PVRDTSCP, because it requires
        app paravirtualization (awareness by the app that it may be running on top
        of Xen), and utilizes a variation of the rdtsc instruction called rdtscp
        that is available on most recent generation processors. (The rdtscp
        instruction differs from the rdtsc instruction in that it reads not only
        the TSC but an additional register set by system software.) When a
        pvrdtscp-modified app is running on a processor that is both TSC-safe and
        supports the rdtscp instruction, information can be obtained about
        migration and TSC frequency/offset adjustment to allow the vast majority
        of timestamps to be obtained at top performance; when running on a
        TSC-unsafe processor or a processor that doesn't support the rdtscp
        instruction, rdtscp is emulated.
    
        PVRDTSCP (tsc_mode==3) has two limitations. First, it applies to all apps
        running in this virtual machine. This means that all apps must either be
        TSC-resilient or pvrdtscp-modified. Second, highest performance is only
        obtained on TSC-safe machines that support the rdtscp instruction; when
        running on older machines, rdtscp is emulated and thus slower. For more
        information on PVRDTSCP, see below.
    
        Finally, tsc_mode==1 always enables TSC emulation, regardless of the
        underlying physical hardware. The "apparent" TSC frequency will be the TSC
        frequency of the initial physical machine, even after migration. This mode
        is useful to measure any performance degradation that might be encountered
        by a tsc_mode==0 domain after migration occurs, or a tsc_mode==3 domain
        when it is running on TSC-unsafe hardware.
    
        Note that while Xen ensures that an emulated TSC is "safe" across
        migration, it does not ensure that it continues to tick at the same rate
        during the actual migration. As an oversimplified example, if TSC is
        ticking once per second in a guest, and the guest is saved when the TSC is
        1000, then restored 30 seconds later, TSC is only guaranteed to be greater
        than or equal to 1001, not precisely 1030. This has some OS implications
        as will be seen in the next section.
    
    TSC INVARIANT BIT and NO_MIGRATE
        Related to TSC emulation, the "TSC Invariant" bit is architecturally
        defined in a cpuid bit on the most recent x86 processors. If set, TSC
        invariance ensures that the TSC is "safe", that is it will increment at a
        constant rate regardless of power events, will be synchronized across all
        processors, and was properly initialized to zero on all processors at
        boot-time by system hardware/BIOS. As long as system software never writes
        to TSC, TSC will be safe and continuously incremented at a fixed rate and
        thus can be used as a system "clocksource".
    
        This bit is used by some OS's, and specifically by Linux starting with
        version 2.6.30(?), to select TSC as a system clocksource. Once selected,
        TSC remains the Linux system clocksource unless manually overridden. In a
        virtualized environment, since it is not possible to synchronize TSC
        across all the machines in a pool or data center, a migration may "break"
        TSC as a usable clocksource; while time will not go backwards, it may not
        track wallclock time well enough to avoid certain time-sensitive
        consequences. As a result, Xen can only expose the TSC Invariant bit to a
        guest OS if it is certain that the domain will never migrate. As of Xen
        4.0, the "no_migrate=1" VM configuration option may be specified to
        disable migration. If no_migrate is selected and the VM is running on a
        physical machine with "TSC Invariant", Linux 2.6.30+ will safely use TSC
        as the system clocksource. But, attempts to migrate or, once saved,
        restore this domain will fail.
    
        There is another cpuid-related complication: The x86 cpuid instruction is
        non-privileged. HVM domains are configured to always trap this instruction
        to Xen, where Xen can "filter" the result. In a PV OS, all cpuid
        instructions have been replaced by a paravirtualized equivalent of the
        cpuid instruction ("pvcpuid") and also trap to Xen. But apps in a PV guest
        that use a cpuid instruction execute it directly, without a trap to Xen.
        As a result, an app may directly examine the physical TSC Invariant cpuid
        bit and make decisions based on that bit. This is still an unsolved
        problem, though a workaround exists as part of the PVRDTSCP tsc_mode for
        apps that can be modified.
    
    MORE ON PVRDTSCP
        Paravirtualized OS's use the "pvclock" algorithm to manage the passing of
        time. This sophisticated algorithm obtains information from a memory page
        shared between Xen and the OS and selects information from this page based
        on the current virtual CPU (vcpu) in order to properly adapt to TSC-unsafe
        systems and changes that occur across migration. Neither this shared page
        nor the vcpu information is available to a userland app so the pvclock
        algorithm cannot be directly used by an app, at least without performance
        degradation roughly equal to the cost of just emulating an rdtsc.
    
        As a result, as of 4.0, Xen provides capabilities for a userland app to
        obtain key time values similar to the information accessible to the PV OS
        pvclock algorithm. The app uses the rdtscp instruction which is defined in
        recent processors to obtain both the TSC and an auxiliary value called
        TSC_AUX. Xen is responsible for setting TSC_AUX to the same value on all
        vcpus running any domain with tsc_mode==3; further, Xen tools are
        responsible for monotonically incrementing TSC_AUX anytime the domain is
        restored/migrated (thus changing key time values); and, when the domain is
        running on a physical machine that either is not TSC-safe or does not
        support the rdtscp instruction, Xen is responsible for emulating the
        rdtscp instruction and for setting TSC_AUX to zero on all processors.
    
        Xen also provides pvclock information via a "pvcpuid" instruction. While
        this results in a slow trap, the information changes (and thus must be
        reobtained via pvcpuid) ONLY when TSC_AUX has changed, which should be
        very rare relative to a high frequency of rdtscp instructions.
    
        Finally, Xen provides additional time-related information via other
        pvcpuid instructions. First, an app is capable of determining if it is
        currently running on Xen, next whether the tsc_mode setting of the domain
        in which it is running, and finally whether the underlying hardware is
        TSC-safe and supports the rdtscp instruction.
    
        As a result, a pvrdtscp-modified app has sufficient information to compute
        the pvclock "elapsed nanoseconds" which can be used as a timestamp. And
        this can be done nearly as fast as a native rdtsc instruction, much faster
        than emulation, and also much faster than nearly all OS-provided time
        mechanisms. While pvrtscp is too complex for most apps, certain enterprise
        TSC-sensitive high-TSC-frequency apps may find it useful to obtain a
        significant performance gain.
    
    HARDWARE TSC SCALING
        Intel VMX TSC scaling and AMD SVM TSC ratio allow the guest TSC read by
        guest rdtsc/p increasing in a different frequency than the host TSC
        frequency.
    
        If a HVM container in default TSC mode (tsc_mode=0) or PVRDTSCP mode
        (tsc_mode=3) is created on a host that provides constant TSC, its guest
        TSC frequency will be the same as the host. If it is later migrated to
        another host that provides constant TSC and supports Intel VMX TSC
        scaling/AMD SVM TSC ratio, its guest TSC frequency will be the same before
        and after migration.
    
        For above HVM container in default TSC mode (tsc_mode=0), if above hosts
        support rdtscp, both guest rdtsc and rdtscp instructions will be executed
        natively before and after migration.
    
        For above HVM container in PVRDTSCP mode (tsc_mode=3), if the destination
        host does not support rdtscp, the guest rdtscp instruction will be
        emulated with the guest TSC frequency.
    
    AUTHORS
        Dan Magenheimer <[email protected]>
    

Log in to reply
 

© Lightnetics 2024