memory manager in windows
TRANSCRIPT
The Memory Manager in Windows Server 2003 and Windows Vista
Landy Wang Software Design Engineer Windows Kernel Team Microsoft Corporation
Outline Memory Manager (MM) improvements in Windows Server 2003 SP164-bit Windows features and Enhancements NUMA and large page support added Performance Enhancements Support for No Execute (NX) Capability
2005 Microsoft Corporation
2
Outline (cont) Memory Manager Improvements Planned for Windows VistaDynamic system address space Kernel page table pages allocated on demand Support for very large registries NUMA and large page support enhancements Advanced video model support I/O and Section Access Improvements Performance Improvements Terminal Server improvements Robustness and Diagnosability Improvements 2005 Microsoft Corporation
3
Server 2003 SP1 Windows 64
Windows 64-bit memory8TB user address space 8TB kernel address space 128GB pools 128GB system page table entries (PTEs) 1TB system cache
Support for x64 platform added 4GB virtual address space added for 32-bit large address space aware applicationsFurther increases performance of WOW layer on both Itanium and x64 systems 2005 Microsoft Corporation
4
Server 2003 SP1 NUMA & Large Page Support
Large page support added for user images and pagefile-backed sections Large pages now also used in 32-bit, even when booted with /3GB switch, forKernel Page Frame Number (PFN) database Initial non-paged pool
Prior large page support (added in Server 2003) was for the followingUser private memory Device driver image mappings Kernel, when not booted with /3GB switch 2005 Microsoft Corporation
5
Server 2003 SP1 NUMA & Large Page Support
Pages zeroed in parallel and in a node aware fashion during boot upReduces boot time on large NUMA systems
Physical pages initially consumed in top-down order, instead of bottom-upKeeps more pages below 4GB available for drivers that require it
2005 Microsoft Corporation
6
Server 2003 SP1 Performance Enhancements
Working set management performance increases, especially inAreas of large shared memory and when booted with /3GB switch
Premature self-trimming and linear hash table walks eliminatedMajor perf increases for apps like Exchange & SAP
2005 Microsoft Corporation
7
Server 2003 SP1 Performance Enhancements
Pool tagging paths parallelizedIntroduced shared acquire mode for spinlocks Employing for tag table updates
Expand hash table for tagging large pagesWhen we detect searches are occurring, instead of waiting for the table to be entirely filled
Overlapped asynchronous flushing for user requests to maximize I/O throughput Pagefiles zeroed in parallel instead of seriallyFaster shutdown when zero my pagefile is set
2005 Microsoft Corporation
8
Server 2003 SP1 Performance Enhancements
Per-process working set lock used to synchronize PTE updates and working set list changes to an address spaceSystem, session or process This lock converted from a mutex to a pushlockPushlocks support both shared and exclusive acquire modes Mutexes support only exclusive acquisitions
In conjunction with 2-byte interlocked operations this allows parallelization of many operationsMmProbeAndLockPagesCompletely remove the PFN lock acquire from this very hot routine
MmUnlockPages VirtualQuery etc. 2005 Microsoft Corporation
9
Server 2003 SP1 Performance Enhancements
Major PFN lock reduction to improve scalabilityReducing time held Replacing acquisitions with lock-free or alternative lock sequences in many places & APIs
Translation look-aside buffer (TLB) optimizations
2005 Microsoft Corporation
10
Server 2003 SP1 Other MM Enhancements
Support for no execute (NX) capability New Win32 SetThreadStackGuarantee APIAllows user applications & the CLR to specify guaranteed stack space requirements Requirements honored even in low resource scenarios
Support for hot-patching a running systemPatch system without reboot to reduce down time Backported to Windows XP SP2
2005 Microsoft Corporation
11
Windows Vista Dynamic Address SpaceSystem virtual address (VA) space allocated on-demandInstead of at boot time based registry & configuration information Region sizes bounded only by VA limitations Applies to non-paged, paged, session space, mapped views, etc.
Kernel page tables allocated on demandNo longer preallocated at system boot, saves1.5MB on x86 systems 3MB on PAE systems 16MB to 2.5GB on 64-bit machines
Boot with very large registries on 32-bit machines With and without /3GB switchImportant for large multipath LUN machines MM locates registry VA space used by boot loader & reuses it as dynamic kernel virtual address space 2005 Microsoft Corporation
12
Key Benefits of Dynamic Address SpaceNo registry editing & reboots to reconfigure systems due to resource imbalances Maximum resources available in wide range of scenarios, w/ no human interventionDesktop heap exhaustion Terminal Server maximum scaling Large video clients /3GB SQL and Exchange machines Http servers, NFS servers, etc
Features enabled w/o reboot, yet have no cost if not used 64-bit systems grow to maximum limit regardless of underlying physical configuration128GB paged pool, nonpaged pool 1TB system cache/system PTEs/special pool 128GB session pool 128GB session views (desktop heaps), etc 2005 Microsoft Corporation
13
Windows Vista Planned Enhancements for NUMA, Large System, Large Page SupportInitial nonpaged pool now NUMA aware, with separate VA ranges for each node Per-node look-asides for full pages Page table allocation for system PTEs, the system cache, etc. distributed across nodesMore even locality Avoids exhausting free pages from the boot node
NUMA-related APIs for device driversMmAllocateContiguousMemorySpecifyCacheNode MmAllocatePagesForMdlEx Default if no node is specified has been changedFrom current processor to the threads ideal processor
Zeroing of pages for these APIs bounds number of threads more intelligently 2005 Microsoft Corporation
14
Windows Vista Planned Enhancements for NUMA, Large System, Large Page SupportWin32 APIs that specify nodes for allocations & mapped views on per VAD & per section basis VirtualAllocExNuma CreateFileMappingExNuma MapViewOfFileExNuma
Scalable queryQueryWorkingSetEx
Higher perf for very physically sparse machinesExample: Hewlett-Packard Superdome1TB gaps between chunks of physical memory
PFN database & initial nonpaged pool always mapped with large pages regardless of physical memory sparseness
2005 Microsoft Corporation
15
Windows Vista Planned Enhancements for NUMA, Large System, Large Page Support /3GB mode on 32-bit systems supports up to 64GB of RAMBooting in /3GB mode on 32-bit systems now supports up to 64GB of RAM instead of just 16GB Booting without /3GB on 32-bit systems continues to support up to 128 GB of RAM
2005 Microsoft Corporation
16
Windows Vista Planned Enhancements for NUMA, Large System, Large Page Support Much faster large page allocations in kernel & user Support for cache-aligned pool allocation directives Data structures describing non-paged pool free list converted from linked list to bitmapReduced lock contention by over 50% Bitmaps can be searched opportunistically lock-free Costly combining of adjacent allocations on free no longer necessary
2005 Microsoft Corporation
17
Windows Vista New Video Model SupportDramatically different video architecture in Windows VistaMore fully exploits modern GPUs & virtual memory
MM provides new mapping typeRotate virtual address descriptors (VADs) Allow video drivers to quickly switch user views from regular application memory into Cached, non-cached, write combined AGP or video RAM mappings Allows video architecture to use GPU to rotate unneeded clients in and out on demand
First time Windows-based OS has supported fully pageable mappings w/ arbitrary cache attributes
2005 Microsoft Corporation
18
Windows Vista I/O Section Access ImprovementsPervasive prefetch-style clustering for all types of page faults and system cache read ahead Major benefits over previous clusteringInfinite size read ahead instead of 64k max Dummy page usageSo a single large I/O is always issued regardless of valid pages encountered in the cluster
Pages for the I/O are put in transition (not valid) No VA space is requiredIf the pages are not subsequently referenced, no working set trim and TLB flush is needed either
Further emphasizes that driver writers must be aware that MDL pages can have their contents change !
2005 Microsoft Corporation
19
Windows Vista I/O Section Access Improvements Significant changes in pagefile writingLarger clusters up to 4GB Align near neighbors Sort by virtual address (VA) Reduced fragmentation Improved reads
Cache manager read ahead size limitations in thread structure removed Improved synchronization between cache manager and memory manager data flushing to maximize filesystem/disk throughput and efficiency
2005 Microsoft Corporation
20
Windows Vista I/O Section Access Improvements Mapped file writing and file flushing performance increasesSupport for writes of any size up to 4GB instead of previous 64k limit per write Multiple asynchronous flushes can be issued, both internally and by the caller, to satisfy a single call
Pagefile fragmentation improvementsOn dirty bit faults, we use interlocked queuing operation to free the pagefile space of the corresponding page Avoids PFN lock acquisitions Reduces needless pagefile fragmentation 2005 Microsoft Corporation
21
Windows Vista I/O Section Access Improvements
Elimination of pagefile writes and potential subsequent re-reads of completely zero pagesCheck pages at trim time to see if they are all zero Optimization used to make this nearly freeUser virtual address used to check for the first and last ULONG_PTR being zero; if they both are, then After the page is trimmed, and TLB invalidated, a kernel mapping used to make the final check of the entire page Avoids needless scans & TLB flushesWeve measured over 90% success rate with this algorithm
2005 Microsoft Corporation
22
Windows Vista I/O Section Access Improvements Access to large section performance increasesA subsection is the name of the data structure used to describe on-disk file spans for sections The subsection structure was convertedFrom a singly linked (i.e., linear walk required) To a balanced AVL tree Enables huge performance gain for sections mapping large filesUser mappings & flushes, system cache mappings, flushes & purges, section-based backups, etc
Mapped page writer does flushing based on a sweep handData is written out much sooner than the prior 5 minute flush everything model 2005 Microsoft Corporation
23
Windows Vista I/O Section Access Improvements Dependencies between modified writer & mapped writer removed toIncrease parallelism Reduce filesystem deadlock rules Provide the cache manager with a way to influence which portions of files get written firstTo optimize disk seek as well as avoiding valid data length extension costs
2005 Microsoft Corporation
24
Windows Vista I/O Section Access ImprovementsCore support for SuperfetchEnables significantly faster app launch by deciding which pages should be prioritized Provides mechanisms to pre-fetch pages and prevents premature cannibalization Includes support for Per page priorities Access bit tracing Private page pre-fetching Section (including pagefile-backed) pre-fetching
2005 Microsoft Corporation
25
Windows Vista Fast S4 SupportHibernation converted to use memory management mirroring facilities Hibernation time reduced by 2x, with 50% smaller hiber-file Resume time reductions
2005 Microsoft Corporation
26
Windows Vista Internal Data Structure and Algorithmic Performance EnhancementsConstant PFN lock time reduction always ongoing, has included areas likeUser address space trimming and deletion MEM_RESET Page allocationsthe PFN sharecount now uses interlocked updates instead of requiring the PFN lock, etc
Page faults Modified writes Page color generation MDL construction for fault I/Os, and so on
Translation look-aside buffer (TLB) optimizations
2005 Microsoft Corporation
27
Windows Vista Internal Data Structure and Algorithmic Performance EnhancementsThe per-process address space lock used to synchronize creation/deletion/changes to user address spacesThis lock was converted from a mutex to a pushlockPushlocks support both shared and exclusive acquire modes Mutexes support only exclusive acquisitions
Allowed parallelization of many operations like VirtualAlloc, etc
VirtualAlloc support has been revamped to reduceConventional (non-AWE) allocations by over 30% AWE allocations by over 2500% (not a typo)
Address Windowing Extension (AWE) non-zeroed allocations are >10x faster than in SP1Can now therefore be used for http responses, for example
2005 Microsoft Corporation
28
Windows Internal Data Structure and Algorithmic Performance EnhancementsPFN database contains information about all physical memory in the machine In the past, whenever a new page was needed:The PFN spinlock was acquired New page removed from appropriate list chained through PFN database
This has been improved by adding a zero and free page SLIST for every NUMA node and page color Now obtain the page without needing the PFN lock in many instances where we need a single pageDemand zero faults, copy on write faults, etc For example, the fault processing path length is cut in halfAlleviates pressure on both the working set pushlock & PFN lock 2005 Microsoft Corporation
29
Windows Vista Terminal Services ImprovementsAdded Terminal Server session objectsEnables various components to have secure session IDs and implement compartment IDs, for example
Major overhaul of Terminal Server global-per-session image supportEliminated multiple image control areasTo provide single image cache & fix flush/purge/truncate races Only the shared subsections themselves are now per-session, instead of the entire image Shared subsection use AVL tree instead of a linked list, for faster searches
Support for hot-patching of session-space drivers 64-bit Windows uses demand zero pages instead of pool for WOW64 page table bitmaps 2005 Microsoft Corporation
30
Windows Vista Additional Robustness and DiagnosabilityCapability to mark system cache views as read onlyUsed by Registry to protect views from inadvertent driver corruption
Reduced data loss in the face of crashesFlush all modified data to its backing store (local & remote) if we are going to bugcheck due to a failed inpage Only failed inpages of kernel and/or drivers are fatalFailed inpages of user process code/data merely results in an inpage exception being handed to the application
Commit thresholds now reflected in global named eventsApps can use this to monitor the system
2005 Microsoft Corporation
31
Windows Vista Additional Robustness and Diagnosability .pagein debugger support for kernel/driver addresses addedAllows for viewing memory addresses which have been paged out to disk when debugging crashes
2005 Microsoft Corporation
32
Call to Action Consider these significant Memory Manager enhancements as you develop drivers for Windows Server 2003 and Windows Vista Use new APIs when available in Windows Vista
2005 Microsoft Corporation
33
2005 Microsoft Corporation. All rights reserved.This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.