|
|
-
It’s been over two years since I blogged. Although I remain happily (perhaps even ecstatically) working at Microsoft, I left the CLR team and the Developer Division about a year ago. I’m now on an incubation team, exploring evolution and revolution in operating systems. This is a fascinating area that includes devices, concurrency, scheduling, security, distribution, application model, programming model and even some aspects of user interaction (where I am totally out of my depth). And, as you might expect with my background, our effort also includes managed programming.
Anyway, this blog will remain available indefinitely. It continues to be useful for certain technical details which are unavailable elsewhere.
In the meantime, if any readers are interested in working on a deep systems incubation with me and a team of truly outstanding developers, please send me email (cbrumme). We are holding to some very high standards for this effort in terms of insight, experience and hard work. But if you are like me, I am confident you will find it a dream opportunity.
|
-
My original posts on Finalization and Hosting had some hokey XXXXX markers in place of content, where that content hadn't already been disclosed in some form. Now that the Visual Studio 2005 Community Preview is available, I've gone back to those two posts and replaced the XXXXX markers with real text.
Also, it's obviously been a while since my last post. I started writing something this weekend, but the weather here has been spectacular and I was compelled to go outside and play. I'll try to have something in the next couple of weeks.
|
-
Hosting
My prior three blogs were supposed to be on Hosting. Each time I got side tracked, first on Exceptions, then on Application Compatibility and finally on Finalization. I refuse to be side tracked this time… much.
Also, I need to explain why it’s taken so long to get this blog out. Part of the reason is vacation. I spent Thanksgiving skiing in Whistler. Then I took a quick side trip to Scottsdale for a friend’s surprise birthday party and to visit my parents. Finally, I spent over three weeks on Maui getting a break from the Seattle winter.
Another reason for the delay is writer’s block. This topic is so huge. The internal specification for the Whidbey Hosting Interfaces is over 100 pages. And that spec only covers the hosting interfaces themselves. There are many other aspects of hosting, like how to configure different security policy in different AppDomains, or how to use COM or managed C++ to stitch together the unmanaged host with the managed applications. There’s no way I can cover the entire landscape.
Anyway, here goes.
Mostly I was tourist overhead at the PDC. But one of the places I tried to pay for my ticket was a panel on Hosting. The other panelists included a couple of Program Managers from the CLR, another CLR architect, representatives from Avalon / Internet Explorer, SQL Server, Visual Studio / Office, and – to my great pleasure – a representative from IBM for DB2.
One thing that was very clear at that panel is that the CLR team has done a poor job of defining what hosting is and how it is done. Depending on your definition, hosting could be:
- Mixing unmanaged and managed code in the same process.
- Running multiple applications, each in its own specially configured AppDomain.
- Using the unmanaged hosting interfaces described in mscoree.idl.
- Configuring how the CLR runs in the process, like disabling the concurrent GC through an application config file.
Even though the hosting interfaces described in mscoree.idl are a small part of what could be hosting, I’m going to concentrate on those interfaces.
In V1 and V1.1 of the CLR, we provided some APIs that allowed an unmanaged process host to exercise some limited control over the CLR. This limited control included the ability to select the version of the CLR to load, the ability to create and configure AppDomains from unmanaged code, access to the ThreadPool, and a few other fundamental operations.
Also, we knew we eventually needed to support hosts which manage all the memory in the process and which use non-preemptive scheduling of tasks and perhaps even light-weight fibers rather than OS threads. So we added some rudimentary (and alas inadequate) APIs for fibers and memory control. This invariably happens when you add features that you think you will eventually need, rather than features that someone is actually using and giving feedback on.
If you look closely at the V1 and V1.1 hosting APIs, you really see what we needed to support ASP.NET and a few other scenarios, like ones involving EnterpriseServices, Internet Explorer or VSA, plus some rudimentary guesses at what we might need to coexist properly inside SQL Server.
Obviously in Whidbey we have refined those guesses about SQL Server into hard requirements. And we tried very hard to generalize each extension that we added for SQL Server, so that it would be applicable to many other hosting scenarios. In fact, it’s amazing that the SQL Server team still talks to us – whenever they ask for anything, we always say No and give them something that works a lot better for other hosts and not nearly so well for SQL Server.
In our next release (Whidbey), we’ve made a real effort to clean up the existing hosting support and to dramatically extend it for a number of new scenarios. Therefore I’m not going to spend any more time discussing those original V1 & V1.1 hosting APIs, except to the extent that they are still relevant to the following Whidbey hosting discussion.
Also I’m going to skip over all the general introductory topics like “When to host” since they were the source of my writer’s block. Instead, I’m going to leap into some of the more technically interesting topics. Maybe after we’ve studied various details we can step back and see some general guidelines.
Threading and Synchronization
One of the most interesting challenges we struggled with during Whidbey was the need to cooperate with SQL Server’s task scheduling. SQL Server can operate in either thread mode or fiber mode. Most customers run in thread mode, but SQL Server can deliver its best numbers on machines with lots of CPUs when it’s running in fiber mode. That gap between thread and fiber mode has been closing as the OS addresses issues with its own preemptive scheduler.
A few years ago, I ran some experiments to see how many threads I could create in a single process. Not surprisingly, after almost 2000 threads I ran out of address space in the process. That’s because the default stack size on NT is 1 MB and the default user address space is 2 GB. (Starting with V1.1, the CLR can load into LARGEADDRESSAWARE processes and use up to 3 GB of address space). If you shrink the default stack size, you can create more than 2000 threads before hitting the address space limit. I see stack sizes of 256 KB in the SQL Server process on my machine, clearly to reduce this impact on process address space.
Of course, address space isn’t the only limit you can hit. Even on the 4 CPU server box I was experimenting with, the real memory on the system was inadequate for the working set being used. With enough threads, I exceeded real memory and experienced paging. (Okay, it was actually thrashing). But nowadays there are plenty of servers with several GB of real – and real cheap – memory, so this doesn’t have to be an issue.
In my experiments, I simulated server request processing using an artificial work load that combined blocking, allocation, CPU-intensive computation, and a reasonable memory reference set using a mixture of both shared and per-request allocations. In the first experiments, all the threads were ready to run and all of them had equal priority. The result of this was that all threads were scheduled in a round-robin fashion on those 4 CPUs. Since the Windows OS schedules threads preemptively, each thread would execute until it either needed to block or it exceeded its quantum. With hundreds or even thousands of threads, each context switch was extremely painful. That’s because most of the memory used by that thread was so cold in the cache, having been fully displaced by the hundreds of threads that ran before it.
As we all know, modern CPUs are getting faster and faster at raw computation. And they have more and more memory available to them. But access to that memory is getting relatively slower each year. By that, I mean that a single memory access costs the equivalent of an increasing number of instructions. One of the ways the industry tries to mitigate that relative slowdown is through a cache hierarchy. Modern X86 machines have L1, L2 and L3 levels of cache, ordered from fastest and smallest to slowest and largest.
(Other ways we try to mitigate the slowdown is by increasing the locality of our data structures and by pre-fetching. If you are a developer, hopefully you already know about locality. In the unmanaged world, locality is entirely your responsibility. In the managed world, you get some locality benefits from our environment – notably the garbage collector, but also the auto-layout of the class loader. Yet even in managed code, locality remains a major responsibility of each developer).
Unfortunately, context switching between such a high number of threads will largely invalidate all those caches. So I changed my simulated server to be smarter about dispatching requests. Instead of allowing 1000 requests to execute concurrently, I would block 996 of those requests and allow 4 of them to run. This makes life pretty easy for the OS scheduler! There are four CPUs and four runnable threads. It’s pretty obvious which threads should run.
Not only will the OS keep those same four threads executing, it will likely keep them affinitized to the same CPUs. When a thread moves from one CPU to another, the new CPU needs to fill all the levels of cache with data appropriate to the new thread. However, if we can remain affinitized, we can enjoy all the benefits of a warm cache. The OS scheduler attempts to run threads on the CPU that last ran them (soft affinity). But in practice this soft affinity is too soft. Threads tend to migrate between CPUs far more than we would like. When the OS only has 4 runnable threads for its 4 CPUs, the amount of migration seemed to drop dramatically.
Incidentally, Windows also supports hard affinity. If a thread is hard affinitized to a CPU, it either runs on that CPU or it doesn’t run. The CLR can take advantage of this when the GC is executing in its server mode. But you have to be careful not to abuse hard affinity. You certainly don’t want to end up in a situation where all the “ready to run” threads are affinitized to one CPU and all the other CPUs are necessarily stalled.
Also, it’s worth mentioning the impact of hyper-threading or NUMA on affinity. On traditional SMP, our choices were pretty simple. Either our thread ran on its ideal processor, where we are most likely to see all the benefits of a warm cache, or it ran on some other processor. All those other processor choices can be treated as equally bad for performance. But with hyper-threading or NUMA, some of those other CPUs might be better choices than others. In the case of hyper-threading, some logical CPUs are combined into a single physical CPU and so they share access to the same cache memory at some level in the cache hierarchy. For NUMA, the CPUs may be arranged in partitions (e.g. hemispheres on some machines), where each partition has faster access to some memory addresses and slower access to other addresses. In all these cases, there’s some kind of gradient from the very best CPU(s) for a thread to execute on, down to the very worst CPU(s) for that particular thread. The world just keeps getting more interesting.
Anyway, remember that my simulated server combined blocking with other operations. In a real server, that blocking could be due to a web page making a remote call to get rows from a database, or perhaps it could be blocking due to a web service request. If my server request dispatcher only allows 4 requests to be in flight at any time, such blocking will be a scalability killer. I would stall a CPU until my blocked thread is signaled. This would be intolerable.
Many servers address this issue by releasing some multiple of the ideal number of requests simultaneously. If I have 4 CPUs dedicated to my server process, then 4 requests is the ideal number of concurrent requests. If there’s “moderate” blocking during the processing of a typical request, I might find that 8 concurrent requests and 8 threads is a good tradeoff between more context switching and not stalling any CPUs. If I pick too high of a multiple over the number of CPUs, then context switching and cache effects will hurt my performance. If I pick too low a multiple, then blocking will stall a CPU and hurt my performance.
If you look at the heuristics inside the managed ThreadPool, you’ll find that we are constantly monitoring the CPU utilization. If we notice that some CPU resources are being wasted, we may be starving the system by not doing enough work concurrently. When this is detected, we are likely to release more threads from the ThreadPool in order to increase concurrency and make better use of the CPUs. This is a decent heuristic, but it isn’t perfect. For instance, CPU utilization is “backwards looking.” You actually have to stall a CPU before we will notice that more work should be executed concurrently. And by the time we’ve injected extra threads, the stalling situation may already have passed.
The OS has a better solution to this problem. IO Completion Ports have a direct link to the blocking primitives in Win32. When a thread is processing a work item from a completion port, if that thread blocks efficiently through the OS, then the blocking primitive will notify the completion port that it should release another thread. (Busy waiting instead of efficient blocking can therefore have a substantial impact on the amount of concurrency in the process). This feedback mechanism with IO Completion Ports is far more immediate and effective than the CLR’s heuristic based on CPU utilization. But in fairness I should point out that if a managed thread performs managed blocking via any of the managed blocking primitives (contentious Monitor.Enter, WaitHandle.WaitOne/Any/All, Thread.Join, GC.WaitForPendingFinalizers, etc.), then we have a similar feedback mechanism. We just don’t have hooks into the OS, so we cannot track all the blocking operations that occur in unmanaged code.
Of course, in my simulated server I didn’t have to worry about “details” like how to track all OS blocking primitives. Instead, I postulated a closed world where all blocking had to go through APIs exposed by my server. This gave me accurate and immediate information about threads either beginning to block or waking up from a blocking operation. Given this information, I was able to tweak my request dispatcher so it avoided any stalling by injecting new requests as necessary.
Although it’s possible to completely prevent stalling in this manner, it’s not possible to prevent context switches. Consider what happens on a 1 CPU machine. We release exactly one request which executes on one thread. When that thread is about to block, we release a second thread. So far, it’s perfect. But when the first thread resumes from its blocking operation, we now have two threads executing concurrently. Our request dispatcher can “retire” one of those threads as soon as it’s finished its work. But until then we have two threads executing on a single CPU and this will impact performance.
I suppose we could try to get ruthless in this situation, perhaps by suspending one of the threads or reducing its priority. In practice, it’s never a good idea to suspend an executing thread. If that thread holds any locks that are required by other concurrent execution, we may have triggered a deadlock. Reducing the priority might help and I suspect I played around with that technique. To be honest, I can’t remember that far back.
We’ll see that SQL Server can even solve this context switching problem.
Oh yeah, SQL Server
So what does any of this have to do with SQL Server?
Not surprisingly, the folks who built SQL Server know infinitely more than me about how to get the best performance out of a server. And when the CLR is inside SQL Server, it must conform to their efficient design. Let’s look at their thread mode, first. Fiber mode is really just a refinement over this.
Incoming requests are carried on threads. SQL Server handles a lot of simultaneous requests, so there are a lot of threads in the process. With normal OS “free for all” scheduling, this would result in way too many context switches, as we have seen. So instead those threads are affinitized to a host scheduler / CPU combination. The scheduler tries to ensure that there is one unblocked thread available at any time. All the other threads are ideally blocked. This gives us the nirvana of 100% busy CPUs and minimal context switches. To achieve this nirvana, all the blocking primitives need to cooperate with the schedulers. Even if an event has been signaled and a thread is considered by the application to be “ready to run”, the scheduler may not choose to release it, if the scheduler’s corresponding CPU is already executing another thread. In this manner, the blocking primitive and the scheduler are tightly integrated.
When I built my simulated server, I was able to achieve an ideal “closed world” where all the synchronization primitives were controlled by me. SQL Server attempts the same thing. If a thread needs to block waiting for a data page to be read, or for a page or row latch to be released, that blocking occurs through the SQL Server scheduler. This guarantees that exactly one thread is available to run on each CPU, as we’ve seen.
Of course, execution of managed code also hits various blocking points. Monitor.Enter (‘lock’ in C# and ‘SyncLock’ in VB.NET) is a typical case. Other cases include waiting for a GC to complete, waiting for class construction or assembly loading or type loading to occur, waiting for a method to be JITted, or waiting for a remote call or web service to return. For SQL Server to hit their performance goals and to avoid deadlocks, the CLR must route all of these blocking primitives to SQL Server (or any other similar host) through the new Whidbey hosting APIs.
Leaving the Closed World
But what about synchronization primitives that are used for coordination with unmanaged code and which have precise semantics that SQL Server cannot hope to duplicate? For example, WaitHandle and its subtypes (like Mutex, AutoResetEvent and ManualResetEvent) are thin wrappers over the various OS waitable handles. These primitives provide atomicity guarantees when you perform a WaitAll operation on them. They have special behavior related to message pumping. And they can be used to coordinate activity across multiple processes, in the case of named primitives. It’s unrealistic to route operations on WaitHandle through the hosting APIs to some equivalent host-provided replacements.
This issue with WaitHandle is part of a more general problem. What happens if I PInvoke from managed code to an OS service like CoInitialize or LoadLibrary or CryptEncrypt? Do those OS services block? Well, I know that LoadLibrary will have to take the OS loader lock somewhere. I could imagine that CoInitialize might need to synchronize something, but I have no real idea. One thing I am sure of: if any blocking happens, it isn’t going to go through SQL Server’s blocking primitives and coordinate with their host scheduler. The idealized closed world that SQL Server needs has just been lost.
The solution here is to alert the host whenever a thread “leaves the runtime”. In other words, if we are PInvoking out, or making a COM call, or the thread is otherwise transitioning out to some unknown unmanaged execution, we tell the host that this is happening. If the host is tracking threads as closely as SQL Server does, it can use this event to disassociate the thread from the host scheduler and release a new thread. This ensures that the CPU stays busy. That’s because even if the disassociated thread blocks, we’ve released another thread. This newly released thread is still inside our closed world, so it will notify before it blocks so we can guarantee that the CPU won’t stall.
Wait a second. The CLR did a ton of work to re-route most of its blocking operations through the host. But we could have saved almost that entire ton of engineering effort if we had just detached the thread from the host whenever SQL Server called into managed code. That way, we could freely block and we wouldn’t disrupt the host’s scheduling decisions.
This is true, but it won’t perform as well as the alternative. Whenever a thread disassociates from a host scheduler, another thread must be released. This guarantees that the CPU is busy, but it has sacrificed our nirvana of only having a single runnable thread per CPU. Now we’ve got two runnable threads for this CPU and the OS will be preemptively context-switching between them as they run out of quantum.
If a significant amount of the processing inside a host is performed through managed code, this would have a serious impact on performance.
Indeed, if a significant amount of the processing inside a host is performed in unmanaged code, called via PInvokes or COM calls or other mechanisms that “leave the runtime”, this too can have a serious impact on performance. But, for practical purposes, we expect most execution to remain inside the host or inside managed code. The amount of processing that happens in arbitrary unmanaged code should be low, especially over time as our managed platform grows to fill in some of the current gaps.
Of course, some PInvokes or COM calls might be to services that were exported from the host. We certainly don’t want to disassociate from the host scheduler every time the in-process ADO provider performs a PInvoke back to SQL Server to get some data. This would be unnecessary and expensive. So there’s a way for the host to control which PInvoke targets perform a “leave runtime” / “return to runtime” pair and which ones are considered to remain within the closed world of our integrated host + runtime.
Even if we were willing to tolerate the substantial performance impact of considering all of the CLR to be outside the host’s closed world (i.e. we disassociated from the host’s scheduler whenever we ran managed code), this approach would be inadequate when running in fiber mode. That’s because of the nasty effects which thread affinity can have on a fiber-based system.
Fiber Mode
As we’ve seen, SQL Server and other “extreme” hosts can ensure that at any time each CPU has only a single thread within the closed world that is ready to run. But when SQL Server is in thread mode, there are still a large number of threads that aren’t ready to run. It turns out that all those blocked threads impose a modest cost upon the OS preemptive scheduler. And that cost becomes an increasing consideration as the number of CPUs increases. For 1, 2, 4 and probably 8 CPU machines, fiber mode isn’t worth the headaches we’re about to discuss. But by the time you get to a larger machine, you might achieve something like a 20% throughput boost by switching to fiber mode. (I haven’t seen real numbers in a year or two, so please take that 20% as a vague ballpark).
Fiber mode simply eliminates all those extra threads from any consideration by the OS. If you stay within the idealized nirvana (i.e. you don’t perform a “leave runtime” operation), there is only one thread for each host scheduler / CPU. Of course, there are many stacks / register contexts and each such stack / register context corresponds to an in-flight request. When a stack is ready to run, the single thread switches away from whatever stack it was running and switches to the new stack. But from the perspective of the OS scheduler, it just keeps running the only thread it knows about.
So in both thread mode and fiber mode, SQL Server uses non-preemptive host scheduling of these tasks. This scheduling happens in user mode, which is a distinct advantage over the OS preemptive scheduling which happens in kernel mode. The only difference is whether the OS scheduler is aware of all the tasks on the host scheduler, or whether they all look like a single combined thread – albeit with different stacks and register contexts.
But the impact of this difference is significant. First, it means that there is an M:N relationship between stacks (logical CLR threads) and OS threads. This is M:N because multiple stacks will execute on a single thread, and because the specially nominated thread that carries those stacks can change over time. This change in the nominated thread occurs as a consequence of those “leave runtime” calls. Remember that when a thread leaves the runtime, we inform the host which disassociates the thread from the host scheduler. A new thread is then created or obtained from a short list of already-created threads. This new thread then picks up the next stack that is ready to run. The effect is that this stack has migrated from the original disassociated thread to the newly nominated thread.
This M:N relationship between stacks and OS threads causes problems everywhere that thread affinity would normally occur. I’ve already mentioned CPU affinity when discussing how threads are associated with CPUs. But now I’m talking about a different kind of affinity. Thread affinity is the association between various programmatic operations and the thread that these operations must run on. For example, if you take an OS critical section by calling EnterCriticalSection, the resulting ownership is tied to your thread. Sometimes developers say that the OS critical section is scoped to your thread. You must call LeaveCriticalSection from that same thread.
None of this is going to work properly if your logical thread is asynchronously and apparently randomly migrating between different physical threads. You’ll successfully take the critical section on one logical thread. If you attempt to recursively acquire this critical section, you will deadlock if a migration has intervened. That’s because it will look like a different physical thread is actually the owner.
Imagine writing some hypothetical code inside the CLR:
EnterCriticalSection(pCS);
If (pGlobalBlock == NULL)
pGlobalBlock = Alloc(count);
LeaveCriticalSection(pCS);
Obviously any real CLR code would be full of error handling, including a ‘finally’ clause to release the lock. And we don’t use OS critical sections directly since we typically reflect them to an interested host as we’ve discussed. And we instrument a lot of this stuff, including spinning during lock acquisition. And we wrap the locks with lots of logic to avoid deadlocks, including GC-induced deadlocks. But let’s ignore all of the goop that would be necessary for real CLR code.
It turns out that the above code has a thread affinity problem. Even though SQL Server’s fiber scheduling is non-preemptive, scheduling decisions can still occur whenever we call into the host. For reasons that I’ll explain later, all memory allocations in the CLR have the potential to call into the host and result in scheduling. Obviously most allocations will be satisfied locally in the CLR without escalation to the host. And most escalations to the host still won’t cause a scheduling decision to occur. But from a correctness perspective, all allocations have the potential to cause scheduling.
Other places where thread affinity can bite us include:
- The OS Mutex and the managed System.Threading.Mutex wrapper.
- LoadLibrary and DllMain interactions. As I’ve explained in my blog entry on Shutdown, DllMain notifications occur on a thread which holds the OS loader lock.
- TLS (thread local storage). It’s worth mentioning that, starting with Windows Server 2003, there are new FLS (fiber local storage) APIs. These APIs allow you to associate state with the logical rather than the physical thread. When a fiber is associated with a thread for execution (SwitchToFiber), the FLS is automatically moved from the fiber onto the thread. For managed TLS, we now move this automatically. But we cannot do this unconditionally for all the unmanaged TLS.
- Thread culture or locale, the impersonation context or user identity, the COM+ transaction context, etc. In some sense, these are just special cases of thread local storage. However, for historical reasons it isn’t possible to solve these problems by moving them to FLS.
- Taking control of a thread for GC, Abort, etc. via the OS SuspendThread() service.
- Any use of ThreadId or Thread Handle. This includes all debugging.
- “Hand-rolled” locks that we cannot discover or reason about, and which you have inadvertently based on the physical OS thread rather than the logical thread or fiber.
- Various PInvokes or COM calls that might end up in unmanaged code with affinity requirements. For instance, MSHTML can only be called on STA threads which are necessarily affinitized. Of course, there is no list of all the APIs that have odd threading behavior. It’s a minefield out there.
Solving affinity issues is relatively simple. The hard part is identifying all the places. Note that the last two bullet items are actually the application’s responsibility to identify. Some application code might appear to execute correctly when logical threads and OS threads are 1:1. But when a host creates an M:N relationship, any latent application bugs will be exposed.
In many cases, the easiest solution to a thread affinity issue is to disassociate the thread from the host’s scheduler until the affinity is no longer required. The hosting APIs provide for this, and we’ve taken care of it for you in many places – like System.Threading.Mutex.
Before we finish our discussion of locking, there’s one more aspect worth mentioning. In an earlier blog, I have mentioned the limited deadlock detection and deadlock breaking which the CLR performs when executing class constructors or JITting.
Except for this limited case, the CLR doesn’t concern itself with application-level deadlocks. If you write some managed code that takes a set of locks in random order, resulting in a potential deadlock, we consider that to be your application bug. But some hosts may be more helpful. Indeed, SQL Server has traditionally detected deadlocks in all data accesses. When a deadlock occurs, SQL Server selects a victim and aborts the corresponding transaction. This allows the other requests implicated in the deadlock to proceed.
With the new Whidbey hosting APIs, it’s possible for the host to walk all contentious managed locks and obtain a graph of the participants. This support extends to locking through our Monitor and our ReaderWriterLock. Clearly, an application could perform locking through other means. For example, an AutoResetEvent can be used to simulate mutual exclusion. But it’s not possible for such locks to be included in the deadlock algorithms, since there isn’t a strong notion of lock ownership that we can use.
Once the host has selected a deadlock victim, it must cause that victim to abort its forward progress somehow. If the victim is executing managed code, some obvious ways to do this include failing the lock attempt (since the thread is necessarily blocking), aborting the thread, or even unloading the AppDomain. We’ll return to the implications of this choice in the section on Reliability below.
Finally, it’s interesting to consider how one might get even better performance than what SQL Server has achieved. We’ve seen how fiber mode eliminates all the extra threads, by multiplexing a number of stacks / register contexts onto a single thread. What happens if we then eliminate all those fibers? For a dedicated server, we can achieve even better performance by forcing all application code to maintain its state outside of a thread’s stack. This allows us to use a single thread per CPU which executes user requests by processing them on its single dedicated stack. All synchronous blocking is eliminated by relying on asynchronous operations. The thread never yields while holding its stack pinned. The amount of memory required to hold an in-flight request will be far less than a 256 KB stack reservation. And the cost of processing an asynchronous completion through polling can presumably be less than the cost of a fiber context switch.
If all you care about is performance, this is an excellent way to build a server. But if you need to accommodate 3rd party applications inside the server, this approach is questionable. Most developers have a difficult time breaking their logic into segments which can be separately scheduled with no stack dependencies. It’s a tedious programming model. Also, the underlying Windows platform still contains a lot of blocking operations that don’t have asynchronous variants available. WMI is one example.
Memory Management
Servers must not page.
Like all rules, this one isn’t strictly true. It is actually okay to page briefly now and then, when the work load transitions from one steady state to another. But if you have a server that is routinely paging, then you have driven that server beyond its capacity. You need to reduce the load on the server or increase the server’s memory capacity.
At the same time, it’s important to make effective use of the memory capacity of a server. Ideally, a database would store the entire database contents in memory. This would allow it to avoid touching the disk, except to write the durable log that protects it from data loss and inconsistency in the face of catastrophic failure. Of course, the 2 or 3 GB limit of Win32 is far too restrictive for most interesting databases. (SQL Server can use AWE to escape this limit, at some cost). And even the address limits of Win64 are likely to be exceeded by databases presently. That’s because Win64 does not give you a full 64 bits of addressing and databases are already heading into the petabytes.
So a database needs to consider all the competing demands for memory and make wise decisions about which ones to satisfy. Historically, those demands have included the buffer cache which contains data pages, compiled query plans, and all those thread stacks. When the CLR is loaded into the process, significant additional memory is required for the GC heap, application code, and the CLR itself. I’m not sure what techniques SQL Server uses to trade off the competing demands for memory. Some servers carve memory up based on fixed ratios for the different broad uses, and then rely on LRU within each memory chunk. Other servers assign a cost to each memory type, which indicates how expensive it would be to regenerate that memory. For example, in the case of a data page, that cost is an IO.
Some servers use elaborate throttling of inbound requests, to keep the memory load reasonable. This is relatively easy to do when all requests are comparable in terms of their memory and CPU requirements. But if some queries access a single database page and other queries touch millions of rows, it would be hard to factor this into a throttling decision that is so far upstream from the query processor. Instead, SQL Server tends to accept a large number of incoming requests and process them “concurrently.” We’ve already seen in great detail why this concurrent execution doesn’t actually result in preemptive context switching between all the corresponding tasks. But it is still the case that each request will hold onto some reference set of memory, even when the host’s non-preemptive scheduler has that request blocked.
If enough requests are blocked while holding onto significant unshared memory, then the server process may find itself over-committed on memory. At this point, it could page – which hurts performance. Or it could kill some of the requests and free up the resources they are holding onto. This is an unfortunate situation, because we’ve presumably already devoted resources like the CPU to get the request to its current state of partial completion. If we throw away the request, all that work was wasted. And the client is likely to resubmit the request, so we will have to repeat all that work soon.
Nevertheless, if the server is over-committed and it’s not practical to recover more memory by e.g. shrinking the number of pages devoted to the buffer cache, then killing in-flight requests is a sound strategy. This is particularly reasonable in database scenarios, since the transactional nature of database operations means that we can kill requests at any time and with impunity.
Unfortunately, the world of arbitrary managed execution has no transactional foundation we can rely on. We’ll pick up this issue again below, in the section on Reliability.
It should be obvious that, if SQL Server or any other host is going to make wise decisions about memory consumption on a “whole process” basis, that host needs to know exactly how much memory is being used and for what purposes. For example, before the host unloads an AppDomain as a way of backing out of an over-committed situation, the host needs some idea of how many megabytes this unload operation is likely to deliver.
In the reverse direction, the host needs to be able to masquerade as the operating system. For instance, the CLR’s GC monitors system memory load and uses this information in its heuristics for deciding when to schedule a collection. The host needs a way to influence these collection decisions.
SQL Server and ASP.NET
Clearly a lot of work went into threading, synchronization and memory management in SQL Server. One obvious question to ask is how ASP.NET compares. They are both server products from Microsoft and they both execute managed code. Why didn’t we need to add all this support to the hosting interfaces in V1 of the CLR, so we could support ASP.NET?
I think it’s fair to say that ASP.NET took a much simpler approach to the problem of building a scalable server. To achieve efficient threading, they rely on the managed ThreadPool’s heuristics to keep the CPUs busy without driving up too many context switches. And since the bulk of memory allocations are due to the application, rather than the ASP.NET infrastructure (in other words, they aren’t managing large shared buffer pools for data pages), it’s not really possible for ASP.NET to act as a broker for all the different memory consumers. Instead, they just monitor the total memory load, and recycle the worker process if a threshold is exceeded.
(Incidentally, V1 of ASP.NET and the CLR had an unfortunate bug with the selection of this threshold. The default point at which ASP.NET would recycle the process was actually a lower memory load than the point at which the CLR’s GC would switch to a more aggressive schedule of collections. So we were actually killing the worker process before the CLR had a chance to deliver more memory back to the application. Presumably in Whidbey this selection of default thresholds is now coordinated between the two systems.)
How can ASP.NET get away with this simpler approach?
It really comes down to their fundamental goals. ASP.NET can scale out, rather than having to scale up. If you have more incoming web traffic, you can generally throw more web servers at the problem and load balance between them. Whereas SQL Server can only scale out if the data supports this. In some cases, it does. There may be a natural partitioning of the data, like access to the HotMail mailbox for a particular incoming user. But in too many other cases, the data cannot be sufficiently partitioned and the server must be scaled up. On X86 Windows, the practical limit is a 32-way CPU with a hard limit of 3 GB of user address space. If you want to keep increasing your work load on a single box, you need to use every imaginative trick – like fibers or AWE – to eke out all possible performance.
There’s also an availability issue. ASP.NET can recycle worker processes quite quickly. And if they have scaled out, recycling a worker process on one of the computers in the set will have no visible effect on the availability of the set of servers. But SQL Server may be limited to a single precious process. If that process must be recycled, the server is unavailable. And recycling a database is more expensive than recycling a stateless ASP.NET worker process, because transaction logs must be replayed to move the database forwards or backwards to a consistent state.
The short answer is, ASP.NET didn’t have to do all the high tech fancy performance work. Whereas SQL Server was forced down this path by the nature of the product they must build.
Reliability
Well, if you haven’t read my earlier blogs on asynchronous exceptions, or if – like me – you read the Reliability blog back in June and don’t remember what it said – you might want to review it quickly at http://blogs.msdn.com/cbrumme/archive/2003/06/23/51482.aspx.
The good news is that we’ve revisited the rules for ThreadAbortException in Whidbey, so that there is now a way to abort a thread without disturbing any backout code that it is currently running. But it’s still the case that asynchronous exceptions can intrude at fairly arbitrary spots in the execution.
Anyway, the availability goals of SQL Server place some rather difficult requirements on the CLR. Sure, we were pretty solid in V1 and V1.1. We ran a ton of stress and – if you avoided stack overflow, running out of memory, and any asynchronous exceptions like Thread.Abort – we could run applications indefinitely. We really were very clean.
One problem with this is that “indefinitely” isn’t long enough for SQL Server. They have a noble goal of chasing 5 9’s and you can’t get there with loose statements like “indefinitely”. Another problem is that we can no longer exclude OutOfMemoryException and ThreadAbortException from our reliability profile. We’ve already seen that SQL Server tries to use 100% of memory, without quite triggering paging. The effect is that SQL Server is always on the brink of being out of memory, so allocation requests are frequently being denied. Along the same lines, if the server is loaded it will allow itself to become over-committed on all resources. One strategy for backing out of an over-commitment is to abort a thread (i.e. kill a transaction) or possibly unload one or more AppDomains.
Despite this stressful abuse, at no time can the process terminate.
The first step to achieve this was to harden the CLR so that it was resilient to any resource failures. Fortunately we have some extremely strong testers. One tester built a system to inject a resource failure in every allocator, for every unique logical call stack. This tests every distinct backout path in the product. This technique can be used for unmanaged and managed (FX) code. That same tester is also chasing any unmanaged leaks by applying the principles of a tracing garbage collector to our unmanaged CLR data structures. This technique has already exposed a small memory leak that we shipped in V1 of the CLR – for the “Hello World” application!
With testers like that, you better have a strong development team too. At this point, I think we’ve annotated the vast majority of our unmanaged CLR methods with reliability contracts. These are a bit like Eiffel pre- and post-conditions and they provide machine-verifiable statements about each method’s behavior with respect to GC, exceptions, and other fundamental operations. These contracts can be used during test coverage (and, in some cases, during static scans of the binary images) to test for conformance.
The bottom line is that the next release of CLR should be substantially more robust in the face of resource errors. Leaving aside stack overflows and focusing entirely on the unmanaged runtime, we are shooting for perfection. Even for stack overflow, we expect to get very, very close. And we have the mechanisms in place that allow us to be rigorous in chasing after these goals.
But what about all of the managed code?
Will FX be as robust as the unmanaged CLR? And how can we possibly hold 3rd party authors of stored procedures or user defined functions to that same high bar? We want to enable a broad class of developers to write this sort of code, and we cannot expect them to perform many hundreds of hours of stress testing and fault injection on each new stored procedure. If we’re chasing 5 9’s by requiring every external developer to write perfect code, we should just give up now.
Instead, SQL Server relies on something other than perfect code. Consider how SQL Server worked before it started hosting the CLR:
The vast majority of execution inside SQL Server was via Transact SQL or TSQL. Any application written in TSQL is inherently scalable, fiber-aware, and robust in the face of resource errors. Any computation in TSQL can be terminated with a clean transaction abort.
Unfortunately, TSQL isn’t expressive enough to satisfy all application needs. So the remaining applications were written in extended stored procedures or xprocs. These are typically unmanaged C++. Their authors must be extremely sophisticated, because they are responsible for integrating their execution with the unusual threading environment and resource rules that exist inside SQL Server. Throw in the rules for data access and security (which I won’t be discussing in this blog) and it takes superhuman knowledge and skill to develop a bug-free xproc.
In other words, you had a choice of well-behaved execution and limited expression (TSQL), or the choice of arbitrary execution coupled with a very low likelihood that you would get it right (xprocs).
One of the shared goals of the SQL Server and CLR teams in Whidbey was to eliminate the need for xprocs. We wanted to provide a spectrum of choices to managed applications. In Whidbey, that spectrum consists of three buckets for managed code:
Code in this bucket is the most constrained. In fact, the host constrains it beyond what the CLR would normally allow to code that’s only granted SecurityPermissionFlag.Execution. So this code must be verifiably typesafe and has a reduced grant set. But it is further constrained from defining mutable static fields, from creating or controlling threads, from using the threadpool, etc. The goal here is to guide the code to best practices for scalability and robustness within the SQL Server or similar hosted environments. In the case of SQL Server, this means that all state should be stored in the database and that concurrency is controlled through transactions against the data. However, it’s important to realize that these additional constraints are not part of the Security system and they may well be subvertible. The constraints are simply speedbumps (not roadblocks) which guide the application code away from potentially non-scalable coding techniques and which encourage best practices.
Code in this bucket should be sufficient for replacing most xprocs. Such code must also be verifiably typesafe, but it is granted some additional permissions. The exact set of permissions is presumably subject to change until Yukon ships, but it’s likely to allow access to the registry, the file system, and the network.
This is the final managed escape hatch for writing code inside SQL Server. This code does not have to be verifiable. It has FullTrust (with the possible exception of UIPermission, which makes no sense within the database). This means that it can do anything the most arbitrary xproc can do. However, it is much more likely to work properly, compared to that xproc. First, it sits on top of a framework that has been designed to work inside the database. Second, the code has all the usual benefits of managed code, like a memory manager that’s based on accurate reachability rather than on programmer correctness. Finally, it is executing on a runtime that understands the host’s special rules for resource management, synchronization, threading, security, etc.
For code in the Safe bucket, you may be wondering how a host could constrain code beyond SecurityPermissionFlag.Execution. There are two techniques available for this:
1) Any assembly in the ‘Safe’ subset could be scanned by a host-provided pre-verifier, to check for any questionable programming constructs like the definition of mutable static fields, or the use of reflection. This raises the obvious question of how the host can interject itself into the binding process and guarantee that only pre-verified assemblies are loaded. The new Whidbey hosting APIs contain a Fusion loader hook mechanism, which allows the host to abstract the notion of an assembly store, without disturbing all our normal loader policy. You can think of this as the natural evolution of the AppDomain.AssemblyResolve event. SQL Server can use this mechanism to place all application assemblies into the database and then deliver them to the loader on demand. In addition to enabling pre-verification, the loader hooks can also be used to ensure that applications inside the database are not inadvertently broken or influenced by changes outside the database (e.g. changes to the GAC). In fact, you could even copy a database from one machine to another and theoretically this could automatically transfer all the assemblies required by that database.
2) The Whidbey hosting APIs provide controls over a new Host Protection Attribute (HPA) feature. Throughout our frameworks, we’ve decorated various unprotected APIs with an appropriate HPA. These HPAs indicate that the decorated API performs a sensitive operation like Synchronization or Thread Control. For instance, use of the ThreadPool isn’t considered a secure operation. (At some level, it is a risk for Denial of Service attacks, but DOS remains an open design topic for our managed platform). If code is running outside of a host that enables these HPAs, they have no effect. Partially trusted code, including code that only has Execution permission, can still call all these APIs. But if a host does enable these attributes, then code with insufficient trust can no longer call these APIs directly. Indirect calls are still permitted, and in this sense the HPA mechanism is similar to the mechanism for LinkDemands.
Although HPAs use a mechanism that is similar to LinkDemands, it’s very important to distinguish the HPA feature – which is all about programming model guidance – from any Security feature. A great way to illustrate this distinction is Monitor.Enter.
Ignoring HPAs, any code can call Monitor.Enter and use this API to synchronize with other threads. Naturally, SQL Server would prefer that most developers targeting their environment (including all the naïve ones) should rely on database locks under transaction control for this sort of thing. Therefore they activate the HPA on this class:
[HostProtection(Synchronization=true, ExternalThreading=true)]
public sealed class Monitor
{
...
[MethodImplAttribute(MethodImplOptions.InternalCall)]
public static extern void Enter(Object obj);
However, devious code in the ‘Safe’ bucket could use a HashTable as an alternate technique for locking. If you create a synchronized HashTable and then perform inserts or lookups, your Object.Equals and GetHashCode methods will be called within the lock that synchronizes the HashTable. The BCL developers were smart enough to realize this, and they added another HPA:
public class Hashtable : IDictionary, ISerializable,
IDeserializationCallback, ICloneable
{
...
[HostProtection(Synchronization=true)]
public static Hashtable Synchronized(Hashtable table) {
if (table==null)
throw new ArgumentNullException("table");
return new SyncHashtable(table);
}
Are there other places inside the frameworks where it’s possible to trick an API into providing synchronization for its caller? Undoubtedly there are, but we aren’t going to perform exhaustive audits of our entire codebase to discover them all. As we find additional APIs, we will decorate them with HPAs, but we make no guarantees here.
This would be an intolerable situation for a Security feature, but it’s perfectly acceptable when we’re just trying to increase the scalability and reliability of naively written database applications.
Escalation Policy
I chose the HPA on System.Threading.Monitor for a reason, in the above example. If you’ve read my earlier blogs on Thread.Abort, you know that it’s dangerous to asynchronously abort another thread. That thread could be executing a class constructor, in which case that class is now unavailable throughout the AppDomain. That thread could be in the middle of an update to some shared application state, which would leave the application in an inconsistent state.
In V1 & V1.1, it was not really possible to write code that is robust in the face of asynchronous exceptions like Abort. In Whidbey, we’re now introducing some constructs (Constrained Execution Regions and Critical Finalization) which make it possible to do this. I’m not going to discuss those constructs in this blog. But suffice it to say that, although it makes it possible to write entirely robust code, it doesn’t make it easy. Without a higher level programmatic construct, like transactions, it’s very difficult to write entirely robust code. You must acquire all the resources required for forward progress, tolerating exceptions during this acquisition phase. Then you enter a forward progress phase, which either cannot fail or which unconditionally triggers some compensating backout code upon failure. If compensation is triggered, it must guarantee that the system is returned to a consistent state before it completes.
If you’ve successfully written that sort of code, you know that it’s an onerous discipline. There’s no way that we can expect the greater population of developers to write large bodies of bug-free code based on this plan.
That’s why, in V1 & V1.1, we recommend either using Abort on the current thread (in which case it is not asynchronous) or we recommend using it in conjunction with an AppDomain.Unload (in which case any inconsistent application state is likely to be discarded).
In Whidbey, it is possible to avoid inducing asynchronous Aborts onto threads that are performing backout (i.e. filter, finally, catch or fault blocks) or that hold locks. Our definition of a lock is pretty broad. It includes execution of a class constructor, since all .cctor execution is synchronized according to elaborate rules by the CLR. It also includes Monitor.Enter, Mutex, ReaderWriterLock, etc. Finally, it includes any “hand-rolled” locks that you build, so long as you properly identify them to us.
Our rationale here is that any thread holding a lock may be updating shared state. If a thread isn’t holding a lock, then any update it performs against shared state must be atomic or at least it never leaves that shared state in an inconsistent state. This is strictly a heuristic, but it’s a pretty good one.
If we believe this heuristic, it means that we can use Abort without consequently unloading an AppDomain, if that thread doesn’t hold any locks and isn’t performing any backout. And it just so happens that the bulk of all managed code executing inside SQL Server is in the ‘Safe’ subset – which coincidentally is highly discouraged via HPAs from taking or holding locks.
In other words, code in the ‘Safe’ subset can almost always take an asynchronous exception without affecting any of the execution on other threads in the same AppDomain. This is the case, even though that code was written by developers who don’t understand the deep issues involved with asynchronous exceptions. It further means that if we should catch such a thread at a point where it isn’t safe to inject an asynchronous exception without also unloading the AppDomain, we can identify this window. Once this window is identified, we can either hold off from injecting the exception until this unsafe window has closed, or we can unload the entire AppDomain to eliminate the application inconsistency. The host can decide whether to hold off on the injection or alternatively to proceed with an AppDomain unload, based on criteria like how resource-constrained the host is.
The hosting APIs for making these decisions imperatively would be rather complicated. So the Whidbey hosting APIs provide a declarative mechanism called an escalation policy. This allows the host to express transitions and timeouts that take effect during error conditions. For instance, SQL Server might state that any attempt to Abort a thread should delay if the victim thread holds a lock. But if that delay exceeds 30 seconds, the Abort attempt should be escalated to an AppDomain.Unload. Of course, the feature is more general than SQL Server’s needs. Indeed, the V1 ASP.NET process recycling feature should now be expressible as a particular Whidbey escalation policy.
Winding down
As usual, I didn’t get around to many of the interesting topics. For instance, those guidelines on when and how to host are noticeably absent. And I didn’t explain how to do any simple stuff, like picking concurrent vs. non-concurrent vs. server GC. The above text is completely free of any specific details of what our hosting APIs look like (partly because they are subject to change until Whidbey ships). And I didn’t touch on any hosting topics outside of the hosting APIs, like all of the AppDomain considerations. As you can imagine, there’s also plenty I could have said about Security. For instance, the hosting APIs allow the host to participate in role-based security and impersonation of Windows identities… Oh well.
Fortunately, one of the PMs involved in the Whidbey hosting effort is apparently writing a book on the general topic of hosting. Presumably all these missing topics will be covered there. And hopefully he won’t run into the same issues with writer’s block that I experienced on this topic.
(Indeed, the event that ultimately resolved my writer’s block was that my wife got the flu. When she’s not around, my weekends are boring enough for me to think about work. The reason I’m posting two blogs this weekend is that Kathryn has gone to Maui for the week and has left me behind.)
Finally, the above blog talks about SQL Server a lot.
Hopefully it’s obvious that the CLR wants to be a great execution environment for a broad set of servers. In V1, we focused on ASP.NET. Based on that effort, we automatically worked well in many other servers with no additional work. For example, EnterpriseServices dropped us into their server processes simply by selecting the server mode of our GC. Nothing else was required to get us running efficiently. (Well, we did a ton of other work in the CLR to support EnterpriseServices. But that work was related to the COM+ programming model and infrastructure, rather than their server architecture. We had to do that work whether we ran in their server process or were instead loading EnterpriseServices into the ASP.NET worker process or some other server).
In Whidbey we focused on extending the CLR to meet SQL Server’s needs. But at every opportunity we generalized SQL Server’s requirements and tried to build something that would be more broadly useful. Just as our ASP.NET work enabled a large number of base server hosting scenarios, we hope that our SQL Server work will enable a large number of advanced server hosting scenarios.
If you have a “commercially significant” hosting problem, whether on the server or the client, and you’re struggling with how to incorporate managed code, I would be interested in hearing from you directly. Feel free to drop me an email with the broad outline of what you are trying to achieve, and I’ll try to get you supported. That support might be something as lame as some suggestions from me on how I would tackle the problem. Or at the other extreme, I could imagine more formal support and conceivably some limited feature work. That other extreme really depends on how commercially significant your product is and on how well our business interests align. Obviously decisions like that are far outside my control, but I can at least hook you up with the right people if this seems like a sensible approach.
Okay, one more ‘Finally’. From time to time readers of my blog send me emails asking if there are jobs available on the CLR team. At this moment, we do. Drop me an email if you are interested. It’s an extremely challenging team to work on, but the problems are truly fascinating.
|
-
Earlier this week, I wrote an internal email explaining how Finalization works in V1 / V1.1, and how it has been changed for Whidbey. There’s some information here that folks outside of Microsoft might be interested in.
Costs
Finalization is expensive. It has the following costs:
1) Creating a finalizable object is slower, because each such object must be placed on a RegisteredForFinalization queue. In some sense, this is a bit like having an extra pointer-sized field in your object that the system initializes for you. However, our current implementation uses a slower allocator for every finalizable object, and this impact can be measured if you allocate small objects at a high rate.
2) Each GC must do a weak pointer scan of this queue, to find out whether any finalizable objects are now collectible. All such objects are then moved to a ReadyToFinalize queue. The cost here is small.
3) All objects in the ReadyToFinalize queue, and all objects reachable from them, are then marked. This means that an entire graph of objects which would normally die in one generation can be promoted to the next generation, based on a single finalizable root to this graph. Note that the size of this graph is potentially huge.
4) The older generation will be collected at some fraction of the frequency of the younger generation. (The actual ratio depends on your application, of course). So promotion of the graph may have increased the time to live of this graph by some large multiple. For large graphs, the combined impact of this item and #3 above will dominate the total cost of finalization.
5) We currently use a single high priority Finalizer thread to walk the ReadyToFinalize queue. This thread dequeues each object, executes its Finalize method, and proceeds to the next object. This is the one cost of finalization which customers actually expect.
6) Since we dedicate a thread to calling finalizers, we inflict an expense on every managed process. This can be significant in Terminal Server scenarios where the high number of processes multiplies the number of finalizer threads.
7) Since we only use a single thread for finalization, we are inherently non-scalable if a process is allocating finalizable objects at a high rate. One CPU performing finalization might not keep up with 31 other CPUs allocating those finalizable objects.
8) The single finalizer thread is a scarce resource. There are various circumstances where it can become blocked indefinitely. At that point, the process will leak resources at some rate and eventually die. See http://blogs.msdn.com/cbrumme/archive/2004/02/02/66219.aspx for extensive details.
9) Finalization has a conceptual cost to managed developers. In particular, it is difficult to write correct Finalize methods as I shall explain.
Eventually we would like to address #5 thru #8 above by scheduling finalization activity over our ThreadPool threads. We have also toyed with the idea of reducing the impact of #3 and #4 above, by pruning the graph based on reachability from your Finalize method and any code that it might call. Due to indirections that we cannot statically explore, like interface and virtual calls, it’s not clear whether this approach will be fruitful. Also, this approach would cause an observable change in behavior if resurrection occurs. Regardless, you should not expect to see any of these possible changes in our next release.
Reachability
One of the guidelines for finalization is that a Finalize method shouldn’t touch other objects. People sometimes incorrectly assume that this is because those other objects have already been collected. Yet, as I have explained, the entire reachable graph from a finalizable object is promoted.
The real reason for the guideline is to avoid touching objects that may have already been finalized. That’s because finalization is unordered.
So, like most guidelines, this one is made to be broken under certain circumstances. For instance, if your object “contains” a private object that is not itself finalizable, clearly you can refer to it from your own Finalize method without risk.
In fact, a sophisticated developer might even create a cycle between two finalizable objects and coordinate their finalization behavior. Consider a buffer and a file. The Finalize method of the buffer will flush any pending writes. The Finalize method of the file will close the handle. Clearly it’s important for the buffer flush to precede the handle close. One legitimate but brittle solution is to create a cycle of references between the buffer and the file. Whichever Finalize method is called first will execute a protocol between the two objects to ensure that both side effects happen in order. The subsequent Finalize call on the second object should do nothing.
I should point out that Whidbey solves the buffer and file problem differently, relying on the semantics of critical finalization. And I should also point out that any protocol for sequencing the finalization of two objects should anticipate that one day we may execute these two Finalize methods concurrently on two different threads. In other words, the protocol must be thread-safe.
Ordering
This raises the question of why finalization is unordered.
In many cases, no natural order is even possible. Finalizable objects often occur in cycles. You could imagine decorating some references between objects, to indicate the direction in which finalization should proceed. This would add a sorting cost to finalization. It would also cause complexity when these decorated references cross generation boundaries. And in many cases the decorations would not fully eliminate cycles. This is particularly true in component scenarios, where no single developer has sufficient global knowledge to create an ordering:
Your component would achieve its guarantees when tested by you, prior to deployment. Then in some customer application, additional decorated references would create cycles and your guarantees would be lost. This is a recipe for support calls and appcompat issues.
Unordered finalization is substantially faster. Not only do we avoid sorting (which might involve metadata access and marking through intermediate objects), but we can also efficiently manage the RegisteredForFinalization and ReadyToFinalize queues without ever having to memcpy. Finally, there’s value in forcing developers to write Finalize methods with minimal dependencies on any other objects. This is key to our eventual goal of making Finalization scalable by distributing it over multiple threads.
Based on the above and other considerations like engineering complexity, we made a conscious decision that finalization should be unordered.
Partial Trust
There are no security permissions associated with the definition of a Finalize method. As we’ve seen, it’s possible to mount a denial of service attack via finalization. However, many other denial of service attacks are possible from partial trust, so this is uninteresting.
Customers and partners sometimes ask why partially trusted code is allowed to participate in finalization. After all, Finalize methods are typically used to release unmanaged resources. Yet partially trusted code doesn’t have direct access to unmanaged resources. It must always go through an API provided by an assembly with UnmanagedCodePermission or some other effective equivalent to FullTrust.
The reason is that finalization can also be used to control pure managed resources, like object pools or caches. I should point out that techniques based on weak handles can be more efficient than techniques based on finalization. Nevertheless, it’s quite reasonable for partially trusted code to use finalization for pure managed resources.
SQL Server has a set of constraints that they place on partially trusted assemblies that are loaded into their environment. I believe that these constraints prevent definition of static fields (except for initonly and literal static fields), use of synchronization, and the definition of Finalize methods. However, these constraints are not related to security. Rather, they are to improve scalability and reliability of applications by simplifying the threading model and by moving all shared state into the database where it can be transacted.
It’s hard to implement Finalize perfectly
Even when all Finalize methods are authored by fully trusted developers, finalization poses some problems for processes with extreme availability requirements, like SQL Server. In part, this is because it’s difficult to write a completely reliable Finalize method – or a completely reliable anything else.
Here are some of the concerns specifically related to finalization. I’ll explain later how some of these concerns are addressed in the context of a highly available process like SQL Server.
Your Finalize method must tolerate partially constructed instances
It’s possible for partially trusted code to subtype a fully trusted finalizable object (with APTCA) and throw an exception from the constructor. This can be done before chaining to the base class constructor. The result is that a zero-initialized object is registered for finalization.
Even if partially trusted code isn’t intentionally causing finalization of your partially constructed instances, asynchronous problems like StackOverflowException, OutOfMemoryException or AppDomainUnloadException can cause your constructor to be interrupted at a fairly arbitrary location.
Your Finalize method must consider the consequence of failure
It’s possible for partially trusted code to subtype a fully trusted finalizable object (with APTCA) and fail to chain to the base Finalize method. This causes the fully trusted encapsulation of the resource to leak.
Even if partially trusted code isn’t intentionally causing finalization of your object to fail, the aforementioned asynchronous exceptions can cause your Finalize method to be interrupted at a fairly arbitrary location.
In addition, the CLR exposes a GC.SuppressFinalize method which can be used to prevent finalization of any object. Arguably we should have made this a protected method on Object or demanded a permission, to prevent abuse of this method. However, we didn’t want to add a member to Object for such an obscure feature. And we didn’t want to add a demand, since this would have prevented efficient implementation of IDisposable from partial trust.
Your object is callable after Finalization
We’ve already seen how all the objects in a closure can access each other during finalization. Indeed, if any one of those objects re-establishes its reachability from a root (e.g. it places itself into a static field or a handle), then all the other objects it reaches will also become re-established. This is referred to as resurrection. If you have a finalizable object that is publicly exposed, you cannot prevent your object from becoming resurrected. You are at the mercy of all the other objects in the graph.
One possible solution here is to set a flag to indicate that your object has been finalized. You can pepper your entire API with checks to this flag, throwing an ObjectDisposedException if you are subsequently called. Yuck.
Your object is callable during Finalization
It’s true that the finalizer thread is currently single-threaded (though this may well change in the future). And it’s true that the finalizer thread will only process instances that – at some point – were discovered to be unreachable from the application. However, the possibility of resurrection means that your object may become visible to the application before its Finalize method is actually called. This means that application threads and the finalizer thread can simultaneously be active in your object.
If your finalizable object encapsulates a protected resource like an OS handle, you must carefully consider whether you are exposed to threading attacks. Shortly before we shipped V1, we fixed a number of handle recycling attacks that were due to race conditions between the application and Finalization. See http://blogs.msdn.com/cbrumme/archive/2003/04/19/51365.aspx for more details.
Your Finalizer could be called multiple times
Just as there is a GC.SuppressFinalize method, we also expose a GC.ReRegisterForFinalize method. And the same arguments about protected accessibility or security demands apply to the ReRegisterForFinalize method.
Your Finalizer runs in a delicate security context
As I’ve explained in prior blogs, the CLR flows the compressed stack and other security information around async points like ThreadPool.QueueUserWorkItem or Control.BeginInvoke. Indeed, in Whidbey we include more security information by default. However, we do not flow any security information from an object’s constructor to an object’s Finalize method. So (to use an absurd example) if you expose a fully trusted type that accepts a filename string in its constructor and subsequently opens that file in its Finalize method, you have created a security bug.
Clearly it’s hard to write a correct Finalize method. And the managed platform is supposed to make hard things easier. I’ll return to this when I discuss the new Whidbey features of SafeHandles, Critical Finalizers and Constrained Execution Regions.
But what guarantees do I get if I don’t use any of those new gizmos? What happens in a V1 or V1.1 process?
V1 & V1.1 Finalization Guarantees
If you allocate a finalizable object, we guarantee that it will be registered for finalization. Once this has happened, there are several possibilities:
1) As part of the natural sequence of garbage collection and finalization, the finalizer thread dequeues your object and finalizes it.
2) The process can terminate without cooperating with the CLR’s shutdown code. This can happen if you call TerminateProcess or ExitProcess directly. In those cases, the CLR’s first notification of the shutdown is via a DllMain DLL_PROCESS_DETACH notification. It is not safe to call managed code at that time, and we will leak all the finalizers. Of course, the OS will do a fine job of reclaiming all its resources (including abandonment of any cross-process shared resources like Mutexes). But if you needed to flush some buffers to a file, your final writes have been lost.
3) The process can terminate in a manner that cooperates with the CLR’s shutdown code. This includes calling exit() or returning from main() in any unmanaged code built with VC7 or later. It includes System.Environment.Exit(). It includes a shutdown triggered from a managed EXE when all the foreground threads have completed. And it includes shutdown of processes that are CLR-aware, like VisualStudio. In these cases, the CLR attempts to drain both the ReadyToFinalize and the RegisteredForFinalization queues, processing all the finalizable objects.
4) The AppDomain containing your object is unloaded. Prior to Whidbey, the AppDomain will not unload until we have scanned the ReadyToFinalize and the RegisteredForFinalization queues, processing all the finalizable objects that live in the doomed AppDomain.
There are a few points to note here.
· Objects are always finalized in the AppDomain they were created in. A special case exists for any finalizable objects that are agile with respect to AppDomains. To my knowledge, the only such type that exists is System.Threading.Thread.
· I have heard that there is a bug in V1 and V1.1, where we get confused on AppDomain transitions in the ReadyToFinalize queue. The finalization logic attempts to minimize AppDomain transitions by noticing natural partitions in the ReadyToFinalize queue. I’m told there is a bug where we may occasionally skip finalizing the first object of a partition. I don’t believe any customers have noticed this and it is fixed in Whidbey.
· Astute readers will have noticed that during process shutdown and AppDomain unloading we actually finalize objects in the RegisteredForFinalization queue. Such objects are still reachable and would not normally be subject to finalization. Normally a Finalize method can rely on safely accessing finalizable state that is rooted via statics or some other means. You can detect when this is no longer safe by checking AppDomain.IsFinalizingForUnload or Environment.HasShutdownStarted.
· Since there is no ordering of finalization, critical infrastructure is being finalized along with application objects. This means that WaitHandles, remoting infrastructure and even security infrastructure is disappearing underneath you. This is a potential security concern and a definite reliability concern. We have spot-fixed a few cases of this. For example, we never finalize our Thread objects during process shutdown.
· Finalization during process termination will eventually timeout. If a particular Finalize method gets stuck, or if the queue isn’t reducing in size over time (i.e. you create 2 new finalizable instances out of each execution of your Finalize method), we will eventually timeout and terminate the process. The exact timeouts depend on whether a profiler is attached and other details.
· The thread that initiates process shutdown performs the duties of “watchdog.” It is responsible for detecting timeouts during process termination. If this thread is an STA thread, we cause it to pump COM calls in and out of the STA while it blocks as watchdog. If the application has a deadlock that implicates the STA thread while it is executing these unmanaged COM calls, then the timeout mechanism is defeated and the process will hang. This is fixed in Whidbey.
· Subject to all of the above, we guarantee that we will dequeue your object and initiate a call to the Finalize method. We do not guarantee that your Finalize method can be JITted without running out of stack or memory. We do not guarantee that the execution of your Finalize method will complete without being aborted. We do not guarantee that any types you require can be loaded and have their .cctors run. All you get is a “best effort” attempt. We’ll soon see how Whidbey extensions allow you to do better than this and guarantee full execution.
· (If you want to know more about the shutdown of managed processes, see http://blogs.msdn.com/cbrumme/archive/2003/08/20/51504.aspx.)
SafeHandle
Whidbey contains some mechanisms that address many of the V1 and V1.1 issues with finalization. Let’s start with SafeHandle, since it’s the easiest to understand. Conceptually, this is just an encapsulation of an OS handle. You should read the documentation of this feature for details. Briefly, SafeHandle provides the following benefits:
1) Someone else wrote it and is maintaining it. Using it is much easier than building equivalent functionality yourself.
2) It prevents races between an application thread and the finalizer thread in unmanaged code. And it does this in a manner that leverages the type system. Specifically, clients are forced to deal with SafeHandles rather than IntPtrs or value types which don’t have strong identity and lifetime semantics.
3) It prevents handle-recycling attacks. You can read more details about finalization races (#2 above) and this bullet on handle-recycling attacks by reading http://blogs.msdn.com/cbrumme/archive/2003/04/19/51365.aspx. In that blog from last April, I allude to the existence of SafeHandle without giving details.
4) It discourages promotion of large graphs of objects, by placing the finalizable resources in a tiny leaf instance.
5) It participates with the PInvoke marshaler to ensure that unmarshaled instances will be registered for finalization.
6) For the handful of bizarre APIs that aren’t covered by our standard marshaling styles, Constrained Execution Regions (CERs) can be used to guarantee that unmarshaled instances will be registered for finalization.
7) It uses the new Critical Finalization mechanism to guarantee that leaks cannot occur. This means that we not only guarantee we will initiate execution of your Finalize method, but we also make some strong guarantees that allow you to ensure that it actually completes execution.
8) In order to guarantee that there will be no leaks, we necessarily leave the system open to denial of service and hangs. This is the halting problem. The Critical Finalization mechanism addresses this dilemma by making the leak protection explicit, restricting it to small regions of carefully written code, and by using the security system. Only trusted code can achieve strong guarantees about leakage. Such code is trusted not to create denial of service problems, whether maliciously or inadvertently, over small blocks of explicitly identified code.
9) Since SafeHandle uses Critical Finalization, it solves the problem of sequencing buffer flushing before handle closing that I mentioned earlier.
So what is this Critical Finalization thing?
Critical Finalization (CF) and CERs
Any object that derives from CriticalFinalizerObject (including SafeHandle) obtains critical finalization. This means:
1) Before any objects of this type are created, the CLR will “prepare” any resources that will be necessary for the Finalize method to run. Preparation includes JITting the code, running class constructors and – most importantly – traversing the static reachability of other methods and types that will be required during execution and making sure that they are likewise prepared. However, the CLR cannot statically traverse through indirections like interface calls and virtual calls. So there is a mechanism for the developer to guide the CLR through these opaque indirections.
2) The CLR will never timeout on the execution of one of these Finalize methods. As I mentioned, we rely on the limited amount of code written via this discipline combined with the trust decisions of the security system to avoid hangs here.
3) When the Finalize method is called, it is called in a protected state that prevents the CLR from injecting Thread.Aborts or other optional asynchronous exceptions. Because of our special preparation, we also prevent other asynchronous exceptions like OutOfMemoryExceptions due to JITting or type loading and TypeInitializationExceptions due to .cctors failures. Of course, if the application tries to allocate an object it may get an OutOfMemoryException. This is application-induced rather than system-induced and therefore is not considered the CLR’s responsibility. The Finalize method can use standard exception handling to protect itself here.
4) All normal finalizable objects are either executed or discarded without finalization, before any critical finalizers are executed. This means that a buffer flush can precede the close of the underlying handle.
The first 3 bullet points above are not restricted to CF. These bullet points apply to all CERs. The fundamental difference between CF and other CERs is the funky flow control from the instantiation of an object to the execution of its Finalize method via registration on our finalization queues. Other CERs can use normal block scopes in a single method to express the same reliability concepts. For normal CERs, the preparation phase, the forward execution phase and the backout phases are all contained in a single method.
A full description of CERs is beyond the scope of a note that is ostensibly about finalization. However, a brief description makes sense.
Essentially, CERs address issues with asynchronous exceptions. I have already mentioned asynchronous exceptions, which is the CLR’s term for all the pesky problems that manifest themselves as surprising exceptions. These are distinct from the application-level exceptions, which presumably are anticipated by and handled by the application.
You can read about asynchronous exceptions and the novel problems introduced by a managed execution environment that virtualizes resources so aggressively at http://blogs.msdn.com/cbrumme/archive/2003/06/23/51482.aspx.
In V1 and V1.1, the CLR does a poor job of distinguishing asynchronous exceptions from application exceptions. In Whidbey, we are starting to make this separation but it remains one of the weak design points for our hosting and exception stories.
Anyway, I’m sure that many readers are familiar with the difficulty of writing reliable unmanaged code that is guaranteed to complete in the face of limited resources (e.g. memory or stack), free threading, and other facts of life. And by now, if you’ve read all the blog articles I’ve mentioned, you are also familiar with the additional problems caused by a highly virtualized execution environment.
CERs allow you to declare regions of code where the CLR is constrained from injecting any system-generated errors. And the author of the code is constrained from performing certain actions if he wants to avoid additional exceptions. An obvious example is that he shouldn’t new up an object if he is not prepared to deal with an OutOfMemoryException from that operation.
In addition to CERs, Whidbey provides reliability contracts. These contracts can be used to annotate methods with their guarantees and requirements with respect to reliability. Using these contracts, it’s possible to compose reliable execution out of components written by different authors. This is necessary for building reliable applications that make use of framework services. If the reliability requirements and guarantees of the framework services were not themselves explicit, the client applications could not remain reliable on top of them.
Finalization in SQL Server and other high availability hosts
Back to finalization.
In a normal unhosted process, there isn’t a strong distinction between normal and critical finalization. Normal processes won’t run out of memory, and if they do they should probably Fail Fast. It’s unlikely that the risk of trying to continue execution after resource exhaustion is worth the increased risk of subsequent crashes, hangs or other corruptions. Normal processes won’t experience Thread.Aborts that are injected across threads. (As opposed to aborting the current thread, which is no more dangerous than throwing any other exception).
So the only real concern is whether all the finalizable objects will drain during process exit, before the timeouts kick in. The timeouts are quite generous and in practice this is not a concern.
However, a hosted process like SQL Server is quite different. Because of SQL Server’s availability requirements, it is vital that the process not FailFast for something innocuous like OutOfMemoryExceptions. Indeed, SQL Server tries to run on the brink of memory exhaustion for performance reasons, so these memory exceptions are a constant fact of life in that environment. Furthermore, SQL Server uses Thread.Abort explicitly across threads to terminate long-running requests and it uses Thread.Abort implicitly to unload AppDomains. On a heavily loaded system, AppDomains may be unloaded to relieve resource pressure.
I have a lengthy blog on this topic, but I have not been able to post it because it talks about undisclosed Whidbey features. At some point (no later than shipping Beta1), you will find it at http://blogs.msdn.com/cbrumme with a title of Hosting. Until then, I’ll just mention that the Whidbey APIs support an escalation policy. This is a declarative mechanism by which the host can express timeouts for normal finalization, normal AppDomain unload, normal Abort attempts, etc. In addition to timeouts, the escalation policy can indicate appropriate actions whenever these timeouts expire. So a normal AppDomain unload could (for example) be escalated to a rude AppDomain unload or a normal process exit or a rude process exit.
The distinction between polite/normal and rude involves several aspects beyond finalization. If we just consider finalization, polite/normal means that we execute both normal and critical finalization. Contrast this with a rude scenario where we will ignore the normal finalizers, which are discarded, and only execute the critical finalizers. As you might expect, a similar distinction occurs between executing normal exception backout on threads, vs. restricting ourselves to any backout that is associated with CERs.
This allows a host to avoid solving the halting problem when performing normal finalization and exception backout, without putting the process at risk with respect to (critical) resource leakage or inconsistent state.
|
-
I’ve already written the much-delayed blog on Hosting, but I can’t post it yet because it mentions a couple of new Whidbey features, which weren’t present in the PDC bits. Obviously Microsoft doesn’t want to make product disclosures through my random blog articles.
I’m hoping this will be sorted out in another week or two.
While we’re waiting, I thought I would talk briefly(!) about pumping and apartments. The CLR made some fundamental decisions about OLE, thread affinity, reentrancy and finalization. These decisions have a significant impact on program correctness, server scalability, and compatibility with legacy (i.e. unmanaged) code. So this is going to be a blog like the one on Shutdown from last August (see http://blogs.msdn.com/cbrumme/archive/2003/08/20/51504.aspx). There will be more detail than you probably care to know about one of the more frustrating parts of the Microsoft software stack.
First, an explanation of my odd choice of terms. I’m using OLE as an umbrella which includes the following pieces of technology:
COM – the fundamental object model, like IUnknown and IClassFactory
DCOM – remoting of COM using IDL, NDR pickling and the SCM
Automation – IDispatch, VARIANT, Type Libraries, etc.
Active/X – Protocols for controls and their containers
Next, some disclaimers:
I am not and have never been a GUI programmer. So anything I know about Windows messages and pumping is from debugging GUI applications, not from writing them. I’m not going to talk about WM_PENCTL notifications or anything else that requires UI knowledge.
Also, I’m going to point out a number of problems with OLE and apartments. The history of the CLR and OLE are closely related. In fact, at one point COM+ 1.0 was known internally as COM98 and the CLR was known internally as COM99. We had some pretty aggressive ship targets back then!
In general, I love OLE and the folks who work on it. Although it is inappropriate for the Internet, DCOM is still the fastest and most enterprise-ready distributed object system out there. In a few ways the architecture of .NET Remoting is superior to DCOM, but we never had the time or resources to even approach the engineering effort that has gone into DCOM. Presumably Indigo will eventually change this situation. I also love COM’s strict separation of contract from implementation, the ability to negotiate for contracts, and so much more.
The bottom line is that OLE has had at least as much impact on Microsoft products and the industry, in its day, as .NET is having now.
But, like anything else, OLE has some flaws. In contrast to the stark architectural beauty of COM and DCOM, late-bound Automation is messy. At the time this was all rolled out to the world, I was at Borland and then Oracle. As an outsider, it was hard for me to understand how one team could have produced such a strange combination.
Of course, Automation has been immensely successful – more successful than COM and DCOM. My aesthetic taste is clearly no predictor of what people want. Generally, people want whatever gets the job done, even if it does so in an ad hoc way. And Automation has enabled an incredible number of application scenarios.
Apartments
If there’s another part of OLE that I dislike, it’s Single Threaded Apartments. Presumably everyone knows that OLE offers three kinds of apartments:
Single Threaded Apartment (STA) – one affinitized thread is used to call all the objects residing in the apartment. Any call on these objects from other threads must perform cross-thread marshaling to this affinitized thread, which dispatches the call. Although a process can have an arbitrary number of STAs (with a corresponding number of threads), most client processes have a single Main STA and the GUI thread is the affinitized thread that owns it.
Multiple Threaded Apartment (MTA) – each process has at most one MTA at a time. If the current MTA is not being used, OLE may tear it down. A different MTA will be created as necessary later. Most people think of the MTA as not having thread affinity. But strictly speaking it has affinity to a group of threads. This group is the set of all the threads that are not affinitized to STAs. Some of the threads in this group are explicitly placed in the MTA by calling CoInitializeEx. Other threads in this group are implicitly in the MTA because the MTA exists and because these threads haven’t been explicitly placed into STAs. So, by the strict rules of OLE, it is not legal for STA threads to call on any objects in the MTA. Instead, such calls must be marshaled from the calling STA thread over to one of the threads in the MTA before the call can legally proceed.
Neutral Apartment (NA) – this is a recent invention (Win2000, I think). There is one NA in the process. Objects contained in the NA can be called from any thread in the process (STA or MTA threads). There are no threads associated with the NA, which is why it isn’t called NTA. Calls into NA objects can be relatively efficient because no thread marshaling is ever required. However, these cross-apartment calls still require a proxy to handle the transition between apartments. Calls from an object in the NA to an object in an STA or the MTA might require thread marshaling. This depends on whether or not the current thread is suitable for calling into the target object. For example, a call from an STA object to an NA object and from there to an MTA object will require thread marshaling during the transition out of the NA into the MTA.
Threading
The MTA is effectively a free-threaded model. (It’s not quite a free-threaded model, because STA threads aren’t strictly allowed to call on MTA objects directly). From an efficiency point of view, it is the best threading model. Also, it imposes the least semantics on the application, which is also desirable. The main drawback with the MTA is that humans can’t reliably write free-threaded code.
Well, a few developers can write this kind of code if you pay them lots of money and you don’t ask them to write very much. And if you code review it very carefully. And you test it with thousands of machine hours, under very stressful conditions, on high-end MP machines like 8-ways and up. And you’re still prepared to chase down a few embarrassing race conditions once you’ve shipped your product.
But it’s not a good plan for the rest of us.
The NA model is truly free-threaded, in the sense that any thread in the process can call on these objects. All such threads must still transition through a proxy layer that maintains the apartment boundary. But within the NA all calls are direct and free-threaded. This is the only apartment that doesn’t involve thread affinity.
Although the NA is free-threaded, it is often used in conjunction with a lock to achieve rental threading. The rental model says that only one thread at a time can be active inside an object or a group of objects, but there is no restriction on which thread this might be. This is efficient because it avoids thread marshaling. Rather than marshaling a call from one thread to whatever thread is affinitized to the target objects, the calling thread simply acquires the lock (to rent the context) and then completes the call on the current thread. When the thread returns back out of the context, it releases the lock and now other threads can make calls.
If you call out of a rental context into some other object (as opposed to the return pathway), you have a choice. You can keep holding the lock, in which case other threads cannot rent the context until you fully unwind. In this mode, the rental context supports recursion of the current thread, but it does not support reentrancy from other threads. Alternatively, the thread could release the lock when it calls out of the rental context, in which case it must reacquire the lock when it unwinds back and returns to the rental context. In this mode, the rental context supports full reentrancy.
Throughout this blog, we’ll be returning to this fundamental decision of whether to support reentrancy. It’s a complex issue.
If only recursion is supported on a rental model, it’s clear that this is a much more forgiving world for developers than a free-threaded model. Once a thread has acquired the rental lock, no other threads can be active in the rented objects until the lock has been released. And the lock will not be released until the thread fully unwinds from the call into the context.
Even with reentrancy, the number of places where concurrency can occur is limited. Unless the renting thread calls out of the context, the lock won’t be released and the developer knows that other threads aren’t active within the rented objects. Unfortunately, it might be hard for the developer to know all the places that call out of the current context, releasing the lock. Particularly in a componentized world, or a world that combines application code with frameworks code, the developer can rarely have sufficient global knowledge.
So it sounds like limiting a rental context to same-thread recursion is better than allowing reentrancy during call outs, because the developer doesn’t have to worry about other threads mutating the state of objects in the rental context. This is true. But it also means that the resulting application is subject to more deadlocks. Imagine what can happen if two rental contexts are simultaneously making calls to each other. Thread T1 holds the lock to rent context C1. Thread T2 holds the lock to rent context C2. If T1 calls into C2 just as T2 calls into C1, and we are on the recursion plan, we have a classic deadlock. Two locks have been taken in different sequences by two different threads. Alternatively, if we are on a reentrancy plan, T1 will release the lock for C1 before contending for the lock on C2. And T2 will release the lock for C2 before contending for the lock on C1. The deadlock has been avoided, but T1 will find that the objects in C1 have been modified when it returns. And T2 will find similar surprises when it returns to C2.
Affinity
Anyway, we now understand the free-threaded model of the MTA and NA and we understand how to build a rental model on top of these via a lock. How about the single-threaded affinitized model of STAs? It’s hard to completely describe the semantics of an STA, because the complete description must incorporate the details of pages of OLE pumping code, the behavior of 3rd party IMessageFilters, etc. But generally an STA can be thought of as an affinitized rental context with reentrancy and strict stacking. By this I mean:
- It is affinitized rental because all calls into the STA must marshal to the correct thread and because only one logical call can be active in the objects of the apartment at any time. (This is necessarily the case, since there is only ever one thread).
- It has reentrancy because every callout from the STA thread effectively releases the lock held by the logical caller and allows other logical callers to either enter or return back to the STA.
- It has strict stacking because one stack (the stack of the affinitized STA thread) is used to process all the logical calls that occur in the STA. When these logical calls perform a callout, the STA thread reentrantly picks up another call in, and this pushes the STA stack deeper. When the first callout wants to return to the STA, it must wait for the STA thread’s stack to pop all the way back to the point of its own callout.
That point about strict stacking is a key difference between true rental and the affinitized rental model of an STA. With true rental, we never marshal calls between threads. Since each call occurs on its own thread, the pieces of stack for different logical threads are never mingled on an affinitized thread’s actual stack. Returns back into the rental context after a callout can be processed in any order. Returns back into an STA after a callout must be processed in a highly constrained order.
We’ve already seen a number of problems with STAs due to thread affinity, and we can add some more. Here’s the combined list:
- Marshaling calls between threads is expensive, compared to taking a lock.
- Processing returns from callouts in a constrained fashion can lead to inefficiencies. For instance, if the topmost return isn’t ready for processing yet, should the affinitized thread favor picking up a new incoming call (possibly leading to unconstrained stack growth) or should it favor waiting for the topmost return to complete (possibly idling the affinitized thread completely and conceivably resulting in deadlocks).
- Any conventional locks held by an affinitized thread are worthless. The affinitized thread is processing an arbitrary number of logical calls, but a conventional lock (like an OS CRITICAL_SECTION or managed Monitor) will not distinguish between these logical calls. Instead, all lock acquisitions are performed by the single affinitized thread and are granted immediately as recursive acquisitions. If you are thinking of building a more sophisticated lock that avoids this issue, realize that you are making that classic reentrancy vs. deadlock decision all over again.
- Imagine a common server situation. The first call comes in from a particular client, creates a few objects (e.g. a shopping cart) and returns. Subsequent calls from that client manipulate that initial set of objects (e.g. putting some items into the shopping cart). A final call checks out the shopping cart, places the order, and all the objects are garbage collected. Now imagine that all those objects are affinitized to a particular thread. As a consequence, the dispatch logic of your server must ensure that all calls from the same client are routed to the same thread. And if that thread is busy doing other work, the dispatch logic must delay processing the new request until the appropriate affinitized thread is available. This is complicated and it has a severe impact on scalability.
- STAs must pump. (How did I get this far without mentioning pumping?)
- Any STA code that assumed a single-threaded world for the process, rather than just for the apartment, might not pump. Such code breaks when we introduce the CLR into the process, as we will see.
Failure to Pump
Let’s look at those last two bullet points in more detail. When your STA thread is doing nothing else, it needs to be checking to see if any other threads want to marshal some calls into it. This is done with a Windows message pump. If the STA thread fails to pump, these incoming calls will be blocked. If the incoming calls are GUI SendMessages or PostMessages (which I think of as synchronous or asynchronous calls respectively), then failure to pump will produce an unresponsive UI. If the incoming calls are COM calls, then failure to pump will result in calls timing out or deadlocking.
If processing one incoming call is going to take a while, it may be necessary to break up that processing with intermittent visits to the message pump. Of course, if you pump you are allowing reentrancy to occur at those points. So the developer loses all his wonderful guarantees of single threading.
Unfortunately, there’s a whole lot of STA code out there which doesn’t pump adequately. For the most part, we see this in non-GUI applications. If you have a GUI application that isn’t pumping enough, it’s obvious right there on the screen. Those bugs tend to get fixed.
For non-GUI applications, a failure to pump may not be noticed in unmanaged code. When that code is moved to managed (perhaps by re-compiling some VB6 code as VB.NET), we start seeing bugs. Let’s look at a couple of real-world cases that we encountered during V1 of the CLR and how the lingering effects of these cases are still causing major headaches for managed developers and for Microsoft Support. I’ll describe a server case first, and then a client case.
ADO and ASP Compatibility Mode
ADO.NET and ASP.NET are a winning combination. But ASP.NET also supports an ASP compatibility mode. In this mode, legacy ASP pages can be served up by the managed ASP.NET pipeline. Such pages were written before we invented our managed platform, so they use ADO rather than ADO.NET for any data access. Also, in this mode the DCOM threadpool is used rather than the managed System.Threading.ThreadPool. Although all the threads in the managed ThreadPool are explicitly placed in the MTA (as you might hope and expect), the DCOM threadpool actually contains STA threads.
The purpose of this STA threadpool was to allow legacy STA COM objects in general, and VB6 objects in particular, to be moved from the client to the server. The result suffers from the scaling problems I alluded to before, since requests are dispatched on up to 100 STA threads with careful respect for any affinity. Also, VB6 has a variable scope which corresponds to “global” (I forget its name), but which is treated as per-thread when running on the server. If there are more than 100 clients using a server, multiple clients will share a single STA thread based on the whims of the request dispatch logic. This means that global variables are shared between sets of clients in a surprising fashion, based on the STA that they happen to correspond to.
A typical ASP page written in VBScript would establish a (hopefully pooled) database connection from ADO, query up a row, modify a field, and write the row back to the database. Since the page was likely written in VB, any COM AddRef and Release calls on the ADO row and field value objects were supplied through the magic of the VB6 runtime. This means they occur on the same thread and in a very deterministic fashion.
The ASP page contains no explicit pumping code. Indeed, at no point was the STA actually pumped. Although this is a strict violation of the rules, it didn’t cause any problems. That’s because there are no GUI messages or inter-apartment COM calls that need to be serviced.
This technique of executing ASP pages on STAs with ADO worked fairly well – until we tried to extend the model to ASP.NET running in ASP compatibility mode. The first problem that we ran into was that all managed applications are automatically multi-threaded. For any application of reasonable complexity, there are sure to be at least a few finalizable objects. These objects will have their Finalize methods called by one or more dedicated finalizer threads that are distinct from the application threads.
(It’s important that finalization occurs on non-application threads, since we don’t want to be holding any application locks when we call the Finalize method. And today the CLR only has a single Finalizer thread, but this is an implementation detail. It’s quite likely that in the future we will concurrently call Finalize methods on many objects, perhaps by moving finalization duties over to the ThreadPool. This would address some scalability concerns with finalization, and would also allow us to make stronger guarantees about the availability of the finalization service).
Our COM Interop layer ensures that we almost only ever call COM objects in the correct apartment and context. The one place where we violate COM rules is when the COM object’s apartment or context has been torn down. In that case, we will still call IUnknown::Release on the pUnk to try to recover its resources, even though this is strictly illegal. We’ve gone backwards and forwards on whether this is appropriate, and we provide a Customer Debug Probe so that you can detect whether this is happening in your application.
Anyway, let’s pretend that we absolutely always call the pUnk in the correct apartment and context. In the case of an object living in an STA, this means that the Finalizer thread will marshal the call to the affinitized thread of that STA. But if that STA thread is not pumping, the Finalizer thread will block indefinitely while attempting to perform the cross-thread marshaling.
The effect on a server is crippling. The Finalizer thread makes no progress. The number of unreleased pUnks grows without bounds. Eventually some resource (usually memory) is exceeded and the process crashes.
One solution is to edit the original ASP page to pump the underlying STA thread that it is executing on. A light-weight way to pump is to call Thread.CurrentThread.Join(0). This causes the current thread to block until the current thread dies (which isn’t going to happen) or until 0 milliseconds have elapsed – whichever happens first. I’ll explain later why this also performs some pumping and why this is a controversial aspect of the CLR. A heavier-weight way to pump is to call GC.WaitForPendingFinalizers. This not only performs pumping, but it also waits for the Finalization queue to drain.
If you are porting a page that produces a modest number of COM objects, doing a simple Join on each page may be sufficient. If your page performs elaborate processing, perhaps creating an unbounded number of COM objects in a loop, then you may need to either add a Join within the loop or WaitForPendingFinalizers at the end of the page processing. The only way to really know is to experiment with both techniques, measuring the growth of the Finalization queue and the impact on server throughput.
ADO’s Threading Model
There was another problem with using ADO from ASP.NET’s ASP compatibility mode. Do you know what the threading model of ADO is? Well, if you check the registry for some ADO CLSIDs on your machine, you may find them registered as ThreadingModel=Single or you may find them registered as ThreadingModel=Both.
If these classes are registered as Single, OLE will carefully ensure that their instances can only be called from the thread that they were created on. This implies that the objects can assume a single-threaded view of the world and they do not need to be written in a thread-safe manner. If these classes are registered as Both, OLE will ensure that their instances are only called from threads in the right apartment. But if that apartment is the MTA, these objects better have been written in a thread-safe manner. For example, they had better be using InterlockedIncrement and Decrement, or an equivalent, for reference counting.
Unfortunately, the ADO classes are not thread-safe. Strictly speaking, they should never be registered as anything but Single. You may find them registered as Both on your machine because this improves scalability and throughput for some key scenarios. And those key scenarios happen to limit themselves to “one thread at a time” because of how ASP and VB6 work.
In fact, the legacy ADO classes don’t even support single-threaded access if there is reentrancy. They will randomly crash when used in this manner (and this is exactly the manner in which ADO was driven in the early days of V1). Here are the steps:
- The page queries up an ADO row object, which enters managed code via COM Interop as an RCW (runtime-callable wrapper).
- By making a COM call on this RCW, the page navigates to a field value. This field value also enters managed code via COM Interop as an RCW.
- The page now makes a COM call via ADO which results in a call out to the remote database. At this point, the STA thread is pumped by the DCOM remote call. Since this is a remote call, it’s going to take a while before it returns.
- The garbage collector decides that it’s time to collect. At this point, the RCW for the field value is still reachable and is reported. The RCW for the row object is no longer referenced by managed code and is collected.
- The Finalizer thread notices that the pUnk underlying the row’s RCW is no longer in use, and it makes the cross-apartment call from the Finalizer thread’s apartment (MTA) to the ADO row object’s apartment (STA).
- Recall that the STA thread is pumping for the duration of the remote database call (#3 above). It picks up the cross-thread call from the Finalizer (#5 above) and performs the Release on the Row object. This is the final Release and ADO deletes the unmanaged Row object from memory. This logical call unwinds and the Finalizer thread is unblocked (hurray). The STA thread returns to pumping.
- The remote database call returns back to the server machine. The STA thread picks it up from its pumping loop and returns back to the page, unwinding the thread.
- The page now updates the field value, which involves a COM call to the underlying ADO object.
- ADO crashes or randomly corrupts memory.
What happened? The ADO developers made a questionable design decision when they implemented COM reference counting throughout their hierarchy. The field values refer to their owning row object, but they don’t hold a reference count on that row. Instead, they assume that the row will live as long as all of its associated field values. And yet, whenever the application makes an ADO call on a field value, the field value will access that (hopefully present) row.
This assumption worked fine in the days of ASP and VB6. So nobody even noticed the bug until the CLR violated those threading assumptions – without violating the underlying OLE rules, of course.
It was impractical to fix this by opening up ADO and rewriting the code. There are many different versions of ADO in existence, and many products that distribute it. Another option was to add GC.KeepAlive(row) calls at the bottom of each page, to extend the lifetime of the row objects until the field values were no longer needed. This would have been a nightmare for Support.
Instead, the ADO team solved the problem for managed code with a very elegant technique. (I invented it, so of course I think it was elegant). They opened up the assembly that was created by TlbImp’ing ADO. Then they added managed references from the RCWs of the field values to the RCWs of their owning rows. These managed references are completely visible to the garbage collector. Now the GC knows that if the field values are reachable then the row values must also be reachable. Problem solved.
No Typelib Registered
Incidentally, we ran into another very common problem when we moved existing client or server COM applications over to managed code. Whenever an application uses a COM object, it tries hard to match the thread of the client to the ThreadingModel of the server. In other words, if the application needs to use a ThreadingModel=Main COM object, the application tries to ensure that the creating thread is in an STA. Similarly, if the application needs to use a ThreadingModel=Free COM object, it tries to create this object from an MTA thread. Even if a COM object is ThreadingModel=Both, the application will try to access the object from the same sort of thread (STA vs. MTA) as the thread that created the object.
One reason for doing this is performance. If you can avoid an apartment transition, your calls will be much faster. Another reason has to do with pumping and reentrancy. If you make a cross-apartment call into an STA, the STA better be pumping to pick up your call. And if you make a cross-apartment call out of an STA, your thread will start pumping and your application becomes reentrant. This is a small dose of free-threading, and many application assumptions start to break. A final reason for avoiding apartment transitions is that they often aren’t supported. For instance, most ActiveX scenarios require that the container and the control are in the same STA. If you introduce an apartment boundary (even between two STAs), bizarre cases like Input Synchronous messages stop working properly.
The net result is that a great many applications avoid using COM objects across apartment boundaries. And this means that – even if that COM object is nominally marshalable across an apartment boundary – this often isn’t being tested. So an application might install itself without ensuring that the typelib of the COM component is actually registered.
When the application is moved to managed code, developers are frustrated to see InvalidCastExceptions on the managed side. A typical sequence is that they successfully ‘new’ the COM object, implying that the CoCreate returned a pUnk which was wrapped in an RCW. Then when they cast it to one of the interfaces that they know is supported, a casting exception is thrown. This casting exception is due to a QueryInterface call failing with E_NOINTERFACE. Yet this HRESULT is not returned by the COM object, which does indeed support the interface. Instead, it is returned by a COM apartment proxy which sits between the RCW and that COM object. The COM apartment proxy is simply failing to marshal the interface across the apartment boundary – usually because the COM object is using the OLEAUT marshaler and the Typelib has not been properly registered.
This is a common failure, and it’s unfortunate that a generic E_NOINTERFACE doesn’t lead to better debuggability for this case.
Finally, I can’t help but mention that the COM Interop layer added other perturbations to many unmanaged COM scenarios that seemed to be working just fine. Common perturbations from managed code include garbage collection, a Finalizer thread, strict conformance to OLE marshaling rules, and the fact that managed objects are agile with respect to COM apartments and COM+ contexts (unless they derive from ServicedComponent).
For instance, Trident required that all calls on its objects occur on the correct thread. But Trident also had an extension model where 3rd party objects could be aggregated onto their base objects. Unfortunately, the aggregator performed blind delegation to the 3rd party objects. And – even more unfortunate – this blind delegation did not exclude QI’s for IMarshal. Of course, managed objects implement IMarshal to achieve their apartment and context agility. So if Trident aggregated a managed object as an extention, the containing Trident object would attempt to become partially agile in a very broken way.
Hopefully we found and dealt with most of these issues before we shipped V1.
Not Pumping a Client
I said I would describe two cases where non-pumping unmanaged code caused problems when we moved to managed code. The above explains, in great detail, how ADO and ASP compatibility mode caused us problems on the server. Now let’s look at the non-GUI client case.
We all know that a WinForms GUI client is going to put the main GUI thread into an STA. And we know that there’s a lot of pumping in a GUI application, or else not much is going to show on the screen.
Assume for a moment that a Console application also puts its main thread into an STA. If that main thread creates any COM objects via COM Interop, and if those COM objects are ThreadingModel=Main or Both, then the application better be pumping. If it fails to pump, we’ll have exactly the same situation with our server running ASP compatibility mode. The Finalizer thread won’t be able to marshal calls into the STA to Release any pUnks.
On a well-loaded server, that failure is quickly noticed by the developer or by the folks in operations. But on a client, this might be just a mild case of constipation. The rate of creation of finalizable objects may be low enough that the problem is never noticed. Or it may be noticed as a gradual build up of resources. If the problem is reported to Microsoft Support, the customer generally categorizes it as a garbage collection bug.
So what is the apartment of a Console application’s main thread? Well, it depends.
If you build a Console application in Notepad, the main thread is likely to start off in the MTA. If you build a Console application with Visual Studio, then if you pick C# or VB.NET your main thread is likely to be in an STA. If you build a Console application with Visual Studio and you choose managed C++, your main thread is likely to be in an MTA for V1 or V1.1. I think it’s likely to be in an STA for our next release.
Wow. Why are we all over the place on this? Mostly, it’s because there is no correct answer. Either the developer is not going to use any COM objects in his Console application, in which case the choice doesn’t really matter, or the developer is going to use some COM objects and this should inform his decision.
For instance, if the developer will use COM objects with ThreadingModel=Main, he probably wants to put his main thread into an STA so he can use the COM objects directly without cross-thread marshaling and all the issues that this would imply. This means he should also pump that thread, if there are other threads (like the Finalizer!) active in the process. Alternatively, if the developer intends to use COM objects with ThreadingModel=Free, he probably wants to put his main thread in the MTA so he can access those objects directly. Now he doesn’t need to pump, but he does need to consider the implications of writing free-threaded code.
Either way, the developer has some responsibility.
Unfortunately, the choice of a default is typically made by the project type that he selects in Visual Studio, or is based on the CLR’s default behavior (which favors MTA). And realistically the subtleties of apartments and pumping are beyond the knowledge (or interest) of most managed developers. Let’s face it: nobody should have to worry about this sort of thing.
The Managed CoInitialize Mess
There are three ways to select an apartment choice for the main thread of your Console application. All three of these techniques have concerns associated with them.
1) You can place either an STAThreadAttribute or MTAThreadAttribute onto the main method.
2) You can perform an assignment to System.Threading.CurrentThread.ApartmentState as one of the first statements of your main method (or of your thread procedure if you do a Thread.Start).
3) You can accept the CLR’s default of MTA.
So what’s wrong with each of these techniques?
The first technique is the preferred method, and it works very well for C#. After some tweaks to the VB.NET compiler before we shipped V1, it worked well for VB too. Managed C++ still doesn’t properly support this technique. The reason is that the entrypoint of a managed C++ EXE isn’t actually your ‘main’ routine. Instead, it’s a method inside the C-runtime library. That method eventually delegates to your ‘main’ routine. But the CLR doesn’t scan through the closure of calls from the entrypoint when looking for the custom attribute that defines the threading model. If the CLR doesn’t find it on the method that is the EXE’s entrypoint, it stops looking. The net result is that your attribute is quietly ignored for C++.
I’m told that this will be addressed in Whidbey, by having the linker propagate the attribute from ‘main’ to the CRT entrypoint. And indeed this is how the VB.NET compiler works today.
What’s wrong with the second technique? Unfortunately, it is subject to a race condition. Before the CLR can actually call your thread procedure, it may first call some module constructors, class constructors, AssemblyLoad notifications and AssemblyResolve notifications. All of this execution occurs on the thread that was just created. What happens if some of these methods set the thread’s ApartmentState before you get a chance? What happens if they call Windows services like the clipboard that also set the apartment state? A more likely scenario is that one of these other methods will make a PInvoke call that marshals a BSTR, SAFEARRAY or VARIANT. Even these innocuous operations can force a CoInitializeEx on your thread and limit your ability to configure the thread from your thread procedure.
When you are developing your application, none of the above is likely to occur. The real nightmare scenario is that a future version of the CLR will provide a JIT that inlines a little more aggressively, so some extra class constructors execute before your thread procedure. In other words, you will ship an application that is balanced on a knife edge here, and this will become an App Compatibility issue for all of us. (See http://blogs.msdn.com/cbrumme/archive/2003/11/10/51554.aspx for more details on the sort of thing we worry about here).
In fact, for the next release of the CLR we are seriously considering making it impossible to set the apartment state on a running thread in this manner. At a minimum, you should expect to see a Customer Debug Probe warning of the risk here.
And the third technique from above has a similar problem. Recall that threads in the MTA can be explicitly placed there through a CoInitializeEx call, or they can be implicitly treated as being in the MTA because they haven’t been placed into an STA. The difference between these two cases is significant.
If a thread is explicitly in the MTA, any attempt to configure it as an STA thread will fail with an error of RPC_E_CHANGED_MODE. By contrast, if a thread is implicitly in the MTA it can be moved to an STA by calling CoInitializeEx. This is more likely than it may sound. If you attempt a clipboard operation, or you call any number of other Windows services, the code you call may attempt to place your thread in the STA. And when you accept the CLR default behavior, it currently leaves the thread implicitly in the MTA and therefore is subject to reassignment.
This is another place where we are seriously considering changing the rules in the next version of the CLR. Rather than place threads implicitly in the MTA, we are considering making this assignment explicit and preventing any subsequent reassignment. Once again, our motivation is to reduce the App Compat risk for applications after they have been deployed.
Speaking of race conditions and apartments, the CLR has a nasty bug which was introduced in V1 and which we have yet to remove. I’ve already mentioned that any threads that aren’t in STAs or explicitly in the MTA are implicitly in the MTA. That’s not strictly true. These threads are only in the MTA if there is an MTA for them to be in.
There is an MTA if OLE is active in the process and if at least one thread is explicitly in the MTA. When this is the case, all the other unconfigured threads are implicitly in the MTA. But if that one explicit thread should terminate or CoUninitialize, then OLE will tear down the MTA. A different MTA may be created later, when a thread explicitly places itself into it. And at that point, all the unconfigured threads will implicitly join it.
But this destruction and recreation of the MTA has some serious impacts on COM Interop. In fact, any changes to the apartment state of a thread can confuse our COM Interop layer, cause deadlocks on downlevel platforms, and lead to memory leaks and violation of OLE rules.
Let’s look at how this specific race condition occurs first, and then I’ll talk about the larger problems here.
- An unmanaged thread CoInitializes itself for the MTA and calls into managed code.
- While in managed code, that thread introduces some COM objects to our COM Interop layer in the form of RCWs, perhaps by ‘new’ing them from managed code.
- The CLR notices that the current thread is in the MTA, and realizes that it must “keep the MTA alive.” We signal the Finalizer thread to put itself explicitly into the MTA via CoInitializeEx.
- The unmanaged thread returns out to unmanaged code where it either dies or simply calls CoUninitialize. The MTA is torn down.
- The Finalizer thread wakes up and explicitly CoInitializes itself into the MTA. Oops. It’s too late to keep the original MTA alive and it has the effect of creating a new MTA. At least this one will live until the end of the process.
As far as I know, this is the only race condition in the CLR that we haven’t fixed. Why have we ignored it all these years? First, we’ve never seen it reported from the field. This isn’t so surprising when you consider that the application often shares responsibility for keeping the MTA alive. Many applications are aware of this obligation and – if they use COM – they always keep an outstanding CoInitialize on one MTA thread so the apartment won’t be torn down. Second, I generally resist fixing bugs by adding inter-thread dependencies. It would be all too easy to create a deadlock by making step 3 wait for the Finalizer thread to CoInitialize itself, rather than just signaling it to do so. This is particularly true since the causality of calls from the Finalizer to other threads is often opaque to us, as I’ll explain later. And we certainly don’t want to create a dedicated thread for this purpose. Dedicated threads have a real impact on Terminal Server scenarios, where the cost of one thread in a process is multiplied by all the processes that are running. Even if we were prepared to pay this cost, we would want to create this thread lazily. But synchronizing with the creation of another thread is always a dangerous proposition. Thread creation involves taking the OS loader lock and making DLL_THREAD_ATTACH notifications to all the DllMain routines that didn’t explicitly disable these calls.
The bottom line is that the fix is expensive and distasteful. And it speaks to a more general problem, where many different components in a process may be individually spinning up threads to keep the MTA from being recycled. A better solution is for OLE to provide an API to keep this apartment alive, without requiring all those dedicated threads. This is the approach that we are pursuing for the long term.
In our general cleanup of the CLR’s treatment of CoInitialize, we are also likely to change the semantics of assigning the current thread’s ApartmentState to Unknown. In V1 & V1.1 of the CLR, any attempt to set the state to Unknown would throw an ArgumentOutOfRangeException, so we’re confident that we can make this change without breaking applications.
If the CLR has performed an outstanding CoInitializeEx on this thread, we may treat the assignment to Unknown as a request to perform a CoUninitialize to reverse the operation. Currently, the only way you can CoUninitialize a thread is to PInvoke to the OLE32 service. And such changes to the apartment state are uncoordinated with the CLR.
Now why does it matter if the apartment state of a thread changes, without the CLR knowing? It matters because:
1) The CLR may hold RCWs over COM objects in the apartment that is about to disappear. Without a notification, we cannot legally release those pUnks. As I’ve already mentioned, we break the rules here and attempt to Release anyway. But it’s still a very bad situation and sometimes we will end up leaking.
2) The CLR will perform limited pumping of STA threads when you perform managed blocking (e.g. WaitHandle.WaitOne). If we are on a recent OS, we can use the IComThreadingInfo interface to efficiently determine whether we should pump or not. But if we are on a downlevel platform, we would have to call CoInitialize prior to each blocking operation and check for a failure code to absolutely determine the current state of the thread. This is totally impractical from a performance point of view. So instead we cache what we believe is the correct apartment state of the thread. If the application performs a CoInitialize or CoUninitialize without informing us, then our cached knowledge is stale. So on downlevel platforms we might neglect to pump an STA (which can cause deadlocks). Or we may attempt to pump an MTA (which can cause deadlocks).
Incidentally, if you ever run managed applications under a diagnostic tool like AppVerifier, you may see complaints from that tool at process shutdown that we have leaked one or more CoInitialize calls. In a well-behaved application, each CoInitialize would have a balancing CoUninitialize. However, most processes are not so well-behaved. It’s typical for applications to terminate the process without unwinding all the threads of the process. There’s a very detailed description of the CLR’s shutdown behavior at http://blogs.msdn.com/cbrumme/archive/2003/08/20/51504.aspx.
The bottom line here is that the CLR is heavily dependent on knowing exactly when apartments are created and destroyed, or when threads become associated or disassociated with those apartments. But the CLR is largely out of the loop when these operations occur, unless they occur through managed APIs. Unfortunately, we are rarely informed. For an extreme example of this, the Shell has APIs which require an STA. If the calling thread is implicitly in the MTA, these Shell APIs CoInitialize that calling thread into an STA. As the call returns, the API will CoUnitialize and rip down the apartment.
We would like to do better here over time. But there are some pretty deep problems and most solutions end up breaking an important scenario here or there.
Back to Pumping
Enough of the CoInitialize mess. I mentioned above that managed blocking will perform some pumping when called on an STA thread.
Managed blocking includes a contentious Monitor.Enter, WaitHandle.WaitOne, WaitHandle.WaitAny, GC.WaitForPendingFinalizers, our ReaderWriterLock and Thread.Join. It also includes anything else in FX that calls down to these routines. One noticeable place where this happens is during COM Interop. There are pathways through COM Interop where a cache miss occurs on finding an appropriate pUnk to dispatch a call. At those points, the COM call is forced down a slow path and we use this as an opportunity to pump a little bit. We do this to allow the Finalizer thread to release any pUnks on the current STA, if the application is neglecting to pump. (Remember those ASP Compat and Console client scenarios?) This is a questionable practice on our part. It causes reentrancy at a place where it normally could never occur in pure unmanaged scenarios. But it allows a number of applications to successfully run without clogging up the Finalizer thread.
Anyway, managed blocking does not include PInvokes directly to any of the OS blocking services. And keep in mind that if you PInvoke to the OS blocking services directly, the CLR will no longer be able to take control of your thread. Operations like Thread.Interrupt, Thread.Abort and AppDomain.Unload will be indefinitely delayed.
Did you notice that I neglected to mention WaitHandle.WaitAll in the list of managed blocking opeprations? That’s because we don’t allow you to call WaitAll from an STA thread. The reason is rather subtle. When you perform a pumping wait, at some level you need to call MsgWaitForMultipleObjectsEx, or a similar Msg* based variant. But the semantics of a WAIT_ALL on an OS MsgWaitForMultipleObjectsEx call is rather surprising and not what you want at all. It waits for all the handles to be signaled AND for a message to arrive at the message queue. In other words, all your handles could be signaled and the application will keep blocking until you nudge the mouse! Ugh.
We’ve toyed with some workarounds for this case. For example, you could imagine spinning up an MTA thread and having it perform the blocking operation on the handles. When all the handles are signaled, it could set another event. The STA thread would do a WaitHandle.WaitOne on that other event. This gives us the desired behavior that the STA thread wakes up when all handles are signaled, and it still pumps the message queue. However, if any of those handles are “thread-owned”, like a Mutex, then we have broken the semantics. Our sacrificial MTA thread now owns the Mutex, rather than the STA thread.
Another technique would be to put the STA thread into a loop. Each iteration would ping the handles with a brief timeout to see if it could acquire them. Then it would check the message queue with a PeekMessage or similar technique, and then iterate. This is a terrible solution for battery-powered devices or for Terminal Server scenarios. What used to be efficient blocking is now busily spinning in a loop. And if no messages actually arrive, we have disturbed the fairness guarantees of the OS blocking primitives by pinging.
A final technique would be to acquire the handles one by one, using WaitOne. This is probably the worst approach of all. The semantics of an OS WAIT_ALL are that you will either get no handles or you will get all of them. This is critical to avoiding deadlocks, if different parts of the application block on the same set of handles – but fill the array of handles in random order.
I keep saying that managed blocking will perform “some pumping” when called on an STA thread. Wouldn’t it be great to know exactly what will get pumped? Unfortunately, pumping is a black art which is beyond mortal comprehension. On Win2000 and up, we simply delegate to OLE32’s CoWaitForMultipleHandles service. And before we wrote the initial cut of our pumping code for NT4 and Win9X, I thought I would glance through CoWaitForMultipleHandles to see how it is done. It is many, many pages of complex code. And it uses special flags and APIs that aren’t even available on Win9X.
The code we finally wrote for the downlevel platforms is relatively simple. We gather the list of hidden OLE windows associated with the current STA thread and try to restrict our pumping to the COM calls which travel through them. However, a lot of the pumping complexity is in USER32 services like PeekMessage. Did you know that calling PeekMessage for one window will actually cause SendMessages to be dispatched on other windows belonging to the same thread? This is another example of how someone made a tradeoff between reentrancy and deadlocks. In this case, the tradeoff was made in favor of reentrancy by someone inside USER32.
By now you may be thinking “Okay. Pump more and I get reentrancy. Pump less and I get deadlocks.” But of course the world is more complicated than that. For instance, the Finalizer thread may synchronously call into the main GUI STA thread, perhaps to release a pUnk there, as we have seen. The causality from the Finalizer thread to the main GUI STA thread is invisible to the CLR (though the CLR Security Lead recently suggested using OLE channel hooks as a technique for making this causality visible). If the main GUI STA thread now calls GC.WaitForPendingFinalizers in order to pump, there’s a possibility of a deadlock. That’s because the GUI STA thread must wait for the Finalizer thread to drain its queue. But the Finalizer thread cannot drain its queue until the GUI thread has serviced its incoming synchronous call from the Finalizer.
Reentrancy, Avalon, Longhorn and the Client
Ah, reentrancy again. From time to time, customers inside or outside the company discover that we are pumping messages during managed blocking on an STA. This is a legitimate concern, because they know that it’s very hard to write code that’s robust in the face of reentrancy. In fact, one internal team completely avoids managed blocking, including almost any use of FX, for this reason.
Avalon was very upset, too. I’m not sure how much detail they have disclosed about their threading model. And it’s certainly not my place to reveal what they are doing. Suffice it to say that their model is an explicit rental model that does not presume thread affinity. If you’ve read this far, I’m sure you approve of their decision.
Avalon must necessarily coexist with STAs, but Avalon doesn’t want to require them. The CLR and Avalon have a shared long term goal of driving STAs out of the platform. But, realistically, this will take decades. Avalon’s shorter term goal is to allow some useful GUI applications to be written without STAs. Even this is quite difficult. If you call the clipboard today, you will have an STA.
Avalon also has made a conscious design choice to favor deadlocks over reentrancy. In my opinion, this is an excellent goal. Deadlocks are easily debugged. Reentrancy is almost impossible to debug. Instead, it results in odd inconsistencies that manifest over time.
In order to achieve their design goals, Avalon requires the ability to control the CLR’s pumping. And since we’ve had similar requests from other teams inside and outside the company, this is a reasonable feature for us to provide.
V1 of the CLR had a conscious goal of making as much legacy VB and C++ code work as was possible. When we saw the number of applications that failed to pump, we had no choice but to insert pumping for them – even at the cost of reentrancy. Avalon is in a completely different position. All Avalon code is new code. They are in a great position to define an explicit model for pumping, and then require that all new applications rigorously conform to that model.
Indeed, as much as I dislike STAs, I have a bigger concern about Longhorn and its client focus. Historically, Microsoft has built a ton of great functionality and added it to the platform. But that functionality is often mixed up with various client assumptions. STAs are probably the biggest of those assumptions. The Shell is an example of this. It started out as a user-focused set of services, like the namespace. But it’s growing into something that’s far more generally useful. To the extent that the Shell wants to take its core concepts and make them part of the base managed Longhorn platform, it needs to shed the client focus. The same is true of Office.
For instance, I want to write some code that navigates to a particular document through some namespace and then processes it in some manner. And I want that exact same code to run correctly on the client and on the server. On the client, my processing of that document should not make the UI unresponsive. On the server, my processing of that document should not cause problems with scalability or throughput.
Historically, this just hasn’t been the case. We have an opportunity to correct this problem once, with the major rearchitecture that is Longhorn. But although Longhorn will have both client and server releases, I worry that we might still have a dangerous emphasis on the client.
This may be one of the biggest risks we face in Longhorn.
Winding Down
Finally, I feel a little bad about picking something I don’t like and writing about it. But there’s a reason that this topic came up. Last week, a customer in Japan was struggling with using mshtml.dll to crack some HTML files from inside ASP.NET. It’s the obvious thing to do. Clearly ‘mshtml’ stands for Microsoft HTML and clearly this is how we expect customers to process files in this format.
Unfortunately, MSHTML was written as client-side functionality. In fact, I’m told that it drives its own initialization by posting Windows messages back to itself and waiting for them to be pumped. So if you aren’t pumping an STA, you aren’t going to get very far.
There’s that disturbing historical trend at Microsoft to combine generally useful functionality with a client bias again!
We explained to the customer the risks of using client components on a server, and the pumping behavior that is inherent in managed blocking on an STA. After we had been through all the grisly details, the customer made the natural observation: None of this is written down anywhere.
Well, I still never talked about a mysterious new flag to CoWaitForMultipleHandles. Or how custom implementations of IMessageFilter can cause problems. Or the difference between Main and Single. Or the relationship between apartments and COM+ contexts and ServicedComponents. Or the amazing discovery that OLE32 sometimes requires you to pump the MTA if you have DCOM installed on Win9X.
But I’m sure that at this point I’ve said far more than most people care to hear about this subject.
|
-
By default, old blogs are truncated from this web site. If you want to read
old entries that have scrolled off, go to the CATEGORIES section at the right hand
side of the web page. Select CLR (rss) and you'll see the full list.
|
-
The PDC has happened, which means two things. I
can post some of my (slightly self-censored) reactions to the show, and I can talk
about what we ve disclosed about Whidbey and Longhorn more freely. In
this particular case, I had promised to talk about the deep changes we re making
in Whidbey to allow you to host the CLR in your process. As
you ll see, I got side tracked and ended up discussing Application Compatibility
instead.
But first, my impressions of the PDC:
The first keynote, with Bill, Jim
& Longhorn, was guaranteed to be good. It had all the coolness of Avalon,
WinFS and Indigo, so of course it was impressive. In fact, throughout all the
sessions I attended, I was surprised by the apparent polish
and maturity of Longhorn. In my opinion, Avalon looked like it is the most mature
and settled. Indigo also looked surprisingly real. WinFS looked good in
the keynote, where it was all about the justification for the technology. But
in the drill-down sessions, I had the sense that it s not as far along as the others.
Hopefully all the attendees realize
that Longhorn is still a long way off. It
s hard to see from the demos, but a lot of fundamental design issues and huge missing
pieces remain.
Incidentally, I still can t believe
that we picked WinFX to describe the extended managed frameworks and WinFS to describe
the new storage system. One of those
names has got to go.
I was worried that the Whidbey keynote
on Tuesday would appear mundane and old-fashioned by comparison. But to an audience
of developers, Eric's keynote looked very good indeed. Visual Studio looked
better than I've ever seen it. The device app was so easy to write that I feel
I could build a FedEx-style package tracking application in a weekend.
The
high
point
of this keynote was ASP.NET. I hadn't been paying attention to what they've
done recently, so I was blown away by the personalization system and by the user-customizable
web pages. If I had seen a site like that, I would have assumed the author spent
weeks getting it to work properly. It
s hard to believe this can all be done with drag-and-drop.
In V1, ASP.NET hit a home run by focusing
like a laser beam on the developer experience. Everyone put so much effort into
building apps, questioning why each step was necessary, and refining the process.
It's great to see that they continue to follow that same discipline. In the
drill-down sessions, over and over again I saw that focus resulting in a near perfect
experience for developers. There are
some other teams, like Avalon, that seem to have a similar religion and are obtaining
similar results. (Though Avalon desperately
needs some tools support. Notepad is
fine for authoring XAML in demos, but I wouldn t want to build a real application
this way).
Compared to ASP.NET, some other teams
at Microsoft are still living in the Stone Age. Those
teams are still on a traditional cycle of building features, waiting for customers
to build applications with those features, and then incorporating any feedback. Beta
is way too late to find out that the programming model is clumsy. We
shouldn t be shirking our design responsibilities like this.
Anyway, the 3rd keynote (from Rick
Rashid & Microsoft Research) should have pulled it all together. I think
the clear message should have been something like:
Whidbey
is coming next and has great developer features. After that, Longhorn will arrive
and will change everything. Fortunately, Microsoft Research is looking 10+ years
out, so you can be sure we will increasingly drive the whole industry.
This should have been an easy story
to tell. The fact is that MSR is a world class research institution. Browse
the Projects, Topics or People categories at http://research.microsoft.com and
you ll see many name brand researchers like Butler Lampson and Jim Gray. You
will see tremendous breadth on the areas under research, from pure math and algorithms
to speech, graphics and natural language. There
are even some esoterica like nanotech and quantum computing. We
should have used the number of published papers and other measurements to compare
MSR with other research groups in the software industry, and with major research universities. And
then we should have shown some whiz-bang demos of about 2 minutes each.
Unfortunately, I think instead we
sent a message that Interesting technology comes from Microsoft product groups,
while MSR is largely irrelevant. Yet
nothing could be further from the truth. Even
if I restrict consideration to the CLR, MSR has had a big impact. Generics
is one of the biggest feature added to the CLR, C# or the base Frameworks in Whidbey. This
feature was added to the CLR by MSR team members, who now know at least as much about
our code base as we do. All the CLR
s plans for significantly improved code quality and portable compilers depend on a
joint venture between MSR and the compiler teams. To
my knowledge, MSR has used the CLR to experiment with fun things like transparent
distribution, reorganizing objects based on locality, techniques for avoiding security
stack crawls, interesting approaches to concurrency, and more. SPOT
(Smart Object Personal Technology) is a wonderful example of what MSR has done with
the CLR s basic IL and metadata design, eventually leading to a very cool product.
In my opinion, Microsoft Research
strikes a great balance between long term speculative experimentation and medium term
product-oriented improvements. I wish
this had come across better at the PDC.
Trends
In the 6+ years I ve been at Microsoft,
we ve had 4 PDCs. This is the first
one I ve actually attended, because I usually have overdue work items or too many
bugs. (I ve missed all 6 of our mandatory
company meetings for the same reason). So
I really don t have a basis for comparison.
I guess I had expected to be beaten
up about all the security issues of the last year, like Slammer and Blaster.
And I had expected developers to be interested in all aspects of security. Instead,
the only times the topic came up in my discussions is when I raised it.
However, some of my co-workers did
see a distinct change in the level of interest in security. For
example, Sebastian Lange and Ivan Medvedev gave a talk on managed security to an audience
of 700-800. They reported a real upswing
in awareness and knowledge on the part of all PDC attendees.
But consider a talk I attended on
Application Compatibility. At a time
when most talks were overflowing into the hallways, this talk filled less than 50
seats of a 500 to 1000 seat meeting room. I
know that AppCompat is critically important to IT. And
it s a source of friction for the entire industry, since everyone is reluctant to
upgrade for fear of breaking something. But
for most developers this is all so boring compared to the cool visual effects we can
achieve with a few lines of XAML.
Despite a trend to increased interest
in security on the part of developers, I suspect that security remains more of an
IT operations concern than it does a developer concern. And
although the events of the last year or two have got more developers excited about
security (including me!), I doubt that we will ever get developers excited about more
mundane topics like versioning, admin or compatibility. This
latter stuff is dead boring.
That doesn t mean that the industry
is doomed. Instead, it means that modern
applications must obtain strong versioning, compatibility and security guarantees
by default rather than through deep developer involvement. Fortunately,
this is entirely in keeping with our long term goals for managed code.
With the first release of the CLR,
the guarantees for managed applications were quite limited. We
guaranteed memory safety through an accurate garbage collector, type safety through
verification, binding safety through strong names, and security through CAS. (However,
I think we would all agree that our current support for CAS still involves far too
much developer effort and not enough automated guarantees. Our
security team has some great long-term ideas for addressing this.)
More importantly, we expressed programs
through metadata and IL, so that we could expand the benefits of reasoning about these
programs over time. And we provided metadata
extensibility in the form of Custom Attributes and Custom Signature Modifiers, so
that others could add to the capabilities of the managed environment without depending
on the CLR team s schedule.
FxCop (http://www.gotdotnet.com/team/fxcop/)
is an obvious example of how we can benefit from this ability to reason about programs. All
teams developing managed code at Microsoft are religious about incorporating this
tool into their build process. And since
FxCop supports adding custom rules, we have added a large number of Microsoft-specific
or product-specific checks.
Churn and Application Breakage
We also have some internal tools that
allow us to compare different versions of assemblies so we can discover inadvertent
breaking changes. Frankly, these tools
are still maturing. Even in the
Everett
timeframe, they did a good job of blatant violations like the removal of a public
method from a class or addition of a method to an interface. But
they didn t catch changes in serialization format, or changes to representation after
marshaling through PInvoke or COM Interop. As
a result, we shipped some unintentional breaking changes in
Everett
, and until recently we were on a path to do so again in Whidbey.
As far as I know, these tools still
don t track changes to CAS constructs, internal dependency graphs, thread-safety
expectations, exception flow (including a static replacement for the checked exceptions
feature), reliability contracts, or other aspects of execution. Some
of these checks will probably be added over time, perhaps by adding additional metadata
to assemblies to reveal the developer s intentions and to make automated validation
more tractable. Other checks seem like
research projects or are more appropriate for dynamic tools rather than static tools. It
s very encouraging to see teams inside and outside of Microsoft working on this.
I expect that all developers will
eventually have access to these or similar tools from Microsoft or 3rd parties,
which can be incorporated into our build processes the way FxCop has been.
Sometimes applications break when
their dependencies are upgraded to new versions. The
classic example of this is Win95 applications which broke when the operating system
was upgraded to WinXP. Sometimes this
is because the new versions have made breaking changes to APIs. But
sometimes it s because things are just different . The
classic case here is where a test case runs perfectly on a developer s machine, but
fails intermittently in the test lab or out in the field. The
difference in environment might be obvious, like a single processor box vs. an 8-way. Yet
all too often it s something truly subtle, like a DLL relocating when it misses its
preferred address, or the order of DllMain notifications on a DLL_THREAD_ATTACH. In
those cases, the change in environment is not the culprit. Instead,
the environmental change has finally revealed an underlying bug or fragility in the
application that may have been lying dormant for years.
The managed environment eliminates
a number of common fragilities, like the double-free of memory blocks or the use of
a file handle or Event that has already been closed. But
it certainly doesn t guarantee that a multi-threaded program which appears to run
correctly on a single processor will also execute without race conditions on a 32-way
NUMA box. The author of the program must
use techniques like code reviews, proof tools and stress testing to ensure that his
code is thread-safe.
The situation that worries me the most is when an application
relies on accidents of current FX and CLR implementations. These
dependencies can be exceedingly subtle.
Here are some examples of breakage that we have encountered,
listed in the random order they occur to me:
-
Between V1.1 and Whidbey, the implementation of reflection
has undergone a major overhaul to improve access times and memory footprint. One
consequence is that the order of members returned from APIs like Type.GetMethods has
changed. The old order was never documented
or guaranteed, but we ve found programs including our own tests which assumed
stability here.
-
Structs and classes can specify Sequential, Explicit
or AutoLayout. In the case of AutoLayout,
the CLR is free to place members in any order it chooses. Except
for alignment packing and the way we chunk our GC references, our layout here is currently
quite predictable. But in the future
we hope to use access patterns to guide our layout for increased locality. Any
applications that predict the layout of AutoLayout structs and classes via unsafe
coding techniques are at risk if we pursue that optimization.
-
Today, finalization occurs on a single Finalizer thread. For
scalability and robustness reasons, this is likely to change at some point. Also,
the GC already perturbs the order of finalization. For
instance, a collection can cause a generation boundary to intervene between two instances
that are normally allocated consecutively. Within
a given process run, there will likely be some variation in finalization sequence. But
for two objects that are allocated consecutively by a single thread, there is a high
likelihood of predictable ordering. And
we all know how easy it is to make assumptions about this sort of thing in our code.
-
In an earlier blog (http://blogs.gotdotnet.com/cbrumme/PermaLink.aspx/e55664b4-6471-48b9-b360-f0fa27ab6cc0),
I talked about some of the circumstances that impact when the JIT will stop reporting
a reference to the GC. These include
inlining decisions, register allocation, and obvious differences like X86 vs. AMD64
vs. IA64. Clearly we want the freedom
to chase better code quality with JIT compilers and NGEN compilers in ways that will
substantially change these factors. Just
yesterday an internal team reported a GC bug on multi-processor machines only
that we quickly traced to confusion over lifetime rules and bad practice in the application. One
finalizable object was accessing some state in another finalizable object, in the
expectation that the first object was live because it was the this argument
of an active method call.
-
During V1.1 Beta testing, a customer complained about
an application we had broken. This application
contained unmanaged code that reached back into its caller s stack to retrieve a
GCHandle value at an offset that had been empirically discovered. The
unmanaged code then transitioned into managed and redeemed the supposed handle value
for the object it referenced. This usually
worked, though it was clearly dependent on filthy implementation details. Unfortunately,
the System.EnterpriseServices pathways leading to the unmanaged application were somewhat
variable. Under certain circumstances,
the stack was not what the unmanaged code predicted. In
V1, the value at the predicted spot was always a 0 and the redemption attempt failed
cleanly. In V1.1, the value at that stack
location was an unrelated garbage value. The
consequence was a crash inside mscorwks.dll and Fail Fast termination of the process.
-
In V1 and V1.1, Object.GetHashCode() can be used to obtain
a hashcode for any object. However, our
implementation happened to return values which tended to be small ascending integers. Furthermore,
these values happened to be unique across all reachable instances that were hashed
in this manner. In other words, these
values were really object identifiers or OIDs. Unfortunately,
this implementation was a scalability killer for server applications running on multi-processor
boxes. So in Whidbey Object.GetHashCode()
is now all we ever promised it would be: an integer with reasonable distribution but
no uniqueness guarantees. It s a great
value for use in HashTables, but it s sure to disappoint some existing managed applications
that relied on uniqueness.
-
In V1 and V1.1, all string literals are Interned as described
in http://blogs.gotdotnet.com/cbrumme/PermaLink.aspx/7943b9be-cca9-41e1-8a83-3d7a0dbba270. I
noted there that it is a mistake to depend on Interning across assemblies. That
s because the other assembly might start to compose a String value which it originally
specified as a literal. In Whidbey, assemblies
can opt-in or opt-out of our Interning behavior. This
new freedom is motivated by a desire to support faster loading of assemblies (particularly
assemblies that have been NGEN ed). We
ve seen some tests fail as a result.
-
I ve seen some external developers use a very fragile
technique based on their examination of Rotor sources. They
navigate through one of System.Threading.Thread s private fields (DONT_USE_InternalThread)
to an internal unmanaged CLR data structure that represents a running managed thread. From
there, they can pluck interesting information like the Thread::ThreadState bit field. None
of these data structures are part of our contract with managed applications and all
of them are sure to change in future releases. The
only reason the ThreadState field is at a stable offset in our internal Thread struct
today is that its frequency of access merits putting it near the top of the struct
for good cache-line filling behavior.
-
Reflection allows highly privileged code to access private
members of arbitrary types. I am aware
of dozens of teams inside and outside of Microsoft which rely on this mechanism for
shipping products. Some of these uses
are entirely justified, like the way Serialization accesses private state that the
type author marked as [Serializable()]. Many
other uses rather questionable, and a few are truly heinous. Taken
to the extreme, this technique converts every internal implementation detail into
a publicly exposed API, with the obvious consequences for evolution and application
compatibility.
-
Assembly loading and type resolution can happen on very
different schedules, depending on how your application is running. We
ve seen applications that misbehave based on NGEN vs. JIT, domain-neutral vs. per-domain
loading, and the degree to which the JIT inlines methods. For
example, one application created an AppDomain and started running code in it. That
code subsequently modified the private application directory and then attempted to
load an assembly from that directory. Of
course, because of inlining the JIT had already attempted to load the assembly with
the original application directory and had failed. The
correct solution here is to disallow any changes to an AppDomain s application directory
after code starts executing inside that AppDomain. This
directory should only be modifiable during the initialization of the AppDomain.
-
In prior blogs, I ve talked about unhandled exceptions
and the CLR s default policy for dealing with them. That
policy is quite involved and hard to defend. One
aspect of it is that exceptions that escape the Finalizer thread or any ThreadPool
threads are swallowed. This keeps the
process running, but it often leaves the application in an inconsistent state. For
example, locks may not have been released by the thread that took the exception, leading
to subsequent hangs. Now that the technology
for reporting process crashes via Watson dumps is maturing, we really want to change
our default policy for unhandled exceptions so that we Fail Fast with a process crash
and a Watson upload. However, any change
to this policy will undoubtedly cause many existing applications to stop working.
-
Despite the flexibility of CAS, most applications still
run with Full Trust. I truly believe
that this will change over time. For
example, in Whidbey we will have ClickOnce permission elevation and in Longhorn we
will deliver the Secure Execution Environment or SEE. Both
of these features were discussed at the PDC. When
we have substantial code executing in partial trust, we re going to see some unfortunate
surprises. For example, consider message
pumping. If a Single Threaded Apartment
thread has some partial trust code on its stack when it blocks (e.g. Monitor.Enter
on a contentious monitor), then we will pump messages on that thread while it is blocked. If
the dispatching of a message requires a stack walk to satisfy a security Full Demand,
then the partially trusted code further back on the stack may trigger a security exception. Another
example is related to class constructors. As
you probably know, .cctor methods execute on the first thread that needs access to
a class in a particular AppDomain. If
the .cctor must satisfy a security demand, the success of the .cctor now depends on
the accident of what other code is active on the thread s stack. Along
the same lines, the .cctor method may fail if there is insufficient stack space left
on the thread that happens to execute it. These
are all well understood problems and we have plans for fixing them. But
the fixes will necessarily change observable behavior for a class of applications.
I could fill a lot more pages with this sort of list. And
our platform is still in its infancy. Anyway,
one clear message from all this is that things will change and then applications will
break.
But can we categorize these failures and make some sense
of it all? For each failure, we need
to decide whether the platform or the application is at fault for each case. And
then we need to identify some rules or mechanisms that can avoid these failures or
mitigate them. I see four categories.
Category
1: The application explicitly screws
itself
The easiest category to dispense with is the one where
a developer intentionally and explicitly takes advantage of a behavior that s/he knows
is guaranteed to change. A perfect example
of this is #8 above. Anyone who navigates
through private members to unmanaged internal data structures is setting himself up
for problems in future versions. The
responsibility (or irresponsibility in this case) lies with the application. In
my opinion, the platform should have no obligations.
But consider #5 above. It
s clearly in this same category, and yet opinions on our larger team were quite divided
on whether we needed to fix the problem. I
spoke to a number of people who definitely understood the incredible difficulty of
keeping this application running on new versions of the CLR and EnterpriseServices. But
they consistently argued that the operating system has traditionally held itself to
this sort of compatibility bar, that this is one of the reasons for Windows ubiquity,
and that the managed platform must similarly step up.
Also, we have to be realistic here. If
a customer issue like this involves one of our largest accounts, or has been escalated
through a very senior executive (a surprising number seem to reach Steve Ballmer),
then we re going to pull out all the stops on a fix or a temporary workaround.
In many cases, our side-by-side support is an adequate
and simple solution. Customers can continue
to run problematic applications on their old bits, even though a new version of these
bits has also been installed. For instance,
the config file for an application can specify an old version of the CLR. Or
binding redirects could roll back a specific assembly. But
this technique falls apart if the application is actually an add-in that is dynamically
loaded into a process like Internet Explorer or SQL Server. It
s unrealistic to lock back the entire managed stack inside Internet Explorer (possibly
preventing newer applications that use generics or other Whidbey features from running
there), just so older questionable applications can keep running.
It s possible that we could provide lock back at finer-grained
scopes than the process scope in future versions of the CLR. Indeed,
this is one of the areas being explored by our versioning team.
Anyway, if we were under sufficient pressure I could
imagine us building a one-time QFE (patch) for an important customer in this category,
to help them transition to a newer version and more maintainable programming techniques. But
if you aren t a Fortune 100 company or Steve Ballmer s brother-in-law, I personally
hope we would be allowed to ignore any of your applications that are in this category.
Category
2: The platform explicitly screws the
application
I would put #6, #7 and #11 above in a separate category. Here,
the platform team wants to make an intentional breaking change for some valid reason
like performance or reliability. In fact,
#10 above is a very special case of this category. In
#10, we would like to break compatibility in Whidbey so that we can provide a stronger
model that can avoid subsequent compatibility breakage. It
s a paradoxical notion that we should break compatibility now so we can increase future
compatibility, but the approach really is sensible.
Anyway, if the platform makes a conscious decision to
break compatibility to achieve some greater goal, then the platform is responsible
for mitigation. At a minimum, we should
provide a way for broken applications to obtain the old behavior, at least for some
transition period. We have a few choices
in how to do this, and we re likely to pick one based on engineering feasibility,
the impact of a breakage, the likelihood of a breakage, and schedule pressure:
-
Rely on side-by-side and explicit administrator intervention. In
other words, the admin notices the application no longer works after a platform upgrade,
so s/he authors a config file to lock the application back to the old platform bits. This
approach is problematic because it requires a human being to diagnose a problem and
intervene. Also, it has the problems
I already mentioned with using side-by-side on processes like Internet Explorer or
SQL Server.
-
For some changes, it shouldn t be necessary to lock
back the entire platform stack. Indeed,
for many changes the platform could simultaneously support the old and new behaviors. If
we change our default policy for dealing with unhandled exceptions, we should definitely
retain the old policy& at least for one release cycle.
-
If we expect a significant percentage of applications
to break when we make a change, we should consider an opt-in policy for that change. This
eliminates the breakage and the human involvement in a fix. In
the case of String Interning, we require each assembly to opt-in to the new non-intern
ed behavior.
-
In some cases, we ve toyed with the idea of having the
opt-in be implicit with a recompile. The
logic here is that when an application is recompiled against new platform bits, it
is presumably also tested against those new bits. The
developer, rather than the admin, will deal with any compatibility issues that arise. We
re well set up for this, since managed assemblies contain metadata giving us the version
numbers of the CLR and the dependent assemblies they were compiled against. Unfortunately,
execution models like ASP.NET work against us here. As
you know, ASP.NET pages are recompiled automatically by the system based on dependency
changes. There is no developer available
when this happens.
Windows
Shimming
Before we look at the next two categories of AppCompat
failure, it s worth taking a very quick look at one of the techniques that the operating
system has traditionally used to deal with these issues. Windows
has an AppCompat team which has built something called a shimming engine.
Consider what happened when the company tried to move
consumers from Win95/Win98/WinMe over to WinXP. They
discovered a large number of programs which used the GetVersion or the preferred GetVersionEx
APIs in such a way that the programs refused to run on NT-based systems.
In fact, WinXP did such a good job of achieving compatibility
with Win9X systems that in many cases the only reason
the application wouldn t run was the version check that the program made at start
up. The fix was to change GetVersion
or GetVersionEx to lie about the version number of the current operating system. Of
course, this lie should only be told to programs that need the lie in order to work
properly.
I ve heard that this shim which lies about the operating
system version is the most commonly applied shim we have. As
I understand it, at process launch the shimming engine tries to match the current
process against any entries in its database. This
match could be based on the name, timestamp or size of the EXE, or of other files
found relative to that EXE like a BMP for the splash screen in a subdirectory. The
entry in the database lists any shims that should be applied to the process, like
the one that lies about the version. The
shimming engine typically bashes the IAT (import address table) of a DLL or EXE in
the process, so that its imports are bound to the shim rather than to the normal export
(e.g. Kernel32!GetVersionEx). In addition,
the shimming engine has other tricks it perform less frequently, like wrapping COM
objects up with intercepting proxies.
It s easy to see how this infrastructure can allow applications
for Win95 to execute on WinXP. However,
this approach has some drawbacks. First,
it s rather labor-intensive. Someone
has to debug the application, determine which shims will fix it, and then craft some
suitable matching criteria that will identify this application in the shimming database. If
an appropriate shim doesn t already exist, it must be built.
In the best case, the application has some commercial
significance and Microsoft has done all the testing and shimming. But
if the application is a line of business application that was created in a particular
company s IT department, Microsoft will never get its hands on it. I
ve heard we re now allowing sophisticated IT departments to set up their own shimming
databases for their own applications but this only allows them to apply existing
shims to their applications.
And from my skewed point of view the worst part of
all this is that it really won t work for managed applications. For
managed apps, binding is achieved through strong names, Fusion and the CLR loader. Binding
is practically never achieved through DLL imports.
So it s instructive to look at some of the techniques
the operating system has traditionally used. But
those techniques don t necessarily apply directly to our new problems.
Anyway, back to our categories&
Category
3: The application accidentally screws
itself
Category
4: The platform accidentally screws the
application
Frankly, I m having trouble distinguishing these two
cases. They are clearly distinct categories,
but it s a judgment call where to draw the line. The
common theme here is that the platform has accidentally exposed some consistent behavior
which is not actually a guaranteed contract. The
application implicitly acquires a dependency on this consistent behavior, and is broken
when the consistency is later lost.
In the nirvana of some future fully managed execution
environment, the platform and tools would never expose consistent behavior unless
it was part of a guarantee. Let s look
at some examples and see how practical this is.
In example #1 above, reflection used to deliver members
in a stable order. In Whidbey, that order
changes. In hindsight, there s a simple
solution here. V1 of the product could
have contained a testing mode that randomized the returned order. This
would have exposed the developer to our actual guarantees, rather than to a stronger
accidental consistency. Within the CLR,
we ve used this sort of technique to force us down code paths that otherwise wouldn
t be exercised. For example, developers
on the CLR team all use NT-based (Unicode) systems and avoid Win9X (Ansi) systems. So
our Win9X Ansi/Unicode wrappers wouldn t typically get tested by developers. To
address this, our checked/debug CLR build originally considered the day of the week
and used Ansi code paths every other day. But
imagine chasing a bug at
11:55 PM
. When the bug magically disappears on
your next run at
1:03 AM
the next morning, you are far too frazzled to think clearly about the reason. Today,
we tend to use low order bits in the size of an image like mscorwks.dll or the assembly
being tested, so our randomization is now more friendly to testing.
In example #2 above, you could imagine a similar perturbation
on our AutoLayout algorithms when executing a debug version of an application, or
when launched from inside a tool like Visual Studio.
For example #4, the CLR already has internal stress modes
that force different and aggressive GC schedules. These
can guarantee compaction to increase the likelihood of detecting stale references. They
can perform extensive checks of the integrity of the heap, to ensure that the write
barrier and other mechanisms are effective. And
they can ensure that every instruction of JITted managed code that can synchronize
with the GC will synchronize with the GC. I
suspect that these modes would do a partial job of eradicating assumptions about lifetimes
reported by the JIT. However, we will
remain exposed to significantly different code generators (like Rotor s FJIT) or
execution on significantly different architectures (like CPUs with dramatically more
registers).
In contrast with the above difficulty, it s easy to
imagine adding a new GC stress mode that perturbs the finalization queues, to uncover
any hidden assumptions about finalization order. This
would address example #3.
Customer Debug Probes, AppVerifier and other
tools
It turns out that the CLR already has a partial mechanism
for enabling perturbation during testing and removing it on deployed applications. This
mechanism is the Customer Debug Probes feature that we shipped in V1.1. Adam
Nathan s excellent blog site has a series of articles on CDPs, which are collected
together at http://blogs.gotdotnet.com/anathan/CategoryView.aspx/Debugging. The
original goal of CDPs was to counteract the black box nature of debugging certain
failures of managed applications, like corruptions of the GC heap or crashes due to
incorrect marshaling directives. These
probes can automatically diagnose common application errors, like failing to keep
a marshaled delegate rooted so it won t be collected. This
approach is so much easier than wading through dynamically generated code without
symbols, because we tell you exactly where your bugs are. But
we re now realizing that we can also use CDPs to increase the future compatibility
of managed applications if we can perturb current behavior that is likely to change
in the future.
Unfortunately, example #6 from above reveals a major
drawback with the technique of perturbation. When
we built the original implementation of Object.GetHashCode, we simply never considered
the difference between what we wanted to guarantee (hashing) and what we actually
delivered (OIDs). In hindsight, it is
obvious. But I m not convinced that
we aren t falling into similar traps in our new features. We
might be a little smarter than we were five years ago, but only a little.
Example #10 worries me for similar reasons. I
just don t think we were smart enough to predict that changing the binding configuration
of an AppDomain after starting to execute code in that AppDomain would be so fragile. When
a developer delivers a feature, s/he needs to consider security, thread-safety, programming
model, key invariants of the code base like GC reporting, correctness, and so many
other aspects. It would be amazing if
a developer consistently nailed each of these aspects for every new feature. We
re kidding ourselves if we think that evolution and unintentional implicit contracts
will get adequate developer attention on every new feature.
Even if we had perfect foresight and sufficient resources
to add perturbation for all operations, we would still have a major problem. We
can t necessarily rely on 3rd party developers to test their applications
with perturbation enabled. Consider the
unmanaged AppVerifier experience.
The operating system has traditionally offered a dynamic
testing tool called AppVerifier which can diagnose many common unmanaged application
bugs. For example, thanks to uploads
of Watson process dumps from the field, most unmanaged application crashes can now
be attributed to incorrect usage of dynamically allocated memory. Yet
AppVerifier can use techniques like placing each allocation in its own page or leaving
pages unmapped after release, to deterministically catch overruns, double frees, and
reads or writes of freed memory.
In other words, there is hard evidence that if every
unmanaged application had just used the memory checking support of AppVerifier, then
two out of every three application crashes would be eliminated. Clearly
this didn t happen.
Of course, AppVerifier can diagnose far more than just
memory problems. And it s very easy
and convenient to use.
Since testing with AppVerifier is part of the Windows
Logo compliance program, you would expect that it s used fairly rigorously by ISVs. And,
given its utility, you would expect that most IT organizations would use this tool
for their internal applications. Unfortunately,
this isn t the case. Many applications
submitted for the Windows Logo actually fail to launch under AppVerifier. In
other words, they violate at least one of the rules before they finish initializing.
The Windows AppCompat team recognizes that proactive
tools like AppVerifier are so much better than reactive mitigation like shimming broken
applications out in the field. That
s why they made the AppVerifier tool a major focus of their poorly attended Application
Compatibility talk that I sat in on at the PDC. (Aha! I
really was going somewhere with all this.)
There s got to be a reason why developers don t use
such a valuable tool. In my opinion,
the reason is that AppVerifier is not integrated into Visual Studio. If
the Debug Properties in VS allowed you to enable AppVerifier and CDP checks, we would
have much better uptake. And if an integrated
project system and test system could monitor code coverage numbers, and suggest particular
test runs with particular probes enabled, we would be approaching nirvana.
Winding Down
Looking at development within Microsoft, one trend is
very clear: Automated tools and processes
are a wonderful supplement for human developers. Whether
we re talking about security, reliability, performance, application compatibility
or any other measure of software quality, we re now seeing that static and dynamic
analysis tools can give us guarantees that we will never obtain from human beings. Bill
Gates touched on this during his PDC keynote, when he described our new tools for
statically verifying device driver correctness, for some definition of correctness.
This trend was very clear to me during the weeks I spent
on the DCOM / RPCSS security fire drill. I
spent days looking at some clever marshaling code, eventually satisfying myself that
it worked perfectly. Then someone else
wrote an automated attacker and discovered real flaws in just a few hours. Other
architects and senior developers scrutinized different sections of the code. Then
some researchers from MSR who are focused on automatic program validation ran their
latest tools over the same code and gave us step-by-step execution models that led
up to crashes. Towards the end of the
fire drill, a virtuous cycle was established. The
code reviewers noticed new categories of vulnerabilities. Then
the researchers tried to evolve their tools to detect those vulnerabilities. Aspects
of this process were very raw, so the tools sometimes produced a great deal of noise
in the form of false positives. But it
s clear that we were getting real value from Day One and the future potential here
is enormous.
One question that always comes up, when we talk about
adding significant value to Visual Studio through additional tools, is whether Microsoft
should give away these tools. It s a
contentious issue, and I find myself going backwards and forwards on it. One
school of thought says that we should give away tools to promote the platform and
improve all the programs in the Windows ecology. In
the case of tools that make our customers applications more secure or more resilient
to future changes in the platform, this is a compelling argument. Another
school of thought says that Visual Studio is a profit center like any other part of
the company, and it needs the freedom to charge what the market will bear.
Given that my job is building a platform, you might expect
me to favor giving away Visual Studio. But
I actually think the profit motive is a powerful mechanism for making our tools competitive. If
Visual Studio doesn t have P&L responsibility, their offering will deteriorate
over time. The best way to know whether
they ve done all they can to make the best tools possible, is to measure how much
their customers are willing to pay. I
want Borland to compete with Microsoft on building the best tools at the best price,
and I want to be able to measure the results of that competition through revenue and
market penetration.
In all this, I have avoided really talking about the
issues of versioning. Of course, versioning
and application compatibility are enormously intertwined. Applications
break for many reasons, but the typical reason is that one component is now binding
to a new version of another component. We
have a whole team of architects, gathered from around the company, who have been meeting
regularly for about a year to grapple with the problems of a complete managed versioning
story. Unlike managed AppCompat, the
intellectual investment in managed versioning has been enormous.
Anyway, Application Compatibility remains a relatively
contentious subject over here. There
s no question that it s a hugely important topic which will have a big impact on
the longevity of our platform. But we
are still trying to develop techniques for achieving compatibility that will be more
successful than what Windows has done in the past, without limiting our ability to
innovate on what is still a very young execution engine and set of frameworks. I
have deliberately avoided talking about what some of those techniques might be, in
part because our story remains incomplete.
Also, we won t realize how badly AppCompat will bite
us until we can see a lot of deployed applications that are breaking as we upgrade
the platform. At that point, it s easier
to justify throwing more resources at the problem. But
by then the genie is out of the bottle& the deployed applications will already
depend on brittle accidents of implementation, so recovery will be painfully breaking. In
a world where we are always under intense resource and schedule pressure, the needs
of AppCompat must be balanced against performance, security, developer productivity,
reliability, innovation and all the other must haves .
You know, I really do want to talk about Hosting. It
is a truly fascinating subject. I m
much more comfortable talking about non-preemptive fiber scheduling than I am talking
about uninteresting topics like implicit contracts and compatibility trends.
But Hosting is going to have to wait at least a few more
weeks.
|
-
I had
hoped this article would be on changes to the next version of the CLR which
allow it to be hosted inside SQL Server and other “challenging”
environments. This is more
generally interesting than you might think, because it creates an opportunity
for other processes (i.e. your
processes) to host the CLR with a similar level of integration and control. This includes control over memory usage,
synchronization, threading (including fibers), extended security models,
assembly storage, and more.
However,
that topic is necessarily related to our next release, and I cannot talk about
deep details of that next release until those details have been publicly
disclosed. In late October,
Microsoft is holding its PDC and I expect us to disclose many details at that
time. In fact, I’m signed up to be
a member of a PDC panel on this topic.
If you work on a database or an application server or a similarly
complicated product that might benefit from hosting the CLR, you may want to
attend.
After
we’ve disclosed the hosting changes for our next release, you can expect a blog
on hosting in late October or some time in November.
Instead,
this blog is on the managed exception model. This is an unusual topic for me. In the past, I’ve picked topics where I
can dump information without having to check any of my facts or do any
research. But in the case of
exceptions I keep finding questions I cannot answer. At the top level, the managed exception
model is nice and simple. But – as
with everything else in software – the closer you look, the more you
discover.
So for
the first time I decided to have some CLR experts read my blog entry before I
post it. In addition to pointing
out a bunch of my errors, all the reviewers were unanimous on one point: I
should write shorter blogs.
Of
course, we can’t talk about managed exceptions without first considering Windows
Structured Exception Handling (SEH).
And we also need to look at the C++ exception model. That’s because both managed exceptions
and C++ exceptions are implemented on top of the underlying SEH mechanism, and
because managed exceptions must interoperate with both SEH and C++
exceptions.
Windows
SEH
Since
it’s at the base of all exception handling on Windows, let’s look at SEH
first. As far as I know, the
definitive explanation of SEH is still Matt Pietrek’s excellent 1997 article for
Microsoft Systems Journal: http://www.microsoft.com/msj/0197/exception/exception.aspx. There have
been some extensions since then, like vectored exception handlers, some security
enhancements, and the new mechanisms to support IA64 and AMD64. (It’s hard to base exceptions on FS:[0]
chains if your processor doesn’t have an FS segment register). We’ll look at all these changes
shortly. But Matt’s 1997 article
remains a goldmine of information.
In fact, it was very useful to the developers who implemented exceptions
in the CLR.
The SEH
model is exposed by MSVC via two constructs:
- __try {…}
__except(filter_expression) {…}
- __try {…} __finally
{…}
Matt’s
article explains how the underlying mechanism of two passes over a chain of
single callbacks is used to provide try/except/finally semantics. Briefly, the OS dispatches an exception
by retrieving the head of the SEH chain from TLS. Since the head of this chain is at the
top of the TIB/TEB (Thread Information Block / Thread Environment Block,
depending on the OS and the header file you look at), and since the FS segment
register provides fast access to this TLS block on X86, the SEH chain is often
called the FS:[0] chain.
Each
entry consists of a next or a prev pointer (depending on how you look at it) and
a callback function. You can add
whatever data you like after that standard entry header. The callback function is called with all
sorts of additional information related to the exception that’s being
processed. This includes the
exception record and the register state of the machine which was captured at the
time of the exception.
To
implement the 1st form of MSVC SEH above (__try/__except), the
callback evaluates the filter expression during the first pass over the handler
chain. As exposed by MSVC, the
filter expression can result in one of three legal values:
EXCEPTION_CONTINUE_EXECUTION
= -1
EXCEPTION_CONTINUE_SEARCH =
false 0
EXCEPTION_EXECUTE_HANDLER =
true 1
Of
course, the filter could also throw its own exception. That’s not generally desirable, and I’ll
discuss that possibility and other flow control issues later.
But if
you look at the underlying SEH mechanism, the handler actually returns an
EXCEPTION_DISPOSITION:
typedef enum
_EXCEPTION_DISPOSITION
{
ExceptionContinueExecution,
ExceptionContinueSearch,
ExceptionNestedException,
ExceptionCollidedUnwind
}
EXCEPTION_DISPOSITION;
So
there’s some mapping that MSVC is performing here. Part of that mapping is just a trivial
conversion between the MSVC filter values and the SEH handler values. For instance ExceptionContinueSearch has
the value 1 at the SEH handler level but the equivalent
EXCEPTION_CONTINUE_SEARCH has the value 0 at the MSVC filter level. Ouch.
But the
other part of the mapping has to do with a difference in functionality. For example, ExceptionNestedException
and ExceptionCollidedUnwind are primarily used by the OS dispatch mechanism
itself. We’ll see the circumstances
in which they arise later. More
importantly, MSVC filters can indicate that the __except clause should run by
returning EXCEPTION_EXECUTE_HANDLER.
But we shall see that at the SEH level this decision is achieved by
having the exception dispatch routine fix up the register context and then
resuming execution at the right spot.
The
EXCEPTION_CONTINUE_EXECUTION case supports a rather esoteric use of SEH. This return value allows the filter to
correct the problem that caused the exception and to resume execution at the
faulting instruction. For example,
an application might be watching to see when segments are being written to so
that it can log this information.
This could be achieved by marking the segment as ReadOnly and waiting for
an exception to occur on first write.
Then the filter could use VirtualProtect to change the segment containing
the faulting address to ReadWrite and then restart the faulting
instruction. Alternatively, the
application could have two VirtualAllocs for each region of memory. One of these could be marked as ReadOnly
and the second could be a shadow that is marked as ReadWrite. Now the exception filter can simply
change the register state of the CPU that faulted, so that the register
containing the faulting address is changed from the ReadOnly segment to the
shadowed ReadWrite segment.
Obviously anyone who is playing these games must have a lot of
sophistication and a deep knowledge of how the program executes. Some of these games work better if you
can constrain the code that’s generated by your program to only touch faulting
memory using a predictable cliché like offsets from a particular
register.
I’ll
talk about this kind of restartable or resumable exception in the context of
managed code later. For now, let’s
pretend that the filter either returns “true – I would like my ‘except’ clause
to handle this exception” or “false – my ‘except’ clause is uninterested in this
exception”. If the filter returns
false, the next SEH handler is fetched from the chain and it is asked this same
question.
The OS
is pretty paranoid about corrupt stacks during this chain traversal. It checks that all chain entries are
within the bounds of the stack.
(These bounds are also recorded in the TEB). The OS also checks that all entries are
in ascending order on the stack. If
you violate these rules, the OS will consider the stack to be corrupt and will
be unable to process exceptions.
This is one of the reasons that a Win32 application cannot break its
stack into multiple disjoint segments as an innovative technique for dealing
with stack overflow.
Anyway,
eventually a handler says “true – I would like my ‘except’ clause to handle this
exception”. That’s because there’s
a backstop entry at the end of the chain which is placed there by the OS when
the thread is created. This last
entry wants to handle all the exceptions, even if your application-level
handlers never do. That’s where you
get the default OS behavior of consulting the unhandled exception filter list,
throwing up dialog boxes for Terminate or Debug, etc.
As soon
as a filter indicates that it wants to handle an exception, the first pass of
exception handling finishes and the second pass begins. As Matt’s article explains, the handler
can use the poorly documented RtlUnwind service to deliver second pass
notifications to all the previous handlers and pop them off the handler
chain.
In other
words, no unwinding happened as the first pass progressed. But during the second pass we see two
distinct forms of unwind. The first
form involves popping SEH records from the chain that was threaded from
TLS. Each such SEH record is popped
before the corresponding handler gets called for the second pass. This leaves the SEH chain in a
reasonable form for any nested exceptions that might occur within a
handler.
The
other form of unwind is the actual popping of the CPU stack. This doesn’t happen as eagerly as the
popping of the SEH records. On X86,
EBP is used as the frame pointer for methods containing SEH. ESP points to the top of the stack, as
always. Until the stack is actually
unwound, all the handlers are executed on top of the faulting exception
frame. So the stack actually grows
when a handler is called for the first or second pass. EBP is set to the frame of the method
containing a filter or finally clause so that local variables of that method
will be in scope.
The
actual popping of the stack doesn’t occur until the catching ‘except’ clause is
executed.
So we’ve
got a handler whose filter announced in the first pass that it would handle this
exception via EXCEPTION_EXECUTE_HANDLER.
And that handler has driven the second pass by unwinding and delivering
all the second pass notifications.
Typically it will then fiddle with the register state in the exception
context and resume execution at the top of the appropriate ‘except’ clause. This isn’t necessarily the case, and
later we’ll see some situations where the exception propagation gets
diverted.
How
about the try/finally form of SEH?
Well, it’s built on the same underlying notion of a chain of
callbacks. During the first pass
(the one where the filters execute, to decide which except block is going to
catch), the finally handlers all say EXCEPTION_CONTINUE_SEARCH. They never actually catch anything. Then in the second pass, they execute
their finally blocks.
Subsequent
additions to SEH
All of
the above – and a lot more – is in Matt’s article. There are a few things that aren’t in
his article because they were added to the model later.
For
example, Windows XP introduced the notion of a vectored exception handler. This allows the application to register
for a first crack at an exception, without having to wait for exception handling
to propagate down the stack to an embedded handler. Fortunately, Matt wrote an “Under The
Hood” article on this particular topic.
This can be found at http://msdn.microsoft.com/msdnmag/issues/01/09/hood/default.aspx.
Another
change to SEH is related to security.
Buffer overruns – whether on the stack or in heap blocks – remain a
favorite attack vector for hackers.
A typical buffer overrun attack is to pass a large string as an argument
to an API. If that API expected a
shorter string, it might have a local on the stack like “char
filename[256];”. Now if the API is
foolish enough to strcpy a malicious hacker’s argument into that buffer, then
the hacker can put some fairly arbitrary data onto the stack at addresses higher
(further back on the stack) than that ‘filename’ buffer. If those higher locations are supposed
to contain call return addresses, the hacker may be able to get the CPU to
transfer execution into the buffer itself.
Oops. The hacker is
injecting arbitrary code and then executing it, potentially inside someone
else’s process or under their security credentials.
There’s
a new speed bump that an application can use to reduce the likelihood of a
successful stack-based buffer overrun attack. This involves the /GS C++ compiler
switch, which uses a cookie check in the function epilog to determine whether a
buffer overrun has corrupted the return address before executing a return based
on its value.
However,
the return address trick is only one way to exploit buffer overruns. We’ve already seen that SEH records are
necessarily built on the stack. And
in fact the OS actually checks to be sure they are within the stack bounds. Those SEH records contain callback
pointers which the OS will invoke if an exception occurs. So another way to exploit a buffer
overrun is to rewrite the callback pointer in an SEH record on the stack. There’s a new linker switch (/SAFESEH)
that can provide its own speed bump against this sort of attack. Modules built this way declare that all
their handlers are embedded in a table in the image; they do not point to
arbitrary code sequences sprinkled in the stack or in heap blocks. During exception processing, the
exception callbacks can be validated against this table.
Of
course, the first and best line of defense against all these attacks is to never
overrun a buffer. If you are
writing in managed code, this is usually pretty easy. You cannot create a buffer overrun in
managed code unless the CLR contains a bug or you perform unsafe operations
(e.g. unverifiable MC++ or ‘unsafe’ in C#) or you use high-privilege unsafe APIs
like StructureToPtr or the various overloads of Copy in the
System.Runtime.InteropServices.Marshal class.
So, not
surprisingly and not just for this reason, I recommend writing in managed
code. But if you must write some
unmanaged code, you should seriously consider using a String abstraction that
eliminates all those by-rote opportunities for error. And if you must code each strcpy
individually, be sure to use strncpy instead!
A final
interesting change to the OS SEH model since Matt’s article is due to
Win64. Both IA64 and AMD64 have a
model for exception handling that avoids reliance on an explicit handler chain
that starts in TLS and is threaded through the stack. Instead, exception handling relies on
the fact that on 64-bit systems we can perfectly unwind a stack. And this ability is itself due to the
fact that these chips are severely constrained on the calling conventions they
support.
If you
look at X86, there are an unbounded number of calling conventions possible. Sure, there are a few common well-known
conventions like stdcall, cdecl, thiscall and fastcall. But optimizing compilers can invent
custom calling conventions based on inter-procedural analysis. And developers writing in assembly
language can make novel decisions about which registers to preserve vs. scratch,
how to use the floating point stack, how to encode structs into registers,
whether to back-propagate results by re-using the stack that contained in-bound
arguments, etc. Within the CLR, we
have places where we even unbalance the stack by encoding data after a CALL
instruction, which is then addressable via the return address. This is a particularly dangerous game
because it upsets the branch prediction code of the CPU and can cause prediction
misses on several subsequent RET instructions. So we are careful to reserve this
technique for low frequency call paths.
And we also have some stubs that compute indirect JMPs to out-of-line RET
‘n’ instructions in order to rebalance the stack.
It would
be impossible for a stack crawler to successfully unwind these bizarre stacks
for exception purposes, without completely simulating arbitrary code
execution. So on X86 the exception
mechanism must rely on the existence of a chain of crawlable FS:[0] handlers
that is explicitly maintained.
Incidentally, the above distinction between perfect stack crawling on
64-bit systems vs. hopeless stack crawling on X86 systems has deeper
repercussions for the CLR than just exception handling. The CLR needs the ability to crawl all
the managed portions of a thread’s stack on all architectures. This is a requirement for proper
enforcement of Code Access Security; for accurate reporting of managed
references to the GC; for hijacking return addresses in order to asynchronously
take control of threads; and for various other reasons. On X86, the CLR devotes considerable
resources to achieving this.
Anyway,
on 64-bit systems the correspondence between an activation record on the stack
and the exception record that applies to it is not achieved through an FS:[0]
chain. Instead, unwinding of the
stack reveals the code addresses that correspond to a particular activation
record. These instruction pointers
of the method are looked up in a table to find out whether there are any
__try/__except/__finally clauses that cover these code addresses. This table also indicates how to proceed
with the unwind by describing the actions of the method epilog.
Managed
Exceptions
Okay,
enough about SEH – for now. Let’s
switch to the managed exception model.
This model contains a number of constructs. Depending on the language you code in,
you probably only have access to a subset of these.
try {…} finally
{…}
This is
pretty standard. All managed
languages should expose this, and it should be the most common style of
exception handling in user code. Of
course, in the case of MC++ the semantics of ‘finally’ is exposed through
auto-destructed stack objects rather than through explicit finally clauses. You should be using ‘finally’ clauses to
guarantee consistency of application state far more frequently than you use
‘catch’ clauses. That’s because
catch clauses increase the likelihood that developers will swallow exceptions
that should be handled elsewhere, or perhaps should even be left unhandled. And if catch clauses don’t actually
swallow an exception (i.e. they ‘rethrow’), they still create a poor debugging
experience as we shall see.
try {…} catch (Object o)
{…}
This is
pretty standard, too. One thing
that might surprise some developers is that you can catch any instance that’s of
type Object or derived from Object.
However, there is a CLS rule that only subtypes of System.Exception
should be thrown. In fact, C# is so
eager for you to only deal with System.Exception that it doesn’t provide any
access to the thrown object unless you are catching Exception or one of its
subtypes.
When you
consider that only Exception and its subtypes have support for stack traces,
HRESULT mapping, standard access to exception messages, and good support
throughout the frameworks, then it’s pretty clear that you should restrict
yourself to throwing and processing exceptions that derive from
Exception.
In
retrospect, perhaps we should have limited exception support to Exception rather
than Object. Originally, we wanted
the CLR to be a useful execution engine for more run-time libraries than just
the .NET Frameworks. We imagined
that different languages would execute on the CLR with their own particular
run-time libraries. So we didn’t
want to couple the base engine operations too tightly with CLS rules and
constructs in the frameworks. Of
course, now we understand that the commonality of the shared framework classes
is a huge part of the value proposition of our managed environment. I suspect we would revisit our original
design if we still could.
try
{…} catch
(Object o) if (expression) {…}
This is
invented syntax, though I’m told it’s roughly what MC++ is considering. As far as I know, the only two .NET
languages that currently support exception filters are VB.NET and – of course –
ILASM. (We never build a managed
construct without exposing it via ILDASM and ILASM in a manner that allows these
two tools to round-trip between source and binary forms).
VB.NET
has sometimes been dismissed as a language that’s exclusively for less
sophisticated developers. But the
way this language exposes the advanced feature of exception filters is a great
example of why that position is too simplistic. Of course, it is true that VB has
historically done a superb job of providing an approachable toolset and
language, which has allowed less sophisticated developers to be highly
productive.
Anyway,
isn’t this cool:
Try
…try
statements…
Catch e As
InvalidOperationException When expressionFilter
…catch
statements…
End
Try
Of course, at the runtime
level we cannot separate the test for the exception type expression and the
filter expression. We only support
a bare expression. So the VB
compiler turns the above catch into something like this, where $exception_obj is the
implicit argument passed to the filter.
Catch When
(IsInst($exception_obj, InvalidOperationException)
&& expressionFilter)
While
we’re on the topic of exception handling in VB, have you ever wondered how VB
.NET implements its On Error statement?
On Error { Goto {
<line> | 0 | -1 } | Resume Next }
Me
neither. But I think it’s pretty
obvious how to implement this sort of thing with an interpreter. You wait for something to go wrong, and
then you consult the active “On Error” setting. If it tells you to “Resume Next”, you
simply scan forwards to the next statement and away you go.
But in
an SEH world, it’s a little more complicated. I tried some simple test cases with the
VB 7.1 compiler. The resulting
codegen is based on advancing a _Vb_t_CurrentStatement local variable to
indicate the progression of execution through the statements. A single try/filter/catch covers
execution of these statements. It
was interesting to see that the ‘On Error’ command only applies to exceptions
that derive from System.Exception.
The filter refuses to process any other exceptions.
So VB is
nicely covered. But what if you did
need to use exception filters from C#?
Well, in V1 and V1.1, this would be quite difficult. But C# has announced a feature for their
next release called anonymous methods.
This is a compiler feature that involves no CLR changes. It allows blocks of code to be mentioned
inline via a delegate. This
relieves the developer from the tedium of defining explicit methods and state
objects that can be gathered into the delegate and the explicit sharing of this
state. This and other seductive
upcoming C# features are described at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dv_vstechart/html/vbconcprogramminglanguagefuturefeatures.asp.
Using a
mechanism like this, someone has pointed out that one could define delegates for
try, filter and catch clauses and pass them to a shared chunk of ILASM. I love the way the C# compiler uses type
inferencing to automatically deduce the delegate types. And it manufactures a state object to
ensure that the locals and arguments of DoTryCatch are available to the “try
statements”, “filter expression” and “catch statements”, almost as if everything
was scoped in a single method body.
(I say “almost” because any locals or arguments that are of byref,
argiterator or typedbyref types cannot be disassociated from a stack without
breaking safety. So these cases are
disallowed).
I’m
guessing that access to filters from C# could look something like
this:
public void delegate
__Try();
public Int32 delegate
__Filter();
public void delegate
__Catch();
// this reusable helper would
be defined in ILASM or VB.NET:
void DoTryCatch(__Try t,
__Filter f, __Catch c)
// And C# could then use it
as follows:
void
m(…arguments…)
{
…locals…
DoTryCatch(
{ …try
statements…},
{ return
filter_expression; },
{ …catch
statements…}
);
}
You may
notice that I cheated a little bit.
I didn’t provide a way for the ‘catch’ clause to mention the exception
type that it is catching. Of
course, this could be expressed as part of the filter, but that’s not really
playing fair. I suspect the
solution is to make DoTryCatch a generic method that has an unbound Type
parameter. Then DoTryCatch<T>
could be instantiated for a particular type. However, I haven’t actually tried this
so I hate to pretend that it would work.
I am way behind on understanding what we can and cannot do with generics
in our next release, how to express this in ILASM, and how it actually works
under the covers. Any blog on that
topic is years away.
While we
are on the subject of interesting C# codegen, that same document on upcoming
features also discusses iterators.
These allow you to use the ‘yield’ statement to convert the normal pull
model of defining iteration into a convenient push model. You can see the same ‘yield’ notion in
Ruby. And I’m told that both
languages have borrowed this from CLU, which pioneered the feature about the
time that I was born.
When you
get your hands on an updated C# compiler that supports this handy construct, be
sure to ILDASM your program and see how it’s achieved. It’s a great example of what a compiler
can do to make life easier for a developer, so long as we’re willing to burn a
few more cycles compared to a more prosaic loop construct. In today’s world, this is almost always a sensible
trade-off.
Okay,
that last part has nothing to do with exceptions, does it? Let’s get back to the managed exception
model.
try {…} fault
{…}
Have you
ever written code like this, to restrict execution of your finally clause to
just the exceptional cases?
bool exceptional =
true;
try
{
…body of
try…
exceptional =
false;
} finally
{
if (exceptional)
{…}
}
Or how
about a catch with a rethrow, as an alternate technique for achieving finally
behavior for just the exceptional cases:
try
{
…
} catch
{
…
rethrow;
}
In each
case, you are accommodating for the fact that your language doesn’t expose fault
blocks. In fact, I think the only
language that exposes these is ILASM.
A fault block is simply a finally clause that only executes in the
exceptional case. It never executes
in the non-exceptional case.
Incidentally, the first alternative is preferable to the second. The second approach terminates the first
pass of exception handling. This is
a fundamentally different semantics, which has a substantial impact on debugging
and other operations. Let’s look at
rethrow in more detail, to see why this is the case.
Rethrow,
restartable exceptions, debugging
Gee, my
language has rethrow, but no filter.
Why can’t I just treat the following constructs as equivalent?
try {…} filter (expression)
catch (Exception e) {…}
try {…} catch (Exception e) {
if (!expression) rethrow; …}
In fact,
‘rethrow’ tries hard to create the illusion that the initial exception handling
is still in progress. It uses the
same exception object. And it
augments the stack trace associated with that exception object, so that it
includes the portion of stack from the rethrow to the eventual catch.
Hmm, I
guess I should have already mentioned that the stack trace of an Exception is
intentionally restricted to the segment of stack from the throw to the
catch. We do this for performance
reasons, since part of the cost of an exception is linear with the depth of the
stack that we capture. I’ll talk
about the implications of exception performance later. Of course, you can use the
System.Diagnostics.StackTrace class to gather the rest of the stack from the
point of the catch, and then manually merge it into the stack trace from the
Exception object. But this is a
little clumsy and we have sometimes been asked to provide a helper to make this
more convenient and less brittle to changes in the formatting of stack
traces.
Incidentally, when you are playing around with stack traces (whether they
are associated with exceptions, debugging, or explicit use of the StackTrace
class), you will always find JIT inlining getting in your way. You can try to defeat the JIT inliner
through use of indirected calls like function pointers, virtual methods,
interface calls and delegates. Or
you can make the called method “interesting” enough that the JIT decides it
would be unproductive or too difficult to inline. All these techniques are flawed, and all
of them will fail over time. The
correct way to control inlining is to use the
MethodImpl(MethodImplOptions.NoInlining) pseudo-custom attribute from the
System.Runtime.CompilerServices namespace.
One way
that a rethrow differs from a filter is with respect to resumable or restartable
exceptions. We’ve already seen how
SEH allows an exception filter to return EXCEPTION_CONTINUE_EXECUTION. This causes the faulting instruction to
be restarted. Obviously it’s
unproductive to do this unless the filter has first taken care of the faulting
situation somehow. It could do this
by changing the register state in the exception context so that a different
value is dereferenced, or so that execution resumes at a different
instruction. Or it could have
modified the environment the program is running in, as with the VirtualProtect
cases that I mentioned earlier.
In V1
and V1.1, the managed exception model does not support restartable
exceptions. In fact, I think that
we set EXCEPTION_NONCONTINUABLE on some (but perhaps not all) of our exceptions
to indicate this. There are several
reasons why we don’t support restartable exceptions:
- In order to repair a faulting situation, the exception
handler needs intimate knowledge about the execution environment. In managed code, we’ve gone to great
lengths to hide these details.
For example, there is no architecture-neutral mapping from the IL
expression of stack-based execution to the register set of the underlying
CPU.
- Restartability is often desired for asynchronous
exceptions. By ‘asynchronous’ I
mean that the exception is not initiated by an explicit call to ‘throw’ in the
code. Rather, it results from a
memory fault or an injected failure like Abort that can happen on any
instruction. Propagating a
managed exception, where this involves execution of a managed filter,
necessarily involves the potential for a GC. A JIT has some discretion over the
GC-safe points that it chooses to support in a method. Certainly the JIT must gather GC
information to report roots accurately at all call-sites. But the JIT normally isn’t required to
maintain GC info for every instruction.
If any instruction might fault, and if any such fault could be resumed,
then the JIT would need GC info for all instructions in all methods. This would be expensive. Of course, ‘mov eax, ecx’ cannot fault
due to memory access issues. But
a surprising number of instructions are subject to fault if you consider all
of memory – including the stack – to be unmapped. And even ‘mov eax, ecx’ can fault due
to a Thread.Abort.
If you
were paying attention to that last bullet, you might be wondering how
asynchronous exceptions could avoid GC corruption even without resumption. After all, the managed filter will still
execute and we know that the JIT doesn’t have complete GC information for the
faulting instruction.
Our
current solution to this on X86 is rather ad hoc, but it does work. First, we constrain the JIT to never
flow the contents of the scratch registers between a ‘try’ clause and any of the
exception clauses (‘filter’, ‘finally’, ‘fault’ and ‘catch’). The scratch registers in this case are
EAX, ECX, EDX and sometimes EBP.
Our JIT compiler decides, method-by-method, whether to use EBP as a
stack-frame register or a scratch register. Of course, EBP isn’t really a scratch
register since callees will preserve it for us, but you can see where I’m
going.
Now when
an asynchronous exception occurs, we can discard the state of all the scratch
registers. In the case of EAX, ECX
& EDX, we can unconditionally zero them in the register context that is
flowed via exception propagation.
In the case of EBP, we only zero it if we aren’t using EBP as a frame
register. When we execute a managed
handler, we can now report GC roots based on the GC information that’s
associated with the handler’s instruction pointer.
The
downside to this approach, other than its ad hoc nature, is that it constrains
the codegen of any method that contains exception handlers. At some point we may have to model
asynchronous exceptions more accurately, or expand the GC information spewed by
the JIT compiler, or a combination, so that we can enable better code generation
in the presence of exceptions.
We’ve
already seen how VB.NET can use a filter and explicit logic flow from a catch
clause to create the illusion of restartable exceptions to support ‘On Error
Resume Next’. But this should not
be confused with true restartability.
Before
we leave the topic of rethrow, we should briefly consider the InnerException
property of System.Exception. This
allows one exception to be wrapped up in the state of another exception. A couple of important places where we
take advantage of this are reflection and class construction.
When you
perform late-bound invocation via reflection (e.g. Type.InvokeMember or
MethodInfo.Invoke), exceptions can occur in two places:
1)
The reflection infrastructure may
decide that it cannot satisfy your request, perhaps because you passed the wrong
number of arguments, or the member lookup failed, or you are invoking on someone
else’s private members. That last
one sounds vaguely dirty.
2)
The late-bound invocation might
work perfectly, but the target method you called may throw an exception back at
you. Reflection must faithfully
give you that exception as the result of the call. Returning it as an outbound argument,
rather than throwing it at you, would be dangerous. We would lose one of the wonderful
properties of exceptions, which is that they are hard to ignore. Error codes are constantly being
swallowed or otherwise ignored, leading to fragile execution.
The
problem is that these two sources of exceptions are ambiguous. There must be some way to tell whether
the invocation attempt failed or whether the target of the invocation
failed. Reflection
disambiguates these cases by using an instance of
System.Reflection.TargetInvocationException for the case where the invoked
method threw an exception. The
InnerException property of this instance is the exception that was thrown by the
invoked method. If you get any
exceptions from a late-bound invocation other than TargetInvocationException,
those other exceptions indicate problems with the late-bound dispatch attempt
itself.
Something similar happens with TypeInitializationException. If a class constructor (.cctor) method
fails, we capture that exception as the InnerException of a
TypeInitializationException.
Subsequent attempts to use that class in this AppDomain from this or
other threads will have that same TypeInitializationException instance thrown at
them.
So
what’s the difference between the following three constructs, where the
overloaded constructor for MyExcep is placing its argument into
InnerException:
try {…} catch (Exception e) {
if (expr) rethrow; …}
try {…} catch (Exception e) {
if (expr) throw new MyExcep(); …}
try {…} catch (Exception e) {
if (expr) throw new MyExcep(e); …}
Well,
the 2nd form is losing information. The original exception has been
lost. It’s hard to recommend that
approach.
Between
the 1st and 3rd forms, I suppose it depends on whether the
intermediary can add important information by wrapping the original exception in
a MyExcep instance. Even if you are
adding value with MyExcep, it’s still important to preserve the original
exception information in the InnerException so that sophisticated programs and
developers can determine the complete cause of the error.
Probably
the biggest impact from terminating the first pass of exception handling early,
as with the examples above, is on debugging. Have you ever attached a debugger to a
process that has failed with an unhandled exception? When everything goes perfectly, the
debugger pops up sitting in the context of the RaiseException or trap
condition.
That’s
so much better than attaching the debugger and ending up on a ‘rethrow’
statement. What you really care
about is the state of the process when the initial exception was thrown. But the first pass has terminated and
the original state of the world may have been lost. It’s clear why this happens, based on
the two pass nature of exception handling.
Actually, the determination of whether or not the original state of the
world has been lost or merely obscured is rather subtle. Certainly the current instruction
pointer is sitting in the rethrow rather than on the original fault. But remember how filter and finally
clauses are executed with an EBP that puts the containing method’s locals in
scope… and an ESP that still contains the original faulting method? It turns out that the catching handler
has some discretion on whether to pop ESP before executing the catch clause or
instead to delay the pop until the catch clause is complete. The managed handler currently pops the
stack before calling the catch clause, so the original state of the exception is
truly lost. I believe the unmanaged
C++ handler delays the pop until the catch completes, so recovering the state of
the world for the original exception is tricky but possible.
Regardless, every time you catch and rethrow, you inflict this bitter
disappointment on everyone who debugs through your code. Unfortunately, there are a number of
places in managed code where this disappointment is unavoidable.
The most
unfortunate place is at AppDomain boundaries. I’ve already explained at http://blogs.gotdotnet.com/cbrumme/PermaLink.aspx/56dd7611-a199-4a1f-adae-6fac4019f11b why the Isolation requirement of AppDomains forces us to
marshal most exceptions across the boundary. And we’ve just discussed how reflection
and class construction terminate the first pass by wrapping exceptions as the
InnerException of an outer exception.
One
alternative is to trap on all first-chance exceptions. That’s because debuggers can have first
crack at exceptions before the vectored exception handler even sees the
fault. This certainly gives you the
ability to debug each exception in the context in which it was thrown. But you are likely to see a lot of
exceptions in the debugger this way!
In fact,
throughout V1 of the runtime, the ASP.NET team ran all their stress suites with
a debugger attached and configured to trap on first-chance Access Violations
(“sxe av”). Normally an AV in
managed code is converted to a NullReferenceException and then handled like any
other managed exception. But
ASP.NET’s settings caused stress to trap in the debugger for any such AV. So their team enforced a rule that all
their suites (including all dependencies throughout FX) must avoid such
faults.
It’s an
approach that worked for them, but it’s hard to see it working more
broadly.
Instead,
over time we need to add new hooks to our debuggers so they can trap on just the
exceptions you care about. This
might involve trapping exceptions that are escaping your code or are being
propagated into your code (for some definition of ‘your code’). Or it might involve trapping exceptions
that escape an AppDomain or that are propagated into an AppDomain.
The
above text has described a pretty complete managed exception model. But there’s one feature that’s
conspicuously absent. There’s no
way for an API to document the legal set of exceptions that can escape from
it. Some languages, like C++,
support this feature. Other
languages, like Java, mandate it.
Of course, you could attach Custom Attributes to your methods to indicate
the anticipated exceptions, but the CLR would not enforce this. It would be an opt-in discipline that
would be of dubious value without global buy-in and guaranteed
enforcement.
This is
another of those religious language debates. I don’t want to rehash all the reasons
for and against documenting thrown exceptions. I personally don’t believe the
discipline is worth it, but I don’t expect to change the minds of any
proponents. It doesn’t
matter.
What
does matter is that disciplines like this must be applied universally to have
any value. So we either need to
dictate that everyone follow the discipline or we must so weaken it that it is
worthless even for proponents of it.
And since one of our goals is high productivity, we aren’t going to
inflict a discipline on people who don’t believe in it – particularly when that
discipline is of debatable value.
(It is debatable in the literal sense, since there are many people on
both sides of the argument).
To me,
this is rather like ‘const’ in C++.
People often ask why we haven’t bought into this notion and applied it
broadly throughout the managed programming model and frameworks. Once again, ‘const’ is a religious
issue. Some developers are fierce
proponents of it and others find that the modest benefit doesn’t justify the
enormous burden. And, once again,
it must be applied broadly to have value.
Now in
C++ it’s possible to ‘const-ify’ the low level runtime library and services, and
then allow client code to opt-in or not.
And when the client code runs into places where it must lose ‘const’ in
order to call some non-const-ified code, it can simply remove ‘const’ via a
dirty cast. We have all done this
trick, and it is one reason that I’m not particularly in favor of ‘const’
either.
But in a
managed world, ‘const’ would only have value if it were enforced by the
CLR. That means the verifier would
prevent you from losing ‘const’ unless you explicitly broke type safety and were
trusted by the security system to do so.
Until more than 80% of developers are clamoring for an enforced ‘const’
model throughout the managed environment, you aren’t going to see us added
it.
Foray into
C++ Exceptions
C++
exposes its own exception model, which is distinct from the __try / __except /
__finally exposure of SEH. This is
done through auto-destruction of stack-allocated objects and through the ‘try’
and ‘catch’ keywords. Note that
there are no double-underbars and there is no support for filters other than
through matching of exception types.
Of course, under the covers it’s still SEH. So there’s still an FS:[0] handler (on
X86). But the C++ compiler
optimizes this by only emitting a single SEH handler per method regardless of
how many try/catch/finally clauses you use. The compiler emits a table to indicate
to a common service in the C-runtime library where the various try, catch and
finally clauses can be found in the method body.
Of
course, one of the biggest differences between SEH and the C++ exception model
is that C++ allows you to throw and catch objects of types defined in your
application. SEH only lets you
throw 32-bit exception codes. You
can use _set_se_translator to map SEH codes into the appropriate C++ classes in
your application.
A large
part of the C++ exception model is implicit. Rather than use explicit try / finally /
catch clauses, this language encourages use of auto-destructed local
variables. Whether the method
unwinds via a non-exceptional return statement or an exception being thrown,
that local object will auto-destruct.
This is
basically a ‘finally’ clause that’s been wrapped up in a more useful language
construct. Auto-destruction occurs
during the second pass of SEH, as you would expect.
Have you
noticed that the C++ exception you throw is often a stack-allocated local? And that if you explicitly catch it,
this catch is also with a stack-allocated object? Did you ever wake up at night in a cold
sweat, wondering whether a C++ in-flight exception resides on a piece of stack
that’s already been popped? Of
course not.
In fact,
we’ve now seen enough of SEH to understand how the exception always remains in a
section of the stack above ESP (i.e. within the bounds of the stack). Prior to the throw, the exception is
stack-allocated within the active frame.
During the first pass of SEH, nothing gets popped. When the filters execute, they are
pushed deeper on the stack than the throwing frame.
When a
frame declares it will catch the exception, the second pass starts. Even here, the stack doesn’t
unwind. Then, before resetting the
stack pointer, the C++ handler can copy-construct the original exception from
the piece of stack that will be popped into the activation frame that will be
uncovered.
If you
are an expert in unmanaged C++ exceptions, you will probably be interested to
learn of the differences between managed C++ exceptions and unmanaged C++
exceptions. There’s a good write-up
of these differences at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vcmex/html/vccondifferencesinexceptionhandlingbehaviorundermanagedexceptionsforc.asp.
A Single
Managed Handler
We’ve
already seen how the C++ compiler can emit one SEH handler per method and reuse
it for all the exception blocks in that method. The handler can do this by consulting a
side table that indicates how the various clauses map to instruction sequences
within that method.
In the
managed environment, we can take this even further. We maintain a boundary between managed
and unmanaged code for many reasons, like synchronization with the garbage
collector, to enable stack crawling through managed code, and to marshal
arguments properly. We have
modified this boundary to erect a single SEH handler at every unmanaged ->
managed call in. For the most part,
we must do this without compiler support since many of our transitions occur
through dynamically generated machine code.
The cost
of modifying the SEH chain during calls into managed code is quickly amortized
as we call freely between managed methods.
So the immediate cost of pushing FS:[0] handlers on method entry is
negligible for managed code. But
there is still an impact on the quality of the generated code. We saw part of this impact in the
discussion of register usage across exception clauses to remain
GC-safe.
Of
course, the biggest cost of exceptions is when you actually throw one. I’ll return to this near the end of the
blog.
Flow
Control
Here’s
an interesting scenario that came up recently.
Let’s
say we drive the first pass of exception propagation all the way to the end of
the handler chain and we reach the unhandled exception backstop. That backstop will probably pop a dialog
in the first pass, saying that the application has suffered an unhandled
exception. Depending on how the
system is configured, the dialog may allow us to terminate the process or debug
it. Let’s say we choose
Terminate.
Now the
2nd pass begins. During
the 2nd pass, all our finally clauses can execute.
What if
one of those 2nd pass ‘finally’ clauses throws a new exception? We’re going to start a new exception
propagation from this location – with a new Exception instance. When we drive this new Exception up the
chain, we may actually find a handler that will swallow the second
exception.
If this
is the case, the process won’t terminate due to that first exception. This is despite the fact that SEH told
the user we had an unhandled exception, and the user told us to terminate the
process.
This is
surprising, to say the least. And
this behavior is possible, regardless of whether managed or unmanaged exceptions
are involved. The mechanism for SEH
is well-defined and the exception model operates within those rules. An application should avoid certain
(ab)uses of this mechanism, to avoid confusion.
Indeed,
we have prohibited some of those questionable uses in managed code.
In
unmanaged, you should never return from a finally. In an exceptional execution of a
finally, a return has the effect of terminating the exception processing. The catch handler never sees its
2nd pass and the exception is effectively swallowed. Conversely, in a non-exceptional
execution of a finally, a return has the effect of replacing the method’s return
value with the return value from the finally. This is likely to cause developer
confusion.
So in
managed code we’ve made it impossible for you to return from a finally
clause. The full rules for flow
control involving managed exception clauses should be found at Section 12.4.2.8
of ECMA Partition I (http://msdn.microsoft.com/net/ecma/).
However,
it is possible to throw from a managed finally clause. (In general, it’s very hard to
confidently identify regions of managed code where exceptions cannot be
thrown). And this can have the
effect of replacing the exception that was in flight with a new 1st
and 2nd pass sweep, as described above. This is the ExceptionCollidedUnwind
situation that is mentioned in the EXCEPTION_DISPOSITION enumeration.
The C++
language takes a different approach to exceptions thrown from the 2nd
pass. We’ve already seen that C++
autodestructors execute during the 2nd pass of exception
handling. If you’ve ever thrown an
exception from the destructor, when that destructor is executed as part of an
exception unwind, then you have already learned a painful lesson. The C++ behavior for this situation is
to terminate the process via a termination handler.
In
unmanaged C++, this means that developers must follow great discipline in the
implementation of their destructors.
Since eventually those destructors might run in the context of exception
backout, those destructors should never allow an exception to escape them. That’s painful, but presumably
achievable.
In
managed C++, I’ve already mentioned that it’s very hard to identify regions
where exceptions cannot occur. The
ability to prevent (asynchronous and resource) exceptions over limited ranges of
code is something we would like to enable at some point in the future, but it
just isn’t practical in V1 and V1.1.
It’s way too easy for an out-of-memory or type-load or
class-initialization or thread-abort or appdomain-unload or similar exception to
intrude.
Finally,
it’s possible for exceptions to be thrown during execution of a filter. When this happens in an OS SEH context,
it results in the ExceptionNestedException situation that is mentioned in the
EXCEPTION_DISPOSITION enumeration.
The managed exception model took a different approach here. We’ve already seen that an MSVC filter
clause has three legal returns values (resume execution, continue search, and
execute handler). If a managed
filter throws an exception, we contain that exception and consider the filter to
have replied “No, I don’t want to handle this one. Continue searching for a
handler”.
This is
a reasonable interpretation in all cases, but it falls out particularly well for
stack overflow. With the historical
OS support for stack overflow, it’s very hard to reliably execute backout
code. As I’ve mentioned in other
blogs, you may only have one 4K page of stack available for this purpose. If you blow that page, the process is
terminated. It’s very hard to
execute managed filters reliably within such a limited region. So a reasonable approach is to consider
the filters to have themselves thrown a StackOverflowException and for us to
interpret this as “No, I don’t want to handle this one.”
In a
future version, we would like to provide a more defensible and useful mechanism
for handling stack overflow from managed code.
Error
Handling without Exceptions
So we’ve
seen how SEH and C++ and managed exceptions all interoperate. But not all error handling is based on
exceptions. When we consider
Windows, there are two other error handling systems that the CLR can
interoperate with. These are the
Get/SetLastError mechanism used by the OS and the HRESULT / IErrorInfo mechanism
used by COM.
Let’s
look at the GetLastError mechanism first, because it’s relatively simple. A number of OS APIs indicate failure by
returning a sentinel value. Usually
this sentinel value is -1 or 0 or 1, but the details vary depending on the
API. This sentinel value indicates
that the client can call GetLastError() to recover a more detailed OS status
code. Unfortunately, it’s sometimes
hard to know which APIs participate in the GetLastError protocol. Theoretically this information is always
documented in MSDN and is consistent from one version of the OS to the next –
including between the NT and Win95-based OSes.
The real
issue occurs when you PInvoke to one of these methods. The OS API latches any failure codes
with SetLastError. Now on the
return path of the PInvoke, we may be calling various OS services and managed
services to marshal the outbound arguments. We may be synchronizing with a pending
GC, which could involve a blocking operation like WaitForSingleObject. Somewhere in here, we may call another
OS API that itself latches an error code (or the absence of an error code)
through its own call to SetLastError.
So by
the time we return to some managed code that can generate up a new PInvoke stub
to call GetLastError, you can be sure that the original error code is long
gone. The solution is to tag your
PInvoke declaration to indicate that it should participate in the GetLastError
protocol. This tells the PInvoke
call to capture the error as part of the return path, before any other OS calls
on this thread have an opportunity to erase it or replace it.
This
protocol works well for PInvokes.
Unfortunately, we do not have a way to tag IJW VTFixup stubs in the same
way. So when you make managed ->
unmanaged calls via MC++ IJW, there isn’t a convenient and reliable way to
recover a detailed OS status code on the return path. Obviously this is something we would
like to address in some future version, though without blindly inflicting the
cost of a GetLastError on all managed -> unmanaged transitions through
IJW.
COM Error
Handling
To
understand how the CLR interoperates with COM HRESULTs, we must first review how
PreserveSig is used to modify the behavior of PInvoke and COM
Interop.
Normally, COM signatures return an HRESULT error code. If the method needs to communicate some
other result, this is typically expressed with an [out, retval] outbound
argument. Of course, there are
exceptions to this pattern. For
example, IUnknown::AddRef and Release both return a count of the outstanding
references, rather than an HRESULT.
More importantly, HRESULTs can be used to communicate success codes as
well as error codes. The two most
typical success codes are S_OK and S_FALSE, though any HRESULT with the high bit
reset is considered a success code.
COM
Interop normally transforms the unmanaged signature to create a managed
signature where the [out, retval] argument becomes the managed return
value. If there is no [out,
retval], then the return type of the managed method is ‘void’. Then the COM Interop layer maps between
failure HRESULTs and managed exceptions.
Here’s a simple example:
COM: HRESULT GetValue([out, retval] IUnknown
**ppRet)
CLR: IUnknown
GetValue()
However,
the return value might be a DWORD-sized integer that should not be interpreted
as an HRESULT. Or it might be an
HRESULT – but one which must sometimes distinguish between different success
codes. In these cases, PreserveSig
can be specified on the signature and it will be preserved on the managed side
as the traditional COM signature.
Of
course, the same can happen with PInvoke signatures. Normally a DLL export like Ole32.dll’s
CoGetMalloc would have its signature faithfully preserved. Presumably the transformation would be
something like this:
DLL: HRESULT CoGetMalloc(DWORD c, [out,
retval] IMalloc **ppRet)
CLR: DWORD CoGetMalloc(DWORD c, ref IMalloc
ppRet)
If OLE32
returns some sort of failure HRESULT from this call, it will be returned to the
managed caller. If instead the
application would prefer to get this error case automatically converted to a
managed Exception, it can use PreserveSig to indicate this.
Huh? In the COM case
PreserveSig means “give me the unconverted HRESULT signature”, but in the
PInvoke case PreserveSig means “convert my HRESULTs into exceptions.” Why would we use the same flag to
indicate exactly opposite semantic | |
|