9fans archive / 1997 / 04 / 76    prev next

search terms are split using tokenize from:regexp searches From: lines subject:regexp searches Subject: lines before:yyyy[/mm[/dd]] and after:yyyy[/mm[/dd]] specify date range powered by grep(1)
From: presotto@pla... presotto@pla... Subject: how the pentium pro really works Date: Tue, 22 Apr 1997 15:35:29 -0400 Here's 3 messages from Mike Haertel describing how the Pentium Pro works vis a vis syncronization. As he forcefully point out, the problem isn't speculative loads, its just queued stores. As expected, surrounding shared accesses with spin locks is sufficient. Only iffy operations like our current version of sleep/wakeup have to be more carefully handled. An interesting point is that the same model exists on the Pentium. However, the shorter pipelines and buffers in the pentium are less likely to exacerbate the problem. We were just lucky. =================================== >From ducky.net!mike Tue Apr 22 04:19:03 EDT 1997 To: research.bell-labs.com!presotto Subject: Pentium Pro and coherence Date: Tue, 22 Apr 1997 01:15:56 -0700 From: Mike Haertel <ducky.net!mike> In article <199704211614.MAA02731@cse...>, you wrote: >The Pro people have remained silent >on the subject (we've sent email). Hi, I am an architect at Intel. Who did you send email to? I'm surprised you got no response. In any case, perhaps I can clarify things a little. >Of course, I could be totally wrong about the speculative reads and >it may be the interlock instruction on the writer and not the >reader that causes the processors to become coherent. The caches are always coherent using an MESI protocol. The real problem is that not all written data in the system is in the cache(s). The PPro's memory ordering model is called "processor ordering" and is a formalization of the 486's semantics. The 486 had a write-through cache with write queue to memory which was not snooped by loads on other processors. Loosely speaking, this means the ordering of events originating from any one processor in the system, as observed by other processors, is always the same. However, different observers are allowed to disagree on the interleaving of events from two or more processors. The PPro does speculative and out-of-order loads. However, it has a mechanism called the "memory order buffer" to ensure that the above memory ordering model is not violated. Load and store instructions do not get retired until the processor can prove there are no memory ordering violations in the actual order of execution that was used. Stores do not get sent to memory until they are ready to be retired. If the processor detects a memory ordering violation, it discards all unretired operations (including the offending memory operation) and restarts execution at the oldest unretired instruction. I.e. when a violation is detected the MOB whacks the machine... :-) For example, consider the following sequence: P1: load (1000) -> reg P2: store 10 -> (1000) load (1000) -> reg store 20 -> (1000) Suppose on P1, the 2nd load speculatively executes first (for whatever reason), and picks up 10 (the result of the first store on P2). Later, P2 executes the 2nd store (causing the cached copy of location 1000 on P1 to be invalidated), and finally P1 executes the 1st load. At this point, P1 discovers that a younger load has already read from the same location, and that the location was subsequently invalidated by P2. P1 says "a-ha! that violates the memory ordering model!", clobbers the speculative state of the machine from the offending instruction (the 1st load) onward, and resumes execution starting at the offending load. Serializing instructions like CPUID force the machine to wait until all queued stores have been written out. (Actually, serializing instructions force the machine to wait until they are retired, but they cannot retire until all older stores have retired, which has an effect equivalent to draining a store queue.) Note that serializing instructions do not serialize the other processors, only the local processor. You should be able to reproduce your bug by manually working through the possible processor-ordering-consistent interleavings of events from multiple processors. Note that you should think of a processor as also observing itself. Finally, since the caches are actually fully coherent, you should be able to do correct locking without too many serializing instructions, perhaps without any. Future Intel processors will implement the same memory ordering model. =================================== >From ducky.net!mike Tue Apr 22 12:25:03 EDT 1997 To: research.bell-labs.com!presotto Subject: Re: Pentium Pro and coherence Date: Tue, 22 Apr 1997 09:24:42 -0700 From: Mike Haertel <ducky.net!mike> >0,0 blows us away. If I understand correctly, putting a >synchronizing instruction between the writes and subsequent read > P1: P2: > x = 0 y = 0 > x = 1 y = 1 > cpuid cpuid > read y read x >will cause the processor the instruction was executed on >to wait until all processors have gotten out their >queued stores and then blow away any inconsistencies on >caused by speculative loads. The cpuid waits only until the *local* processor has gotten out its queued stores. It doesn't wait for any of the other processors. However, in this example (where all processors do cpuid before any processor does a load) I think you're OK. The cpuid forces the local processor to wait until its queued writes have been globally observed. What this means is that you are effectively serializing access to "the bus" (really, the combination of the bus and the coherent caches--writes to M-state cache lines on the local processor count as "globally observed"). Some processor (say P2) is last to execute cpuid. This means that P1 has already executed cpuid, therefore P1's "x=1" has been globally observed, so P2's load is guaranteed to see x=1. Finally, I'd like to emphasize: The inconsistencies are NOT caused by speculative loads, they are caused by queued writes on other processors. >What we need is that if the following sequence is executed > > P1: P2: > x = 0 y = 0 > x = 1 y = 1 > read y read x > >has the values read will be one of > > 1 0 > 0 1 > 1 1 > >0,0 blows us away. You could get 0,0 even on the 486 or Pentium. The difference is that the PPro has such deep pipelines and buffers that it is more likely to expose such bugs. =================================== >From ducky.net!mike Tue Apr 22 14:05:04 EDT 1997 To: research.bell-labs.com!presotto Date: Tue, 22 Apr 1997 11:01:30 -0700 From: Mike Haertel <ducky.net!mike> >Do you mind if I repost your mail to the 9fans >list? Sure, go ahead. One other addendum I'd like to make: in your original post to 9fans, you mentioned some paranoia about similar problems possibly existing in other parts of the kernel. One bit of reassurance: any data structure protected by a spin lock is safe. Here's why: P1 P2 [already holding lock] wait for lock->busy == 0 store data->x grab lock store data->y use data->x and ->y lock->busy = 0 Because of processor ordering, when P2 observes lock->busy == 0, it also has observed all prior stores by P1. Hence P2 never gets an inconsistent view of P1's updates. This would not be the case if the PPro allowed speculative loads to violate processor ordering semantics. This is also probably not the case on other processors with weaker memory ordering semantics. Digital's Alpha may be one such processor, not sure. On those processors, when releasing a spin lock you need a "lock release" synchronization instruction rather than a simple store.