Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

How is the Write-Combine buffer physically hooked up? I have seen block diagrams illustrating a number of variants:

  • Between L1 and Memory controller
  • Between CPU's store buffer and Memory controller
  • Between CPU's AGUs and/or store units

Is it microarchitecture-dependent?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
364 views
Welcome To Ask or Share your Answers For Others

1 Answer

Write buffers can have different purposes or different uses in different processors. This answer may not apply to processors not specifically mentioned. I'd like to emphasis that the term "write buffer" may mean different things in different contexts. This answer is about Intel and AMD processors only.

Write-Combining Buffers on Intel Processors

Each cache might be accompanied with zero or more line fill buffers (also called fill buffers). The collection of fill buffers at L2 are called the super queue or superqueue (each entry in the super queue is a fill buffer). If the cache is shared between logical cores or physical cores, then the associated fill buffers are shared as well between the cores. Each fill buffer can hold a single cache line and additional information that describes the cache line (if it's occupied) including the address of the cache line, the memory type, and a set of validity bits where the number of bits depends on the granularity of tracking the individual bytes of the cache line. In early processors (such as Pentium II), only one of the fill buffers is capable of write-combining (and write-collapsing). The total number of line buffers and those capable of write-combing has increased steadily with newer processors.

Nehalem up to Broadwell include 10 fill buffers at each L1 data cache. Core and Core2 have 8 LFBs per physical core. According to this, there are 12 LFBs on Skylake. @BeeOnRope has observed that there are 20 LFBs on Cannon Lake. I could not find a clear statement in the manual that says LFBs are the same as WCBs on all of these microarchitectures. However, this article written by a person from Intel says:

Consult the Intel? 64 and IA-32 Architectures Optimization Reference Manual for the number of fill buffers in a particular processor; typically the number is 8 to 10. Note that sometimes these are also referred to as "Write Combining Buffers", since on some older processors only streaming stores were supported.

I think the term LFB was first introduced by Intel with the Intel Core microarchitecture, on which all of the 8 LFBs are WCBs as well. Basically, Intel sneakily renamed WCBs to LFBs at that time, but did not clarify this in their manuals since then.

That same quote also says that the term WCB was used on older processors because streaming loads were not supported on them. This could be interpreted as the LFBs are also used by streaming load requests (MOVNTDQA). However, Section 12.10.3 says that streaming loads fetch the target line into buffers called streaming load buffers, which are apparently physically different from the LFBs/WCBs.

A line fill buffer is used in the following cases:

(1) A fill buffer is allocated on a load miss (demand or prefetch) in the cache. If there was no fill buffer available, load requests keep piling up in the load buffers, which may eventually lead to stalling the issue stage. In case of a load request, the allocated fill buffer is used to temporarily hold requested lines from lower levels of the memory hierarchy until they can be written to the cache data array. But the requested part of the cache line can still be provided to the destination register even if the line has not yet been written to the cache data array. According to Patrick Fay (Intel):

If you search for 'fill buffer' in the PDF you can see that the Line fill buffer (LFB) is allocated after an L1D miss. The LFB holds the data as it comes in to satisfy the L1D miss but before all the data is ready tobe written to the L1D cache.

(2) A fill buffer is allocated on a cacheable store to the L1 cache and the target line is not in a coherence state that allows modifications. My understanding is that for cacheable stores, only the RFO request is held in the LFB, but the data to be store waits in the store buffer until the target line is fetched into the LFB entry allocated for it. This is supported by the following statement from Section 2.4.5.2 of the Intel optimization manual:

The L1 DCache can maintain up to 64 load micro-ops from allocation until retirement. It can maintain up to 36 store operations from allocation until the store value is committed to the cache, or written to the line fill buffers (LFB) in the case of non-temporal stores.

This suggests that cacheable stores are not committed to the LFB if the target line is not in the L1D. In other words, the store has to wait in the store buffer until either the target line is written into the LFB, and then the line is modified in the LFB, or the target line is written into the L1D, and then the line is modified in the L1D.

(3) A fill buffer is allocated on a uncacheable write-combining store in the L1 cache irrespective of whether the line is in the cache or its coherence state. WC stores to the same cache line can be combined and collapsed (multiple writes to the same location in the same line will make the last store in program order overwrite previous stores before they become globally observable) in a single LFB. Ordering is not maintained among the requests currently allocated in LFBs. So if there are two WCBs in use, there is no guarantee which will be evicted first, irrespective of the order of stores with respect to program order. That's why WC stores may become globally observable out of order even if all stores are retired committed in order (although the WC protocol allows WC stores to be committed out of order). In addition, WCBs are not snooped and so only becomes globally observable when they reach the memory controller. More information can be found in Section 11.3.1 in the Intel manual V3.

There are some AMD processors that use buffers that are separate from the fill buffers for non-temporal stores. There were also a number of WCB buffers in the P6 (the first to implement WCBs) and P4 dedicated for the WC memory type (cannot be used for other memory types). On the early versions of P4, there are 4 such buffers. For the P4 versions that support hyperthreading, when hyperthreading is enabled and both logical cores are running, the WCBs are statically partitioned between the two logical cores. Modern Intel microarchitectures, however, competitively share the all the LFBs, but I think keep at least one available for each logical core to prevent starvation.

(4) The documentation of L1D_PEND_MISS.FB_FULL indicates that UC stores are allocated in the same LFBs (irrespective of whether the line is in the cache or its coherence state). Like cacheable stores, but unlike WC, UC stores are not combined in the LFBs.

(5) I've experimentally observed that requests from IN and OUT instructions are also allocated in LFBs. For more information, see: How do Intel CPUs that use the ring bus topology decode and handle port I/O operations.

Additional information:

The fill buffers are managed by the cache controller, which is connected to other cache controllers at other levels (or the memory controller in case of the LLC). A fill buffer is not allocated when a request hits in the cache. So a store request that hits in the cache is performed directly in the cache and a load request that hits in the cache is directly serviced from the cache. A fill buffer is not allocated when a line is evicted from the cache. Evicted lines are written to their own buffers (called writeback buffers or eviction buffers). Here is a patent from Intel that discusses write combing for I/O writes.

I've run an experiment that is very similar to the one I've described here to determine whether a single LFB is allocated even if there are multiple loads to the same line. It turns out that that is indeed the case. The first load to a line that misses in the write-back L1D cache gets an LFB allocated for it. All later loads to the same cache line are blocked and a block code is written in their corresponding load buffer entries to indicate that they are waiting on the same request being held in that LFB. When the data arrives, the L1D cache sends a wake-up signal to the load buffer and all entries that are waiting on that line are woken up (unblocked) and scheduled to be issued to the L1D cache when at least one load port is available. Obviously the memory scheduler has to choose between the unblocked loads and the loads that have just been dispatched from the RS. If the line got evicted for whatever reason before all waiting loads get the chance to be serviced, then they will be blocked again and an LFB will be again allocated for that line. I've not tested the store case, but I think no matter what the operation is, a single LFB is allocated for a line. The request type in the LFB can be promoted from prefetch to demand load to speculative RFO to demand RFO when required. I also found out empirically that speculative requests that were issued from uops on a mispredicted path are not removed when flushing the pipeline. They might be demoted to prefetch requests. I'm not sure.

Write-Combining Buffers on AMD Processors

I mentioned before according to an article that there are some AMD processors that use buffers that are separate from the fill buffers for non-temporal stores. I quote from the article:

On the older AMD processors (K8 and Family 10h), non-temporal stores used a set of four “write-combining registers” that were independent of the eight buffers used for L1 data cache misses.

The "on the older AMD processors" part got me curious. Did this change on newer AMD processors? It seems to me that this is still true on all newer AMD processors including the most recent Family 17h Processors (Zen). The WikiChip article on the Zen mircoarchitecture includes two figures that mention WC buffers: this and this. In the first figure, it's not clear how the WCBs are used. However, in the second one it's clear that the WCBs shown are indeed specifically used for NT writes (there is no connection between the WCBs and the L1 data cache). The source for the second figure seems to be these slides1. I think that the first figure was made by WikiChip (which


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...