Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 1 #### CS-590.26, Spring 2014 #### High-Speed Memory Systems: Architecture and Performance Analysis ## Coherence and Consistency Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 2 #### The Problem is Multi-Fold Cache Consistency (taken from web-cache community) In the presence of a cache, reads and writes behave (to a first order) no differently than if the cache were not there #### Three main issues: - Consistent with backing store - Consistent with self - Consistent with other clients of same backing store Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 3 #### Consistency w/ Backing Store For example, write-through vs. write-back Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 4 #### Consistency w Self #### Virtual cache synonym problem & hardware solutions Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 5 #### Consistency w Self #### Operating system solutions to aliasing problem Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 6 #### Consistency w Self #### Segmentation as a solution to the aliasing problem Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 7 #### Consistency w Self #### Segmentation as a solution to the aliasing problem Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 8 #### Consistency w Self #### **ASID** remapping Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 9 #### Consistency w Other Clients i.e. Cache Coherence & various Consistency Models First, a look at some of the things that can go wrong, just inside a SINGLE CHIP: Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 10 ``` Process B (producer): Process C (consumer): qlobal char data[SIZE]; global char data[SIZE]; global int ready=0; global int ready=0; int fd = open("dev A"); while (1) { while (1) { while (!ready) while (ready) process (data); ready = 0; dma(fd, data, SIZE); ready = 1; ``` Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 11 Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 12 Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 13 Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 14 Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 15 #### Problem: causal relationships Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 16 #### Problem scales with the system size #### Solve system linear eqs: $x_{i+1} = Ax_i + b$ ``` while (!converged) { doparallel(N) { int i = myid(); xtemp[i] = b[i]; for (j=0; j<N; j++) { xtemp[i] += A[i,j] * x[j]; implicit barrier sync doparallel(N) { int i = myid(); x[i] = xtemp[i]; ``` Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 17 #### Some Consistency Models Strict Consistency: A read operation shall return the value written by the most recent store operation. **Sequential Consistency:** The result of an execution is the same as a single interleaving of sequential, program-order accesses from different processors. **Processor Consistency:** Writes from a process are observed by other clients to be in program order; all clients observe a single interleaving of writes from different processors. Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 18 #### Strict Consistency #### Fails to satisfy strict consistency: Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 19 #### Sequential Consistency Fails to satisfy strict consistency: But satisfies Sequential Consistency A writes 1 Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 20 #### Sequential Consistency #### Handles our earlier problem: Note: for this to work, memory controller may reorder internally, but not externally Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 21 #### Sequential Consistency #### Requirements: - Everyone can reorder internally but not externally - All I/O & memory references must go through the same sync point (e.g. memory-mapped I/O) - · Write of data and driver signal must be same client - Write buffering presents significant problems - Reads must be delayed by system latency - ... let's look at this last one more closely Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 22 #### Really Famous Example (Goodman 1989) #### Process P1: #### Process P2: ``` Initially, B=0 B=1; if (A==0) { kill P1; } ``` Sequential Consistency allows 0 or 1 processes to die (not both) Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 23 #### Race-Condition Example #### Fails to satisfy sequential consistency: Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 24 #### Race-Condition Example #### Satisfies sequential consistency: Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 25 #### Race-Condition Example #### In practice: - · speculate - throw exception if problem occurs #### HOWEVER — from Jaleel & Jacob [HPCA 2005]: - increasing the reorder buffer from 80 to 512 entries results in an increase in memory traps by 6x and an increase in total execution overhead by 10–40% - reordering memory instructions increases L1 data cache accesses by 10–60% and L1 data cache misses by 10–20% Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 26 #### **Processor Consistency** Also called Total Store Order All writes in program order, reads freely reordered Both of these scenarios are satisfied in this model: Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 27 #### Some Other Consistency Models **Partial store order** — a processor can freely reorder local writes with respect to other local writes Weak consistency — a processor can freely reorder local writes ahead of local reads **Release consistency** — different classes of synchronization ... enforces synchronization only w.r.t. *acquire/release* operations. On *acquire*, memory system updates all protected variables before continuing; on *release*, memory system propagates changes to the protected variables out to the rest of the system Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 28 #### Cache Coherence Schemes #### Ways to implement a consistency model: - in software (e.g. in virtual memory system, via page table) - · in hardware - combine hardware & software The hardware component is called "cache coherence" Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 29 #### Coherence Implementations #### **Cache-Block States:** I — Invalid M — Modified — read-writable, forwardable, dirty S — Shared — read-only (can be clean or dirty) E — Exclusive — read-writable, clean O — Owned — read-only, forwardable, dirty Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 30 #### Coherence Implementations: SI Works with write-through caches Block is either present (Shared) or not (Invalid) ## Problem: Nobody wants to use write-through caches Both schemes require broadcast or multicast of coherence information and/or write data Note: write-update and sequential consistency don't play nice together Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 31 #### Coherence Implementations: MSI Write-back caches: dirty bit (Modified state) # Problem: when the appreads data and then writes, sends a second broadcast xmit write miss Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 32 #### Coherence Implementations: MESI **Reduces write broadcasts** # Problem: When you ask for a block, potentially many clients may respond write hit Modified bus read-miss send cached data bus read-miss send cached data Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 33 #### Coherence Implementations: MESIF Shared broken into two: Shared (1+) and Forwardable (1) Compare MESI (left) vs. MESIF (right): Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 34 #### Coherence Implementations: MOESI Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 35 #### MESI vs MOESI (AMD) vs MESIF (Intel) | | Clean/Dirty | Unique? | Can | Can | Can Silent | Comments | |------------|-------------|---------|--------|----------|---------------|---------------------------------------| | | | | Write? | Forward? | Transition to | | | Modified | Dirty | Yes | Yes | Yes | | Must writeback to share or replace | | Exclusive | Clean | Yes | Yes | Yes | MSIF | Transitions to M on write | | Shared | Clean | No | No | No | | Does not forward | | Invalid | NA | NA | NA | NA | | Cannot Read | | Forwarding | Clean | Yes | No | Yes | SI | Must invalidate other copies to write | | | Clean/Dirty | Unique? | Can | Can | Can Silent | Comments | |-----------|-------------|---------|--------|----------|---------------|------------------------------| | | | | Write? | Forward? | Transition to | | | Modified | Dirty | Yes | Yes | Yes | 0 | Can share without writeback | | Owned | Dirty | Yes | Yes | Yes | | Must writeback to transition | | Exclusive | Clean | Yes | Yes | Yes | MSI | Transitions to M on write | | Shared | Either | No | No | No | 1 | Shared can be dirty or clean | | Invalid | NA | NA | NA | NA | | Cannot Read | | | Clean/Dirty | Unique? | Can | Can | Can Silent | Comments | |-----------|-------------|---------|--------|----------|---------------|------------------------------------| | | | | Write? | Forward? | Transition to | | | Modified | Dirty | Yes | Yes | Yes | | Must writeback to share or replace | | Exclusive | Clean | Yes | Yes | Yes | MSI | Transitions to M on write | | Shared | Clean | No | No | Yes | | Shared implies clean, can forward | | Invalid | NA | NA | NA | NA | | Cannot Read | Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete #### Some System Configurations Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 37 Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 38 #### **Directory-Based Protocols** Can run on any configuration—the main idea is to eliminate the need to broadcast every coherence event Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 39 #### **Directory-Based Protocols** Each memory block has a directory entry P+1 bits where P is the number of processors One dirty bit per directory entry If dirty bit is on then only one presence bit can be on Nodes only communicate with other nodes that have the memory block Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 40 #### **Directory-Based Protocols** #### Flavors: - Centralized Directory, Centralized Memory Poor scalability, but better than bus-based ... - Decentralized Directory, Distributed Memory Each node stores small piece of entire directory corresponding to the memory resident at that node. Directory queries sent to the node that the address of the block corresponds to (i.e. its Home Node). - Clustered Presence bits are coarse-grained Spring 2014 CS-590.26 Lecture H Bruce Jacob University of Crete SLIDE 41 #### **Problem with Scalability** #### What if latency to other procs > latency to local DRAM?