「[翻訳] Durango CPU Overview」の編集履歴(バックアップ)一覧はこちら
追加された行は緑色になります。
削除された行は赤色になります。
<div>
<blockquote>このページは<a href="http://www.vgleaks.com/durango-cpu-overview/">http://www.vgleaks.com/durango-cpu-overview/</a>からの引用です</blockquote>
</div>
<div class="body-wrapper">
<div class="container-wrapper">
<div class="content-wrapper main container">
<div class="page-wrapper single-blog single-sidebar left-sidebar">
<div class="row">
<div class="gdl-page-left twelve columns">
<div class="row">
<div class="gdl-page-item mb20 gdl-blog-full eight columns">
<h1 class="blog-title"><a href="http://www.vgleaks.com/durango-cpu-overview/">Durango CPU Overview</a></h1>
<div class="blog-content-wrapper">
<div class="blog-content">
<blockquote>The<a href="http://www.vgleaks.com/durango/" title="Durango">Durango</a>CPU brings a host of modern micro-architectural
performance features to console development. With Durango, a familiar
instruction set architecture and high performance silicon mean developers can
focus effort on content and features, not micro-optimization. The trend towards
more parallel power continues in this hardware; so, an effective strategy for
multi-core computing is more important than ever.</blockquote>
<blockquote>
<h2>Architectural Overview</h2>
<p>The Durango CPU is structured as two modules. A module contains four x64
cores, each running a single thread at 1.6 GHz. Each core contains a 32 KB
instruction cache (I-cache) and a 32 KB data cache (D-cache), and the 4 cores
in each module share a 2 MB level 2 (L2) cache. In total, the modules have 8
hardware threads and 4 MB of L2. The architecture is little-endian.</p>
</blockquote>
<p><img width="915" height="608" title="Durango CPU Overview" src="http://www.vgleaks.com/wp-content/uploads/2013/02/cpu.jpg" alt="cpu Durango CPU Overview" class="alignnone size-full wp-image-1832" /></p>
<blockquote>
<p>Four cores communicate with the module’s L2 via the L2 Interface (L2I), and
with the other module and the rest of the system (including main RAM) via the
Core Communication Interface (CCI) and the North Bridge.</p>
</blockquote>
<blockquote>
<h2>Caches</h2>
<p>The caches can be summarized as shown in the following table.</p>
<table width="659" cellspacing="0" cellpadding="0" border="0"><tbody><tr><td width="61" valign="top"><b>Cache</b></td>
<td width="230" valign="top"><b>Policy</b></td>
<td width="57" valign="top"><b>Ways</b></td>
<td width="76" valign="top"><b>Set Size</b></td>
<td width="85" valign="top"><b>Line Size</b></td>
<td width="151" valign="top"><b>Sharing</b></td>
</tr><tr><td width="61" valign="top"><i>L1 I</i></td>
<td width="230" valign="top">Read only</td>
<td width="57" valign="top">2</td>
<td width="76" valign="top">256</td>
<td width="85" valign="top">64 bytes</td>
<td width="151" valign="top">Dedicated to 1 core</td>
</tr><tr><td width="61" valign="top"><i>L1 D</i></td>
<td width="230" valign="top">Write-allocate, write-back</td>
<td width="57" valign="top">8</td>
<td width="76" valign="top">64</td>
<td width="85" valign="top">64 bytes</td>
<td width="151" valign="top">Dedicated to 1 core</td>
</tr><tr><td width="61" valign="top"><i>L2</i></td>
<td width="230" valign="top">Write-allocate, write-back, inclusive</td>
<td width="57" valign="top">16</td>
<td width="76" valign="top">2048</td>
<td width="85" valign="top">64 bytes</td>
<td width="151" valign="top">Shared by module</td>
</tr></tbody></table><p> </p>
</blockquote>
<blockquote>
<p>The 4 MB of L2 cache is split into two parts, one in each module. On an L2
miss from one module, the hardware checks if the required line is resident in
the other module—either in its L2 only, or any of its cores’ L1 caches.
Checking and retrieving data from the other module’s caches is quicker than
fetching it from main memory, but this is still much slower than fetching it
from the local L1 or L2. This makes choice of core and module very important
for processes that share data.</p>
<table width="659" cellspacing="0" cellpadding="0" border="0"><tbody><tr><td width="196" valign="top"><b>Memory access result</b></td>
<td width="66" valign="top"><b>Cycles</b></td>
<td width="397" valign="top"><b>Notes</b></td>
</tr><tr><td width="196" valign="top"><i>L1 hit</i></td>
<td width="66" valign="top">3</td>
<td width="397" valign="top">Required line is in this core’s L1</td>
</tr><tr><td width="196" valign="top"><i>L2 hit</i></td>
<td width="66" valign="top">17</td>
<td width="397" valign="top">Required line is in this module’s L2</td>
</tr><tr><td width="196" valign="top"><i>Remote L2 hit, remote L1 miss</i></td>
<td width="66" valign="top">100</td>
<td width="397" valign="top">Required line is in the other module’s L2</td>
</tr><tr><td width="196" valign="top"><i>Remote L2 hit, remote L1 hit</i></td>
<td width="66" valign="top">120</td>
<td width="397" valign="top">Required line is in the other module’s L2 & in
remote core’s L1</td>
</tr><tr><td width="196" valign="top"><i>Local L2 miss, remote L2 miss</i></td>
<td width="66" valign="top">144-160</td>
<td width="397" valign="top">Required line is not resident in any cache; load
from memory</td>
</tr></tbody></table><p> </p>
</blockquote>
<blockquote>
<p>Both L1 and L2 caches have hardware prefetchers that automatically predict
the next line required, based on the stream of load/store addresses generated
so far. The prefetchers can derive negative and positive strides from multiple
address sequences, and can make a considerable difference to performance. While
the x64 instruction set has explicit cache control instructions, in many
situations the prefetcher removes the need to manually insert these.</p>
</blockquote>
<blockquote>
<p>The Durango CPU does not support line or way locking in either L1 or L2, and
has no L3 cache.</p>
</blockquote>
<blockquote>
<p>This document does not cover memory paging or translation lookaside buffers
(TLBs) on the cores.</p>
</blockquote>
<blockquote>
<h2>Instruction Set Architecture</h2>
<p>The cores execute the x64 instruction set (also known as x86-64 or AMD64);
this instruction set will be familiar to developers working on AMD or Intel
based architectures, including that of desktop computers running Windows. x64
is a 64-bit extension to 32-bit x86 , which is a complex instruction set
computer (CISC) with register-memory, variable instruction length, and a long
history of binary backward compatibility; that is, some instruction encodings
have not changed since the 16-bit Intel 8086.</p>
</blockquote>
<blockquote>
<p>The x64 architecture requires SSE2 support, and Visual Studio makes
exclusive use of SSE instructions for all floating-point operations. x64
deprecates older instruction sets: x87, Intel MMX®, and AMD 3DNow!®. x64
supports the following instruction set extensions:</p>
<ul><li><b>SIMD/vector instructions</b>: SSE up to SSE4.2 (including SSSE3 for
packing and SSE4a), and AVX</li>
<li><b>F16C</b>: half-precision float conversion</li>
<li><b>BMI</b>: bit shifting and manipulation</li>
<li><b>AES+CLMULQDQ</b>: cryptographic function support</li>
<li><b>XSAVE</b>: extended processor state save</li>
<li><b>MOVBE</b>: byte swapping/permutation</li>
<li><b>VEX prefixing</b>: Permits use of 256-bit operands in support of AVX
instructions</li>
<li><b>LOCK prefix</b>: modifies selected integer instructions to be
system-wide atomic</li>
</ul><p> </p>
</blockquote>
<blockquote>
<p>The cores do not support XOP, AVX2, or FMA3/4 (fused multiply-add).</p>
<p>Architecturally, the cores each have sixteen 64-bit general purpose
registers, eight 80-bit floating point registers, and sixteen 256-bit
vector/SIMD registers. The 80-bit floating point registers are part of x87
legacy support.</p>
</blockquote>
<blockquote>
<h2>Performance</h2>
<p>Durango CPU cores run at 1.6 GHz; this is half the clock rate of the<a href="http://www.vgleaks.com/xbox-360/" title="Xbox 360">Xbox 360</a>’s cores.
Because of this, it is tempting to assume that the Xbox 360’s cores might
outperform Durango’s cores. However, this is emphatically not true, for the
reasons described in the following sections.</p>
</blockquote>
<blockquote>
<h3>Sub-ISA Parallelism and Micro-Operations</h3>
<p>Like most recent high-performance x64 processors, the cores do not execute
the x64 instruction set natively; instead, internally instructions are decoded
into micro-operations, which the processor executes. This translation provides
opportunities to parallelize beyond traditional superscalar execution.</p>
</blockquote>
<blockquote>
<p>Durango CPU cores have dual x64 instruction decoders, so they can decode two
instructions per cycle. On average, an x86 instruction is converted to 1.7
micro-operations, and many common x64 instructions are converted to 1
micro-operation. In the right conditions, the processor can simultaneously
issue six micro-operations: a load, a store, two ALU, and two vector floating
point. The core has corresponding pipelines: two identical 64-bit ALU
pipelines, two 128-bit vector float pipelines (one with float multiply, one
with float add), one load pipeline, and one store pipeline. A core can retire 2
micro-operations a cycle.</p>
</blockquote>
<blockquote>
<h3>Out of Order Execution</h3>
<p>Xbox 360 CPU cores execute in-order (also called<i>program order</i>)the
instructions in exactly the order the compiler laid them out. The Xbox 360 CPU
has no opportunity to anticipate and avoid stalls caused by dependencies in the
incoming instruction stream, and no compiler can eliminate all possible
pipeline issues.</p>
</blockquote>
<blockquote>
<p>In contrast, the Durango CPU cores execute fully out of order (OOO), also
called<i>data order</i>, since execution order is determined by data
dependencies. This means the processor is able, while executing a sequence of
instructions, to re-order the micro-operations (<i>not</i>the x64 instructions)
via an internal 64-entry re-order buffer (ROB). This improves performance
by:</p>
<ul><li>Starting loads and stores as early as possible to avoid stalls.</li>
<li>Executing instructions in data-dependency order.</li>
<li>Fetching instructions from branch destination as soon as the branch address
is resolved.</li>
</ul></blockquote>
<blockquote>
<h3>Register Renaming</h3>
<p>A low count of registers can cause execution of instructions to be
unnecessarily serialized. Similar in concept to translating x64 instructions to
micro-operations, register names used in the x64 instruction stream are not
used as is, but are instead renamed to point at entries in a large internal
physical register file (PRF)—Durango cores have a 64-entry, 64-bit,
general-purpose PRF and a 72-entry, 128-bit, vector float PRF. With renaming,
the processor can disentangle serialization by register name alone, and to get
better throughput, it can push independent micro-operations to earlier
positions in the execution order via OOO.</p>
</blockquote>
<blockquote>
<h3>Speculative Execution</h3>
<p>Instruction streams can be regarded as being divided into basic blocks of
non-branching code by branches. CPUs with deep pipelines execute basic blocks
efficiently, but they face performance challenges around conditional branches.
The simplest approach—stall until the conditional is determined and the branch
direction is known—results in poor performance.</p>
</blockquote>
<blockquote>
<p>The Durango CPU is able to fetch ahead and predict through multiple
conditional branches and hold multiple basic blocks in its re-order buffer,
effectively executing ahead through the code from predicted branch outcomes.
This is made possible via the core tracking which registers in the PRF
represent speculative results—that is, those from basic blocks that are not
currently certain to be executed. Once a branch direction is determined, if the
core predicted the branch direction correctly, results from that basic block
are marked as valid. If the core mispredicted, speculative results (which may
include many basic blocks) are discarded, and fetching and execution then
begins from the correct address.</p>
</blockquote>
<blockquote>
<h3>Store Forwarding</h3>
<p>With in-order execution, a store to memory followed shortly by a load from
the same location can cause a stall while the contents of memory (usually via
an L1 line) are updated; the stall ensures that the load gets the correct
result, rather than a stale value. On Xbox 360, this commonly encountered
penalty is called Load-Hit-Store. On Durango, the cores have store-forwarding
hardware to deal with this situation. This hardware monitors the load store
queue, looking for memory accesses with the same size and address; when it
finds a match, it can short-cut the store and subsequent load via the physical
register file, and thereby avoid significant pipeline stalls.</p>
</blockquote>
<blockquote>
<h3>Highly Utilized Out of Order Load Store Engine</h3>
<p>A Durango core is able to drive its load store unit at around 80-90%
capacity,<i>even on typical code</i>, because the combination of OOO, register
renaming, and store forwarding massively reduces pipeline flushes and stalls,
permitting highly effective use of L1 bandwidth. This improvement is partly the
result of the load store unit being able to reorder independent memory accesses
to avoid data hazards: loads can be arbitrarily re-ordered, and stores may
bypass loads, but stores cannot bypass other stores.</p>
<p>By contrast, the load store hardware in the Xbox 360 is utilized at about
15% capacity on typical code, due to the many pipeline bubbles from in-order
execution on the cost-reduced PowerPC cores. In conjunction with pipeline
issues, the major factors in the Xbox 360’s throughput being as low as 0.2
instructions per cycle (IPC) are L1 miss, L2 miss, and waiting for data from
memory.</p>
</blockquote>
<blockquote>
<h3>Cache Performance</h3>
<p>The Durango CPU uses 64-byte cache lines, which makes a process less likely
to waste bandwidth loading unneeded data. On Xbox 360, ensuring effective use
of cache lines for 128-byte lines can be tricky. While a Durango core’s L1 data
cache is the same size as on Xbox 360, it is not shared between two hyper
threads, and it has better set associativity. L2 is effectively three times the
size, for each hardware thread, and it has better associativity: 512 KB per
hardware thread on Durango versus approximately 170 KB per hardware thread on
Xbox 360. L1 and L2 bandwidth will be more efficiently utilized on an automatic
basis via prefetching, smaller cache lines, register renaming, OOO, and store
forwarding.</p>
</blockquote>
<blockquote>
<h3>Advanced Branch Predictor</h3>
<p>Effective branch prediction increases the likelihood that speculative
execution will execute the right code path. The Durango CPU cores have an
advanced dynamic branch predictor, able to predict up to 2 branches per cycle.
Rather than a branch<i>direction</i>, an actual branch<i>address</i>is
predicted, meaning the instruction fetch unit can speculatively fetch
instructions without waiting for resolution of the branch instruction
dependencies and the resultant target. The first-level sparse predictor stores
information about the branch target for the first two branches in a cache line,
hashed by line address in 4 KB of storage. The sparse information also
indicates if more than 2 branches are present in that line, and indexes into a
second-level dense predictor, by using a 4-KB set-associative cache of
prediction information for branches in 8-byte chunks. A branch target address
calculator checks relative branch predictions as early as possible in the
pipeline to permit discarding incorrectly fetched instructions. In addition,
the prediction unit contains a 16-entry call/return stack and a 32-entry
out-of-page address predictor.</p>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<p> </p>
<div>
<blockquote>このページは<a href="http://www.vgleaks.com/durango-cpu-overview/">http://www.vgleaks.com/durango-cpu-overview/</a>からの引用です</blockquote>
<blockquote>作業中・・・<br /></blockquote>
</div>
<div class="body-wrapper">
<div class="container-wrapper">
<div class="content-wrapper main container">
<div class="page-wrapper single-blog single-sidebar left-sidebar">
<div class="row">
<div class="gdl-page-left twelve columns">
<div class="row">
<div class="gdl-page-item mb20 gdl-blog-full eight columns">
<h1 class="blog-title"><a href="http://www.vgleaks.com/durango-cpu-overview/">Durango CPU Overview</a></h1>
<div class="blog-content-wrapper">
<div class="blog-content">
<blockquote>The<a title="Durango" href="http://www.vgleaks.com/durango/">Durango</a>CPU brings a host of modern
micro-architectural performance features to console development. With Durango,
a familiar instruction set architecture and high performance silicon mean
developers can focus effort on content and features, not micro-optimization.
The trend towards more parallel power continues in this hardware; so, an
effective strategy for multi-core computing is more important than
ever.</blockquote>
<blockquote>
<h2>Architectural Overview</h2>
<p>The Durango CPU is structured as two modules. A module contains four x64
cores, each running a single thread at 1.6 GHz. Each core contains a 32 KB
instruction cache (I-cache) and a 32 KB data cache (D-cache), and the 4 cores
in each module share a 2 MB level 2 (L2) cache. In total, the modules have 8
hardware threads and 4 MB of L2. The architecture is little-endian.</p>
</blockquote>
<p><img width="915" height="608" class="alignnone size-full wp-image-1832" alt="cpu Durango CPU Overview" src="http://www.vgleaks.com/wp-content/uploads/2013/02/cpu.jpg" title="Durango CPU Overview" /></p>
<blockquote>
<p>Four cores communicate with the module’s L2 via the L2 Interface (L2I), and
with the other module and the rest of the system (including main RAM) via the
Core Communication Interface (CCI) and the North Bridge.</p>
</blockquote>
<blockquote>
<h2>Caches</h2>
<p>The caches can be summarized as shown in the following table.</p>
<table width="659" cellspacing="0" cellpadding="0" border="0"><tbody><tr><td width="61" valign="top"><b>Cache</b></td>
<td width="230" valign="top"><b>Policy</b></td>
<td width="57" valign="top"><b>Ways</b></td>
<td width="76" valign="top"><b>Set Size</b></td>
<td width="85" valign="top"><b>Line Size</b></td>
<td width="151" valign="top"><b>Sharing</b></td>
</tr><tr><td width="61" valign="top"><i>L1 I</i></td>
<td width="230" valign="top">Read only</td>
<td width="57" valign="top">2</td>
<td width="76" valign="top">256</td>
<td width="85" valign="top">64 bytes</td>
<td width="151" valign="top">Dedicated to 1 core</td>
</tr><tr><td width="61" valign="top"><i>L1 D</i></td>
<td width="230" valign="top">Write-allocate, write-back</td>
<td width="57" valign="top">8</td>
<td width="76" valign="top">64</td>
<td width="85" valign="top">64 bytes</td>
<td width="151" valign="top">Dedicated to 1 core</td>
</tr><tr><td width="61" valign="top"><i>L2</i></td>
<td width="230" valign="top">Write-allocate, write-back, inclusive</td>
<td width="57" valign="top">16</td>
<td width="76" valign="top">2048</td>
<td width="85" valign="top">64 bytes</td>
<td width="151" valign="top">Shared by module</td>
</tr></tbody></table><p> </p>
</blockquote>
<blockquote>
<p>The 4 MB of L2 cache is split into two parts, one in each module. On an L2
miss from one module, the hardware checks if the required line is resident in
the other module—either in its L2 only, or any of its cores’ L1 caches.
Checking and retrieving data from the other module’s caches is quicker than
fetching it from main memory, but this is still much slower than fetching it
from the local L1 or L2. This makes choice of core and module very important
for processes that share data.</p>
<table width="659" cellspacing="0" cellpadding="0" border="0"><tbody><tr><td width="196" valign="top"><b>Memory access result</b></td>
<td width="66" valign="top"><b>Cycles</b></td>
<td width="397" valign="top"><b>Notes</b></td>
</tr><tr><td width="196" valign="top"><i>L1 hit</i></td>
<td width="66" valign="top">3</td>
<td width="397" valign="top">Required line is in this core’s L1</td>
</tr><tr><td width="196" valign="top"><i>L2 hit</i></td>
<td width="66" valign="top">17</td>
<td width="397" valign="top">Required line is in this module’s L2</td>
</tr><tr><td width="196" valign="top"><i>Remote L2 hit, remote L1 miss</i></td>
<td width="66" valign="top">100</td>
<td width="397" valign="top">Required line is in the other module’s L2</td>
</tr><tr><td width="196" valign="top"><i>Remote L2 hit, remote L1 hit</i></td>
<td width="66" valign="top">120</td>
<td width="397" valign="top">Required line is in the other module’s L2 & in
remote core’s L1</td>
</tr><tr><td width="196" valign="top"><i>Local L2 miss, remote L2 miss</i></td>
<td width="66" valign="top">144-160</td>
<td width="397" valign="top">Required line is not resident in any cache; load
from memory</td>
</tr></tbody></table><p> </p>
</blockquote>
<blockquote>
<p>Both L1 and L2 caches have hardware prefetchers that automatically predict
the next line required, based on the stream of load/store addresses generated
so far. The prefetchers can derive negative and positive strides from multiple
address sequences, and can make a considerable difference to performance. While
the x64 instruction set has explicit cache control instructions, in many
situations the prefetcher removes the need to manually insert these.</p>
</blockquote>
<blockquote>
<p>The Durango CPU does not support line or way locking in either L1 or L2, and
has no L3 cache.</p>
</blockquote>
<blockquote>
<p>This document does not cover memory paging or translation lookaside buffers
(TLBs) on the cores.</p>
</blockquote>
<blockquote>
<h2>Instruction Set Architecture</h2>
<p>The cores execute the x64 instruction set (also known as x86-64 or AMD64);
this instruction set will be familiar to developers working on AMD or Intel
based architectures, including that of desktop computers running Windows. x64
is a 64-bit extension to 32-bit x86 , which is a complex instruction set
computer (CISC) with register-memory, variable instruction length, and a long
history of binary backward compatibility; that is, some instruction encodings
have not changed since the 16-bit Intel 8086.</p>
</blockquote>
<blockquote>
<p>The x64 architecture requires SSE2 support, and Visual Studio makes
exclusive use of SSE instructions for all floating-point operations. x64
deprecates older instruction sets: x87, Intel MMX®, and AMD 3DNow!®. x64
supports the following instruction set extensions:</p>
<ul><li><b>SIMD/vector instructions</b>: SSE up to SSE4.2 (including SSSE3 for
packing and SSE4a), and AVX</li>
<li><b>F16C</b>: half-precision float conversion</li>
<li><b>BMI</b>: bit shifting and manipulation</li>
<li><b>AES+CLMULQDQ</b>: cryptographic function support</li>
<li><b>XSAVE</b>: extended processor state save</li>
<li><b>MOVBE</b>: byte swapping/permutation</li>
<li><b>VEX prefixing</b>: Permits use of 256-bit operands in support of AVX
instructions</li>
<li><b>LOCK prefix</b>: modifies selected integer instructions to be
system-wide atomic</li>
</ul><p> </p>
</blockquote>
<blockquote>
<p>The cores do not support XOP, AVX2, or FMA3/4 (fused multiply-add).</p>
<p>Architecturally, the cores each have sixteen 64-bit general purpose
registers, eight 80-bit floating point registers, and sixteen 256-bit
vector/SIMD registers. The 80-bit floating point registers are part of x87
legacy support.</p>
</blockquote>
<blockquote>
<h2>Performance</h2>
<p>Durango CPU cores run at 1.6 GHz; this is half the clock rate of
the<a title="Xbox 360" href="http://www.vgleaks.com/xbox-360/">Xbox 360</a>’s
cores. Because of this, it is tempting to assume that the Xbox 360’s cores
might outperform Durango’s cores. However, this is emphatically not true, for
the reasons described in the following sections.</p>
</blockquote>
<blockquote>
<h3>Sub-ISA Parallelism and Micro-Operations</h3>
<p>Like most recent high-performance x64 processors, the cores do not execute
the x64 instruction set natively; instead, internally instructions are decoded
into micro-operations, which the processor executes. This translation provides
opportunities to parallelize beyond traditional superscalar execution.</p>
</blockquote>
<blockquote>
<p>Durango CPU cores have dual x64 instruction decoders, so they can decode two
instructions per cycle. On average, an x86 instruction is converted to 1.7
micro-operations, and many common x64 instructions are converted to 1
micro-operation. In the right conditions, the processor can simultaneously
issue six micro-operations: a load, a store, two ALU, and two vector floating
point. The core has corresponding pipelines: two identical 64-bit ALU
pipelines, two 128-bit vector float pipelines (one with float multiply, one
with float add), one load pipeline, and one store pipeline. A core can retire 2
micro-operations a cycle.</p>
</blockquote>
<blockquote>
<h3>Out of Order Execution</h3>
<p>Xbox 360 CPU cores execute in-order (also called<i>program order</i>)the
instructions in exactly the order the compiler laid them out. The Xbox 360 CPU
has no opportunity to anticipate and avoid stalls caused by dependencies in the
incoming instruction stream, and no compiler can eliminate all possible
pipeline issues.</p>
</blockquote>
<blockquote>
<p>In contrast, the Durango CPU cores execute fully out of order (OOO), also
called<i>data order</i>, since execution order is determined by data
dependencies. This means the processor is able, while executing a sequence of
instructions, to re-order the micro-operations (<i>not</i>the x64 instructions)
via an internal 64-entry re-order buffer (ROB). This improves performance
by:</p>
<ul><li>Starting loads and stores as early as possible to avoid stalls.</li>
<li>Executing instructions in data-dependency order.</li>
<li>Fetching instructions from branch destination as soon as the branch address
is resolved.</li>
</ul></blockquote>
<blockquote>
<h3>Register Renaming</h3>
<p>A low count of registers can cause execution of instructions to be
unnecessarily serialized. Similar in concept to translating x64 instructions to
micro-operations, register names used in the x64 instruction stream are not
used as is, but are instead renamed to point at entries in a large internal
physical register file (PRF)—Durango cores have a 64-entry, 64-bit,
general-purpose PRF and a 72-entry, 128-bit, vector float PRF. With renaming,
the processor can disentangle serialization by register name alone, and to get
better throughput, it can push independent micro-operations to earlier
positions in the execution order via OOO.</p>
</blockquote>
<blockquote>
<h3>Speculative Execution</h3>
<p>Instruction streams can be regarded as being divided into basic blocks of
non-branching code by branches. CPUs with deep pipelines execute basic blocks
efficiently, but they face performance challenges around conditional branches.
The simplest approach—stall until the conditional is determined and the branch
direction is known—results in poor performance.</p>
</blockquote>
<blockquote>
<p>The Durango CPU is able to fetch ahead and predict through multiple
conditional branches and hold multiple basic blocks in its re-order buffer,
effectively executing ahead through the code from predicted branch outcomes.
This is made possible via the core tracking which registers in the PRF
represent speculative results—that is, those from basic blocks that are not
currently certain to be executed. Once a branch direction is determined, if the
core predicted the branch direction correctly, results from that basic block
are marked as valid. If the core mispredicted, speculative results (which may
include many basic blocks) are discarded, and fetching and execution then
begins from the correct address.</p>
</blockquote>
<blockquote>
<h3>Store Forwarding</h3>
<p>With in-order execution, a store to memory followed shortly by a load from
the same location can cause a stall while the contents of memory (usually via
an L1 line) are updated; the stall ensures that the load gets the correct
result, rather than a stale value. On Xbox 360, this commonly encountered
penalty is called Load-Hit-Store. On Durango, the cores have store-forwarding
hardware to deal with this situation. This hardware monitors the load store
queue, looking for memory accesses with the same size and address; when it
finds a match, it can short-cut the store and subsequent load via the physical
register file, and thereby avoid significant pipeline stalls.</p>
</blockquote>
<blockquote>
<h3>Highly Utilized Out of Order Load Store Engine</h3>
<p>A Durango core is able to drive its load store unit at around 80-90%
capacity,<i>even on typical code</i>, because the combination of OOO, register
renaming, and store forwarding massively reduces pipeline flushes and stalls,
permitting highly effective use of L1 bandwidth. This improvement is partly the
result of the load store unit being able to reorder independent memory accesses
to avoid data hazards: loads can be arbitrarily re-ordered, and stores may
bypass loads, but stores cannot bypass other stores.</p>
<p>By contrast, the load store hardware in the Xbox 360 is utilized at about
15% capacity on typical code, due to the many pipeline bubbles from in-order
execution on the cost-reduced PowerPC cores. In conjunction with pipeline
issues, the major factors in the Xbox 360’s throughput being as low as 0.2
instructions per cycle (IPC) are L1 miss, L2 miss, and waiting for data from
memory.</p>
</blockquote>
<blockquote>
<h3>Cache Performance</h3>
<p>The Durango CPU uses 64-byte cache lines, which makes a process less likely
to waste bandwidth loading unneeded data. On Xbox 360, ensuring effective use
of cache lines for 128-byte lines can be tricky. While a Durango core’s L1 data
cache is the same size as on Xbox 360, it is not shared between two hyper
threads, and it has better set associativity. L2 is effectively three times the
size, for each hardware thread, and it has better associativity: 512 KB per
hardware thread on Durango versus approximately 170 KB per hardware thread on
Xbox 360. L1 and L2 bandwidth will be more efficiently utilized on an automatic
basis via prefetching, smaller cache lines, register renaming, OOO, and store
forwarding.</p>
</blockquote>
<blockquote>
<h3>Advanced Branch Predictor</h3>
<p>Effective branch prediction increases the likelihood that speculative
execution will execute the right code path. The Durango CPU cores have an
advanced dynamic branch predictor, able to predict up to 2 branches per cycle.
Rather than a branch<i>direction</i>, an actual branch<i>address</i>is
predicted, meaning the instruction fetch unit can speculatively fetch
instructions without waiting for resolution of the branch instruction
dependencies and the resultant target. The first-level sparse predictor stores
information about the branch target for the first two branches in a cache line,
hashed by line address in 4 KB of storage. The sparse information also
indicates if more than 2 branches are present in that line, and indexes into a
second-level dense predictor, by using a 4-KB set-associative cache of
prediction information for branches in 8-byte chunks. A branch target address
calculator checks relative branch predictions as early as possible in the
pipeline to permit discarding incorrectly fetched instructions. In addition,
the prediction unit contains a 16-entry call/return stack and a 32-entry
out-of-page address predictor.</p>
</blockquote>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<p> </p>