Final Architecture Project

Architecture Parameter Sensitivity Analysis of High Bandwidth Memory and 2.5D Memory Architectures

Core question: When does high peak memory bandwidth translate into useful achieved performance?

Zezhou Zhang, Leilei Ouyang, Duoxiang Zhao, Ruolin Yin, Yuzhou Chen Introduction to Computer Architecture DRAMsim3 final workflow
0
HBM over DDR4 achieved BW (stream)
0
HBM over DDR4 achieved BW (random)
0
DVFS energy saving (light workload)
DRAM Core
Timing limits random access — new experiment
Scroll to explore
Background

The memory wall: compute outpaces off-chip memory bandwidth.

The memory wall problem

  • Processor compute capability grows ~50% per year; DRAM bandwidth grows ~7-10% per year.
  • Modern GPUs/accelerators demand >1 TB/s of memory bandwidth for AI/ML workloads.
  • HBM (High Bandwidth Memory) addresses this through 3D-stacked DRAM + wide I/O interfaces.
  • 2.5D integration uses a silicon interposer to place HBM and processor side-by-side.

Our approach

We use DRAMsim3 to systematically compare architectures and understand which parameters truly matter — not just peak bandwidth, but achieved bandwidth, latency, utilization, and energy.

10.43x
HBM over DDR4 achieved BW (stream)
3.27x
HBM over DDR4 achieved BW (random)
37.3%
DVFS energy saving (light workload)
DRAM
Core timing limits random access — new experiment

Methodology

A unified DRAMsim3 workflow connects architectures, workloads, and metrics.

Experimental setup

  • Simulator: DRAMsim3 (cycle-accurate, open source).
  • Architectures: DDR4 baseline, HBM baseline, HMC/2.5D proxy.
  • HBM sensitivity: channels (1/2/4/8), frequency (800/1000/1250/2000 MHz), queue size (16/32/64).
  • Workloads: built-in stream (high locality) & random (low locality); timed synthetic traces (light/mixed/heavy) for DVFS.

Metrics

  • Achieved bandwidth (GB/s)
  • Average read latency (ns)
  • Bandwidth utilization = achieved / peak
  • Energy per completed request (pJ/req)
  • DVFS: energy saving %, EDP improvement %, bandwidth/latency change %

Part 1 — Baseline Experiments

Three architectures compared under controlled workloads.

Architecture Config source Key characteristics Role
DDR4 DDR4_Baseline.ini 1 channel, 64-bit bus, ~101.6 GB/s peak Conventional memory baseline
HBM HBM_Baseline.ini 8 channels, 128-bit per channel, 4 dies, ~512 GB/s peak Main target: stacked wide-I/O memory
HMC/2.5D proxy HMC_4GB_4Lx16.ini 16 channels (4 links × 16 lanes), ~120 GB/s peak Proxy for 2.5D stacked/interposer memory

Workloads

Stream: sequential, row-buffer-friendly → tests peak bandwidth capability.

Random: uniform random, low locality → tests latency and queue behavior.

Experiment design principle

Each architecture runs both workloads for 200,000 cycles. We record achieved BW, average read latency, bandwidth utilization, and energy per request — all from the same DRAMsim3 output parser.


Part 1 — Baseline Data Analysis

HBM delivers 10.43x stream and 3.27x random bandwidth over DDR4.

MetricWorkloadDDR4HBMHMC/2.5DHBM/DDR4HMC/HBM
Achieved BW (GB/s) Stream16.9176.4110.610.43x0.63x
Random18.560.679.93.27x1.32x
Avg Read Latency (ns) Stream248.675.0229.00.30x3.05x
Random484.5380.793.10.79x0.24x
Energy/Request (pJ) Stream11178115210590.10x0.92x
Random17299240011960.14x0.50x

Key finding 1

HBM dominates stream workloads — 10.43x bandwidth, dramatically lower latency and energy. The wide channel interface and channel-level parallelism convert peak BW into achieved BW.

Key finding 2

HMC/2.5D proxy gives better random access performance than HBM (1.32x BW, 93ns vs 381ns latency). This suggests a different latency-bandwidth tradeoff in 2.5D-style architectures.


Part 2 — Parameter Sensitivity Analysis

Three HBM parameters varied one at a time: channels, frequency, queue size.

ParameterValues testedWhat it revealsSignificance
Channel count 1, 2, 4, 8 Channel-level parallelism; queue pressure distribution Strongest result
Frequency (MHz) 800, 1000, 1250, 2000 Peak BW vs. achieved BW; diminishing returns; DRAM bottleneck hypothesis Focus of this section
Transaction queue size 16, 32, 64 Memory-level parallelism exploitation Workload-dependent
Frequency sensitivity is the most nuanced result — and reveals the fundamental DRAM core bottleneck.

Part 2 — Frequency Sensitivity (Focus)

Stream BW grows but saturates; random BW is nearly flat above 1000 MHz.

HBM frequency sensitivity: bandwidth, latency, and utilization across 800-2000 MHz.
HBM frequency sweep (800/1000/1250/2000 MHz): achieved BW, average read latency, and bandwidth utilization.
Frequency (MHz)Peak BW (GB/s)Stream BW (GB/s)Random BW (GB/s)Stream Util.Random Util.Random Latency (ns)
800409.6142.151.134.7%12.5%156.2
1000512.0176.460.634.4%11.8%380.7
1250640.0186.660.129.2%9.4%477.1
20001024.0215.461.021.0%6.0%524.1

Part 2 — Frequency Sensitivity: Deep Analysis

Why does random bandwidth saturate? The DRAM core timing is the bottleneck.

Stream workload: partial scaling

  • BW grows from 142.1 → 215.4 GB/s (+51.6% across 800→2000 MHz).
  • But utilization drops from 34.7% → 21.0%.
  • Peak BW grows 2.5x (409.6→1024 GB/s), achieved BW grows only 1.5x.
  • Diminishing returns: each frequency increment converts less into achieved BW.
  • Energy/request grows from 917 → 2656 pJ (+190%) — significant energy cost.

Random workload: complete saturation

  • BW is essentially flat: 51.1 → 61.0 GB/s at 800→2000 MHz (+19%).
  • Above 1000 MHz: zero gain (60.6 → 60.1 → 61.0 GB/s).
  • Utilization collapses: 12.5% → 6.0%.
  • Latency increases from 156 → 524 ns (3.4x).
  • Energy/request explodes from 1786 → 7784 pJ (+336%).
Random BW saturates at ~61 GB/s regardless of frequency. This proves the limit is NOT interface bandwidth — it is DRAM core timing: tRCD (row activation), tRP (precharge), and CL (column access). Every random access pays this fixed timing cost.

Part 2 — Frequency Sensitivity: Conclusions

Key takeaways: frequency is not a free lunch.

+51.6%
Stream BW gain (800→2000 MHz)
+19%
Random BW gain (800→2000 MHz)
~0%
Random BW gain above 1000 MHz
-47% to -71%
Utilization drop (highest to lowest freq)

3 conclusions from frequency sensitivity

  1. Stream benefits but with diminishing returns. Each extra MHz of peak BW converts less efficiently into achieved BW. The sweet spot appears around 1000 MHz (34.4% utilization, 176.4 GB/s).
  2. Random BW is DRAM-core-limited. Above 1000 MHz, the interface can deliver more, but the DRAM timing (tRCD + CL + tRP) cannot service random requests any faster. Adding more peak BW is wasted.
  3. Higher frequency costs energy proportionally more than it delivers bandwidth. For random: +150% peak BW (1000→2000) costs +63% energy per request with +0.6% bandwidth gain.

This directly motivates our new experiment

If DRAM core timing is the real bottleneck for random access, can we prove this by varying DRAM timing parameters (tRCD, tRP, tCL) directly? → See Part 3(c): DRAM as Bottleneck.


Part 2 — Channel Sensitivity

Channel count is the most reliable HBM tuning dimension.

HBM channel sensitivity: bandwidth and latency scale well with channel count.
HBM channel sweep (1/2/4/8): achieved BW increases and latency decreases monotonically in both workloads.

Bandwidth scaling

  • Stream: 26.5 → 176.4 GB/s = 6.66x from 1→8 channels
  • Random: 7.8 → 60.6 GB/s = 7.73x from 1→8 channels
  • Near-linear scaling: channels expose independent memory-level parallelism

Latency reduction

  • Stream: 188.9 → 75.0 ns = 60% reduction
  • Random: 920.3 → 380.7 ns = 59% reduction
  • Queue pressure distributes across channels → lower per-channel contention

Part 2 — Transaction Queue Sensitivity

Larger queues help stream (1.62x) but not random (1.01x).

HBM queue sensitivity: stream benefits, random does not.
HBM transaction queue sweep (16/32/64): stream BW improves, random BW is flat.
Queue sizeStream BW (GB/s)Random BW (GB/s)Stream Latency (ns)Random Latency (ns)
16114.660.353.4302.1
32176.460.675.0380.7
64185.461.0103.3542.1

Interpretation

Queue capacity exposes memory-level parallelism. Stream benefits because its sequential access pattern has exploitable parallelism — more outstanding requests can be pipelined across banks and channels. Random does not benefit because each request is structurally independent and the bottleneck is DRAM timing, not queue depth. Larger queues trade latency for throughput: at queue=64, stream latency rises +36% but BW gains only +5.1% over queue=32.


Part 3(a) — Innovation: HBM-Aware DVFS

Load-aware frequency switching saves energy when the workload leaves slack.

HBM DVFS extension: energy and EDP changes across scenarios.
DVFS extension (10000-request): fixed HBM 1250 MHz baseline vs. load-aware 800/1000/1250 MHz dynamic selection.

Method

  • Split timed traces into 100K-cycle windows
  • Baseline probe: measure load index L ∈ [0,1] per window at 1250 MHz
  • Decision: L < 0.4 → 800 MHz, L < 0.7 → 1000 MHz, else → 1250 MHz
  • Simulate each window at selected frequency, aggregate results

4 workload scenarios

  • Light: sparse random, gap 200-800 cycles, L~0.2
  • Mixed: alternating dense/sparse phases, L∈[0.3,0.8]
  • Heavy: dense random with 20% priority requests, L>0.7
  • Baseline: dense random, no priority, L>0.7

Part 3(a) — DVFS: When is the tradeoff ideal?

DVFS saves 37.3% energy at light load, with 30.8% EDP improvement.

ScenarioEnergy SavingEDP ImprovementBW ChangeLatency Change800 MHz wins1250 MHz winsVerdict
Light 37.3% 30.8% -36.1% +5.2% 51 0 Ideal tradeoff
Mixed 4.7% 2.6% -1.1% +1.0% 1 6 Modest benefit
Heavy 0% 0% 0% 0% 0 1 Correctly stays high
Baseline 0% 0% 0% 0% 0 1 Correctly stays high

When is the tradeoff ideal?

Light workloads (L < 0.4): 37.3% energy saving, 30.8% EDP improvement, only 5.2% latency increase. The 36.1% bandwidth loss is acceptable because the workload doesn't need it. All 51 windows run at 800 MHz — the policy correctly identifies sustained low load.

When is DVFS not useful?

Heavy/sustained workloads: the policy correctly holds at 1250 MHz, claiming no artificial savings. Mixed workloads: modest 4.7% energy saving — the sparse windows are too few to accumulate significant savings.

DVFS is not a universal optimizer. It is a targeted energy tradeoff that works best when load index stays below 0.4 consistently. The 1:1 energy-BW tradeoff at light load is the fundamental design tension.

Part 3(b) — Innovation: 2.5D Stacked HBM

2.5D integration trades stream bandwidth for better random access latency.

What is 2.5D stacked HBM?

2.5D integration places HBM stacks and the processor die side-by-side on a silicon interposer. The interposer provides ultra-dense routing (~2μm pitch vs. ~100μm on PCB), enabling thousands of short, low-power connections between memory and processor.

  • HBM dies are 3D-stacked with TSVs (through-silicon vias)
  • The stack sits on a base logic die connected to the interposer
  • Interposer routes signals from HBM to processor with minimal distance (~2-5mm)
  • Key benefit: shorter physical path → lower latency, lower power per bit

Our proxy: HMC/2.5D in DRAMsim3

We use the HMC (Hybrid Memory Cube) configuration as a proxy for 2.5D-style stacked memory. HMC shares key characteristics: 3D-stacked DRAM, serialized links through a logic base, and proximity to the processor via interposer-style connection. Caveat: DRAMsim3 does not model physical interposer routing delay or thermal effects.

0.63x
2.5D/HMC stream BW vs HBM (lower)
1.32x
2.5D/HMC random BW vs HBM (higher!)
93 ns
2.5D/HMC random latency (HBM: 381 ns)
1196 pJ
2.5D/HMC energy/req random (HBM: 2400 pJ)

Why this tradeoff?

HMC uses 16 narrow serial links (4 links × 16 lanes) instead of HBM's 8 wide parallel channels. The serialized, packet-based protocol has lower per-access overhead for random requests, but lower peak streaming bandwidth. This creates an architecture-consistent tradeoff: 2.5D/HMC is better for latency-sensitive, irregular workloads; HBM is better for bandwidth-hungry, regular workloads.


Part 3(c) — Innovation: DRAM Core as Bottleneck [NEW]

Proving that DRAM timing — not interface bandwidth — limits random access.

Motivation from existing data

Our frequency sweep showed random BW saturating at ~61 GB/s regardless of interface frequency (1000→2000 MHz: zero gain). This strongly suggests the bottleneck is not the HBM interface but the DRAM core timing parameters. Every random access pays tRCD (row activation) + CL (column read) + tRP (precharge), which are fixed real-time costs independent of interface speed.

Hypothesis: In random-access workloads, DRAM row cycle time (tRC = tRAS + tRP) dominates overall latency, making interface bandwidth irrelevant beyond a threshold.

Experiment design space: 3 proposals

(Choose one or combine; suggested priority order shown)

Proposal 1 (Recommended): Systematically vary DRAM core timing parameters

Setup: Fix HBM at 1250 MHz (8 channels, queue=32). Create 5 configs where tRCD, tRP, tCL are scaled by 0.5x / 0.75x / 1.0x / 1.5x / 2.0x relative to baseline.

Metrics: Achieved random BW, random latency, utilization.

Prediction: If DRAM core is the bottleneck, slowing timing by 2x should approximately halve random BW, while speeding by 2x should nearly double it. If interface is the bottleneck, changes will have minimal effect.

[Experiment results to be filled here] Expected: table/plot showing BW vs. timing scale factor Configs: HBM_1250_tRCD_0p5x.ini through HBM_1250_tRCD_2p0x.ini

Proposal 2: Row buffer locality sweep (tRC bottleneck proof)

Setup: Create 5 traces with controlled row buffer hit rates: 0%, 25%, 50%, 75%, 100%. Simulate at HBM 1250 MHz (8 channels).

Prediction: BW should scale linearly with row hit rate, because each miss costs the full tRC (row close + open). At 0% hit rate, even HBM's peak BW is irrelevant — the DRAM core tRC determines maximum throughput.

Key formula: Max random BW ≈ (bus width × channels) / tRC when row hit rate = 0.

[Experiment results to be filled here] Expected: scatter plot of BW vs. row hit rate Trace generators need to be written

Proposal 3: Bank count sensitivity (find bank-level parallelism limit)

Setup: Vary HBM bank groups (2/4/8) and banks per group (2/4/8) at fixed 1250 MHz. Total banks: 4, 8, 16, 32, 64.

Prediction: Random BW should saturate when #banks > #outstanding random requests that can be serviced in parallel. This identifies the bank-level parallelism ceiling.

[Experiment results to be filled here] Expected: BW vs. total bank count line chart Configs: edit bankgroups and banks_per_group

Limitations & Scope

Strongest conclusions come from being clear about what we do not claim.

What we do NOT claim

  • The HMC/2.5D result is a proxy — it does not model physical interposer routing, thermal, or distance effects.
  • Frequency sensitivity is affected by cycle-driven workload generation; do not generalize "higher frequency = slower random" as a universal law.
  • DVFS does not support P99 latency, SLA violation rate, or tail-latency guarantees.
  • Synthetic stream/random workloads show architectural sensitivity, not full application speedup.
  • DRAM core bottleneck experiment (Part 3c) is proposed but not yet executed.

Future work directions

  • Execute DRAM bottleneck experiment (Proposal 1: timing parameter sweep).
  • Add per-request latency logging for P99/tail analysis in DVFS.
  • Use application-derived memory traces (e.g., SPEC, ML benchmarks).
  • Model interposer effects more explicitly or cross-validate with another simulator.
  • Explore 3D-stacked HBM (HBM3/HBM3e) with more channels and higher frequencies.
These limitations define the boundary of our conclusions — they do not invalidate the results within that boundary.

Conclusion

HBM's value comes from converting architecture resources into achieved bandwidth — and knowing what limits that conversion.

10.43x
HBM/DDR4 stream BW — large, reliable gain
~61 GB/s
Random BW ceiling — DRAM-core-limited
37.3%
DVFS energy saving at low load — ideal tradeoff
93 ns
2.5D/HMC random latency vs. HBM 381 ns

Baseline

HBM strongly outperforms DDR4. The advantage is workload-dependent: 10.43x stream, 3.27x random.

Sensitivity

Channels are the most reliable tuning dimension. Frequency has diminishing returns; random BW hits a DRAM core ceiling.

Innovation

DVFS saves energy when load is low. 2.5D changes the latency/BW tradeoff. DRAM bottleneck experiment will quantify the core timing limit.

Final contribution: a reproducible DRAMsim3 workflow that connects memory architecture parameters to measured bandwidth, latency, utilization, and energy — with a clear path to quantifying the DRAM core bottleneck.