SAFARI News
-
December 2012 -- Energy-efficient Communication Substrates:
We are designing and evaluating on-chip interconnects with new designs that
provide high performance with very simple router hardware, and low energy and
area overhead. Our new technical report, "HiRD: A Low-Complexity,
Energy-Efficient Hierarchical Ring Interconnect," describes new mechanisms and
techniques to design a network using a hierarchy of rings without using buffers
in each ring router. The key novel ideas are to place buffers only in "bridge"
routers, which allow traffic to move between rings, and to use "deflections"
(allowing traffic to circle a ring again) when transfer buffers are full. In
this way, no in-ring flow control or buffering is necessary, leading to a
unique hierarchical network design. Our report rigorously argues that HiRD is
livelock- and deadlock-free, and compares its performance and energy-efficiency
against a comprehensive set of relevant baselines. The results show that HiRD
attains equal or better performance at better energy efficiency than a
comprehensive set of baseline NoC topologies and router designs, including a
previous buffered hierarchical ring design. We conclude that HiRD is a
compelling design point which allows scalable, efficient performance while
retaining the simplicity and appeal of ring-based designs.
- October 2012 -- Application-aware Throttling Mechanism for NoCs: Our work on
"Heterogeneous Adaptive Throttling for On-Chip Networks.," was presented at SBAC-PAD 2012. You can
view our slides here.
The network-on-chip (NoC) is a primary shared
resource in a chip multiprocessor (CMP) system. As core counts
continue to increase and applications become increasingly
data-intensive, the network load will also increase, leading
to more congestion in the network. This network congestion
can degrade system performance if the network load is not
appropriately controlled. In this paper, we present Heterogeneous
Adaptive Throttling (HAT), a new application-aware throttling
mechanism
that reduces congestion to improve performance in NoC based multi-core
systems. HAT achieves this improvement
by using two key principles. First, to improve system
performance, HAT observes applications' network intensity
and selectively throttles network-intensive applications, allowing
latency-sensitive applications to make fast progress
with reduced network congestion. Second, to minimize
the over- and under-throttling of applications, which limits
system performance, HAT observes the network load and
dynamically adjusts the throttling rate in a closed-loop
fashion. We show that
HAT outperforms two different state-of-the-art congestioncontrol
mechanisms, providing the best system performance
and fairness.
- June 2012 -- DRAM Refresh: Our work on
"Retention-Aware Intelligent DRAM Refresh," was presented at ISCA 2012. You can
view our slides here.
DRAM cells must be periodically refreshed to prevent loss of data. These
refresh operations interfere with memory accesses and waste energy. These
negative effects increase as DRAM device capacity increases, posing a
significant challenge to DRAM scaling. We propose RAIDR (Retention-Aware
Intelligent DRAM Refresh), a low-cost mechanism that reduces refresh overhead
by exploiting heterogeneity in DRAM cell retention times. DRAMs today are
refreshed at the worst-case refresh rate required by the weakest cells,
resulting in many unnecessary refreshes to all other cells. In contrast,
RAIDR
identifies the refresh rate required by each DRAM row, groups rows into bins
based on their required refresh rate, and refreshes rows in different bins at
different rates, allowing many of these unnecessary refreshes to be skipped.
RAIDR reduces the number of refresh operations by 74.6% at a modest storage
overhead of 1.25KB in a 32GB memory controller.
- June 2012 -- Efficient DRAM Design: Our work on
"Exploiting Subarray-Level Parallelism
(SALP) in DRAM," was presented at ISCA 2012. You can
view our slides here.
Modern DRAMs have multiple banks to serve multiple memory
requests in parallel. However, when two requests go to the same
bank, they have to be served serially, exacerbating the high
latency of off-chip memory. Adding more banks to the system to
mitigate this problem incurs high system cost. Our paper
builds on the key observation that a DRAM bank internally
consists of multiple subarrays that operate largely
independently and have their own row-buffers. Hence, the latencies
of accesses to different subarrays within the same bank can
potentially be overlapped to a
large degree. The paper introduces three schemes that take
advantage of this fact and progressively increase the independence
of operation of subarrays by making small modifications to
the DRAM chip. Our schemes significantly improve system performance on both single-core and
multi-core systems on a variety of workloads while incurring little
(less than 0.15%) or no area overhead in the DRAM chip.
- June 2012 -- Scalable Memory Scheduling in
Heterogeneous Systems: Our work on
Staged Memory Scheduling
was presented at ISCA 2012. You can view our slides here.
When multiple processor (CPU) cores and a GPU integrated together on the
same chip share the off-chip main memory, requests from the GPU can
heavily interfere with requests from the CPU cores, leading to low system
performance and starvation of CPU cores. Unfortunately, state-of-the-art
application-aware memory scheduling algorithms are ineffective at solving
this problem at low complexity due to the large amount of GPU
traffic. We propose a fundamentally new approach that decouples the
memory controller's three primary tasks into three significantly
simpler structures that together improve system performance and fairness,
especially in integrated CPU-GPU systems. Our three-stage memory
controller first groups requests based on row-buffer locality. This
grouping allows the second stage to focus only on inter-application
request scheduling. These two stages enforce high-level policies regarding
performance and fairness, and therefore the last stage consists of simple
per-bank FIFO queues (no further command reordering within each bank) and
straightforward logic that deals only with low-level DRAM commands and
timing. Our evaluations show that SMS improves CPU performance without
degrading GPU frame rate beyond a generally acceptable level, while being
significantly less complex to implement than previous
application-aware schedulers.
- April 2012 -- Energy-efficient Communication
Substrates: Our NOCS 2012 paper, "MinBD:
Minimally-Buffered Deflection Routing for Energy-Efficient
Interconnect," presents a NoC router design which has nearly
the performance (within 2.7%) of a conventional buffered
interconnect design (which uses large, power-hungry input
buffers), by performing deflection routing to handle contention, but also using a
small side-buffer to hold some of the traffic which would have been
deflected. Hence the simplicity and low cost of deflection-based
routers is retained, while deflection rate is significantly
reduced. We show that by designing the router in this way, and by
addressing several other bottlenecks in earlier bufferless deflection
routers (using our CHIPPER work in HPCA 2011 as an example),
minimally-buffered deflection routers show very good performance and
energy efficiency. We show in our evaluations that MinBD is the most
energy-efficient router among a wide variety of past router designs
which we evaluated, including input-buffered, pure bufferless, and a
hybrid buffered-bufferless router, with lower area and
power. You can view our slides here:pptx.
- Jan. 2012 -- Scalable, Energy-Efficient Memory Systems: Our
ICAC 2011 paper, "Memory Power
Management via Dynamic Voltage/Frequency Scaling,"
demonstrates that memory systems which are provisioned for high
performance with memory-intensive applications are often overkill
for many other applications which do not require as much memory
bandwidth. Running memory at a lower frequency has a minimal impact
on the performance of these applications, and also allows for an
operating voltage reduction, which significantly reduces memory
system power and thus increases energy efficiency. We demonstrate a
dynamic voltage/frequency scaling approach to increasing memory
system energy efficiency which observes memory bandwidth at runtime
and scales the memory frequency and voltage with this bandwidth
demand. Significantly, we evaluate this on a real server platform by
using memory controller timing registers in the Intel Nehalem, which
replicates the effect of dynamically adjustable memory frequency.
Combined with an analytical model for power savings, we show that
memory power can be reduced by 10.4% on average (20.5% max in one
workload) with only 0.17% on performance. You can view our slides
here: pptx, pdf.
- Dec. 2011 -- Scalable Memory Systems: Our latest work on memory
interference handling, "Reducing Memory
Interference in Multicore Systems via Application-Aware Memory Channel
Partitioning", was presented at MICRO 2011 in Porto Alegre, Brazil. You can
view our slides here.
Inter-application interference at the main memory is a major impediment to
individual application and system performance. Many past works, including ours,
have addressed this problem by application-aware request reordering in the memory controller. This
paper presents a fundamentally different alternative approach to address this
problem - application-aware Memory Channel Partitioning (MCP). The key idea of
MCP is to map the data of badly-interfering applications to different memory
channels. MCP performs slightly better than the current best memory request
scheduling policy while involving no changes to the memory controller. We also
observe that inter-application interference can be mitigated even better with a
combination of memory channel partitioning and request scheduling. We propose an
Integrated Memory Partitioning and Scheduling (IMPS) mechanism that improves
system performance over the current best memory request scheduler, while
incurring minimal hardware complexity.
-
Oct. 2011 -- Efficient Cache Management:
Caches are critical to performance in modern microprocessors/systems.
Unfortunately, not all blocks inserted into the cache are reused later,
largely degrading the performance benefit of a cache. We propose a new
mechanism, VTS-cache, to predict how likely it is that a missed block will be
reused if it is inserted into the cache and use this prediction to decide at
what location in the cache the block should be inserted. VTS-cache uses the
recency of eviction of a block to predict its future reuse behavior. We
provide a practical, low-cost implementation of VTS-cache, without modifying
the existing cache structure. Our technical report, "Improving Cache
Performance using Victim Tag Stores," describes the mechanism and shows
that VTS-cache outperforms five state-of-the-art proposals.
-
Oct. 2011 -- Energy-efficient Communication Substrates:
We are designing and evaluating on-chip interconnects with new designs
that provide high performance with very simple router hardware, and
low energy and area overhead. In our recent tech report,
"A High-Performance Hierarchical Ring On-Chip Interconnect with Low-Cost
Routers," we show that a hierarchy of rings on-chip provides nearly
the same performance as a baseline high-performance mesh network,
while using very simple ring routers and minimal buffering. The key
insight is to use a high-bandwidth global ring to join several smaller
local rings, and connect rings with simple transfer or "bridge"
routers. The global ring allows quick cross-chip journeys and
alleviates interference seen in both meshes and single-ring
designs. Our technical report provides solutions to ensure forward
progress in such a network (i.e., avoid livelock and deadlock), and
demonstrates with synthesis-based hardware modeling that such designs
are practical.
-
Sep. 2011 -- Heterogeneous Main Memory with New Technologies:
We are designing heterogeneous main memory systems that
consist of multiple memory technologies, e.g., Phase Change Memory
(PCM) and DRAM, to achieve high energy efficiency and overcome DRAM
technology scaling challenges. We have developed a new way of managing
data placement in such a memory system with the goal of achieving the
best characteristics of both PCM and DRAM technologies. The main idea
of our work is to dynamically identify and place data that cause
frequent row buffer miss accesses in DRAM, and data that do not in
PCM. The key insight behind this approach is that data which generally
hit in the row buffer can take advantage of the large memory capacity
that PCM has to offer, and still be accessed as quickly as if the data
were placed in DRAM. Our technical
report, "Row
Buffer Locality-Aware Data Placement in Hybrid Memories,"
describes in detail our mechanism and results.
- Feb. 2011 -- Energy-efficient Communication Substrates: Our
latest work on efficient router design, "CHIPPER: A Low-complexity Bufferless
Deflection Router", was presented at HPCA 2011 in San Antonio, Texas.
You can view our slides here.
We are designing energy-efficient communication substrates to
enable the scaling of a parallel multiprocessor to a large number of
nodes under a given power budget. To this end, we are examining the
design of very efficient routers. This paper designs a simple
bufferless deflection router that is competitive in operating
frequency with buffered routers. It solves two key issues in
deflection router design, livelock freedom and packet reassembly, with
simple mechanisms.
- Dec. 2010 -- Scalable Memory Controllers: Our latest memory
scheduling algorithm, "Thread Cluster Memory Scheduling: Exploiting
Differences in Memory Access Behavior," was presented at MICRO 2010 in
Atlanta, Georgia. You can view our slides here.
Memory schedulers in multi-core systems should carefully schedule memory
requests from different threads to ensure high system performance and fast,
fair progress of each thread. The paper provides an application-aware memory
access scheduling algorithm that maximizes system throughput and fairness at
the same time, outperforming all previous algorithms in both metrics. The main
idea is to dynamically divide threads into two separate clusters (latency-
sensitive and bandwidth-sensitive) and employ different memory request
scheduling policies in each cluster such that the needs of different kinds of
threads are served separately.