What is SAFARI?
SAFARI is the research group
of Professor Onur
Mutlu in
the Computer
Architecture Lab (CALCM)
at Carnegie Mellon
University. We
investigate safe, fair, robust
and intelligent computer architecture, finding novel
ways to provide a substrate with all of these properties for
next-generation multicore and manycore systems.
- Safe: We aim to provide secure execution and
performance isolation to concurrently executing programs in
manycore systems, minimizing the potential for starvation,
denial-of-service attacks, information leakage, and hardware-based
security vulnerabilities. We also investigate architectures that
provide support for the development and efficient execution of
secure software.
- Fair: We investigate mechanisms to provide
fairness in shared-resource management, ensuring that multi-user
and multi-programmed systems distribute computational power in a
manner that is consistent with system's fairness and Quality of
Service policies. The mechanisms we develop provide a hardware
substrate the system software or the user/programmer can control
to enforce a variety of quality of service mechanisms, while
maximizing performance.
- Robust: Our systems should, in addition to
optimizing for safety and fairness, be robust against both
software and hardware faults that may occur, whether malicious or
natural.
- Intelligent: We believe that the most effective
approach to providing a system substrate with these properties is
to combine simple, flexible hardware measurement and control
mechanisms with intelligent, adaptive system software. This
hardware/software approach should preserve both the performance of
hardware-assisted techniques and the flexibility and capability of
software-based mechanisms. Scalability, energy-efficiency, and
high performance are also key concerns in an intelligent
architecture, and we strive to design novel, scalable techniques
that enable flexible performance/energy tradeoffs.
We thank our sponsors:
AMD,
CyLab,
FCRP GSRC,
Intel,
NIH,
NSF,
Oracle and
Samsung.
SAFARI News
- Jan. 2012 -- Scalable, Energy-Efficient Memory Systems: Our
ICAC 2011 paper, "Memory Power
Management via Dynamic Voltage/Frequency Scaling,"
demonstrates that memory systems which are provisioned for high
performance with memory-intensive applications are often overkill
for many other applications which do not require as much memory
bandwidth. Running memory at a lower frequency has a minimal impact
on the performance of these applications, and also allows for an
operating voltage reduction, which significantly reduces memory
system power and thus increases energy efficiency. We demonstrate a
dynamic voltage/frequency scaling approach to increasing memory
system energy efficiency which observes memory bandwidth at runtime
and scales the memory frequency and voltage with this bandwidth
demand. Significantly, we evaluate this on a real server platform by
using memory controller timing registers in the Intel Nehalem, which
replicates the effect of dynamically adjustable memory frequency.
Combined with an analytical model for power savings, we show that
memory power can be reduced by 10.4% on average (20.5% max in one
workload) with only 0.17% on performance. You can view our slides
here: pptx, pdf.
- Dec. 2011 -- Scalable Memory Systems: Our latest work on memory
interference handling, "Reducing Memory
Interference in Multicore Systems via Application-Aware Memory Channel
Partitioning,", was presented at MICRO 2011 in Porto Alegre, Brazil. You can
view our slides here.
Inter-application interference at the main memory is a major impediment to
individual application and system performance. Many past works, including ours,
have addressed this problem by application-aware request reordering in the memory controller. This
paper presents a fundamentally different alternative approach to address this
problem - application-aware Memory Channel Partitioning (MCP). The key idea of
MCP is to map the data of badly-interfering applications to different memory
channels. MCP performs slightly better than the current best memory request
scheduling policy while involving no changes to the memory controller. We also
observe that inter-application interference can be mitigated even better with a
combination of memory channel partitioning and request scheduling. We propose an
Integrated Memory Partitioning and Scheduling (IMPS) mechanism that improves
system performance over the current best memory request scheduler, while
incurring minimal hardware complexity.
-
Oct. 2011 -- Efficient Cache Management:
Caches are critical to performance in modern microprocessors/systems.
Unfortunately, not all blocks inserted into the cache are reused later,
largely degrading the performance benefit of a cache. We propose a new
mechanism, VTS-cache, to predict how likely it is that a missed block will be
reused if it is inserted into the cache and use this prediction to decide at
what location in the cache the block should be inserted. VTS-cache uses the
recency of eviction of a block to predict its future reuse behavior. We
provide a practical, low-cost implementation of VTS-cache, without modifying
the existing cache structure. Our technical report, "Improving Cache
Performance using Victim Tag Stores," describes the mechanism and shows
that VTS-cache outperforms five state-of-the-art proposals.
-
Oct. 2011 -- Energy-efficient Communication Substrates:
We are designing and evaluating on-chip interconnects with new designs
that provide high performance with very simple router hardware, and
low energy and area overhead. In our recent tech report,
"A High-Performance Hierarchical Ring On-Chip Interconnect with Low-Cost
Routers," we show that a hierarchy of rings on-chip provides nearly
the same performance as a baseline high-performance mesh network,
while using very simple ring routers and minimal buffering. The key
insight is to use a high-bandwidth global ring to join several smaller
local rings, and connect rings with simple transfer or "bridge"
routers. The global ring allows quick cross-chip journeys and
alleviates interference seen in both meshes and single-ring
designs. Our technical report provides solutions to ensure forward
progress in such a network (i.e., avoid livelock and deadlock), and
demonstrates with synthesis-based hardware modeling that such designs
are practical.
-
Oct. 2011 -- Energy-efficient Communication Substrates:
We are investigating ways to improve the performance and energy
efficiency of bufferless deflection-based on-chip interconnects
further. Our recent technical report
"MinBD: A Minimally-Buffered Deflection Router Approaching
Conventional Buffered-Router Performance" presents a design
that has nearly the performance (within
4.6%) of conventional interconnects, with large virtual-channel
buffers, by using primarily deflection routing to handle contention,
and only a small buffer per router to assist in high load. We show
that by addressing several simple yet important bottlenecks in earlier
bufferless deflection routers (using our CHIPPER work in HPCA 2011 as
an example), bufferless or minimally-buffered deflection routers show
very good performance and energy efficiency. Finally, this report
shows that our router design principles are applicable to high-radix
routers as well, and that such routers can still provide energy
savings in networks where performance degradation is already addressed
by data-locality mapping techniques.
-
Sep. 2011 -- Heterogeneous Main Memory with New Technologies:
We are designing heterogeneous main memory systems that
consist of multiple memory technologies, e.g., Phase Change Memory
(PCM) and DRAM, to achieve high energy efficiency and overcome DRAM
technology scaling challenges. We have developed a new way of managing
data placement in such a memory system with the goal of achieving the
best characteristics of both PCM and DRAM technologies. The main idea
of our work is to dynamically identify and place data that cause
frequent row buffer miss accesses in DRAM, and data that do not in
PCM. The key insight behind this approach is that data which generally
hit in the row buffer can take advantage of the large memory capacity
that PCM has to offer, and still be accessed as quickly as if the data
were placed in DRAM. Our technical
report, "Row
Buffer Locality-Aware Data Placement in Hybrid Memories,"
describes in detail our mechanism and results.
- Feb. 2011 -- Energy-efficient Communication Substrates: Our
latest work on efficient router design, "CHIPPER: A Low-complexity Bufferless
Deflection Router,", was presented at HPCA 2011 in San Antonio, Texas.
You can view our slides here.
We are designing energy-efficient communication substrates to
enable the scaling of a parallel multiprocessor to a large number of
nodes under a given power budget. To this end, we are examining the
design of very efficient routers. This paper designs a simple
bufferless deflection router that is competitive in operating
frequency with buffered routers. It solves two key issues in
deflection router design, livelock freedom and packet reassembly, with
simple mechanisms.
- Dec. 2010 -- Scalable Memory Controllers: Our latest memory
scheduling algorithm, "Thread Cluster Memory Scheduling: Exploiting
Differences in Memory Access Behavior," was presented at MICRO 2010 in
Atlanta, Georgia. You can view our slides here.
Memory schedulers in multi-core systems should carefully schedule memory
requests from different threads to ensure high system performance and fast,
fair progress of each thread. The paper provides an application-aware memory
access scheduling algorithm that maximizes system throughput and fairness at
the same time, outperforming all previous algorithms in both metrics. The main
idea is to dynamically divide threads into two separate clusters (latency-
sensitive and bandwidth-sensitive) and employ different memory request
scheduling policies in each cluster such that the needs of different kinds of
threads are served separately.