qos_bufferless:proposal - 15-740/18-740 Computer Architecture

- Project Proposal

Project Proposal

Problem Definition & Motivation

On-chip Interconnection networks increase bandwidth efficiency when routing by introducing buffers in the hardware. Despite the increased bandwidth efficiency enabled by the buffers, they consume between 30-40% of total system power [1], and up to 75% of total on-chip network area [2]. To overcome this, deflection/bufferless routing [3] has been proposed [1] which will place a flit (packet fragment) on an available output port, even if it is not the optimal output port due to contention. By doing so, energy is reduced by 37% at low flit injection rates, and 25% at high injection rates, without incurring significant additional latencies to the flits [1]. While bufferless routing helps in these areas, it complicates QoS and fairness in such a network since there is no notion of a flow: a flit comes in, and it is immediately deflected. The improvements made upon energy efficiency and on-chip area, coupled with the lack of QoS and fairness, serves as the motivation for our work. We wish to enable some level of guarantees in the on-chip network, while achieving greater efficiency. This is important in current networks where data from applications may be delay sensitive, or bandwidth sensitive. A process or a thread should be able to define such guarantees on data that it is using the on-chip interconnection network to transfer. Our goal is to develop new routing algorithms and mechanisms in bufferless on-chip interconnection networks which can provide bandwidth guarantees (e.g., 40% of utilization to flow X) and delay bounds (e.g., flit delivery <100us).

Related Work

While not designed for bufferless on-chip interconnection networks, the Core-Stateless Fair Queueing (CSFQ) algorithm [4] achieves fair queueing in the Internet network core without the need for per-flow state and management. This is a key requirement of of allowing QoS in bufferless networks, as per-flow state requires memory which we are trying to reduce to save energy and on-chip area. Although designed for data packet networks, the ideas can easily be adopted in on-chip interconnection networks. While CSFQ achieves fair queueing (i.e., all flows receive equal bandwidth), it does not allow for prioritzation of flows, where flows can be guaranteed different amounts of bandwidth. We look to possibly extend the CSFQ algorithm to provide this and better adopt it for on-chip interconnections. Additionally, CSFQ does not provide any mechanism for latency guarantees. Lee et al. [5] propose Globally-Synchronized Frames (GSF) for providing guarantees in on-chip interconnection networks, controlling a deadline assignment policy for prioritization. However, GSF requires multiple frame buffers to implement this deadline assignment policy, conflicting with our bufferless requirement. We look to extend GSF, along with CSFQ to allow guarantees in bufferless interconnects by overcoming their various conflicts with our requirements.

Solving the Problem

Our high-level approach to introducing guarantees in bufferless interconnects is as follows. First, we look to take the basic idea of inserting a time in our packet headers used by GSF for deadline assignment policy, but we will remove the frame buffers (satisfying the bufferless requirement) and focus on latency requirements rather than bandwidth requirements, as GSF had. To focus on latency requirements and not require per-flow state, we plan to allow a source (e.g., host or edge router) to place a maximum time to service (MTTS) value in the header, for example 100us. Additionally, another field is required in the header: last hop time (LHT). In an attempt to provide latency guarantees in the bufferless interconnects, each hop will use the MTTS and LHT values to determine which available output port the packet should take. At each hop, the LHT is subtracted from the current time to get the time in transit between hops, and then subtracts this from MTTS. The resulting MTTS is the approximate time until the latency deadline has passed. This can be used to prioritize the flit across output ports. This approach requires no per-flow state, satisfying the buffer/memory requirement, while achieving some latency guarantees (which we will quantify).
To achieve bandwidth guarantees, we look to somehow extend the CSFQ algorithm. This is rather difficult, as the basis of fair queuing is splitting the estimated available bandwidth evenly. To do so, the algorithm simply needs to know its link capacity and size of its queue in a data packet network router. However, bufferless interconnects have no queues and we want to achieve uneven bandwidth guarantees. We unfortunately do not have a solid idea we would like to take to achieve this yet, which we hope to discuss with the TAs and professor.

Experimental Setup

Our approach can be evaluated either in software, via simulation, or in hardware, via a field-programmable gate array (FPGA) implementation. There are several options for simulation environments, including the Garnet and Orion simulators, as well as an approach detailed by Moscibroda et al [1] that uses a cycle-accurate interconnection network simulator connected to a cycle-accurate x86 simulator. The Garnet simulator would allow us to make detailed modifications to interconnect logic and accurately evaluate the resulting network performance. The Orion simulator, while also allowing us to make changes to the interconnect logic, focuses more on evaluating the power usage of the network. Finally, employing the approach used by Moscibroda et al [1], while somewhat more complex, would allow us to make highly accurate evaluations of our new algorithms while providing a direct comparison of, and extension to, prior work.
Evaluating our approach via hardware would require us to assume the power benefits shown by Moscibroda et al [1] and focus purely on the accuracy of the guarantees provided by our new algorithms. An implementation on an FPGA provides realistic hardware flexibilities and limitations for designing a router, as well as possibly a small network of such routers. This would allow us to initially evaluate the feasibility of a hardware bufferless router, accurately evaluate the performance of such a router in a realistic on-chip hardware environment, as well as specifically evaluating the QoS characteristics of our algorithms.

Research Plan

Our goal is to implement novel QoS techniques for bufferless interconnect networks and evaluate their performance, defined as their ability to provide the defined guarantees. The first step towards achieving our goal is to outline in detail our new algorithms. We have a decent idea for latency guarantees, but bandwidth guarantees are much more difficult, which we need to research in more depth by possibly finding more related work. For milestone 1, we should have decided upon our experimental infrastructure, and even have it modified to provide us a bufferless interconnect network. For example, if we choose an FPGA, by milestone 1 we hope to build standard routing code to be bufferless. We could also hopefully make headway on implementing one of the two QoS mechanisms. By milestone 2, we should have both QoS mechanisms implemented and be able to provide preliminary evaluation on them. If we are unable to achieve our expected QoS guarantees based on the preliminary results of milestone 2, we need to further modify our QoS algorithms, possibly exploring completely new approaches.

[1] Moscibroda and Mutlu, “A Case for Bufferless Routing in On-Chip Networks,” ISCA '09 submission.
[2] P. Gratz, C. Kim, R. McDonald, S. W. Keckler, and D. Burger. Implementation and evaluation of on-chip network architectures. In ICCD, 2006.
[3] W. J. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann, 2004.
[4] Ion Stoica, Scott Shenker, Hui Zhang, “Core-Stateless Fair Queueing: A Scalable Architecture to Approximate Fair Bandwidth Allocations in High Speed Networks”, SIGCOMM'98.
[5] Lee, Ng, and Asanovic, “Globally-Synchronized Frames for Guaranteed Quality-of-Service in On-Chip Networks,” ISCA 2008.

Back to Project Page

Project Proposal

Problem Definition & Motivation

Related Work

Solving the Problem

Experimental Setup

Research Plan

Views

Navigation

Personal Tools

Search

Toolbox