docs [CoRAM]

Show pagesource Old revisions

This is an old revision of the document!

\chapter{Cor-C Architecture and Compiler} \label{sec:language}

\renewcommand{\epigraphflush}{flushright} \renewcommand{\epigraphwidth}{4.5in} \renewcommand{\epigraphrule}{0pt} \epigraph{\hfill\textit{I have stopped reading Stephen King novels. Now I just read C code instead.}}{Richard O'Keefe} \vspace{10pt}

The Cor-C architecture specification is an \textit{instance} of the CoRAM concept, which establishes all of the requisite details, data types, constraints, and semantics that are necessary for a real portable hardware/software interface\footnote{An appropriate analogy would be the MIPS ISA being an instance of the RISC concept.}. The Cor-C architecture specification defines a dialect of the C language that can be used to express the desired behavior of control thread programs. The use of a standard, high-level language such as C affords an application developer not only simpler but also more natural expressions of control flow and memory pointer manipulations. It is important to note that Cor-C is not intended to be used as a medium for expressing the computational components of an application but rather, to be used as a lightweight memory management interface that ``wrapsa given application to facilitate portability and to reduce design effort. This chapter begins by introducing the salient features of the Cor-C language, including data types, thread invocation and management, control actions, and the semantics of memory. Section~\ref{sec:language} will describe a prototype compiler for the Cor-C specification, which compiles control thread programs into finite state machines. Chapter~\ref{sec:casestudy} will later present actual uses of the prototype compiler for developing real applications using the Cor-C language. \section{CoR-C Overview} \label{sec:detail} The standard collection of primitives in Cor-C are divided into \textit{static} versus \textit{dynamic} control actions. Tables~\ref{tab:accessors} illustrates accessor control actions that are statically processed at compile-time, while Table~\ref{tab:memory_actions} illustrates control actions that are executed dynamically throughout the course of an application. The control actions have the appearance of a memory management API, and abstract away the details of the underlying hardware support—similar to the role served by the Instruction Set Architecture (ISA) between software and evolving hardware implementations. As will be shown later in Chapter~\ref{sec:casestudy}, the basic set of control actions defined are powerful building blocks that can be used to compose more sophisticated memory abstractions such as scratchpads, caches, and FIFOs—each which are tailored to the memory patterns and desired interfaces of specific applications. %The syntax and conventions shown are based on the \textbf{Cor-C architecture %language specification}, which is a devised \textbf{instance} of the CoRAM %architecture in this thesis and prototyped later in %Chapter~\ref{sec:prototype}. \begin{table} \centering \begin{tabular}{@{} l l@{}} \toprule Data type & Description \midrule \smalltt{bool} & 1-bit boolean \smalltt{char, uchar} & 8-bit signed and unsigned integers \smalltt{sint, suint} & 16-bit signed and unsigned integers \smalltt{int, uint} & 32-bit signed and unsigned integers \smalltt{int64, uint64} & 64-bit signed and unsigned integers \smalltt{cpi\_channel\_ty} & An enumeration of channel object types {reg, fifo} \smalltt{cpi\_addr} & 64-bit virtual address \smalltt{cpi\_ram\_addr} & 16-bit local ram address \smalltt{cpi\_hand} & Static handle for CoRAMs and channel objects \smalltt{cpi\_tag} & Transaction tag for logical memory transactions \bottomrule \end{tabular} \caption{Cor-C Data Types} \label{tab:types} \end{table} \begin{table} \centering \small \begin{tabular}{ l p{4.5in} } \toprule Control Action & Description \midrule \smalltt{cpi\_register} & Registers a control thread with name \smalltt{thread\_name} and replicates it \smalltt{N} times. & \smalltt{void cpi\_register\_thread(cpi\_str thread\_name, cpi\_int N);} \midrule \smalltt{cpi\_instance} & Returns the thread ID as an \smalltt{int}. & \smalltt{int cpi\_instance();} \midrule \smalltt{cpi\_get\_ram} & Returns a ram co-handle uniquely identified by \smalltt{obj\_id} and an optional list of sub-ids. & \smalltt{cpi\_hand cpi\_get\_ram(int obj\_id);} & \smalltt{cpi\_hand cpi\_get\_ram(int obj\_id, int sub\_id);} & \smalltt{cpi\_hand cpi\_get\_ram(int obj\_id, …);} \\\midrule \smalltt{cpi\_get\_rams} & Returns a co-handle that combines \smalltt{N} rams together as a single logical memory. When \smalltt{scatter} is enabled, the rams are combined in a word-interleaved fashion. If \smalltt{scatter} is disabled, the rams are composed linearly. The rams selected are based on \smalltt{N} consecutively numbered ids from \smalltt{id…id+N-1}, where \smalltt{id} is the last argument used in the control action. & \smalltt{cpi\_hand cpi\_get\_rams(int N, bool scatter, int obj\_id);} & \smalltt{cpi\_hand cpi\_get\_rams(int N, bool scatter, int obj\_id, int sub\_id);} & \smalltt{cpi\_hand cpi\_get\_rams(int N, bool scatter, int obj\_id, …);} \\\midrule \smalltt{cpi\_get\_channel} & Returns a channel co-handle based on the enumeration \smalltt{ty}. The channel is uniquely identified by \smalltt{obj\_id} and an optional list of sub-ids. & \smalltt{cpi\_hand cpi\_get\_channel(cpi\_channel\_ty ty, int obj\_id);} & \smalltt{cpi\_hand cpi\_get\_channel(cpi\_channel\_ty ty, …);} \bottomrule \end{tabular} \caption{Cor-C Accessor Control Actions (Static).} \label{tab:accessors} \end{table} \subsection{Control Threads} Every application in CoRAM begins with a source-level description of control threads to act as a ``wrapper around the core processing logic. Control threads are written in the Cor-C language, which is syntactically identical to C~\cite{kc}. Table~\ref{tab:types} summarizes the types in the language, which include several types specific to the Cor-C language. To begin an application, threads are declared using the \smalltt{cpi\_register} function from Table~\ref{tab:accessors}, which takes as argument a unique thread name and a scale factor that replicates the body of the containing function \smalltt{N} times. The code below illustrates how two separate Cor-C functions would be instantiated in a single program. In this example, a total of three threads would be executed during runtime (one of threadA, two of threadB).

{\footnotesize \begin{verbatim} single thread void threadA() { cpi_register(“thread-A”, 1); … } two threads void threadB() {

  cpi_register("thread-B", 2);
  ...

} \end{verbatim} }

\subsection{Object Instantiation and Identification}

\begin{figure}

\centering
\begin{minipage}[t]{\columnwidth}
\lstinputlisting[label=lst:bbox_coram,caption=Verilog black-box definition for single-ported embedded CoRAM and Channel FIFO.]{code/bbox_coram.c}
\end{minipage}

\end{figure}

%\begin{figure} % \centering % \begin{minipage}[t]{\columnwidth} % \lstinputlisting[label=lst:bbox_cfifo,caption=Verilog black-box definition for Channel FIFO.]{code/bbox_cfifo.c} % \end{minipage} %\end{figure}

To utilize embedded CoRAMs, a designer begins with a pre-defined library of black-box module wrappers written in a specific hardware description language. Listing~\ref{lst:bbox_coram} (top) shows the Verilog port list of an embedded CoRAM with a single read-write SRAM port. Unlike a typical SRAM, the embedded CoRAM includes extra parameters specific to the Cor-C architecture specification. The \smalltt{THREAD} is a string that names a particular control thread associated with the CoRAM. The additional field, \smalltt{THREAD\_ID} is necessary for scale factors greater than 1 (i.e., when a thread is replicated with \smalltt{cpi\_register}). Finally, the \smalltt{OBJECT\_ID} and an optional list of \smalltt{SUB\_ID} parameters distinguish between multiple CoRAMs managed by a single control thread instance. In addition to the CoRAMs, users may also instantiate channel FIFOs (shown in Listing~\ref{lst:bbox_coram}, bottom) that enable the core logic to communicate with specific control threads in the application. The convention to identifying and acquiring channel objects are the same as that of acquiring CoRAMs.

\pdf{composition.pdf} {\columnwidth} {Linear and Scatter-Gather RAM Compositions.} {Linear and Scatter-Gather RAM Compositions.} {fig:composition}

When performing accesses to memory, a control thread typically gathers one or more instantiated CoRAMs into a single, program-level identifier called the \textbf{co-handle}— or \smalltt{cpi\_hand} for short. The co-handle establishes a compile-time binding between an individual control thread and a collection of one or more CoRAMs that are functioning as a single logical unit. Like conventional FPGAs, CoRAMs can be combined to form flexible aspect ratios and capacities. Figure~\ref{fig:composition} illustrates how multiple CoRAMs are composed to form a single RAM with deeper entries (called linear) or a single RAM with wider data words (called scatter/gather). The composition of multiple RAMs can be declared in a control thread using the \smalltt{get\_rams} accessor function, which returns a co-handle that represents one or more CoRAMs functioning as a single logical unit. The \smalltt{get\_rams} accessor takes as argument \smalltt{N} number of CoRAMs, an option to compose the CoRAMs linearly or in scatter-gather mode, and the base \smalltt{object\_id} (plus an optional list of sub-ids) to uniquely identify a range of CoRAMs.

\subsection{Memory Control Actions}

The basic role of the control thread is to perform memory operations upon co-handles and to inform the core processing logic through channels when particular operations have completed. The most basic way to operate upon a co-handle is to pass it into a \smalltt{cpi\_ram\_write} memory control action, which performs a logical memory transfer of \smalltt{size} bytes from the global memory address \smalltt{mem\_addr} to the local address \smalltt{ram\_addr} of the CoRAMs named by co-handle. When completed, a sequential block of data from memory will be split into RAM-sized words that are written in sequence according to the arranged memory-mapping of addresses of each individual CoRAM (see Figure~\ref{fig:composition}).

\begin{table} \centering \small \begin{tabular}{ l p{4.5in} } \toprule Control Action & Description
\midrule \smalltt{cpi\_nb\_write\_ram} & Performs a non-blocking transfer of \smalltt{N} bytes from memory at address \smalltt{addr} to the \smalltt{rams} co-handle beginning at local address \smalltt{ram\_addr}. Returns a transaction tag \smalltt{cpi\_tag} which can be valid or invalid (\smalltt{CPI\_INVALID\_TAG}). If the last argument \smalltt{tag\_append} is set to a value equal to the tag of a previous non-blocking transfer, the current transfer will be appended to the previous transaction and will share the same tag.

			  & \smalltt{tag = cpi\_nb\_write\_ram(cpi\_hand rams, cpi\_ram\_addr ram\_addr, cpi\_addr addr, cpi\_int N, cpi\_tag tag\_append);} \\\midrule

\smalltt{cpi\_nb\_read\_ram} & Same as \smalltt{cpi\_nb\_write\_ram} except that transfers move from \smalltt{rams} to memory.

			  & \smalltt{tag = cpi\_nb\_read\_read(cpi\_hand rams, cpi\_ram\_addr ram\_addr, cpi\_addr addr, cpi\_int N, cpi\_tag tag\_append);} \\\midrule

\smalltt{cpi\_write\_ram} & Same as \smalltt{cpi\_nb\_write\_ram} except that control threads suspend until the transaction completes.

			  & \smalltt{cpi\_write\_ram(cpi\_hand rams, cpi\_ram\_addr ram\_addr, cpi\_addr addr, cpi\_int N);} \\\midrule

\smalltt{cpi\_read\_ram} & Same as \smalltt{cpi\_nb\_read\_ram} except that control threads suspend until the transaction completes.

			  & \smalltt{cpi\_read\_ram(cpi\_hand rams, cpi\_ram\_addr ram\_addr, cpi\_addr addr, cpi\_int N);} \\\midrule

\smalltt{cpi\_test} & Takes as argument a \smalltt{rams} co-handle and \smalltt{tag} and returns a \smalltt{bool} indicating whether previous transactions associated with \smalltt{cpi\_nb\_read\_ram} or \smalltt{cpi\_nb\_write\_ram} have completed.

			  & \smalltt{bool cpi\_test(cpi\_hand rams, cpi\_hand tag);} \\\midrule

\smalltt{cpi\_wait} & Takes as argument a \smalltt{rams} co-handle and \smalltt{tag} and blocks the control thread until the previous transactions associated with \smalltt{tag} have completed.

			  & \smalltt{void cpi\_wait(cpi\_hand rams, cpi\_hand tag);} \\\midrule

\smalltt{cpi\_bind} & Establishes a static binding between a \smalltt{rams} co-handle and a \smalltt{channel} co-handle, which will automatically deliver notifications on transactions applied to \smalltt{rams} to \smalltt{channel}. Once a {cpi\_bind} is established, the control thread is no longer permitted to use \smalltt{cpi\_test} or \smalltt{cpi\_wait} on \smalltt{rams}. The control thread also can no longer perform \smalltt{write\_channel} to \smalltt{channel}.
\bottomrule \end{tabular} \caption{Memory Control Actions (Dynamic).} \label{tab:memory_actions} \end{table}

\begin{table} \centering \small \begin{tabular}{ l p{4.5in} } \toprule Control Action & Description
\midrule \smalltt{cpi\_read\_channel} & Reads from \smalltt{channel} and returns data of type \smalltt{cpi\_int64}. The control thread will block if channel is empty.

			  & \smalltt{cpi\_hand channel, cpi\_int64 cpi\_read\_channel(cpi\_hand);} \\\midrule

\smalltt{cpi\_write\_channel} & Writes \smalltt{data} to \smalltt{channel}. The control thread will block if the \smalltt{channel} is full.
& \smalltt{void cpi\_write\_channel(cpi\_hand channel, cpi\_int64 data);} \\\midrule \smalltt{cpi\_test\_channel} & Takes as argument a \smalltt{channel} co-handle and returns a \smalltt{bool} indicating whether the \smalltt{channel} is either empty or full (depends on the test input boolean option \smalltt{check\_empty}.

			  & \smalltt{bool cpi\_test\_channel(cpi\_hand channel, bool check\_empty);} \\

\bottomrule \end{tabular} \caption{Channel Control Actions (Dynamic).} \label{tab:channel_actions} \end{table}

\pdf{bind.pdf} {\largewidth} {Supporting Automatic Notification with Channel-to-CoRAM Bindings.} {Supporting Automatic Notification with Channel-to-CoRAM Bindings.} {fig:bind}

\vspace{10pt} \noindent \textbf{Blocking vs. Non-Blocking.} Memory control actions are subdivided into blocking versus non-blocking behaviors (see Table~\ref{tab:memory_actions}). The CoRAM architecture presents a behavior where sequences of ``blockingcontrol actions (\smalltt{cpi\_write\_ram}, \smalltt{cpi\_read\_ram}) will appear to execute atomically ``one-at-a-time from the perspective of a single control thread. In some circumstances, it is desirable from a performance perspective to explicitly allow multiple outstanding control actions to proceed in parallel (i.e., to pipeline multiple address requests). Non-blocking control actions support this by immediately returning control to the thread and providing a tag that must be tested later to determine when a transaction has completed (see \smalltt{cpi\_test} and \smalltt{cpi\_wait} in Table~\ref{tab:memory_actions}). Note that in some cases, the underlying hardware may return an invalid tag, which requires the control thread to retry the transaction at a later time. A tag is held indefinitely until a \smalltt{cpi\_test} or \smalltt{cpi\_wait} is called, which has the side-effect of releasing the tag when the operation returns successfully.

%\textbf{Non-blocking control actions are not guaranteed to execute in a %well-defined order}. Any combination of read-write memory accesses that %overlap in global address ranges will generally result in undefined behavior. %It is the responsibility of the application developer to ``wait or ``test %on specific transactions when ordering or atomicity is desired.

A common task of the control thread is to periodically inform the core logic when specific memory transactions have completed. Table~\ref{tab:channel_actions} summarize the channel control actions that enable bidirectional communication through FIFOs and registers. A very typical synchronization pattern is shown in Figure~\ref{fig:bind}(top), where (1) a control thread issues a memory control action and receives a transaction tag, (2+3) tests the transaction tag for completion, (4) writes a token to the core logic through a channel FIFO, and (5) the core logic consumes the token and processes data from the CoRAM.

\vspace{10pt} \noindent\textbf{Transaction Coalescing.} The use of non-blocking transactions requires a control thread to track multiple outstanding tags, which can lead to overheads in tag state management and cycles consumed by periodic testing. The Cor-C architecture provides an optimization to reduce this overhead by allowing a memory control action to coalesce multiple transactions to an existing tag held by the thread. For example:

{\footnotesize \begin{verbatim} cpi_tag reused_tag = CPI_INVALID_TAG; for(int i=0; i < 10; i++) {

  tag = cpi_nb_write_ram(ramA, i, i*4, 4, reused_tag);

} cpi_wait(reused_tag); \end{verbatim} }

In the example shown above, 10 non-blocking memory transactions are executed by the control thread and coalesced into a single tag. At the end of the loop, only a single \smalltt{cpi\_wait} operation is required. When passing in a re-used tag, the memory control action will merge the new transaction with the prior ones.

\vspace{10pt} \noindent\textbf{Automatic Notifications.} Another feature of the Cor-C specification is the ability to completely eliminate the need for control threads to synchronize directly with core logic. The \smalltt{cpi\_bind} control action shown in Table~\ref{tab:memory_actions} allows a control thread to establish a static binding between a CoRAM co-handle and a channel FIFO. Figure~\ref{fig:bind} illustrates the method of operation—when a control action is performed upon a specific co-handle, the associated channel FIFO will automatically enqueue a token that presents to the core logic the completion of a memory transaction. Completions are placed into the channel FIFO in the same order that transactions are issued. The \smalltt{cpi\_bind} operation reduces the overall latency of a round-trip memory access and also allows a control thread to pipeline non-blocking multiple memory requests without having to periodically test for completions.

\vspace{10pt} \noindent\textbf{Thread-to-Thread Communication.} Thread-to-thread synchronization can be provided natively in the Cor-C specification for message-passing between multiple threads. In some applications, the need for synchronization arises when dependencies must be enforced between phases of computation and when there are multiple concurrent control threads. Custom forms of synchronization can also be facilitated through the use of channels. For example, to implement a fast barrier, users can instantiate channels as needed into the soft logic fabric to implement their own desired synchronization methods.

\section{Disallowed Behaviors} Although control threads have the appearance of general-purpose software threads, there are a number of restrictions in the Cor-C specification:

\begin{itemize}

\item The static control actions listed in Table~\ref{tab:accessors} can only be executed unconditionally (e.g., cannot be conditioned by a loop variable).

\item Control threads are limited to 64 CoRAMs per co-handle\footnote{The architectural limit is placed here due to physical constraints imposed by the cluster-style microarchitecture presented in Chapter~\ref{sec:microarch}.}.

\item Control threads may not test invalid tags or perform control actions with invalid arguments.

\item Threads may not dynamically allocate memory or instantiate global variables.

\item Control threads may not dereference memory pointers directly\footnote{When a control threads needs to directly access memory, a single CoRAM along with a channel FIFO can be allocated and ``wrappedtogether to form a simple load-store interface (see Chapter~\ref{sec:casestudy}).}. \item Control threads may not execute floating point operations. \item Function stacks are allowed but must be statically bounded. \item No recursion allowed. \end{itemize} Many of the language restrictions above are intended to reduce the likelihood of ``abusing control threads for computation purposes. The various restrictions also ensure that control threads are highly amenable to lightweight implementations in hardware (i.e., synthesized threads or executed on lightweight microprocessors).

\section{Simple Example: Vector Addition} To concretely illustrate the features of the Cor-C language, the code below gives a complete top-to-bottom example of the vector increment kernel, where a sequential array of data is read in from memory, incremented, and written back to main memory. The particular kernel in this example performs two concurrent increments per clock cycle.

{\footnotesize \begin{verbatim} void vector_increment_thread() {

  cpi_register_thread("vector_add", 1/*number of threads*/);
  cpi_hand data_store = cpi_get_rams(2/*numRams*/, true/*scatter*/, 0, 0);
  cpi_hand bind_channel = cpi_get_channel(cpi_fifo, 0);
  cpi_hand done_channel = cpi_get_channel(cpi_fifo, 1);
  cpi_bind(bind_channel, data_store);
  cpi_tag tag = CPI_INVALID_TAG;

  /* Read memory */
  for(int i=0; i < 128; i+=8)
      tag = cpi_nb_write_ram(data_store, i, i*8, 8, tag);

  /* Wait for computation to finish */
  while(!cpi_read_channel(done_channel)) {}
  cpi_tag tag = CPI_INVALID_TAG;

  /* Writeback to memory */
  for(int i=0; i < 128; i+=8)
      tag = cpi_nb_read_ram(data_store, i, i*8, 8, tag);
  cpi_wait(tag);

}

module vector_kernel(CLK, RST_N);

  input CLK, RST_N;
  reg busy, writeback, done, dout_en;
  reg [5:0] addr, waddr;
  reg [31:0] din0, din1;
  wire [31:0] dout0, dout1;

  CORAM2#("vector_add", 0/*thread-id*/, 0 /*obj-id*/,
          0/*sub-id*/, 32/*data width*/, 32 /*depth*/, 5/*addr-width*/) 
          arr0 (.CLK(CLK), .RST_N(RST_N), .en(1'b1), .wen(wen), 
                .waddr(waddr), .addr(addr), .din(din0), .dout(dout0));

  CORAM2#("vector_add", 0/*thread-id*/, 0 /*obj-id*/,
          1/*sub-id*/, 32/*data width*/, 32 /*depth*/, 5/*addr-width*/)
          arr1 (.CLK(CLK), .RST_N(RST_N), .en(1'b1), .wen(wen), 
                .waddr(waddr), .addr(addr), .din(din1), .dout(dout1));

ChannelFIFO cfifo(.CLK(CLK), .RST_N(RST_N),

                    .dout_en(dout_en), .dout(0), .../*unused signals*/);

  always@(posedge CLK) begin
      if(RST_N) begin
          if(dout_rdy && !busy) begin
              writeback <= 0;
              busy <= 1;
              addr <= 0;
              waddr <= 0;
          end
          else if(busy) begin
              addr <= addr + 1;
              writeback <= (addr < 32);
              busy <= (addr < 32);
          end
          else writeback <= 0;

          if(writeback) begin
              wen <= 1'b1;
              din0 <= dout0+1;
              din1 <= dout1+1;
              waddr <= waddr+1;
              if(waddr == 31) dout_en <= 1;
          end
          else begin
              wen <= 1'b0;
              dout_en <= 1'b0;
          end
      end
      else begin
          writeback <= 0;
          dout_en <= 0;
          busy <= 0;
          wen <= 0;
      end
  end

endmodule

\end{verbatim} }

In the first phase of the control thread program above, the thread sets up a programmed transfer that reads in 128B of data from memory into 2 separate embedded CoRAMs represented by a single co-handle. To present a wide 64-bit word interface to the fabric, the co-handle is composed with the scatter-gather argument set to true. The \smalltt{cpi\_bind} operation thereafter establishes an implicit channel between the memory system and the core logic, which is the ultimate consumer of the memory data. Within the core logic, four embedded CoRAMs and a single channel FIFO are instantiated as black-box modules. As memory transactions stream through one-at-a-time, the core logic will receive tokens through the channel FIFO, indicating that data is ready for access within the CoRAMs. In the simple example above, the core logic waits until all the tokens are received before performing the accumulation steps. During the compute phase, the core logic reads and writes 32 clock cycles worth of data from the embedded CoRAMs. Upon completion, a token is written back from the core logic to the control thread indicating that a writeback to memory is pending. The control thread in the background wait-polls on the channel until receiving the token and then performs the final set of memory control actions to write data from the CoRAMs to memory.

\vspace{10pt} \noindent \textbf{Summary.} It is not difficult to imagine that many variants of control actions could be added to the Cor-C architecture specification to support more sophisticated patterns or optimizations (e.g., broadcast from one CoRAM to many, prefetch, strided access, programmable patterns, etc.). In a commercial production setting, control actions—like instructions in an ISA—must be carefully defined and preserved to achieve the value of portability and compatibility. Optimizing compilers could also could play a significant role in static optimization of control thread programs. Analysis could be used, for example, to identify non-conflicting control actions that are logically executed in sequence but can actually be executed concurrently without affecting correctness. The next section describes a compiler and proof-of-concept of the Cor-C architecture specification. Chapter~\ref{sec:casestudy} will later present concrete demonstrations of Cor-C-based applications.

%Without delving into details, the Cor-C language essentially follows same basic %lexical elements as the standard C language~\cite{krbook}. %\subsection{Thread State and Memory} %Control threads can declare state variables and bounded-sized arrays through %conventional C-based syntax (e.g., \textit{int x = 0}). Under limited %circumstances, structs may also be used to organize and group variables %together. %Global variables declared outside the scope of a function are \textbf{not %permitted} in the Cor-C specification. Control threads are also not permitted %to directly access the global main memory through dereferencing of pointers. %This limitation of control threads can be addressed through specialized %memory personalities, which will be later described in Section\ref{sec}.

\section{The CoRAM Control Compiler (CORCC)}

\pdf{control_options.pdf} {\largewidth} {Options for Synthesizing Control Threads.} {Options for Synthesizing Control Threads.} {fig:control_options}

\pdf{corcc_example.pdf} {\xcolumnwidth} {CORCC Example.} {CORCC Example.} {fig:corcc_example}

The CoRAM Control Compiler (CORCC) was developed in this thesis to explore the various implementation options for control threads. Figure~\ref{fig:control_options} shows the several ways in which a control thread program can be mapped down into control logic in an FPGA with CoRAM support: (1) directly compiling control thread programs into soft logic state machines via high-level synthesis, (2) compiling control threads to pre-implemented soft microprocessor cores (e.g., Xilinx Microblaze~\cite{xilinx} or Altera Nios~\cite{altera}) or (3) compiling to a hard microprocessor serving as a dedicated microcontroller.

CORCC supports direct synthesis of control threads into synthesizable RTL from standard C code and can also be configured to model the cycle-time performance of a simple microprocessor. The implementation of CORCC leverages the Low Level Virtual Machine (LLVM) framework~\cite{llvm}, which is an open-source, end-to-end compiler with pluggable extensions for custom passes and backends. CORCC leverages the modularity of LLVM and its language-independent intermediate representation (IR) to implement a simple form of high-level synthesis with extensions for microprocessor performance modeling.

\vspace{10pt} \noindent \textbf{Implementation.} The CORCC LLVM extension is implemented in 6000L of C++ as a series of LLVM passes. CORCC extends LLVM with special objects and data types that are specific to the Cor-C architecture specification. These include CoRAM and channel accessors, co-handles, and the memory/channel control actions. LLVM includes front-ends for several popular languages such as C and C++ and can automatically generate an intermediate representation (IR) in Single Static Assignment (SSA) form. In SSA form, each variable in a routine is assigned exactly once, which is useful for various optimizations and simplifying the properties of variables. An important feature of the IR is the LLVM type system, which provides high-level program-level information accessible at the assembly level. The first stage of CORCC is automatically handled by LLVM, which translates the high-level control thread program into IR organized into basic blocks. The assembly employed by LLVM constitutes about 70 instructions~\cite{llvm}, of which a subset of about 30 are supported in CORCC\footnote{Use of unsupported instructions result in compile-time errors in CORCC.}.

\vspace{10pt} \noindent \textbf{Thread-to-Hardware Interface.} The control threads of an application, which exist either in the form of soft finite state machines or as microcontrollers can be viewed as clients that issue memory requests to the underlying memory subsystem comprising embedded CoRAMs, the network-on-chip, and the edge memory interfaces (as illustrated earlier in Figure~\ref{fig:control_options}). CORCC assumes that in a soft implementation of control threads, a well-defined request-response interface exists between the underlying subsystem and the control threads implemented in the fabric. The details of such interfaces are described further in Chapter~\ref{sec:microarch}.

\vspace{10pt} \noindent \textbf{Co-handle Pass.} The first stage of CORCC performs a sweep through the LLVM-generated IR and identifies \smalltt{call} instructions that match the function signatures of static control actions such as co-handle and channel accessors (see Table~\ref{tab:accessors}) that are used to establish bindings to various CoRAM-related object. In LLVM, any function with a return value is assigned to a register identifier with a unique integer. Within CORCC, the static pass creates an internal map between an identified co-handle and its corresponding destination register. When processing a co-handle, the function arguments are checked to be constant and valid values. During this pass, any dynamic control actions are annotated and linked against the detected co-handles. The link step performs a backtracing through registers in the IR to identify the specific co-handle associated with a dynamic control action.

\vspace{10pt} \noindent \textbf{Thread Synthesis.} Once all co-handles have been identified, CORCC performs a synthesis step that translates the LLVM instructions and the Cor-C dynamic control actions into synthesizable Verilog. The basic approach taken by CORCC is to perform a direct mapping of basic blocks into single-cycle states in a finite state machine. To handle register state, CORCC instantiates a physical register for each assigned variable in a program. To implement logic, all of the instructions in a basic block are converted into combinational statements, where the inputs to the logic are read from registers in a single clock cycle (and in the same clock cycle, the output is written to the destination registers). The SSA form of LLVM guarantees that no registers are read and written at the same time within a single basic block.

To handle dynamic control actions, special states are introduced at locations in the LLVM IR where \smalltt{call} instructions are detected. For example, when a \smalltt{cpi\_write\_ram} function call is detected, the parent basic block will be split into two states, one containing the original and the other for invocation of the control thread. The predecessor basic block will always jump into the special state first, which handles the actual issue of the control action to the memory subsystem through a request-response interface; thereafter, the FSM jumps into the original basic block while returning the value of the control action. Figure~\ref{fig:corcc_example} gives a complete example of compiling the simple example from Chapter~\ref{sec:coram} into synthesizable hardware.

\vspace{10pt} \noindent \textbf{Microprocessor Performance Modeling.} To explore the design space for microcontroller-based control threads in our evaluation in Chapter~\ref{sec:evaluation}, CORCC includes an additional feature that approximates the performance characteristics of a simple in-order microprocessor core. The core is modeled with a constant CPI value (cycles per LLVM instruction) and is assumed to have specialized logic that interfaces directly to the underlying memory subsystem described in Chapter~\ref{sec:microarch}. The programmed CPI value sets the rate at which control threads advance through the LLVM basic blocks in order to mimic the performance characteristics of an idealized microprocessor. Chapter~\ref{sec:evaluation} will later present simulation-driven results that compare direct synthesis by CORCC to soft and hard microprocessor cores.

\vspace{10pt} \noindent \textbf{CORCC Limitations.} The CORCC compiler employs a relatively simple approach to high level synthesis, which completely expands the basic blocks of an application into synthesizable hardware. The simple approach taken here can have a detrimental effect on performance and area, especially if LLVM produces large basic blocks or allocates a large number of registers. A potential way to mitigate large critical paths within a basic block are to split basic blocks where necessary, which can either be supported automatically or guided by the user. More advanced high-level synthesis techniques can also be applied—e.g., constraining and scheduling the usage of resources. As will be shown later in Chapter~\ref{sec:evaluation}, without any optimizations, the FSMs generated by the CORCC compiler consume relatively modest area while operating at nominal FPGA clock frequencies.

\vspace{10pt} \noindent \textbf{Cor-C vs. Parallel Languages.} Our selection of the C language is not a fundamental requirement of the CoRAM paradigm. An area that merits further research is the use of functional or parallel languages to express higher levels of parallelism within control threads. A particular consequence of using a sequential-like language of C is the serialization of requests during dynamic execution. Consider the for loop below, which generates a stream of requests to the memory subsystem:

{\footnotesize \begin{verbatim} for(int i=0; i < 8192; i+= BLOCK_BYTES) {

  tag = cpi_nb_write_ram(ramA, 0, 0, BLOCK_BYTES, tag);

} \end{verbatim} }

In the example above, CORCC would not allow the multiple control actions to execute in parallel due to serialization on the coalesced tag variable. In such case, parallel constructs such as \smalltt{forall} can explicitly declare that the loop body operations are independent.

\section{Summary} This chapter presented the Cor-C architecture specification and compiler. Cor-C is a devised instance of the CoRAM concept, and provides an example of how the CoRAM concept is applied in a real-world environment. The CoRAM Control Compiler (CORCC) is a proof-of-concept that implements the Cor-C specification and is evaluated further in Chapter~\ref{sec:evaluation}.

Back to top

docs.1327348131.txt.gz · Last modified: 2012/01/23 19:48 (external edit)

Sitemap Recent Changes