Show page Old revisions

Differences

This shows you the differences between two versions of the page.

--- docs [2012/01/23 19:46]
echung
+++ docs [2012/01/23 20:24] (current)
echung
@@ Line 1: / Line 1: @@
-\chapter{Cor-C Architecture and Compiler}
+====== Corflow Documentation and Examples ======
-\label{sec:language}
+^ Date ^ Documentation  ^ Version ^ Download ^
-\renewcommand{\epigraphflush}{flushright}
+| 23 Feb 2012 | Corflow Programming Guide | 1.0 | pdf |
-\renewcommand{\epigraphwidth}{4.5in}
+| 23 Feb 2012 | Simple Examples | 1.0 | zip |
-\renewcommand{\epigraphrule}{0pt}
-\epigraph{\hfill\textit{I have stopped reading Stephen King novels. Now I just read C code instead.}}{Richard O'Keefe}
-\vspace{10pt}
-The Cor-C architecture specification is an \textit{instance} of the CoRAM
-concept, which establishes all of the requisite details, data types,
-constraints, and semantics that are necessary for a real portable
-hardware/software interface\footnote{An appropriate analogy would be the MIPS
-ISA being an instance of the RISC concept.}. The Cor-C architecture
-specification defines a dialect of the C language that can be used to express
-the desired behavior of control thread programs.  The use of a standard,
-high-level language such as C affords an application developer not only simpler
-but also more natural expressions of control flow and memory pointer
-manipulations.  It is important to note that Cor-C is not intended to be used
-as a medium for expressing the computational components of an application but
-rather, to be used as a lightweight memory management interface that ``wraps''
-a given application to facilitate portability and to reduce design effort.
-This chapter begins by introducing the salient features of the Cor-C language,
-including data types, thread invocation and management, control actions, and
-the semantics of memory. Section~\ref{sec:language} will describe a prototype
-compiler for the Cor-C specification, which compiles control thread programs
-into finite state machines. Chapter~\ref{sec:casestudy} will later present
-actual uses of the prototype compiler for developing real applications using
-the Cor-C language.
-\section{CoR-C Overview}
-\label{sec:detail}
-The standard collection of primitives in Cor-C
-are divided into \textit{static} versus \textit{dynamic} control actions.
-Tables~\ref{tab:accessors} illustrates accessor control actions that
-are statically processed at compile-time, while Table~\ref{tab:memory_actions}
-illustrates control actions that are executed dynamically throughout the
-course of an application. The control actions have the appearance of a memory
-management API, and abstract away the details of the underlying hardware
-support---similar to the role served by the Instruction Set Architecture (ISA)
-between software and evolving hardware implementations.  As will be shown later
-in Chapter~\ref{sec:casestudy}, the basic set of control actions defined are
-powerful building blocks that can be used to compose more sophisticated memory
-abstractions such as scratchpads, caches, and FIFOs---each which are tailored to
-the memory patterns and desired interfaces of specific applications.
-%The syntax and conventions shown are based on the \textbf{Cor-C architecture
-%language specification}, which is a devised \textbf{instance} of the CoRAM
-%architecture in this thesis and prototyped later in
-%Chapter~\ref{sec:prototype}.
-\begin{table}
-\centering
-\begin{tabular}{@{}  l l@{}}
-\toprule
-Data type & Description \\ \midrule
-\smalltt{bool} & 1-bit boolean \\
-\smalltt{char, uchar} & 8-bit signed and unsigned integers \\
-\smalltt{sint, suint} & 16-bit signed and unsigned integers \\
-\smalltt{int, uint}  & 32-bit signed and unsigned integers \\
-\smalltt{int64, uint64} & 64-bit signed and unsigned integers \\
-\smalltt{cpi\_channel\_ty} & An enumeration of channel object types {reg, fifo} \\
-\smalltt{cpi\_addr} & 64-bit virtual address \\
-\smalltt{cpi\_ram\_addr} & 16-bit local ram address \\
-\smalltt{cpi\_hand} & Static handle for CoRAMs and channel objects \\
-\smalltt{cpi\_tag} & Transaction tag for logical memory transactions \\
-\bottomrule
-\end{tabular}
-\caption{Cor-C Data Types}
-\label{tab:types}
-\end{table}
-\begin{table}
-\centering
-\small
-\begin{tabular}{ l p{4.5in} }
-\toprule
-Control Action & Description \\ \midrule
-\smalltt{cpi\_register}		  & Registers a control thread with name \smalltt{thread\_name} and replicates it \smalltt{N} times. \\
-				  & \smalltt{void cpi\_register\_thread(cpi\_str thread\_name, cpi\_int N);} \\ \midrule
-\smalltt{cpi\_instance}		  & Returns the thread ID as an \smalltt{int}. \\
-				  & \smalltt{int cpi\_instance();} \\ \midrule
-\smalltt{cpi\_get\_ram}		  & Returns a ram co-handle uniquely identified by \smalltt{obj\_id} and an optional list of sub-ids.\\
-				  & \smalltt{cpi\_hand cpi\_get\_ram(int obj\_id);} \\
-				  & \smalltt{cpi\_hand cpi\_get\_ram(int obj\_id, int sub\_id);} \\
-				  & \smalltt{cpi\_hand cpi\_get\_ram(int obj\_id, ...);} \\\midrule
-\smalltt{cpi\_get\_rams}	  & Returns a co-handle that combines \smalltt{N} rams together as a single logical memory.
-When \smalltt{scatter} is enabled, the rams are combined in a word-interleaved fashion. If \smalltt{scatter} is disabled, the rams are composed linearly. The rams selected are based on \smalltt{N} consecutively numbered ids from \smalltt{id...id+N-1}, where \smalltt{id} is the last argument used in the control action. \\
-				  & \smalltt{cpi\_hand cpi\_get\_rams(int N, bool scatter, int obj\_id);} \\
-				  & \smalltt{cpi\_hand cpi\_get\_rams(int N, bool scatter, int obj\_id, int sub\_id);}\\
-				  & \smalltt{cpi\_hand cpi\_get\_rams(int N, bool scatter, int obj\_id, ...);} \\\midrule
-\smalltt{cpi\_get\_channel}	  & Returns a channel co-handle based on the enumeration \smalltt{ty}. The channel is uniquely identified by \smalltt{obj\_id} and an optional list of sub-ids.\\
-				  & \smalltt{cpi\_hand cpi\_get\_channel(cpi\_channel\_ty ty, int obj\_id);} \\
-				  & \smalltt{cpi\_hand cpi\_get\_channel(cpi\_channel\_ty ty, ...);} \\
-\bottomrule
-\end{tabular}
-\caption{Cor-C Accessor Control Actions (Static).}
-\label{tab:accessors}
-\end{table}
-\subsection{Control Threads}
-Every application in CoRAM begins with a source-level description of control
-threads to act as a ``wrapper'' around the core processing logic.  Control
-threads are written in the Cor-C language, which is syntactically identical to
-C~\cite{kc}.  Table~\ref{tab:types} summarizes the types in the
-language, which include several types specific to the Cor-C language.  To begin
-an application, threads are declared using the \smalltt{cpi\_register} function
-from Table~\ref{tab:accessors}, which takes as argument a unique thread name
-and a scale factor that replicates the body of the containing function
-\smalltt{N} times. The code below illustrates how two separate Cor-C functions
-would be instantiated in a single program. In this example, a total of three
-threads would be executed during runtime (one of threadA, two of threadB).
-{\footnotesize
-\begin{verbatim}
-// single thread
-void threadA() {
-    cpi_register("thread-A", 1);
-    ...
-}
-// two threads
-void threadB() {
-    cpi_register("thread-B", 2);
-    ...
-}
-\end{verbatim}
-}
-\subsection{Object Instantiation and Identification}
-\begin{figure}
-  \centering
-  \begin{minipage}[t]{\columnwidth}
-	\lstinputlisting[label=lst:bbox_coram,caption=Verilog black-box definition for single-ported embedded CoRAM and Channel FIFO.]{code/bbox_coram.c}
-  \end{minipage}
-\end{figure}
-%\begin{figure}
-%  \centering
-%  \begin{minipage}[t]{\columnwidth}
-%	\lstinputlisting[label=lst:bbox_cfifo,caption=Verilog black-box definition for Channel FIFO.]{code/bbox_cfifo.c}
-%  \end{minipage}
-%\end{figure}
-To utilize embedded CoRAMs, a designer begins with a pre-defined library
-of black-box module wrappers written in a specific hardware description
-language. Listing~\ref{lst:bbox_coram} (top) shows the Verilog port list of an
-embedded CoRAM with a single read-write SRAM port. Unlike a typical SRAM, the
-embedded CoRAM includes extra parameters specific to the Cor-C architecture
-specification.  The \smalltt{THREAD} is a string that names a particular
-control thread associated with the CoRAM. The additional field,
-\smalltt{THREAD\_ID} is necessary for scale factors greater than 1
-(i.e., when a thread is replicated with \smalltt{cpi\_register}).
-Finally, the \smalltt{OBJECT\_ID} and an optional list of \smalltt{SUB\_ID}
-parameters distinguish between multiple CoRAMs managed by a single control
-thread instance.  In addition to the CoRAMs, users may also instantiate channel
-FIFOs (shown in Listing~\ref{lst:bbox_coram}, bottom) that enable the core logic to
-communicate with specific control threads in the application. The convention to
-identifying and acquiring channel objects are the same as that of acquiring
-CoRAMs.
-\pdf{composition.pdf}
-{\columnwidth}
-{Linear and Scatter-Gather RAM Compositions.}
-{Linear and Scatter-Gather RAM Compositions.}
-{fig:composition}
-When performing accesses to memory, a control thread typically gathers one or
-more instantiated CoRAMs into a single, program-level identifier called the
-\textbf{co-handle}--- or \smalltt{cpi\_hand} for short.  The co-handle
-establishes a compile-time binding between an individual control thread and a
-collection of one or more CoRAMs that are functioning as a single logical unit.
-Like conventional FPGAs, CoRAMs can be combined to form flexible aspect ratios
-and capacities.  Figure~\ref{fig:composition} illustrates how multiple CoRAMs
-are composed to form a single RAM with deeper entries (called linear) or a
-single RAM with wider data words (called scatter/gather).  The composition of
-multiple RAMs can be declared in a control thread using the \smalltt{get\_rams}
-accessor function, which returns a co-handle that represents one or more CoRAMs
-functioning as a single logical unit. The \smalltt{get\_rams} accessor takes as
-argument \smalltt{N} number of CoRAMs, an option to compose the CoRAMs linearly
-or in scatter-gather mode, and the base \smalltt{object\_id} (plus an optional
-list of sub-ids) to uniquely identify a range of CoRAMs.
-\subsection{Memory Control Actions}
-The basic role of the control thread is to perform memory operations upon
-co-handles and to inform the core processing logic through channels when
-particular operations have completed.  The most basic way to operate upon a
-co-handle is to pass it into a \smalltt{cpi\_ram\_write} memory control action, which
-performs a logical memory transfer of \smalltt{size} bytes from the global
-memory address \smalltt{mem\_addr} to the local address \smalltt{ram\_addr} of
-the CoRAMs named by co-handle. When completed, a sequential block of data from
-memory will be split into RAM-sized words that are written in sequence
-according to the arranged memory-mapping of addresses of each individual CoRAM
-(see Figure~\ref{fig:composition}).
-\begin{table}
-\centering
-\small
-\begin{tabular}{ l p{4.5in} }
-\toprule
-Control Action & Description \\ \midrule
-\smalltt{cpi\_nb\_write\_ram}	  & Performs a non-blocking transfer of \smalltt{N} bytes from memory at address \smalltt{addr} to the \smalltt{rams} co-handle beginning at local address \smalltt{ram\_addr}. Returns a transaction tag \smalltt{cpi\_tag} which can be valid or invalid (\smalltt{CPI\_INVALID\_TAG}). If the last argument \smalltt{tag\_append} is set to a value equal to the tag of a previous non-blocking transfer, the current transfer will be appended to the previous transaction and will share the same tag.\\
-				  & \smalltt{tag = cpi\_nb\_write\_ram(cpi\_hand rams, cpi\_ram\_addr ram\_addr, cpi\_addr addr, cpi\_int N, cpi\_tag tag\_append);} \\\midrule
-\smalltt{cpi\_nb\_read\_ram}	  & Same as \smalltt{cpi\_nb\_write\_ram} except that transfers move from \smalltt{rams} to memory.\\
-				  & \smalltt{tag = cpi\_nb\_read\_read(cpi\_hand rams, cpi\_ram\_addr ram\_addr, cpi\_addr addr, cpi\_int N, cpi\_tag tag\_append);} \\\midrule
-\smalltt{cpi\_write\_ram}	  & Same as \smalltt{cpi\_nb\_write\_ram} except that control threads suspend until the transaction completes.\\
-				  & \smalltt{cpi\_write\_ram(cpi\_hand rams, cpi\_ram\_addr ram\_addr, cpi\_addr addr, cpi\_int N);} \\\midrule
-\smalltt{cpi\_read\_ram}	  & Same as \smalltt{cpi\_nb\_read\_ram} except that control threads suspend until the transaction completes.\\
-				  & \smalltt{cpi\_read\_ram(cpi\_hand rams, cpi\_ram\_addr ram\_addr, cpi\_addr addr, cpi\_int N);} \\\midrule
-\smalltt{cpi\_test}		  & Takes as argument a \smalltt{rams} co-handle and \smalltt{tag} and returns a \smalltt{bool} indicating whether previous transactions associated with \smalltt{cpi\_nb\_read\_ram} or \smalltt{cpi\_nb\_write\_ram} have completed. \\
-				  & \smalltt{bool cpi\_test(cpi\_hand rams, cpi\_hand tag);} \\\midrule
-\smalltt{cpi\_wait}		  & Takes as argument a \smalltt{rams} co-handle and \smalltt{tag} and blocks the control thread until the previous transactions associated with \smalltt{tag} have completed.\\
-				  & \smalltt{void cpi\_wait(cpi\_hand rams, cpi\_hand tag);} \\\midrule
-\smalltt{cpi\_bind}		  & Establishes a static binding between a \smalltt{rams} co-handle and a \smalltt{channel} co-handle, which will automatically deliver notifications on transactions applied to \smalltt{rams} to \smalltt{channel}. Once a {cpi\_bind} is established, the control thread is no longer permitted to use \smalltt{cpi\_test} or \smalltt{cpi\_wait} on \smalltt{rams}. The control thread also can no longer perform \smalltt{write\_channel} to \smalltt{channel}. \\
-\bottomrule
-\end{tabular}
-\caption{Memory Control Actions (Dynamic).}
-\label{tab:memory_actions}
-\end{table}
-\begin{table}
-\centering
-\small
-\begin{tabular}{ l p{4.5in} }
-\toprule
-Control Action & Description \\ \midrule
-\smalltt{cpi\_read\_channel}	  & Reads from \smalltt{channel} and returns data of type \smalltt{cpi\_int64}. The control thread will block if channel is empty. \\
-				  & \smalltt{cpi\_hand channel, cpi\_int64 cpi\_read\_channel(cpi\_hand);} \\\midrule
-\smalltt{cpi\_write\_channel}	  & Writes \smalltt{data} to \smalltt{channel}. The control thread will block if the \smalltt{channel} is full.\\					 & \smalltt{void cpi\_write\_channel(cpi\_hand channel, cpi\_int64 data);} \\\midrule
-\smalltt{cpi\_test\_channel}		  & Takes as argument a \smalltt{channel} co-handle and returns a \smalltt{bool} indicating whether the \smalltt{channel} is either empty or full (depends on the test input boolean option \smalltt{check\_empty}. \\
-				  & \smalltt{bool cpi\_test\_channel(cpi\_hand channel, bool check\_empty);} \\
-\bottomrule
-\end{tabular}
-\caption{Channel Control Actions (Dynamic).}
-\label{tab:channel_actions}
-\end{table}
-\pdf{bind.pdf}
-{\largewidth}
-{Supporting Automatic Notification with Channel-to-CoRAM Bindings.}
-{Supporting Automatic Notification with Channel-to-CoRAM Bindings.}
-{fig:bind}
-\vspace{10pt}
-\noindent
-\textbf{Blocking vs. Non-Blocking.}
-Memory control actions are subdivided into blocking versus non-blocking
-behaviors (see Table~\ref{tab:memory_actions}).  The CoRAM architecture
-presents a behavior where sequences of ``blocking'' control actions
-(\smalltt{cpi\_write\_ram}, \smalltt{cpi\_read\_ram}) will appear to execute
-atomically ``one-at-a-time'' from the perspective of a single control thread.
-In some circumstances, it is desirable from a performance perspective to
-explicitly allow multiple outstanding control actions to proceed in parallel
-(i.e., to pipeline multiple address requests).  Non-blocking control actions
-support this by immediately returning control to the thread and providing a tag
-that must be tested later to determine when a transaction has completed (see
-\smalltt{cpi\_test} and \smalltt{cpi\_wait} in Table~\ref{tab:memory_actions}).
-Note that in some cases, the underlying hardware may return an invalid
-tag, which requires the control thread to retry the transaction at a later time.
-A tag is held indefinitely until a \smalltt{cpi\_test} or \smalltt{cpi\_wait}
-is called, which has the side-effect of releasing the tag when the operation
-returns successfully.
-%\textbf{Non-blocking control actions are not guaranteed to execute in a
-%well-defined order}. Any combination of read-write memory accesses that
-%overlap in global address ranges will generally result in undefined behavior.
-%It is the responsibility of the application developer to ``wait'' or ``test''
-%on specific transactions when ordering or atomicity is desired.
-A common task of the control thread is to periodically inform the core
-logic when specific memory transactions have completed.
-Table~\ref{tab:channel_actions} summarize the channel control actions that
-enable bidirectional communication through FIFOs and registers. A very typical
-synchronization pattern is shown in Figure~\ref{fig:bind}(top), where (1) a
-control thread issues a memory control action and receives a transaction tag, (2+3)
-tests the transaction tag for completion, (4) writes a token to the core logic
-through a channel FIFO, and (5) the core logic consumes the token and processes
-data from the CoRAM.
-\vspace{10pt}
-\noindent\textbf{Transaction Coalescing.}
-The use of non-blocking transactions requires a control thread to track
-multiple outstanding tags, which can lead to overheads in tag state management
-and cycles consumed by periodic testing.  The Cor-C architecture provides an
-optimization to reduce this overhead by allowing a memory control action to
-coalesce multiple transactions to an existing tag held by the thread.  For
-example:
-{\footnotesize
-\begin{verbatim}
-cpi_tag reused_tag = CPI_INVALID_TAG;
-for(int i=0; i < 10; i++) {
-    tag = cpi_nb_write_ram(ramA, i, i*4, 4, reused_tag);
-}
-cpi_wait(reused_tag);
-\end{verbatim}
-}
-In the example shown above, 10 non-blocking memory transactions are executed by
-the control thread and coalesced into a single tag. At the end of the loop,
-only a single \smalltt{cpi\_wait} operation is required. When passing in a
-re-used tag, the memory control action will merge the new transaction with the
-prior ones.
-\vspace{10pt}
-\noindent\textbf{Automatic Notifications.}
-Another feature of the Cor-C specification is the ability to completely
-eliminate the need for control threads to synchronize directly with
-core logic.  The \smalltt{cpi\_bind} control action shown in
-Table~\ref{tab:memory_actions} allows a control thread to establish a static
-binding between a CoRAM co-handle and a channel FIFO. Figure~\ref{fig:bind}
-illustrates the method of operation---when a control action is performed upon a
-specific co-handle, the associated channel FIFO will automatically enqueue a
-token that presents to the core logic the completion of a memory transaction.
-Completions are placed into the channel FIFO in the same order that
-transactions are issued.  The \smalltt{cpi\_bind} operation reduces the overall
-latency of a round-trip memory access and also allows a control thread to
-pipeline non-blocking multiple memory requests without having to periodically
-test for completions.
-\vspace{10pt}
-\noindent\textbf{Thread-to-Thread Communication.}
-Thread-to-thread synchronization can be provided natively in the Cor-C
-specification for message-passing between multiple threads. In some
-applications, the need for synchronization arises when dependencies must be
-enforced between phases of computation and when there are multiple concurrent
-control threads.  Custom forms of synchronization can also be facilitated
-through the use of channels.  For example, to implement a fast barrier, users
-can instantiate channels as needed into the soft logic fabric to implement
-their own desired synchronization methods.
-\section{Disallowed Behaviors}
-Although control threads have the appearance of general-purpose software
-threads, there are a number of restrictions in the Cor-C specification:
-\begin{itemize}
-\item The static control actions listed in Table~\ref{tab:accessors} can only
-be executed unconditionally (e.g., cannot be conditioned by a loop variable).
-\item Control threads are limited to 64 CoRAMs per co-handle\footnote{The
-architectural limit is placed here due to physical constraints
-imposed by the cluster-style microarchitecture presented in
-Chapter~\ref{sec:microarch}.}.
-\item Control threads may not test invalid tags or perform control actions with
-invalid arguments.
-\item Threads may not dynamically allocate memory or instantiate global variables.
-\item Control threads may not dereference memory pointers directly\footnote{When a
-control threads needs to directly access memory, a single CoRAM along with a
-channel FIFO can be allocated and ``wrapped'' together to form a simple
-load-store interface (see Chapter~\ref{sec:casestudy}).}.
-\item Control threads may not execute floating point operations.
-\item Function stacks are allowed but must be statically bounded.
-\item No recursion allowed.
-\end{itemize}
-Many of the language restrictions above are intended to reduce the likelihood
-of ``abusing'' control threads for computation purposes. The various
-restrictions also ensure that control threads are highly amenable to
-lightweight implementations in hardware (i.e., synthesized threads or executed
-on lightweight microprocessors).
-\section{Simple Example: Vector Addition} To concretely illustrate the features
-of the Cor-C language, the code below gives a complete top-to-bottom example of
-the vector increment kernel, where a sequential array of data is read in from
-memory, incremented, and written back to main memory. The particular kernel in
-this example performs two concurrent increments per clock cycle.
-{\footnotesize
-\begin{verbatim}
-void vector_increment_thread()
-{
-    cpi_register_thread("vector_add", 1/*number of threads*/);
-    cpi_hand data_store = cpi_get_rams(2/*numRams*/, true/*scatter*/, 0, 0);
-    cpi_hand bind_channel = cpi_get_channel(cpi_fifo, 0);
-    cpi_hand done_channel = cpi_get_channel(cpi_fifo, 1);
-    cpi_bind(bind_channel, data_store);
-    cpi_tag tag = CPI_INVALID_TAG;
-    /* Read memory */
-    for(int i=0; i < 128; i+=8)
-        tag = cpi_nb_write_ram(data_store, i, i*8, 8, tag);
-    /* Wait for computation to finish */
-    while(!cpi_read_channel(done_channel)) {}
-    cpi_tag tag = CPI_INVALID_TAG;
-    /* Writeback to memory */
-    for(int i=0; i < 128; i+=8)
-        tag = cpi_nb_read_ram(data_store, i, i*8, 8, tag);
-    cpi_wait(tag);
-}
-module vector_kernel(CLK, RST_N);
-    input CLK, RST_N;
-    reg busy, writeback, done, dout_en;
-    reg [5:0] addr, waddr;
-    reg [31:0] din0, din1;
-    wire [31:0] dout0, dout1;
-    CORAM2#("vector_add", 0/*thread-id*/, 0 /*obj-id*/,
-/*sub-id*/, 32/*data width*/, 32 /*depth*/, 5/*addr-width*/)
-            arr0 (.CLK(CLK), .RST_N(RST_N), .en(1'b1), .wen(wen),
-                  .waddr(waddr), .addr(addr), .din(din0), .dout(dout0));
-    CORAM2#("vector_add", 0/*thread-id*/, 0 /*obj-id*/,
-/*sub-id*/, 32/*data width*/, 32 /*depth*/, 5/*addr-width*/)
-            arr1 (.CLK(CLK), .RST_N(RST_N), .en(1'b1), .wen(wen),
-                  .waddr(waddr), .addr(addr), .din(din1), .dout(dout1));
-    ChannelFIFO cfifo(.CLK(CLK), .RST_N(RST_N),
-                      .dout_en(dout_en), .dout(0), .../*unused signals*/);
-    always@(posedge CLK) begin
-        if(RST_N) begin
-            if(dout_rdy && !busy) begin
-                writeback <= 0;
-                busy <= 1;
-                addr <= 0;
-                waddr <= 0;
-            end
-            else if(busy) begin
-                addr <= addr + 1;
-                writeback <= (addr < 32);
-                busy <= (addr < 32);
-            end
-            else writeback <= 0;
-            if(writeback) begin
-                wen <= 1'b1;
-                din0 <= dout0+1;
-                din1 <= dout1+1;
-                waddr <= waddr+1;
-                if(waddr == 31) dout_en <= 1;
-            end
-            else begin
-                wen <= 1'b0;
-                dout_en <= 1'b0;
-            end
-        end
-        else begin
-            writeback <= 0;
-            dout_en <= 0;
-            busy <= 0;
-            wen <= 0;
-        end
-    end
-endmodule
-\end{verbatim}
-}
-In the first phase of the control thread program above, the thread sets up a
-programmed transfer that reads in 128B of data from memory into 2 separate
-embedded CoRAMs represented by a single co-handle.  To present a wide 64-bit
-word interface to the fabric, the co-handle is composed with the scatter-gather
-argument set to true.  The \smalltt{cpi\_bind} operation thereafter establishes
-an implicit channel between the memory system and the core logic, which is the
-ultimate consumer of the memory data. Within the core logic, four embedded
-CoRAMs and a single channel FIFO are instantiated as black-box modules. As
-memory transactions stream through one-at-a-time, the core logic will receive
-tokens through the channel FIFO, indicating that data is ready for access
-within the CoRAMs. In the simple example above, the core logic waits until all
-the tokens are received before performing the accumulation steps. During the
-compute phase, the core logic reads and writes 32 clock cycles worth of data
-from the embedded CoRAMs. Upon completion, a token is written back from the
-core logic to the control thread indicating that a writeback to memory is
-pending.  The control thread in the background wait-polls on the channel until
-receiving the token and then performs the final set of memory control actions
-to write data from the CoRAMs to memory.
-\vspace{10pt}
-\noindent \textbf{Summary.} It is not difficult to imagine that many variants of control actions could be
-added to the Cor-C architecture specification to support more sophisticated
-patterns or optimizations (e.g., broadcast from one CoRAM to many, prefetch,
-strided access, programmable patterns, etc.). In a commercial production
-setting, control actions---like instructions in an ISA---must be carefully
-defined and preserved to achieve the value of portability and compatibility.
-Optimizing compilers could also could play a significant role in static
-optimization of control thread programs. Analysis could be used, for example,
-to identify non-conflicting control actions that are logically executed in
-sequence but can actually be executed concurrently without affecting
-correctness. The next section describes a compiler and proof-of-concept of the
-Cor-C architecture specification. Chapter~\ref{sec:casestudy} will later
-present concrete demonstrations of Cor-C-based applications.
-%Without delving into details, the Cor-C language essentially follows same basic
-%lexical elements as the standard C language~\cite{krbook}.
-%\subsection{Thread State and Memory}
-%Control threads can declare state variables and bounded-sized arrays through
-%conventional C-based syntax (e.g., \textit{int x = 0}). Under limited
-%circumstances, structs may also be used to organize and group variables
-%together.
-%Global variables declared outside the scope of a function are \textbf{not
-%permitted} in the Cor-C specification. Control threads are also not permitted
-%to directly access the global main memory through dereferencing of pointers.
-%This limitation of control threads can be addressed through specialized
-%memory personalities, which will be later described in Section\ref{sec}.
-\section{The CoRAM Control Compiler (CORCC)}
-\pdf{control_options.pdf}
-{\largewidth}
-{Options for Synthesizing Control Threads.}
-{Options for Synthesizing Control Threads.}
-{fig:control_options}
-\pdf{corcc_example.pdf}
-{\xcolumnwidth}
-{CORCC Example.}
-{CORCC Example.}
-{fig:corcc_example}
-The CoRAM Control Compiler (CORCC) was developed in this thesis to explore the
-various implementation options for control threads.
-Figure~\ref{fig:control_options} shows the several ways in which a control
-thread program can be mapped down into control logic in an FPGA with CoRAM
-support: (1) directly compiling control thread programs into soft logic state
-machines via high-level synthesis, (2) compiling control threads to
-pre-implemented soft microprocessor cores (e.g., Xilinx
-Microblaze~\cite{xilinx} or Altera Nios~\cite{altera}) or (3) compiling to a
-hard microprocessor serving as a dedicated microcontroller.
-CORCC supports direct synthesis of control threads into synthesizable RTL from
-standard C code and can also be configured to model the cycle-time performance
-of a simple microprocessor.  The implementation of CORCC leverages the Low
-Level Virtual Machine (LLVM) framework~\cite{llvm}, which is an open-source,
-end-to-end compiler with pluggable extensions for custom passes and backends.
-CORCC leverages the modularity of LLVM and its language-independent
-intermediate representation (IR) to implement a simple form of high-level
-synthesis with extensions for microprocessor performance modeling.
-\vspace{10pt}
-\noindent \textbf{Implementation.}
-The CORCC LLVM extension is implemented in 6000L of C++ as a series of LLVM
-passes. CORCC extends LLVM with special objects and data types that are
-specific to the Cor-C architecture specification.  These include CoRAM and
-channel accessors, co-handles, and the memory/channel control actions.  LLVM
-includes front-ends for several popular languages such as C and C++ and can
-automatically generate an intermediate representation (IR) in Single Static
-Assignment (SSA) form. In SSA form, each variable in a routine is assigned
-exactly once, which is useful for various optimizations and simplifying the
-properties of variables.  An important feature of the IR is the LLVM type
-system, which provides high-level program-level information accessible at the
-assembly level.  The first stage of CORCC is automatically handled by LLVM,
-which translates the high-level control thread program into IR organized into
-basic blocks. The assembly employed by LLVM constitutes about 70
-instructions~\cite{llvm}, of which a subset of about 30 are supported in
-CORCC\footnote{Use of unsupported instructions result in compile-time
-errors in CORCC.}.
-\vspace{10pt}
-\noindent \textbf{Thread-to-Hardware Interface.}
-The control threads of an application, which exist either in the form of soft
-finite state machines or as microcontrollers can be viewed as clients that
-issue memory requests to the underlying memory subsystem comprising embedded
-CoRAMs, the network-on-chip, and the edge memory interfaces (as illustrated
-earlier in Figure~\ref{fig:control_options}). CORCC assumes that in a soft
-implementation of control threads, a well-defined request-response interface
-exists between the underlying subsystem and the control threads implemented in
-the fabric. The details of such interfaces are described further in
-Chapter~\ref{sec:microarch}.
-\vspace{10pt}
-\noindent \textbf{Co-handle Pass.}
-The first stage of CORCC performs a sweep through the LLVM-generated IR and
-identifies \smalltt{call} instructions that match the function signatures of
-static control actions such as co-handle and channel accessors (see
-Table~\ref{tab:accessors}) that are used to establish bindings to various
-CoRAM-related object. In LLVM, any function with a return value is assigned to
-a register identifier with a unique integer.  Within CORCC, the static pass
-creates an internal map between an identified co-handle and its corresponding
-destination register.  When processing a co-handle, the function arguments are
-checked to be constant and valid values. During this pass, any dynamic control
-actions are annotated and linked against the detected co-handles. The link step
-performs a backtracing through registers in the IR to identify the specific
-co-handle associated with a dynamic control action.
-\vspace{10pt}
-\noindent \textbf{Thread Synthesis.}
-Once all co-handles have been identified, CORCC performs a synthesis step that
-translates the LLVM instructions and the Cor-C dynamic control actions into
-synthesizable Verilog. The basic approach taken by CORCC is to perform a direct
-mapping of basic blocks into single-cycle states in a finite state machine.  To
-handle register state, CORCC instantiates a physical register for each assigned
-variable in a program.  To implement logic, all of the instructions in a
-basic block are converted into combinational statements, where the
-inputs to the logic are read from registers in a single clock cycle (and
-in the same clock cycle, the output is written to the destination registers).
-The SSA form of LLVM guarantees that no registers are read and written at the
-same time within a single basic block.
-To handle dynamic control actions, special states are introduced at locations in the
-LLVM IR where \smalltt{call} instructions are detected. For example, when a
-\smalltt{cpi\_write\_ram} function call is detected, the parent basic block
-will be split into two states, one containing the original and the other
-for invocation of the control thread. The predecessor basic block will always
-jump into the special state first, which handles the actual issue of the
-control action to the memory subsystem through a request-response interface;
-thereafter, the FSM jumps into the original basic block while returning the
-value of the control action.  Figure~\ref{fig:corcc_example} gives a complete
-example of compiling the simple example from Chapter~\ref{sec:coram} into
-synthesizable hardware.
-\vspace{10pt}
-\noindent \textbf{Microprocessor Performance Modeling.}
-To explore the design space for microcontroller-based control threads in our
-evaluation in Chapter~\ref{sec:evaluation}, CORCC includes an additional
-feature that approximates the performance characteristics of a simple in-order
-microprocessor core. The core is modeled with a constant CPI value
-(cycles per LLVM instruction) and is assumed to have specialized logic that
-interfaces directly to the underlying memory subsystem described in
-Chapter~\ref{sec:microarch}.  The programmed CPI value sets the rate at which
-control threads advance through the LLVM basic blocks in order to mimic the
-performance characteristics of an idealized microprocessor.
-Chapter~\ref{sec:evaluation} will later present simulation-driven results that
-compare direct synthesis by CORCC to soft and hard microprocessor cores.
-\vspace{10pt}
-\noindent \textbf{CORCC Limitations.}
-The CORCC compiler employs a relatively simple approach to high level
-synthesis, which completely expands the basic blocks of an application
-into synthesizable hardware. The simple approach taken here can have a
-detrimental effect on performance and area, especially if LLVM produces large
-basic blocks or allocates a large number of registers. A potential way to
-mitigate large critical paths within a basic block are to split basic
-blocks where necessary, which can either be supported automatically or
-guided by the user. More advanced high-level synthesis techniques can also be
-applied---e.g., constraining and scheduling the usage of resources.
-As will be shown later in Chapter~\ref{sec:evaluation}, without any
-optimizations, the FSMs generated by the CORCC compiler consume relatively
-modest area while operating at nominal FPGA clock frequencies.
-\vspace{10pt}
-\noindent \textbf{Cor-C vs. Parallel Languages.}
-Our selection of the C language is not a fundamental requirement of the CoRAM
-paradigm.  An area that merits further research is the use of functional or
-parallel languages to express higher levels of parallelism within control
-threads.  A particular consequence of using a sequential-like language of C is
-the serialization of requests during dynamic execution. Consider the for loop
-below, which generates a stream of requests to the memory subsystem:
-{\footnotesize
-\begin{verbatim}
-for(int i=0; i < 8192; i+= BLOCK_BYTES) {
-    tag = cpi_nb_write_ram(ramA, 0, 0, BLOCK_BYTES, tag);
-}
-\end{verbatim}
-}
-In the example above, CORCC would not allow the multiple control actions
-to execute in parallel due to serialization on the coalesced tag variable.
-In such case, parallel constructs such as \smalltt{forall} can explicitly
-declare that the loop body operations are independent.
-\section{Summary}
-This chapter presented the Cor-C architecture specification and compiler.
-Cor-C is a devised instance of the CoRAM concept, and provides an example
-of how the CoRAM concept is applied in a real-world environment.
-The CoRAM Control Compiler (CORCC) is a proof-of-concept that implements the
-Cor-C specification and is evaluated further in Chapter~\ref{sec:evaluation}.

Back to top

docs.1327348006.txt.gz · Last modified: 2012/01/23 19:46 by echung

Sitemap Recent Changes