Show page Old revisions

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
docs [2012/01/23 19:46]
echung
docs [2012/01/23 20:24] (current)
echung
Line 1: Line 1:
-\chapter{Cor-C Architecture ​and Compiler} +====== Corflow Documentation ​and Examples ======
-\label{sec:​language}+
  
- +^ Date ^ Documentation ​ ^ Version ^ Download ^ 
-\renewcommand{\epigraphflush}{flushright} +| 23 Feb 2012 | Corflow Programming Guide | 1.0 | pdf | 
-\renewcommand{\epigraphwidth}{4.5in} +| 23 Feb 2012 | Simple ​Examples | 1.0 | zip |
-\renewcommand{\epigraphrule}{0pt}  +
-\epigraph{\hfill\textit{I have stopped reading Stephen King novels. Now I just read C code instead.}}{Richard O'​Keefe} +
-\vspace{10pt} +
- +
-The Cor-C architecture specification is an \textit{instance} of the CoRAM +
-concept, which establishes all of the requisite details, data types, +
-constraints,​ and semantics that are necessary for a real portable +
-hardware/​software interface\footnote{An appropriate analogy would be the MIPS +
-ISA being an instance of the RISC concept.}. The Cor-C architecture +
-specification defines a dialect of the C language that can be used to express +
-the desired behavior of control thread programs. ​ The use of a standard, +
-high-level language such as C affords an application developer not only simpler +
-but also more natural expressions of control flow and memory pointer +
-manipulations. ​ It is important to note that Cor-C is not intended to be used +
-as a medium for expressing the computational components of an application but +
-rather, to be used as a lightweight memory management interface that ``wraps''​ +
-a given application to facilitate portability and to reduce design effort. +
- +
-This chapter begins by introducing the salient features of the Cor-C language, +
-including data types, thread invocation and management, control actions, and +
-the semantics of memory. Section~\ref{sec:​language} will describe a prototype +
-compiler for the Cor-C specification,​ which compiles control thread programs +
-into finite state machines. Chapter~\ref{sec:​casestudy} will later present +
-actual uses of the prototype compiler for developing real applications using +
-the Cor-C language. +
- +
-\section{CoR-C Overview} +
-\label{sec:​detail} +
-The standard collection of primitives in Cor-C +
-are divided into \textit{static} versus \textit{dynamic} control actions. +
-Tables~\ref{tab:​accessors} illustrates accessor control actions that +
-are statically processed at compile-time,​ while Table~\ref{tab:​memory_actions} +
-illustrates control actions that are executed dynamically throughout the +
-course of an application. The control actions have the appearance of a memory +
-management API, and abstract away the details of the underlying hardware +
-support---similar to the role served by the Instruction Set Architecture (ISA) +
-between software and evolving hardware implementations. ​ As will be shown later +
-in Chapter~\ref{sec:​casestudy},​ the basic set of control actions defined are +
-powerful building blocks that can be used to compose more sophisticated memory +
-abstractions such as scratchpads,​ caches, and FIFOs---each which are tailored to +
-the memory patterns and desired interfaces of specific applications. +
- +
-%The syntax and conventions shown are based on the \textbf{Cor-C architecture +
-%language specification},​ which is a devised \textbf{instance} of the CoRAM +
-%architecture in this thesis and prototyped later in +
-%Chapter~\ref{sec:​prototype}. ​  +
- +
- +
-\begin{table} +
-\centering +
-\begin{tabular}{@{} ​ l l@{}} +
-\toprule +
-Data type & Description \\ \midrule  +
-\smalltt{bool} & 1-bit boolean \\ +
-\smalltt{char,​ uchar} & 8-bit signed and unsigned integers \\ +
-\smalltt{sint,​ suint} & 16-bit signed and unsigned integers \\ +
-\smalltt{int,​ uint}  & 32-bit signed and unsigned integers \\  +
-\smalltt{int64,​ uint64} & 64-bit signed and unsigned integers \\  +
-\smalltt{cpi\_channel\_ty} & An enumeration of channel object types {reg, fifo} \\ +
-\smalltt{cpi\_addr} & 64-bit virtual address \\ +
-\smalltt{cpi\_ram\_addr} & 16-bit local ram address \\ +
-\smalltt{cpi\_hand} & Static handle for CoRAMs and channel objects \\ +
-\smalltt{cpi\_tag} & Transaction tag for logical memory transactions \\ +
-\bottomrule +
-\end{tabular} +
-\caption{Cor-C Data Types} +
-\label{tab:​types}  +
-\end{table} +
-  +
-  +
-\begin{table} +
-\centering +
-\small +
-\begin{tabular}{ l p{4.5in} } +
-\toprule +
-Control Action & Description \\ \midrule  +
-\smalltt{cpi\_register}  ​ & Registers a control thread with name \smalltt{thread\_name} and replicates it \smalltt{N} times. \\ +
-   & \smalltt{void cpi\_register\_thread(cpi\_str thread\_name,​ cpi\_int N);} \\ \midrule +
-\smalltt{cpi\_instance}  ​ & Returns the thread ID as an \smalltt{int}. \\  +
-   & \smalltt{int cpi\_instance();​} \\ \midrule +
-\smalltt{cpi\_get\_ram}  ​ & Returns a ram co-handle uniquely identified by \smalltt{obj\_id} and an optional list of sub-ids.\\ +
-   & \smalltt{cpi\_hand cpi\_get\_ram(int obj\_id);} \\ +
-   & \smalltt{cpi\_hand cpi\_get\_ram(int obj\_id, int sub\_id);} \\ +
-   & \smalltt{cpi\_hand cpi\_get\_ram(int obj\_id, ...);} \\\midrule +
-\smalltt{cpi\_get\_rams}  ​ & Returns a co-handle that combines \smalltt{N} rams together as a single logical memory.  +
-When \smalltt{scatter} is enabled, the rams are combined in a word-interleaved fashion. If \smalltt{scatter} is disabled, the rams are composed linearly. The rams selected are based on \smalltt{N} consecutively numbered ids from \smalltt{id...id+N-1},​ where \smalltt{id} is the last argument used in the control action. \\ +
-   & \smalltt{cpi\_hand cpi\_get\_rams(int N, bool scatter, int obj\_id);} \\ +
-   & \smalltt{cpi\_hand cpi\_get\_rams(int N, bool scatter, int obj\_id, int sub\_id);​}\\ +
-   & \smalltt{cpi\_hand cpi\_get\_rams(int N, bool scatter, int obj\_id, ...);} \\\midrule +
-\smalltt{cpi\_get\_channel}  ​ & Returns a channel co-handle based on the enumeration \smalltt{ty}. The channel is uniquely identified by \smalltt{obj\_id} and an optional list of sub-ids.\\ +
-   & \smalltt{cpi\_hand cpi\_get\_channel(cpi\_channel\_ty ty, int obj\_id);} \\ +
-   & \smalltt{cpi\_hand cpi\_get\_channel(cpi\_channel\_ty ty, ...);} \\ +
- +
-\bottomrule +
-\end{tabular} +
-\caption{Cor-C Accessor Control Actions (Static).} +
-\label{tab:​accessors} +
-\end{table} ​                        +
-  +
-\subsection{Control Threads} +
-Every application in CoRAM begins with a source-level description of control +
-threads to act as a ``wrapper''​ around the core processing logic. ​ Control +
-threads are written in the Cor-C language, which is syntactically identical to +
-C~\cite{kc}. ​ Table~\ref{tab:​types} summarizes the types in the +
-language, which include several types specific to the Cor-C language. ​ To begin  +
-an application,​ threads are declared using the \smalltt{cpi\_register} function +
-from Table~\ref{tab:​accessors},​ which takes as argument a unique thread name +
-and a scale factor that replicates the body of the containing function +
-\smalltt{N} times. The code below illustrates how two separate Cor-C functions +
-would be instantiated in a single program. In this example, a total of three +
-threads would be executed during runtime (one of threadA, two of threadB). +
- +
-{\footnotesize +
-\begin{verbatim} +
-// single thread +
-void threadA() { +
-    cpi_register("​thread-A",​ 1); +
-    ... +
-+
- +
-// two threads +
-void threadB() { +
-    cpi_register("​thread-B",​ 2); +
-    ... +
-+
-\end{verbatim} +
-}  +
- +
- +
-\subsection{Object Instantiation and Identification} +
- +
-\begin{figure} +
-  \centering +
-  \begin{minipage}[t]{\columnwidth} +
- \lstinputlisting[label=lst:​bbox_coram,​caption=Verilog black-box definition for single-ported embedded CoRAM and Channel FIFO.]{code/​bbox_coram.c} +
-  \end{minipage} +
-\end{figure}  +
- +
- +
-%\begin{figure} +
-%  \centering +
-%  \begin{minipage}[t]{\columnwidth} +
-% \lstinputlisting[label=lst:​bbox_cfifo,​caption=Verilog black-box definition for Channel FIFO.]{code/​bbox_cfifo.c} +
-%  \end{minipage} +
-%\end{figure} ​  +
-  +
-To utilize embedded CoRAMs, a designer begins with a pre-defined library +
-of black-box module wrappers written in a specific hardware description +
-language. Listing~\ref{lst:​bbox_coram} (top) shows the Verilog port list of an +
-embedded CoRAM with a single read-write SRAM port. Unlike a typical SRAM, the +
-embedded CoRAM includes extra parameters specific to the Cor-C architecture +
-specification. ​ The \smalltt{THREAD} is a string that names a particular +
-control thread associated with the CoRAM. The additional field, +
-\smalltt{THREAD\_ID} is necessary for scale factors greater than 1 +
-(i.e., when a thread is replicated with \smalltt{cpi\_register}). +
-Finally, the \smalltt{OBJECT\_ID} and an optional list of \smalltt{SUB\_ID} +
-parameters distinguish between multiple CoRAMs managed by a single control +
-thread instance. ​ In addition to the CoRAMs, users may also instantiate channel +
-FIFOs (shown in Listing~\ref{lst:​bbox_coram},​ bottom) that enable the core logic to +
-communicate with specific control threads in the application. The convention to +
-identifying and acquiring channel objects are the same as that of acquiring +
-CoRAMs. +
- +
-\pdf{composition.pdf} +
-{\columnwidth} +
-{Linear and Scatter-Gather RAM Compositions.} +
-{Linear and Scatter-Gather RAM Compositions.} +
-{fig:​composition} ​  +
-  +
-When performing accesses to memory, a control thread typically gathers one or +
-more instantiated CoRAMs into a single, program-level identifier called the +
-\textbf{co-handle}--- or \smalltt{cpi\_hand} for short. ​ The co-handle +
-establishes a compile-time binding between an individual control thread and a +
-collection of one or more CoRAMs that are functioning as a single logical unit. +
-Like conventional FPGAs, CoRAMs can be combined to form flexible aspect ratios +
-and capacities. ​ Figure~\ref{fig:​composition} illustrates how multiple CoRAMs +
-are composed to form a single RAM with deeper entries (called linear) or a +
-single RAM with wider data words (called scatter/​gather). ​ The composition of +
-multiple RAMs can be declared in a control thread using the \smalltt{get\_rams} +
-accessor function, which returns a co-handle that represents one or more CoRAMs +
-functioning as a single logical unit. The \smalltt{get\_rams} accessor takes as +
-argument \smalltt{N} number of CoRAMs, an option to compose the CoRAMs linearly +
-or in scatter-gather mode, and the base \smalltt{object\_id} (plus an optional +
-list of sub-ids) to uniquely identify a range of CoRAMs. +
- +
-\subsection{Memory Control Actions} +
- +
-The basic role of the control thread is to perform memory operations upon +
-co-handles and to inform the core processing logic through channels when +
-particular operations have completed. ​ The most basic way to operate upon a +
-co-handle is to pass it into a \smalltt{cpi\_ram\_write} memory control action, which +
-performs a logical memory transfer of \smalltt{size} bytes from the global +
-memory address \smalltt{mem\_addr} to the local address \smalltt{ram\_addr} of +
-the CoRAMs named by co-handle. When completed, a sequential block of data from +
-memory will be split into RAM-sized words that are written in sequence +
-according to the arranged memory-mapping of addresses of each individual CoRAM +
-(see Figure~\ref{fig:​composition}). +
- +
- +
-\begin{table} +
-\centering +
-\small  +
-\begin{tabular}{ l p{4.5in} } +
-\toprule +
-Control Action & Description \\ \midrule ​  +
-\smalltt{cpi\_nb\_write\_ram}  ​ & Performs a non-blocking transfer of \smalltt{N} bytes from memory at address \smalltt{addr} to the \smalltt{rams} co-handle beginning at local address \smalltt{ram\_addr}. Returns a transaction tag \smalltt{cpi\_tag} which can be valid or invalid (\smalltt{CPI\_INVALID\_TAG}). If the last argument \smalltt{tag\_append} is set to a value equal to the tag of a previous non-blocking transfer, the current transfer will be appended to the previous transaction and will share the same tag.\\ +
-   & \smalltt{tag = cpi\_nb\_write\_ram(cpi\_hand rams, cpi\_ram\_addr ram\_addr, cpi\_addr addr, cpi\_int N, cpi\_tag tag\_append);​} \\\midrule +
-\smalltt{cpi\_nb\_read\_ram}  ​ & Same as \smalltt{cpi\_nb\_write\_ram} except that transfers move from \smalltt{rams} to memory.\\ +
-   & \smalltt{tag = cpi\_nb\_read\_read(cpi\_hand rams, cpi\_ram\_addr ram\_addr, cpi\_addr addr, cpi\_int N, cpi\_tag tag\_append);​} \\\midrule +
-\smalltt{cpi\_write\_ram}  ​ & Same as \smalltt{cpi\_nb\_write\_ram} except that control threads suspend until the transaction completes.\\ +
- +
-   & \smalltt{cpi\_write\_ram(cpi\_hand rams, cpi\_ram\_addr ram\_addr, cpi\_addr addr, cpi\_int N);} \\\midrule +
-\smalltt{cpi\_read\_ram}  ​ & Same as \smalltt{cpi\_nb\_read\_ram} except that control threads suspend until the transaction completes.\\ +
- +
-   & \smalltt{cpi\_read\_ram(cpi\_hand rams, cpi\_ram\_addr ram\_addr, cpi\_addr addr, cpi\_int N);} \\\midrule  +
-\smalltt{cpi\_test}  ​ & Takes as argument a \smalltt{rams} co-handle and \smalltt{tag} and returns a \smalltt{bool} indicating whether previous transactions associated with \smalltt{cpi\_nb\_read\_ram} or \smalltt{cpi\_nb\_write\_ram} have completed. \\ +
-   & \smalltt{bool cpi\_test(cpi\_hand rams, cpi\_hand tag);} \\\midrule +
-\smalltt{cpi\_wait}  ​ & Takes as argument a \smalltt{rams} co-handle and \smalltt{tag} and blocks the control thread until the previous transactions associated with \smalltt{tag} have completed.\\ +
-   & \smalltt{void cpi\_wait(cpi\_hand rams, cpi\_hand tag);} \\\midrule +
-\smalltt{cpi\_bind}  ​ & Establishes a static binding between a \smalltt{rams} co-handle and a \smalltt{channel} co-handle, which will automatically deliver notifications on transactions applied to \smalltt{rams} to \smalltt{channel}. Once a {cpi\_bind} is established,​ the control thread is no longer permitted to use \smalltt{cpi\_test} or \smalltt{cpi\_wait} on \smalltt{rams}. The control thread also can no longer perform \smalltt{write\_channel} to \smalltt{channel}. \\ +
-\bottomrule +
-\end{tabular} +
-\caption{Memory Control Actions (Dynamic).} +
-\label{tab:​memory_actions}  +
-\end{table}  +
- +
-\begin{table} +
-\centering +
-\small  +
-\begin{tabular}{ l p{4.5in} } +
-\toprule +
-Control Action & Description \\ \midrule ​  +
-\smalltt{cpi\_read\_channel}  ​ & Reads from \smalltt{channel} and returns data of type \smalltt{cpi\_int64}. The control thread will block if channel is empty. \\ +
-   & \smalltt{cpi\_hand channel, cpi\_int64 cpi\_read\_channel(cpi\_hand);​} \\\midrule +
-\smalltt{cpi\_write\_channel}  ​ & Writes \smalltt{data} to \smalltt{channel}. The control thread will block if the \smalltt{channel} is full.\\ & \smalltt{void cpi\_write\_channel(cpi\_hand channel, cpi\_int64 data);} \\\midrule +
-\smalltt{cpi\_test\_channel}  ​ & Takes as argument a \smalltt{channel} co-handle and returns a \smalltt{bool} indicating whether the \smalltt{channel} is either empty or full (depends on the test input boolean option \smalltt{check\_empty}. \\ +
-   & \smalltt{bool cpi\_test\_channel(cpi\_hand channel, bool check\_empty);​} \\ +
-\bottomrule +
-\end{tabular} +
-\caption{Channel Control Actions (Dynamic).} +
-\label{tab:​channel_actions}  +
-\end{table} ​  +
-  +
-\pdf{bind.pdf} +
-{\largewidth} +
-{Supporting Automatic Notification with Channel-to-CoRAM Bindings.} +
-{Supporting Automatic Notification with Channel-to-CoRAM Bindings.} +
-{fig:​bind} ​   +
- +
-  +
-\vspace{10pt} +
-\noindent +
-\textbf{Blocking vs. Non-Blocking.} +
-Memory control actions are subdivided into blocking versus non-blocking +
-behaviors (see Table~\ref{tab:​memory_actions}). ​ The CoRAM architecture +
-presents a behavior where sequences of ``blocking''​ control actions +
-(\smalltt{cpi\_write\_ram},​ \smalltt{cpi\_read\_ram}) will appear to execute +
-atomically ``one-at-a-time''​ from the perspective of a single control thread. +
-In some circumstances,​ it is desirable from a performance perspective to +
-explicitly allow multiple outstanding control actions to proceed in parallel +
-(i.e., to pipeline multiple address requests). ​ Non-blocking control actions +
-support this by immediately returning control to the thread and providing a tag +
-that must be tested later to determine when a transaction has completed (see +
-\smalltt{cpi\_test} and \smalltt{cpi\_wait} in Table~\ref{tab:​memory_actions}). +
-Note that in some cases, the underlying hardware may return an invalid +
-tag, which requires the control thread to retry the transaction at a later time. +
-A tag is held indefinitely until a \smalltt{cpi\_test} or \smalltt{cpi\_wait} +
-is called, which has the side-effect of releasing the tag when the operation +
-returns successfully. +
-  +
-%\textbf{Non-blocking control actions are not guaranteed to execute in a +
-%well-defined order}. Any combination of read-write memory accesses that +
-%overlap in global address ranges will generally result in undefined behavior. +
-%It is the responsibility of the application developer to ``wait''​ or ``test''​ +
-%on specific transactions when ordering or atomicity is desired. +
- +
-A common task of the control thread is to periodically inform the core +
-logic when specific memory transactions have completed. +
-Table~\ref{tab:​channel_actions} summarize the channel control actions that +
-enable bidirectional communication through FIFOs and registers. A very typical +
-synchronization pattern is shown in Figure~\ref{fig:​bind}(top),​ where (1) a +
-control thread issues a memory control action and receives a transaction tag, (2+3) +
-tests the transaction tag for completion, (4) writes a token to the core logic +
-through a channel FIFO, and (5) the core logic consumes the token and processes +
-data from the CoRAM. +
- +
-\vspace{10pt} +
-\noindent\textbf{Transaction Coalescing.} +
-The use of non-blocking transactions requires a control thread to track +
-multiple outstanding tags, which can lead to overheads in tag state management +
-and cycles consumed by periodic testing. ​ The Cor-C architecture provides an +
-optimization to reduce this overhead by allowing a memory control action to +
-coalesce multiple transactions to an existing tag held by the thread. ​ For +
-example: +
- +
-{\footnotesize +
-\begin{verbatim} +
-cpi_tag reused_tag = CPI_INVALID_TAG;​ +
-for(int i=0; i < 10; i++) { +
-    tag = cpi_nb_write_ram(ramA,​ i, i*4, 4, reused_tag);​  +
-+
-cpi_wait(reused_tag);​ +
-\end{verbatim} +
-}   +
- +
-In the example shown above, 10 non-blocking memory transactions are executed by +
-the control thread and coalesced into a single tag. At the end of the loop, +
-only a single \smalltt{cpi\_wait} operation is required. When passing in a +
-re-used tag, the memory control action will merge the new transaction with the +
-prior ones. +
- +
-\vspace{10pt} +
-\noindent\textbf{Automatic Notifications.} +
-Another feature of the Cor-C specification is the ability to completely +
-eliminate the need for control threads to synchronize directly with +
-core logic. ​ The \smalltt{cpi\_bind} control action shown in +
-Table~\ref{tab:​memory_actions} allows a control thread to establish a static +
-binding between a CoRAM co-handle and a channel FIFO. Figure~\ref{fig:​bind} +
-illustrates the method of operation---when a control action is performed upon a +
-specific co-handle, the associated channel FIFO will automatically enqueue a +
-token that presents to the core logic the completion of a memory transaction. +
-Completions are placed into the channel FIFO in the same order that +
-transactions are issued. ​ The \smalltt{cpi\_bind} operation reduces the overall +
-latency of a round-trip memory access and also allows a control thread to +
-pipeline non-blocking multiple memory requests without having to periodically +
-test for completions.  +
- +
-\vspace{10pt} +
-\noindent\textbf{Thread-to-Thread Communication.} +
-Thread-to-thread synchronization can be provided natively in the Cor-C +
-specification for message-passing between multiple threads. In some +
-applications,​ the need for synchronization arises when dependencies must be +
-enforced between phases of computation and when there are multiple concurrent +
-control threads. ​ Custom forms of synchronization can also be facilitated +
-through the use of channels. ​ For example, to implement a fast barrier, users +
-can instantiate channels as needed into the soft logic fabric to implement +
-their own desired synchronization methods. +
- +
-\section{Disallowed Behaviors} +
-Although control threads have the appearance of general-purpose software +
-threads, there are a number of restrictions in the Cor-C specification:​ +
- +
-\begin{itemize} +
- +
-\item The static control actions listed in Table~\ref{tab:​accessors} can only +
-be executed unconditionally (e.g., cannot be conditioned by a loop variable). +
- +
-\item Control threads are limited to 64 CoRAMs per co-handle\footnote{The +
-architectural limit is placed here due to physical constraints +
-imposed by the cluster-style microarchitecture presented in +
-Chapter~\ref{sec:​microarch}.}. +
- +
-\item Control threads may not test invalid tags or perform control actions with +
-invalid arguments. +
- +
-\item Threads may not dynamically allocate memory or instantiate global variables. +
- +
-\item Control threads may not dereference memory pointers directly\footnote{When a +
-control threads needs to directly access memory, a single CoRAM along with a +
-channel FIFO can be allocated and ``wrapped''​ together to form a simple +
-load-store interface (see Chapter~\ref{sec:​casestudy}).}. +
- +
-\item Control threads may not execute floating point operations. +
- +
-\item Function stacks are allowed but must be statically bounded. +
- +
-\item No recursion allowed. +
- +
-\end{itemize} +
- +
-Many of the language restrictions above are intended to reduce the likelihood +
-of ``abusing''​ control threads for computation purposes. The various +
-restrictions also ensure that control threads are highly amenable to +
-lightweight implementations in hardware (i.e., synthesized threads or executed +
-on lightweight microprocessors). +
- +
-\section{Simple ​Example: Vector Addition} To concretely illustrate the features +
-of the Cor-C language, the code below gives a complete top-to-bottom example of +
-the vector increment kernel, where a sequential array of data is read in from +
-memory, incremented,​ and written back to main memory. The particular kernel in +
-this example performs two concurrent increments per clock cycle. +
- +
-{\footnotesize +
-\begin{verbatim} +
-void vector_increment_thread()  +
-+
-    cpi_register_thread("​vector_add", ​1/*number of threads*/​);​ +
-    cpi_hand data_store = cpi_get_rams(2/​*numRams*/,​ true/​*scatter*/,​ 0, 0); +
-    cpi_hand bind_channel = cpi_get_channel(cpi_fifo,​ 0); +
-    cpi_hand done_channel = cpi_get_channel(cpi_fifo,​ 1); +
-    cpi_bind(bind_channel,​ data_store);​ +
-    cpi_tag tag = CPI_INVALID_TAG;​ +
- +
-    /* Read memory */ +
-    for(int i=0; i < 128; i+=8) +
-        tag = cpi_nb_write_ram(data_store,​ i, i*8, 8, tag); +
- +
-    /* Wait for computation to finish */ +
-    while(!cpi_read_channel(done_channel)) {} +
-    cpi_tag tag = CPI_INVALID_TAG;​ +
- +
-    /* Writeback to memory */ +
-    for(int i=0; i < 128; i+=8) +
-        tag = cpi_nb_read_ram(data_store,​ i, i*8, 8, tag); +
-    cpi_wait(tag);​ +
-+
- +
-module vector_kernel(CLK,​ RST_N); +
-    input CLK, RST_N; +
-    reg busy, writeback, done, dout_en; +
-    reg [5:0] addr, waddr; +
-    reg [31:0] din0, din1; +
-    wire [31:0] dout0, dout1; +
- +
-    CORAM2#​("​vector_add",​ 0/​*thread-id*/,​ 0 /​*obj-id*/,​ +
-            0/​*sub-id*/,​ 32/*data width*/, 32 /*depth*/, 5/​*addr-width*/​)  +
-            arr0 (.CLK(CLK), .RST_N(RST_N),​ .en(1'​b1),​ .wen(wen),  +
-                  .waddr(waddr),​ .addr(addr),​ .din(din0), .dout(dout0));​ +
- +
-    CORAM2#​("​vector_add", ​0/​*thread-id*/,​ 0 /​*obj-id*/,​ +
-            1/​*sub-id*/,​ 32/*data width*/, 32 /*depth*/, 5/​*addr-width*/​) +
-            arr1 (.CLK(CLK), .RST_N(RST_N),​ .en(1'​b1),​ .wen(wen),  +
-                  .waddr(waddr),​ .addr(addr),​ .din(din1), .dout(dout1));​ +
-  +
-    ChannelFIFO cfifo(.CLK(CLK),​ .RST_N(RST_N),​  +
-                      .dout_en(dout_en),​ .dout(0), .../*unused signals*/​);​ +
- +
-    always@(posedge CLK) begin +
-        if(RST_N) begin +
-            if(dout_rdy && !busy) begin +
-                writeback <= 0; +
-                busy <= 1; +
-                addr <= 0; +
-                waddr <= 0; +
-            end +
-            else if(busy) begin +
-                addr <= addr + 1; +
-                writeback <= (addr < 32); +
-                busy <= (addr < 32); +
-            end +
-            else writeback <= 0; +
- +
-            if(writeback) begin +
-                wen <= 1'​b1;​ +
-                din0 <= dout0+1; +
-                din1 <= dout1+1; +
-                waddr <= waddr+1; +
-                if(waddr == 31) dout_en <= 1; +
-            end +
-            else begin +
-                wen <= 1'​b0;​ +
-                dout_en <= 1'​b0;​ +
-            end +
-        end +
-        else begin +
-            writeback <= 0; +
-            dout_en <= 0; +
-            busy <= 0; +
-            wen <= 0; +
-        end +
-    end +
- +
-endmodule +
- +
-\end{verbatim} +
-}  +
-  +
-In the first phase of the control thread program above, the thread sets up a +
-programmed transfer that reads in 128B of data from memory into 2 separate +
-embedded CoRAMs represented by a single co-handle. ​ To present a wide 64-bit +
-word interface to the fabric, the co-handle is composed with the scatter-gather +
-argument set to true.  The \smalltt{cpi\_bind} operation thereafter establishes +
-an implicit channel between the memory system and the core logic, which is the +
-ultimate consumer of the memory data. Within the core logic, four embedded +
-CoRAMs and a single channel FIFO are instantiated as black-box modules. As +
-memory transactions stream through one-at-a-time,​ the core logic will receive +
-tokens through the channel FIFO, indicating that data is ready for access +
-within the CoRAMs. In the simple example above, the core logic waits until all +
-the tokens are received before performing the accumulation steps. During the +
-compute phase, the core logic reads and writes 32 clock cycles worth of data +
-from the embedded CoRAMs. Upon completion, a token is written back from the +
-core logic to the control thread indicating that a writeback to memory is +
-pending. ​ The control thread in the background wait-polls on the channel until +
-receiving the token and then performs the final set of memory control actions +
-to write data from the CoRAMs to memory. +
- +
-\vspace{10pt} +
-\noindent \textbf{Summary.} It is not difficult to imagine that many variants of control actions could be +
-added to the Cor-C architecture specification to support more sophisticated +
-patterns or optimizations (e.g., broadcast from one CoRAM to many, prefetch, +
-strided access, programmable patterns, etc.). In a commercial production +
-setting, control actions---like instructions in an ISA---must be carefully +
-defined and preserved to achieve the value of portability and compatibility. +
-Optimizing compilers could also could play a significant role in static +
-optimization of control thread programs. Analysis could be used, for example, +
-to identify non-conflicting control actions that are logically executed in +
-sequence but can actually be executed concurrently without affecting +
-correctness. The next section describes a compiler and proof-of-concept of the +
-Cor-C architecture specification. Chapter~\ref{sec:​casestudy} will later +
-present concrete demonstrations of Cor-C-based applications. +
-   +
-%Without delving into details, the Cor-C language essentially follows same basic +
-%lexical elements as the standard C language~\cite{krbook}. ​  +
-%\subsection{Thread State and Memory} +
-%Control threads can declare state variables and bounded-sized arrays through +
-%conventional C-based syntax (e.g., \textit{int x = 0}). Under limited +
-%circumstances,​ structs may also be used to organize and group variables +
-%together.  +
-%Global variables declared outside the scope of a function are \textbf{not +
-%permitted} in the Cor-C specification. Control threads are also not permitted +
-%to directly access the global main memory through dereferencing of pointers. +
-%This limitation of control threads can be addressed through specialized +
-%memory personalities,​ which will be later described in Section\ref{sec}. +
- +
-\section{The CoRAM Control Compiler (CORCC)} +
- +
-\pdf{control_options.pdf} +
-{\largewidth} +
-{Options for Synthesizing Control Threads.} +
-{Options for Synthesizing Control Threads.} +
-{fig:​control_options} ​  +
- +
-\pdf{corcc_example.pdf} +
-{\xcolumnwidth} +
-{CORCC Example.} +
-{CORCC Example.} +
-{fig:​corcc_example} +
- +
-The CoRAM Control Compiler (CORCC) was developed in this thesis to explore the +
-various implementation options for control threads. +
-Figure~\ref{fig:​control_options} shows the several ways in which a control +
-thread program can be mapped down into control logic in an FPGA with CoRAM +
-support: (1) directly compiling control thread programs into soft logic state +
-machines via high-level synthesis, (2) compiling control threads to +
-pre-implemented soft microprocessor cores (e.g., Xilinx +
-Microblaze~\cite{xilinx} or Altera Nios~\cite{altera}) or (3) compiling to a +
-hard microprocessor serving as a dedicated microcontroller. +
- +
-CORCC supports direct synthesis of control threads into synthesizable RTL from +
-standard C code and can also be configured to model the cycle-time performance +
-of a simple microprocessor. ​ The implementation of CORCC leverages the Low +
-Level Virtual Machine (LLVM) framework~\cite{llvm},​ which is an open-source,​ +
-end-to-end compiler with pluggable extensions for custom passes and backends. +
-CORCC leverages the modularity of LLVM and its language-independent +
-intermediate representation (IR) to implement a simple form of high-level +
-synthesis with extensions for microprocessor performance modeling. +
- +
-\vspace{10pt} +
-\noindent \textbf{Implementation.} +
-The CORCC LLVM extension is implemented in 6000L of C++ as a series of LLVM +
-passes. CORCC extends LLVM with special objects and data types that are +
-specific to the Cor-C architecture specification. ​ These include CoRAM and +
-channel accessors, co-handles, and the memory/​channel control actions. ​ LLVM +
-includes front-ends for several popular languages such as C and C++ and can +
-automatically generate an intermediate representation (IR) in Single Static +
-Assignment (SSA) form. In SSA form, each variable in a routine is assigned +
-exactly once, which is useful for various optimizations and simplifying the +
-properties of variables. ​ An important feature of the IR is the LLVM type +
-system, which provides high-level program-level information accessible at the +
-assembly level. ​ The first stage of CORCC is automatically handled by LLVM, +
-which translates the high-level control thread program into IR organized into +
-basic blocks. The assembly employed by LLVM constitutes about 70 +
-instructions~\cite{llvm},​ of which a subset of about 30 are supported in +
-CORCC\footnote{Use of unsupported instructions result in compile-time +
-errors in CORCC.}. +
- +
-\vspace{10pt} +
-\noindent \textbf{Thread-to-Hardware Interface.} +
-The control threads of an application,​ which exist either in the form of soft +
-finite state machines or as microcontrollers can be viewed as clients that +
-issue memory requests to the underlying memory subsystem comprising embedded +
-CoRAMs, the network-on-chip,​ and the edge memory interfaces (as illustrated +
-earlier in Figure~\ref{fig:​control_options}). CORCC assumes that in a soft +
-implementation of control threads, a well-defined request-response interface  +
-exists between the underlying subsystem and the control threads implemented in +
-the fabric. The details of such interfaces are described further in +
-Chapter~\ref{sec:​microarch}. +
- +
-\vspace{10pt} +
-\noindent \textbf{Co-handle Pass.} +
-The first stage of CORCC performs a sweep through the LLVM-generated IR and +
-identifies \smalltt{call} instructions that match the function signatures of +
-static control actions such as co-handle and channel accessors (see +
-Table~\ref{tab:​accessors}) that are used to establish bindings to various +
-CoRAM-related object. In LLVM, any function with a return value is assigned to +
-a register identifier with a unique integer. ​ Within CORCC, the static pass +
-creates an internal map between an identified co-handle and its corresponding +
-destination register. ​ When processing a co-handle, the function arguments are +
-checked to be constant and valid values. During this pass, any dynamic control +
-actions are annotated and linked against the detected co-handles. The link step +
-performs a backtracing through registers in the IR to identify the specific +
-co-handle associated with a dynamic control action. +
- +
-\vspace{10pt} +
-\noindent \textbf{Thread Synthesis.} +
-Once all co-handles have been identified, CORCC performs a synthesis step that +
-translates the LLVM instructions and the Cor-C dynamic control actions into +
-synthesizable Verilog. The basic approach taken by CORCC is to perform a direct +
-mapping of basic blocks into single-cycle states in a finite state machine. ​ To +
-handle register state, CORCC instantiates a physical register for each assigned +
-variable in a program. ​ To implement logic, all of the instructions in a +
-basic block are converted into combinational statements, where the +
-inputs to the logic are read from registers in a single clock cycle (and +
-in the same clock cycle, the output is written to the destination registers). +
-The SSA form of LLVM guarantees that no registers are read and written at the +
-same time within a single basic block. +
- +
-To handle dynamic control actions, special states are introduced at locations in the +
-LLVM IR where \smalltt{call} instructions are detected. For example, when a +
-\smalltt{cpi\_write\_ram} function call is detected, the parent basic block +
-will be split into two states, one containing the original and the other +
-for invocation of the control thread. The predecessor basic block will always +
-jump into the special state first, which handles the actual issue of the +
-control action to the memory subsystem through a request-response interface;​ +
-thereafter, the FSM jumps into the original basic block while returning the +
-value of the control action. ​ Figure~\ref{fig:​corcc_example} gives a complete +
-example of compiling the simple example from Chapter~\ref{sec:​coram} into +
-synthesizable hardware.  +
- +
-\vspace{10pt} +
-\noindent \textbf{Microprocessor Performance Modeling.} +
-To explore the design space for microcontroller-based control threads in our +
-evaluation in Chapter~\ref{sec:​evaluation},​ CORCC includes an additional +
-feature that approximates the performance characteristics of a simple in-order +
-microprocessor core. The core is modeled with a constant CPI value +
-(cycles per LLVM instruction) and is assumed to have specialized logic that +
-interfaces directly to the underlying memory subsystem described in +
-Chapter~\ref{sec:​microarch}. ​ The programmed CPI value sets the rate at which +
-control threads advance through the LLVM basic blocks in order to mimic the +
-performance characteristics of an idealized microprocessor. +
-Chapter~\ref{sec:​evaluation} will later present simulation-driven results that +
-compare direct synthesis by CORCC to soft and hard microprocessor cores. +
- +
-\vspace{10pt} +
-\noindent \textbf{CORCC Limitations.} +
-The CORCC compiler employs a relatively simple approach to high level +
-synthesis, which completely expands the basic blocks of an application +
-into synthesizable hardware. The simple approach taken here can have a +
-detrimental effect on performance and area, especially if LLVM produces large +
-basic blocks or allocates a large number of registers. A potential way to +
-mitigate large critical paths within a basic block are to split basic +
-blocks where necessary, which can either be supported automatically or +
-guided by the user. More advanced high-level synthesis techniques can also be +
-applied---e.g.,​ constraining and scheduling the usage of resources. +
-As will be shown later in Chapter~\ref{sec:​evaluation},​ without any +
-optimizations,​ the FSMs generated by the CORCC compiler consume relatively +
-modest area while operating at nominal FPGA clock frequencies. +
- +
-\vspace{10pt} +
-\noindent \textbf{Cor-C vs. Parallel Languages.}  +
-Our selection of the C language is not a fundamental requirement of the CoRAM +
-paradigm. ​ An area that merits further research is the use of functional or +
-parallel languages to express higher levels of parallelism within control +
-threads. ​ A particular consequence of using a sequential-like language of C is +
-the serialization of requests during dynamic execution. Consider the for loop +
-below, which generates a stream of requests to the memory subsystem:​ +
- +
-{\footnotesize +
-\begin{verbatim} +
-for(int i=0; i < 8192; i+= BLOCK_BYTES) { +
-    tag = cpi_nb_write_ram(ramA,​ 0, 0, BLOCK_BYTES,​ tag); +
-+
-\end{verbatim} +
-}  +
- +
-In the example above, CORCC would not allow the multiple control actions +
-to execute in parallel due to serialization on the coalesced tag variable. +
-In such case, parallel constructs such as \smalltt{forall} can explicitly +
-declare that the loop body operations are independent. +
- +
-\section{Summary} +
-This chapter presented the Cor-C architecture specification and compiler. +
-Cor-C is a devised instance of the CoRAM concept, and provides an example +
-of how the CoRAM concept is applied in a real-world environment. +
-The CoRAM Control Compiler (CORCC) is a proof-of-concept that implements the +
-Cor-C specification and is evaluated further in Chapter~\ref{sec:​evaluation}.+
 
Back to top
docs.1327348006.txt.gz · Last modified: 2012/01/23 19:46 by echung
 
 
CC Attribution-Noncommercial-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0