This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
docs [2012/01/23 19:46] echung |
docs [2012/01/23 20:24] (current) echung |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | \chapter{Cor-C Architecture and Compiler} | + | ====== Corflow Documentation and Examples ====== |
- | \label{sec:language} | + | |
- | + | ^ Date ^ Documentation ^ Version ^ Download ^ | |
- | \renewcommand{\epigraphflush}{flushright} | + | | 23 Feb 2012 | Corflow Programming Guide | 1.0 | pdf | |
- | \renewcommand{\epigraphwidth}{4.5in} | + | | 23 Feb 2012 | Simple Examples | 1.0 | zip | |
- | \renewcommand{\epigraphrule}{0pt} | + | |
- | \epigraph{\hfill\textit{I have stopped reading Stephen King novels. Now I just read C code instead.}}{Richard O'Keefe} | + | |
- | \vspace{10pt} | + | |
- | + | ||
- | The Cor-C architecture specification is an \textit{instance} of the CoRAM | + | |
- | concept, which establishes all of the requisite details, data types, | + | |
- | constraints, and semantics that are necessary for a real portable | + | |
- | hardware/software interface\footnote{An appropriate analogy would be the MIPS | + | |
- | ISA being an instance of the RISC concept.}. The Cor-C architecture | + | |
- | specification defines a dialect of the C language that can be used to express | + | |
- | the desired behavior of control thread programs. The use of a standard, | + | |
- | high-level language such as C affords an application developer not only simpler | + | |
- | but also more natural expressions of control flow and memory pointer | + | |
- | manipulations. It is important to note that Cor-C is not intended to be used | + | |
- | as a medium for expressing the computational components of an application but | + | |
- | rather, to be used as a lightweight memory management interface that ``wraps'' | + | |
- | a given application to facilitate portability and to reduce design effort. | + | |
- | + | ||
- | This chapter begins by introducing the salient features of the Cor-C language, | + | |
- | including data types, thread invocation and management, control actions, and | + | |
- | the semantics of memory. Section~\ref{sec:language} will describe a prototype | + | |
- | compiler for the Cor-C specification, which compiles control thread programs | + | |
- | into finite state machines. Chapter~\ref{sec:casestudy} will later present | + | |
- | actual uses of the prototype compiler for developing real applications using | + | |
- | the Cor-C language. | + | |
- | + | ||
- | \section{CoR-C Overview} | + | |
- | \label{sec:detail} | + | |
- | The standard collection of primitives in Cor-C | + | |
- | are divided into \textit{static} versus \textit{dynamic} control actions. | + | |
- | Tables~\ref{tab:accessors} illustrates accessor control actions that | + | |
- | are statically processed at compile-time, while Table~\ref{tab:memory_actions} | + | |
- | illustrates control actions that are executed dynamically throughout the | + | |
- | course of an application. The control actions have the appearance of a memory | + | |
- | management API, and abstract away the details of the underlying hardware | + | |
- | support---similar to the role served by the Instruction Set Architecture (ISA) | + | |
- | between software and evolving hardware implementations. As will be shown later | + | |
- | in Chapter~\ref{sec:casestudy}, the basic set of control actions defined are | + | |
- | powerful building blocks that can be used to compose more sophisticated memory | + | |
- | abstractions such as scratchpads, caches, and FIFOs---each which are tailored to | + | |
- | the memory patterns and desired interfaces of specific applications. | + | |
- | + | ||
- | %The syntax and conventions shown are based on the \textbf{Cor-C architecture | + | |
- | %language specification}, which is a devised \textbf{instance} of the CoRAM | + | |
- | %architecture in this thesis and prototyped later in | + | |
- | %Chapter~\ref{sec:prototype}. | + | |
- | + | ||
- | + | ||
- | \begin{table} | + | |
- | \centering | + | |
- | \begin{tabular}{@{} l l@{}} | + | |
- | \toprule | + | |
- | Data type & Description \\ \midrule | + | |
- | \smalltt{bool} & 1-bit boolean \\ | + | |
- | \smalltt{char, uchar} & 8-bit signed and unsigned integers \\ | + | |
- | \smalltt{sint, suint} & 16-bit signed and unsigned integers \\ | + | |
- | \smalltt{int, uint} & 32-bit signed and unsigned integers \\ | + | |
- | \smalltt{int64, uint64} & 64-bit signed and unsigned integers \\ | + | |
- | \smalltt{cpi\_channel\_ty} & An enumeration of channel object types {reg, fifo} \\ | + | |
- | \smalltt{cpi\_addr} & 64-bit virtual address \\ | + | |
- | \smalltt{cpi\_ram\_addr} & 16-bit local ram address \\ | + | |
- | \smalltt{cpi\_hand} & Static handle for CoRAMs and channel objects \\ | + | |
- | \smalltt{cpi\_tag} & Transaction tag for logical memory transactions \\ | + | |
- | \bottomrule | + | |
- | \end{tabular} | + | |
- | \caption{Cor-C Data Types} | + | |
- | \label{tab:types} | + | |
- | \end{table} | + | |
- | + | ||
- | + | ||
- | \begin{table} | + | |
- | \centering | + | |
- | \small | + | |
- | \begin{tabular}{ l p{4.5in} } | + | |
- | \toprule | + | |
- | Control Action & Description \\ \midrule | + | |
- | \smalltt{cpi\_register} & Registers a control thread with name \smalltt{thread\_name} and replicates it \smalltt{N} times. \\ | + | |
- | & \smalltt{void cpi\_register\_thread(cpi\_str thread\_name, cpi\_int N);} \\ \midrule | + | |
- | \smalltt{cpi\_instance} & Returns the thread ID as an \smalltt{int}. \\ | + | |
- | & \smalltt{int cpi\_instance();} \\ \midrule | + | |
- | \smalltt{cpi\_get\_ram} & Returns a ram co-handle uniquely identified by \smalltt{obj\_id} and an optional list of sub-ids.\\ | + | |
- | & \smalltt{cpi\_hand cpi\_get\_ram(int obj\_id);} \\ | + | |
- | & \smalltt{cpi\_hand cpi\_get\_ram(int obj\_id, int sub\_id);} \\ | + | |
- | & \smalltt{cpi\_hand cpi\_get\_ram(int obj\_id, ...);} \\\midrule | + | |
- | \smalltt{cpi\_get\_rams} & Returns a co-handle that combines \smalltt{N} rams together as a single logical memory. | + | |
- | When \smalltt{scatter} is enabled, the rams are combined in a word-interleaved fashion. If \smalltt{scatter} is disabled, the rams are composed linearly. The rams selected are based on \smalltt{N} consecutively numbered ids from \smalltt{id...id+N-1}, where \smalltt{id} is the last argument used in the control action. \\ | + | |
- | & \smalltt{cpi\_hand cpi\_get\_rams(int N, bool scatter, int obj\_id);} \\ | + | |
- | & \smalltt{cpi\_hand cpi\_get\_rams(int N, bool scatter, int obj\_id, int sub\_id);}\\ | + | |
- | & \smalltt{cpi\_hand cpi\_get\_rams(int N, bool scatter, int obj\_id, ...);} \\\midrule | + | |
- | \smalltt{cpi\_get\_channel} & Returns a channel co-handle based on the enumeration \smalltt{ty}. The channel is uniquely identified by \smalltt{obj\_id} and an optional list of sub-ids.\\ | + | |
- | & \smalltt{cpi\_hand cpi\_get\_channel(cpi\_channel\_ty ty, int obj\_id);} \\ | + | |
- | & \smalltt{cpi\_hand cpi\_get\_channel(cpi\_channel\_ty ty, ...);} \\ | + | |
- | + | ||
- | \bottomrule | + | |
- | \end{tabular} | + | |
- | \caption{Cor-C Accessor Control Actions (Static).} | + | |
- | \label{tab:accessors} | + | |
- | \end{table} | + | |
- | + | ||
- | \subsection{Control Threads} | + | |
- | Every application in CoRAM begins with a source-level description of control | + | |
- | threads to act as a ``wrapper'' around the core processing logic. Control | + | |
- | threads are written in the Cor-C language, which is syntactically identical to | + | |
- | C~\cite{kc}. Table~\ref{tab:types} summarizes the types in the | + | |
- | language, which include several types specific to the Cor-C language. To begin | + | |
- | an application, threads are declared using the \smalltt{cpi\_register} function | + | |
- | from Table~\ref{tab:accessors}, which takes as argument a unique thread name | + | |
- | and a scale factor that replicates the body of the containing function | + | |
- | \smalltt{N} times. The code below illustrates how two separate Cor-C functions | + | |
- | would be instantiated in a single program. In this example, a total of three | + | |
- | threads would be executed during runtime (one of threadA, two of threadB). | + | |
- | + | ||
- | {\footnotesize | + | |
- | \begin{verbatim} | + | |
- | // single thread | + | |
- | void threadA() { | + | |
- | cpi_register("thread-A", 1); | + | |
- | ... | + | |
- | } | + | |
- | + | ||
- | // two threads | + | |
- | void threadB() { | + | |
- | cpi_register("thread-B", 2); | + | |
- | ... | + | |
- | } | + | |
- | \end{verbatim} | + | |
- | } | + | |
- | + | ||
- | + | ||
- | \subsection{Object Instantiation and Identification} | + | |
- | + | ||
- | \begin{figure} | + | |
- | \centering | + | |
- | \begin{minipage}[t]{\columnwidth} | + | |
- | \lstinputlisting[label=lst:bbox_coram,caption=Verilog black-box definition for single-ported embedded CoRAM and Channel FIFO.]{code/bbox_coram.c} | + | |
- | \end{minipage} | + | |
- | \end{figure} | + | |
- | + | ||
- | + | ||
- | %\begin{figure} | + | |
- | % \centering | + | |
- | % \begin{minipage}[t]{\columnwidth} | + | |
- | % \lstinputlisting[label=lst:bbox_cfifo,caption=Verilog black-box definition for Channel FIFO.]{code/bbox_cfifo.c} | + | |
- | % \end{minipage} | + | |
- | %\end{figure} | + | |
- | + | ||
- | To utilize embedded CoRAMs, a designer begins with a pre-defined library | + | |
- | of black-box module wrappers written in a specific hardware description | + | |
- | language. Listing~\ref{lst:bbox_coram} (top) shows the Verilog port list of an | + | |
- | embedded CoRAM with a single read-write SRAM port. Unlike a typical SRAM, the | + | |
- | embedded CoRAM includes extra parameters specific to the Cor-C architecture | + | |
- | specification. The \smalltt{THREAD} is a string that names a particular | + | |
- | control thread associated with the CoRAM. The additional field, | + | |
- | \smalltt{THREAD\_ID} is necessary for scale factors greater than 1 | + | |
- | (i.e., when a thread is replicated with \smalltt{cpi\_register}). | + | |
- | Finally, the \smalltt{OBJECT\_ID} and an optional list of \smalltt{SUB\_ID} | + | |
- | parameters distinguish between multiple CoRAMs managed by a single control | + | |
- | thread instance. In addition to the CoRAMs, users may also instantiate channel | + | |
- | FIFOs (shown in Listing~\ref{lst:bbox_coram}, bottom) that enable the core logic to | + | |
- | communicate with specific control threads in the application. The convention to | + | |
- | identifying and acquiring channel objects are the same as that of acquiring | + | |
- | CoRAMs. | + | |
- | + | ||
- | \pdf{composition.pdf} | + | |
- | {\columnwidth} | + | |
- | {Linear and Scatter-Gather RAM Compositions.} | + | |
- | {Linear and Scatter-Gather RAM Compositions.} | + | |
- | {fig:composition} | + | |
- | + | ||
- | When performing accesses to memory, a control thread typically gathers one or | + | |
- | more instantiated CoRAMs into a single, program-level identifier called the | + | |
- | \textbf{co-handle}--- or \smalltt{cpi\_hand} for short. The co-handle | + | |
- | establishes a compile-time binding between an individual control thread and a | + | |
- | collection of one or more CoRAMs that are functioning as a single logical unit. | + | |
- | Like conventional FPGAs, CoRAMs can be combined to form flexible aspect ratios | + | |
- | and capacities. Figure~\ref{fig:composition} illustrates how multiple CoRAMs | + | |
- | are composed to form a single RAM with deeper entries (called linear) or a | + | |
- | single RAM with wider data words (called scatter/gather). The composition of | + | |
- | multiple RAMs can be declared in a control thread using the \smalltt{get\_rams} | + | |
- | accessor function, which returns a co-handle that represents one or more CoRAMs | + | |
- | functioning as a single logical unit. The \smalltt{get\_rams} accessor takes as | + | |
- | argument \smalltt{N} number of CoRAMs, an option to compose the CoRAMs linearly | + | |
- | or in scatter-gather mode, and the base \smalltt{object\_id} (plus an optional | + | |
- | list of sub-ids) to uniquely identify a range of CoRAMs. | + | |
- | + | ||
- | \subsection{Memory Control Actions} | + | |
- | + | ||
- | The basic role of the control thread is to perform memory operations upon | + | |
- | co-handles and to inform the core processing logic through channels when | + | |
- | particular operations have completed. The most basic way to operate upon a | + | |
- | co-handle is to pass it into a \smalltt{cpi\_ram\_write} memory control action, which | + | |
- | performs a logical memory transfer of \smalltt{size} bytes from the global | + | |
- | memory address \smalltt{mem\_addr} to the local address \smalltt{ram\_addr} of | + | |
- | the CoRAMs named by co-handle. When completed, a sequential block of data from | + | |
- | memory will be split into RAM-sized words that are written in sequence | + | |
- | according to the arranged memory-mapping of addresses of each individual CoRAM | + | |
- | (see Figure~\ref{fig:composition}). | + | |
- | + | ||
- | + | ||
- | \begin{table} | + | |
- | \centering | + | |
- | \small | + | |
- | \begin{tabular}{ l p{4.5in} } | + | |
- | \toprule | + | |
- | Control Action & Description \\ \midrule | + | |
- | \smalltt{cpi\_nb\_write\_ram} & Performs a non-blocking transfer of \smalltt{N} bytes from memory at address \smalltt{addr} to the \smalltt{rams} co-handle beginning at local address \smalltt{ram\_addr}. Returns a transaction tag \smalltt{cpi\_tag} which can be valid or invalid (\smalltt{CPI\_INVALID\_TAG}). If the last argument \smalltt{tag\_append} is set to a value equal to the tag of a previous non-blocking transfer, the current transfer will be appended to the previous transaction and will share the same tag.\\ | + | |
- | & \smalltt{tag = cpi\_nb\_write\_ram(cpi\_hand rams, cpi\_ram\_addr ram\_addr, cpi\_addr addr, cpi\_int N, cpi\_tag tag\_append);} \\\midrule | + | |
- | \smalltt{cpi\_nb\_read\_ram} & Same as \smalltt{cpi\_nb\_write\_ram} except that transfers move from \smalltt{rams} to memory.\\ | + | |
- | & \smalltt{tag = cpi\_nb\_read\_read(cpi\_hand rams, cpi\_ram\_addr ram\_addr, cpi\_addr addr, cpi\_int N, cpi\_tag tag\_append);} \\\midrule | + | |
- | \smalltt{cpi\_write\_ram} & Same as \smalltt{cpi\_nb\_write\_ram} except that control threads suspend until the transaction completes.\\ | + | |
- | + | ||
- | & \smalltt{cpi\_write\_ram(cpi\_hand rams, cpi\_ram\_addr ram\_addr, cpi\_addr addr, cpi\_int N);} \\\midrule | + | |
- | \smalltt{cpi\_read\_ram} & Same as \smalltt{cpi\_nb\_read\_ram} except that control threads suspend until the transaction completes.\\ | + | |
- | + | ||
- | & \smalltt{cpi\_read\_ram(cpi\_hand rams, cpi\_ram\_addr ram\_addr, cpi\_addr addr, cpi\_int N);} \\\midrule | + | |
- | \smalltt{cpi\_test} & Takes as argument a \smalltt{rams} co-handle and \smalltt{tag} and returns a \smalltt{bool} indicating whether previous transactions associated with \smalltt{cpi\_nb\_read\_ram} or \smalltt{cpi\_nb\_write\_ram} have completed. \\ | + | |
- | & \smalltt{bool cpi\_test(cpi\_hand rams, cpi\_hand tag);} \\\midrule | + | |
- | \smalltt{cpi\_wait} & Takes as argument a \smalltt{rams} co-handle and \smalltt{tag} and blocks the control thread until the previous transactions associated with \smalltt{tag} have completed.\\ | + | |
- | & \smalltt{void cpi\_wait(cpi\_hand rams, cpi\_hand tag);} \\\midrule | + | |
- | \smalltt{cpi\_bind} & Establishes a static binding between a \smalltt{rams} co-handle and a \smalltt{channel} co-handle, which will automatically deliver notifications on transactions applied to \smalltt{rams} to \smalltt{channel}. Once a {cpi\_bind} is established, the control thread is no longer permitted to use \smalltt{cpi\_test} or \smalltt{cpi\_wait} on \smalltt{rams}. The control thread also can no longer perform \smalltt{write\_channel} to \smalltt{channel}. \\ | + | |
- | \bottomrule | + | |
- | \end{tabular} | + | |
- | \caption{Memory Control Actions (Dynamic).} | + | |
- | \label{tab:memory_actions} | + | |
- | \end{table} | + | |
- | + | ||
- | \begin{table} | + | |
- | \centering | + | |
- | \small | + | |
- | \begin{tabular}{ l p{4.5in} } | + | |
- | \toprule | + | |
- | Control Action & Description \\ \midrule | + | |
- | \smalltt{cpi\_read\_channel} & Reads from \smalltt{channel} and returns data of type \smalltt{cpi\_int64}. The control thread will block if channel is empty. \\ | + | |
- | & \smalltt{cpi\_hand channel, cpi\_int64 cpi\_read\_channel(cpi\_hand);} \\\midrule | + | |
- | \smalltt{cpi\_write\_channel} & Writes \smalltt{data} to \smalltt{channel}. The control thread will block if the \smalltt{channel} is full.\\ & \smalltt{void cpi\_write\_channel(cpi\_hand channel, cpi\_int64 data);} \\\midrule | + | |
- | \smalltt{cpi\_test\_channel} & Takes as argument a \smalltt{channel} co-handle and returns a \smalltt{bool} indicating whether the \smalltt{channel} is either empty or full (depends on the test input boolean option \smalltt{check\_empty}. \\ | + | |
- | & \smalltt{bool cpi\_test\_channel(cpi\_hand channel, bool check\_empty);} \\ | + | |
- | \bottomrule | + | |
- | \end{tabular} | + | |
- | \caption{Channel Control Actions (Dynamic).} | + | |
- | \label{tab:channel_actions} | + | |
- | \end{table} | + | |
- | + | ||
- | \pdf{bind.pdf} | + | |
- | {\largewidth} | + | |
- | {Supporting Automatic Notification with Channel-to-CoRAM Bindings.} | + | |
- | {Supporting Automatic Notification with Channel-to-CoRAM Bindings.} | + | |
- | {fig:bind} | + | |
- | + | ||
- | + | ||
- | \vspace{10pt} | + | |
- | \noindent | + | |
- | \textbf{Blocking vs. Non-Blocking.} | + | |
- | Memory control actions are subdivided into blocking versus non-blocking | + | |
- | behaviors (see Table~\ref{tab:memory_actions}). The CoRAM architecture | + | |
- | presents a behavior where sequences of ``blocking'' control actions | + | |
- | (\smalltt{cpi\_write\_ram}, \smalltt{cpi\_read\_ram}) will appear to execute | + | |
- | atomically ``one-at-a-time'' from the perspective of a single control thread. | + | |
- | In some circumstances, it is desirable from a performance perspective to | + | |
- | explicitly allow multiple outstanding control actions to proceed in parallel | + | |
- | (i.e., to pipeline multiple address requests). Non-blocking control actions | + | |
- | support this by immediately returning control to the thread and providing a tag | + | |
- | that must be tested later to determine when a transaction has completed (see | + | |
- | \smalltt{cpi\_test} and \smalltt{cpi\_wait} in Table~\ref{tab:memory_actions}). | + | |
- | Note that in some cases, the underlying hardware may return an invalid | + | |
- | tag, which requires the control thread to retry the transaction at a later time. | + | |
- | A tag is held indefinitely until a \smalltt{cpi\_test} or \smalltt{cpi\_wait} | + | |
- | is called, which has the side-effect of releasing the tag when the operation | + | |
- | returns successfully. | + | |
- | + | ||
- | %\textbf{Non-blocking control actions are not guaranteed to execute in a | + | |
- | %well-defined order}. Any combination of read-write memory accesses that | + | |
- | %overlap in global address ranges will generally result in undefined behavior. | + | |
- | %It is the responsibility of the application developer to ``wait'' or ``test'' | + | |
- | %on specific transactions when ordering or atomicity is desired. | + | |
- | + | ||
- | A common task of the control thread is to periodically inform the core | + | |
- | logic when specific memory transactions have completed. | + | |
- | Table~\ref{tab:channel_actions} summarize the channel control actions that | + | |
- | enable bidirectional communication through FIFOs and registers. A very typical | + | |
- | synchronization pattern is shown in Figure~\ref{fig:bind}(top), where (1) a | + | |
- | control thread issues a memory control action and receives a transaction tag, (2+3) | + | |
- | tests the transaction tag for completion, (4) writes a token to the core logic | + | |
- | through a channel FIFO, and (5) the core logic consumes the token and processes | + | |
- | data from the CoRAM. | + | |
- | + | ||
- | \vspace{10pt} | + | |
- | \noindent\textbf{Transaction Coalescing.} | + | |
- | The use of non-blocking transactions requires a control thread to track | + | |
- | multiple outstanding tags, which can lead to overheads in tag state management | + | |
- | and cycles consumed by periodic testing. The Cor-C architecture provides an | + | |
- | optimization to reduce this overhead by allowing a memory control action to | + | |
- | coalesce multiple transactions to an existing tag held by the thread. For | + | |
- | example: | + | |
- | + | ||
- | {\footnotesize | + | |
- | \begin{verbatim} | + | |
- | cpi_tag reused_tag = CPI_INVALID_TAG; | + | |
- | for(int i=0; i < 10; i++) { | + | |
- | tag = cpi_nb_write_ram(ramA, i, i*4, 4, reused_tag); | + | |
- | } | + | |
- | cpi_wait(reused_tag); | + | |
- | \end{verbatim} | + | |
- | } | + | |
- | + | ||
- | In the example shown above, 10 non-blocking memory transactions are executed by | + | |
- | the control thread and coalesced into a single tag. At the end of the loop, | + | |
- | only a single \smalltt{cpi\_wait} operation is required. When passing in a | + | |
- | re-used tag, the memory control action will merge the new transaction with the | + | |
- | prior ones. | + | |
- | + | ||
- | \vspace{10pt} | + | |
- | \noindent\textbf{Automatic Notifications.} | + | |
- | Another feature of the Cor-C specification is the ability to completely | + | |
- | eliminate the need for control threads to synchronize directly with | + | |
- | core logic. The \smalltt{cpi\_bind} control action shown in | + | |
- | Table~\ref{tab:memory_actions} allows a control thread to establish a static | + | |
- | binding between a CoRAM co-handle and a channel FIFO. Figure~\ref{fig:bind} | + | |
- | illustrates the method of operation---when a control action is performed upon a | + | |
- | specific co-handle, the associated channel FIFO will automatically enqueue a | + | |
- | token that presents to the core logic the completion of a memory transaction. | + | |
- | Completions are placed into the channel FIFO in the same order that | + | |
- | transactions are issued. The \smalltt{cpi\_bind} operation reduces the overall | + | |
- | latency of a round-trip memory access and also allows a control thread to | + | |
- | pipeline non-blocking multiple memory requests without having to periodically | + | |
- | test for completions. | + | |
- | + | ||
- | \vspace{10pt} | + | |
- | \noindent\textbf{Thread-to-Thread Communication.} | + | |
- | Thread-to-thread synchronization can be provided natively in the Cor-C | + | |
- | specification for message-passing between multiple threads. In some | + | |
- | applications, the need for synchronization arises when dependencies must be | + | |
- | enforced between phases of computation and when there are multiple concurrent | + | |
- | control threads. Custom forms of synchronization can also be facilitated | + | |
- | through the use of channels. For example, to implement a fast barrier, users | + | |
- | can instantiate channels as needed into the soft logic fabric to implement | + | |
- | their own desired synchronization methods. | + | |
- | + | ||
- | \section{Disallowed Behaviors} | + | |
- | Although control threads have the appearance of general-purpose software | + | |
- | threads, there are a number of restrictions in the Cor-C specification: | + | |
- | + | ||
- | \begin{itemize} | + | |
- | + | ||
- | \item The static control actions listed in Table~\ref{tab:accessors} can only | + | |
- | be executed unconditionally (e.g., cannot be conditioned by a loop variable). | + | |
- | + | ||
- | \item Control threads are limited to 64 CoRAMs per co-handle\footnote{The | + | |
- | architectural limit is placed here due to physical constraints | + | |
- | imposed by the cluster-style microarchitecture presented in | + | |
- | Chapter~\ref{sec:microarch}.}. | + | |
- | + | ||
- | \item Control threads may not test invalid tags or perform control actions with | + | |
- | invalid arguments. | + | |
- | + | ||
- | \item Threads may not dynamically allocate memory or instantiate global variables. | + | |
- | + | ||
- | \item Control threads may not dereference memory pointers directly\footnote{When a | + | |
- | control threads needs to directly access memory, a single CoRAM along with a | + | |
- | channel FIFO can be allocated and ``wrapped'' together to form a simple | + | |
- | load-store interface (see Chapter~\ref{sec:casestudy}).}. | + | |
- | + | ||
- | \item Control threads may not execute floating point operations. | + | |
- | + | ||
- | \item Function stacks are allowed but must be statically bounded. | + | |
- | + | ||
- | \item No recursion allowed. | + | |
- | + | ||
- | \end{itemize} | + | |
- | + | ||
- | Many of the language restrictions above are intended to reduce the likelihood | + | |
- | of ``abusing'' control threads for computation purposes. The various | + | |
- | restrictions also ensure that control threads are highly amenable to | + | |
- | lightweight implementations in hardware (i.e., synthesized threads or executed | + | |
- | on lightweight microprocessors). | + | |
- | + | ||
- | \section{Simple Example: Vector Addition} To concretely illustrate the features | + | |
- | of the Cor-C language, the code below gives a complete top-to-bottom example of | + | |
- | the vector increment kernel, where a sequential array of data is read in from | + | |
- | memory, incremented, and written back to main memory. The particular kernel in | + | |
- | this example performs two concurrent increments per clock cycle. | + | |
- | + | ||
- | {\footnotesize | + | |
- | \begin{verbatim} | + | |
- | void vector_increment_thread() | + | |
- | { | + | |
- | cpi_register_thread("vector_add", 1/*number of threads*/); | + | |
- | cpi_hand data_store = cpi_get_rams(2/*numRams*/, true/*scatter*/, 0, 0); | + | |
- | cpi_hand bind_channel = cpi_get_channel(cpi_fifo, 0); | + | |
- | cpi_hand done_channel = cpi_get_channel(cpi_fifo, 1); | + | |
- | cpi_bind(bind_channel, data_store); | + | |
- | cpi_tag tag = CPI_INVALID_TAG; | + | |
- | + | ||
- | /* Read memory */ | + | |
- | for(int i=0; i < 128; i+=8) | + | |
- | tag = cpi_nb_write_ram(data_store, i, i*8, 8, tag); | + | |
- | + | ||
- | /* Wait for computation to finish */ | + | |
- | while(!cpi_read_channel(done_channel)) {} | + | |
- | cpi_tag tag = CPI_INVALID_TAG; | + | |
- | + | ||
- | /* Writeback to memory */ | + | |
- | for(int i=0; i < 128; i+=8) | + | |
- | tag = cpi_nb_read_ram(data_store, i, i*8, 8, tag); | + | |
- | cpi_wait(tag); | + | |
- | } | + | |
- | + | ||
- | module vector_kernel(CLK, RST_N); | + | |
- | input CLK, RST_N; | + | |
- | reg busy, writeback, done, dout_en; | + | |
- | reg [5:0] addr, waddr; | + | |
- | reg [31:0] din0, din1; | + | |
- | wire [31:0] dout0, dout1; | + | |
- | + | ||
- | CORAM2#("vector_add", 0/*thread-id*/, 0 /*obj-id*/, | + | |
- | 0/*sub-id*/, 32/*data width*/, 32 /*depth*/, 5/*addr-width*/) | + | |
- | arr0 (.CLK(CLK), .RST_N(RST_N), .en(1'b1), .wen(wen), | + | |
- | .waddr(waddr), .addr(addr), .din(din0), .dout(dout0)); | + | |
- | + | ||
- | CORAM2#("vector_add", 0/*thread-id*/, 0 /*obj-id*/, | + | |
- | 1/*sub-id*/, 32/*data width*/, 32 /*depth*/, 5/*addr-width*/) | + | |
- | arr1 (.CLK(CLK), .RST_N(RST_N), .en(1'b1), .wen(wen), | + | |
- | .waddr(waddr), .addr(addr), .din(din1), .dout(dout1)); | + | |
- | + | ||
- | ChannelFIFO cfifo(.CLK(CLK), .RST_N(RST_N), | + | |
- | .dout_en(dout_en), .dout(0), .../*unused signals*/); | + | |
- | + | ||
- | always@(posedge CLK) begin | + | |
- | if(RST_N) begin | + | |
- | if(dout_rdy && !busy) begin | + | |
- | writeback <= 0; | + | |
- | busy <= 1; | + | |
- | addr <= 0; | + | |
- | waddr <= 0; | + | |
- | end | + | |
- | else if(busy) begin | + | |
- | addr <= addr + 1; | + | |
- | writeback <= (addr < 32); | + | |
- | busy <= (addr < 32); | + | |
- | end | + | |
- | else writeback <= 0; | + | |
- | + | ||
- | if(writeback) begin | + | |
- | wen <= 1'b1; | + | |
- | din0 <= dout0+1; | + | |
- | din1 <= dout1+1; | + | |
- | waddr <= waddr+1; | + | |
- | if(waddr == 31) dout_en <= 1; | + | |
- | end | + | |
- | else begin | + | |
- | wen <= 1'b0; | + | |
- | dout_en <= 1'b0; | + | |
- | end | + | |
- | end | + | |
- | else begin | + | |
- | writeback <= 0; | + | |
- | dout_en <= 0; | + | |
- | busy <= 0; | + | |
- | wen <= 0; | + | |
- | end | + | |
- | end | + | |
- | + | ||
- | endmodule | + | |
- | + | ||
- | \end{verbatim} | + | |
- | } | + | |
- | + | ||
- | In the first phase of the control thread program above, the thread sets up a | + | |
- | programmed transfer that reads in 128B of data from memory into 2 separate | + | |
- | embedded CoRAMs represented by a single co-handle. To present a wide 64-bit | + | |
- | word interface to the fabric, the co-handle is composed with the scatter-gather | + | |
- | argument set to true. The \smalltt{cpi\_bind} operation thereafter establishes | + | |
- | an implicit channel between the memory system and the core logic, which is the | + | |
- | ultimate consumer of the memory data. Within the core logic, four embedded | + | |
- | CoRAMs and a single channel FIFO are instantiated as black-box modules. As | + | |
- | memory transactions stream through one-at-a-time, the core logic will receive | + | |
- | tokens through the channel FIFO, indicating that data is ready for access | + | |
- | within the CoRAMs. In the simple example above, the core logic waits until all | + | |
- | the tokens are received before performing the accumulation steps. During the | + | |
- | compute phase, the core logic reads and writes 32 clock cycles worth of data | + | |
- | from the embedded CoRAMs. Upon completion, a token is written back from the | + | |
- | core logic to the control thread indicating that a writeback to memory is | + | |
- | pending. The control thread in the background wait-polls on the channel until | + | |
- | receiving the token and then performs the final set of memory control actions | + | |
- | to write data from the CoRAMs to memory. | + | |
- | + | ||
- | \vspace{10pt} | + | |
- | \noindent \textbf{Summary.} It is not difficult to imagine that many variants of control actions could be | + | |
- | added to the Cor-C architecture specification to support more sophisticated | + | |
- | patterns or optimizations (e.g., broadcast from one CoRAM to many, prefetch, | + | |
- | strided access, programmable patterns, etc.). In a commercial production | + | |
- | setting, control actions---like instructions in an ISA---must be carefully | + | |
- | defined and preserved to achieve the value of portability and compatibility. | + | |
- | Optimizing compilers could also could play a significant role in static | + | |
- | optimization of control thread programs. Analysis could be used, for example, | + | |
- | to identify non-conflicting control actions that are logically executed in | + | |
- | sequence but can actually be executed concurrently without affecting | + | |
- | correctness. The next section describes a compiler and proof-of-concept of the | + | |
- | Cor-C architecture specification. Chapter~\ref{sec:casestudy} will later | + | |
- | present concrete demonstrations of Cor-C-based applications. | + | |
- | + | ||
- | %Without delving into details, the Cor-C language essentially follows same basic | + | |
- | %lexical elements as the standard C language~\cite{krbook}. | + | |
- | %\subsection{Thread State and Memory} | + | |
- | %Control threads can declare state variables and bounded-sized arrays through | + | |
- | %conventional C-based syntax (e.g., \textit{int x = 0}). Under limited | + | |
- | %circumstances, structs may also be used to organize and group variables | + | |
- | %together. | + | |
- | %Global variables declared outside the scope of a function are \textbf{not | + | |
- | %permitted} in the Cor-C specification. Control threads are also not permitted | + | |
- | %to directly access the global main memory through dereferencing of pointers. | + | |
- | %This limitation of control threads can be addressed through specialized | + | |
- | %memory personalities, which will be later described in Section\ref{sec}. | + | |
- | + | ||
- | \section{The CoRAM Control Compiler (CORCC)} | + | |
- | + | ||
- | \pdf{control_options.pdf} | + | |
- | {\largewidth} | + | |
- | {Options for Synthesizing Control Threads.} | + | |
- | {Options for Synthesizing Control Threads.} | + | |
- | {fig:control_options} | + | |
- | + | ||
- | \pdf{corcc_example.pdf} | + | |
- | {\xcolumnwidth} | + | |
- | {CORCC Example.} | + | |
- | {CORCC Example.} | + | |
- | {fig:corcc_example} | + | |
- | + | ||
- | The CoRAM Control Compiler (CORCC) was developed in this thesis to explore the | + | |
- | various implementation options for control threads. | + | |
- | Figure~\ref{fig:control_options} shows the several ways in which a control | + | |
- | thread program can be mapped down into control logic in an FPGA with CoRAM | + | |
- | support: (1) directly compiling control thread programs into soft logic state | + | |
- | machines via high-level synthesis, (2) compiling control threads to | + | |
- | pre-implemented soft microprocessor cores (e.g., Xilinx | + | |
- | Microblaze~\cite{xilinx} or Altera Nios~\cite{altera}) or (3) compiling to a | + | |
- | hard microprocessor serving as a dedicated microcontroller. | + | |
- | + | ||
- | CORCC supports direct synthesis of control threads into synthesizable RTL from | + | |
- | standard C code and can also be configured to model the cycle-time performance | + | |
- | of a simple microprocessor. The implementation of CORCC leverages the Low | + | |
- | Level Virtual Machine (LLVM) framework~\cite{llvm}, which is an open-source, | + | |
- | end-to-end compiler with pluggable extensions for custom passes and backends. | + | |
- | CORCC leverages the modularity of LLVM and its language-independent | + | |
- | intermediate representation (IR) to implement a simple form of high-level | + | |
- | synthesis with extensions for microprocessor performance modeling. | + | |
- | + | ||
- | \vspace{10pt} | + | |
- | \noindent \textbf{Implementation.} | + | |
- | The CORCC LLVM extension is implemented in 6000L of C++ as a series of LLVM | + | |
- | passes. CORCC extends LLVM with special objects and data types that are | + | |
- | specific to the Cor-C architecture specification. These include CoRAM and | + | |
- | channel accessors, co-handles, and the memory/channel control actions. LLVM | + | |
- | includes front-ends for several popular languages such as C and C++ and can | + | |
- | automatically generate an intermediate representation (IR) in Single Static | + | |
- | Assignment (SSA) form. In SSA form, each variable in a routine is assigned | + | |
- | exactly once, which is useful for various optimizations and simplifying the | + | |
- | properties of variables. An important feature of the IR is the LLVM type | + | |
- | system, which provides high-level program-level information accessible at the | + | |
- | assembly level. The first stage of CORCC is automatically handled by LLVM, | + | |
- | which translates the high-level control thread program into IR organized into | + | |
- | basic blocks. The assembly employed by LLVM constitutes about 70 | + | |
- | instructions~\cite{llvm}, of which a subset of about 30 are supported in | + | |
- | CORCC\footnote{Use of unsupported instructions result in compile-time | + | |
- | errors in CORCC.}. | + | |
- | + | ||
- | \vspace{10pt} | + | |
- | \noindent \textbf{Thread-to-Hardware Interface.} | + | |
- | The control threads of an application, which exist either in the form of soft | + | |
- | finite state machines or as microcontrollers can be viewed as clients that | + | |
- | issue memory requests to the underlying memory subsystem comprising embedded | + | |
- | CoRAMs, the network-on-chip, and the edge memory interfaces (as illustrated | + | |
- | earlier in Figure~\ref{fig:control_options}). CORCC assumes that in a soft | + | |
- | implementation of control threads, a well-defined request-response interface | + | |
- | exists between the underlying subsystem and the control threads implemented in | + | |
- | the fabric. The details of such interfaces are described further in | + | |
- | Chapter~\ref{sec:microarch}. | + | |
- | + | ||
- | \vspace{10pt} | + | |
- | \noindent \textbf{Co-handle Pass.} | + | |
- | The first stage of CORCC performs a sweep through the LLVM-generated IR and | + | |
- | identifies \smalltt{call} instructions that match the function signatures of | + | |
- | static control actions such as co-handle and channel accessors (see | + | |
- | Table~\ref{tab:accessors}) that are used to establish bindings to various | + | |
- | CoRAM-related object. In LLVM, any function with a return value is assigned to | + | |
- | a register identifier with a unique integer. Within CORCC, the static pass | + | |
- | creates an internal map between an identified co-handle and its corresponding | + | |
- | destination register. When processing a co-handle, the function arguments are | + | |
- | checked to be constant and valid values. During this pass, any dynamic control | + | |
- | actions are annotated and linked against the detected co-handles. The link step | + | |
- | performs a backtracing through registers in the IR to identify the specific | + | |
- | co-handle associated with a dynamic control action. | + | |
- | + | ||
- | \vspace{10pt} | + | |
- | \noindent \textbf{Thread Synthesis.} | + | |
- | Once all co-handles have been identified, CORCC performs a synthesis step that | + | |
- | translates the LLVM instructions and the Cor-C dynamic control actions into | + | |
- | synthesizable Verilog. The basic approach taken by CORCC is to perform a direct | + | |
- | mapping of basic blocks into single-cycle states in a finite state machine. To | + | |
- | handle register state, CORCC instantiates a physical register for each assigned | + | |
- | variable in a program. To implement logic, all of the instructions in a | + | |
- | basic block are converted into combinational statements, where the | + | |
- | inputs to the logic are read from registers in a single clock cycle (and | + | |
- | in the same clock cycle, the output is written to the destination registers). | + | |
- | The SSA form of LLVM guarantees that no registers are read and written at the | + | |
- | same time within a single basic block. | + | |
- | + | ||
- | To handle dynamic control actions, special states are introduced at locations in the | + | |
- | LLVM IR where \smalltt{call} instructions are detected. For example, when a | + | |
- | \smalltt{cpi\_write\_ram} function call is detected, the parent basic block | + | |
- | will be split into two states, one containing the original and the other | + | |
- | for invocation of the control thread. The predecessor basic block will always | + | |
- | jump into the special state first, which handles the actual issue of the | + | |
- | control action to the memory subsystem through a request-response interface; | + | |
- | thereafter, the FSM jumps into the original basic block while returning the | + | |
- | value of the control action. Figure~\ref{fig:corcc_example} gives a complete | + | |
- | example of compiling the simple example from Chapter~\ref{sec:coram} into | + | |
- | synthesizable hardware. | + | |
- | + | ||
- | \vspace{10pt} | + | |
- | \noindent \textbf{Microprocessor Performance Modeling.} | + | |
- | To explore the design space for microcontroller-based control threads in our | + | |
- | evaluation in Chapter~\ref{sec:evaluation}, CORCC includes an additional | + | |
- | feature that approximates the performance characteristics of a simple in-order | + | |
- | microprocessor core. The core is modeled with a constant CPI value | + | |
- | (cycles per LLVM instruction) and is assumed to have specialized logic that | + | |
- | interfaces directly to the underlying memory subsystem described in | + | |
- | Chapter~\ref{sec:microarch}. The programmed CPI value sets the rate at which | + | |
- | control threads advance through the LLVM basic blocks in order to mimic the | + | |
- | performance characteristics of an idealized microprocessor. | + | |
- | Chapter~\ref{sec:evaluation} will later present simulation-driven results that | + | |
- | compare direct synthesis by CORCC to soft and hard microprocessor cores. | + | |
- | + | ||
- | \vspace{10pt} | + | |
- | \noindent \textbf{CORCC Limitations.} | + | |
- | The CORCC compiler employs a relatively simple approach to high level | + | |
- | synthesis, which completely expands the basic blocks of an application | + | |
- | into synthesizable hardware. The simple approach taken here can have a | + | |
- | detrimental effect on performance and area, especially if LLVM produces large | + | |
- | basic blocks or allocates a large number of registers. A potential way to | + | |
- | mitigate large critical paths within a basic block are to split basic | + | |
- | blocks where necessary, which can either be supported automatically or | + | |
- | guided by the user. More advanced high-level synthesis techniques can also be | + | |
- | applied---e.g., constraining and scheduling the usage of resources. | + | |
- | As will be shown later in Chapter~\ref{sec:evaluation}, without any | + | |
- | optimizations, the FSMs generated by the CORCC compiler consume relatively | + | |
- | modest area while operating at nominal FPGA clock frequencies. | + | |
- | + | ||
- | \vspace{10pt} | + | |
- | \noindent \textbf{Cor-C vs. Parallel Languages.} | + | |
- | Our selection of the C language is not a fundamental requirement of the CoRAM | + | |
- | paradigm. An area that merits further research is the use of functional or | + | |
- | parallel languages to express higher levels of parallelism within control | + | |
- | threads. A particular consequence of using a sequential-like language of C is | + | |
- | the serialization of requests during dynamic execution. Consider the for loop | + | |
- | below, which generates a stream of requests to the memory subsystem: | + | |
- | + | ||
- | {\footnotesize | + | |
- | \begin{verbatim} | + | |
- | for(int i=0; i < 8192; i+= BLOCK_BYTES) { | + | |
- | tag = cpi_nb_write_ram(ramA, 0, 0, BLOCK_BYTES, tag); | + | |
- | } | + | |
- | \end{verbatim} | + | |
- | } | + | |
- | + | ||
- | In the example above, CORCC would not allow the multiple control actions | + | |
- | to execute in parallel due to serialization on the coalesced tag variable. | + | |
- | In such case, parallel constructs such as \smalltt{forall} can explicitly | + | |
- | declare that the loop body operations are independent. | + | |
- | + | ||
- | \section{Summary} | + | |
- | This chapter presented the Cor-C architecture specification and compiler. | + | |
- | Cor-C is a devised instance of the CoRAM concept, and provides an example | + | |
- | of how the CoRAM concept is applied in a real-world environment. | + | |
- | The CoRAM Control Compiler (CORCC) is a proof-of-concept that implements the | + | |
- | Cor-C specification and is evaluated further in Chapter~\ref{sec:evaluation}. | + |