.. _Sec-StrStr:

Accelerating String Search
==========================

The structure of a hardware/software (HW/SW) system can evolve quite
dramatically to reflect changing requirements, or during design
exploration.  In this section, we consider several different
implementations of a simple string search
application~\cite{mpAlgo}. Each variation represents a step in the
iterative refinement process, intended to enhance performance or
enable a different software execution environment.

    \begin{figure}[!h]
      \centering
     \includegraphics[width=0.43\textwidth]{platform.pdf}
      \caption{\label{Fig:Platform0}Target platform for string search
	application}
    \end{figure}

Figure~\ref{Fig:Platform0} shows the target platform for our example.
The pertinent components of the host system are the multi-core CPU,
system memory, and PCI Express (PCIe) bus.  The software components of
our application will be run on the CPU in a Linux environment.
Connected to the host is a PCIe expansion card containing (among other
things) a high-performance FPGA chip and a large array of flash
memory.  The FPGA board was designed as a platform to accelerate ``big
data'' analytics by moving more processing power closer to the storage
device.

.. _Sec:StrStrInitial:

Initial Implementation
----------------------

    \begin{figure}[!h]
      \centering
     \includegraphics[width=0.43\textwidth]{data_accel_logical0.pdf}
      \caption{\label{Fig:StringSearch0}Logical components of the string
	search system}
    \end{figure}

The design process really begins with a pure software implementation
of the algorithm, but the first attempt we consider is the initial
inclusion of HW acceleration shown in Figure~\ref{Fig:StringSearch0}.
The search functionality is executed by software running in user-space
which communicates with the hardware accelerator through a device
driver running in the Linux kernel.  The hardware accelerator,
implemented in the FPGA fabric, executes searches over data stored in
the flash array as directed by the software.

The FPGA has direct access to the massive flash memory array, so if we
implement the search kernel in hardware, we can avoid bringing data
into the CPU cache (an important consideration if we intend to run
other programs simultaneously).  By exploiting the high parallelism of
the execution fabric as well as application aware caching of data, an
FPGA implementation can outperform the same search executed on the
CPU.

Multithreading the Software
---------------------------

The efficient use of flash memory requires a relatively sophisticated
management strategy.  Our first refinement is based on the observation
that there are four distinct tasks which the application software
executes (mostly) independently:


* Send search command to the hardware.
* Receive search results from the hardware.
* Send commands to the hardware to manage the flash arrays
* Receive responses from the flash management hardware

To exploit the task-level parallelism in our application, we can
assign one thread to each of the four enumerated tasks.  To further
improve efficiency, the two threads receiving data from the hardware
put themselves to sleep by calling \textbf{poll} and are woken up only
when a message has been received.  

    \begin{figure}[!h]
      \centering
     \includegraphics[width=0.43\textwidth]{data_accel_logical1.pdf}
      \caption{\label{Fig:StringSearch1}Using a mutex to coordinate
      user-level access to hardware accelerator}
    \end{figure}

With the introduction of multithreading, we will need a
synchronization mechanism to enforce coherent access to the hardware
resources. Because the tasks which need coordinating are all being
executed as user-space threads, the access control must be implemented
in software as well.  As shown in Figure~\ref{Fig:StringSearch1}, a
mutex is used to coordinate access to the shared hardware resource
between user-level processes.

.. _Sec:StrStrRefiningInterfaces:

Refining the Interfaces
-----------------------

    \begin{figure}[!h]
      \centering
      \includegraphics[width=0.43\textwidth]{data_accel_logical2.pdf}
      \caption{\label{Fig:StringSearch2}Movement of functionality from
	user to kernel space.  Software-based coordination between kernel
	and user processes are prohibitively expensive.}
    \end{figure}

Figure~\ref{Fig:StringSearch2} shows a further refinement to our
system in which we have reimplemented the Flash Management
functionality as a block-device driver.  Instead of directly operating
on physical addresses, the string search now takes a file descriptor
as input and uses a Linux system-call to retrieve the file block addresses
through the file system. This refinement permits other developers to
write applications which can take advantage of the accelerator without
any knowledge of the internal details of the underlying storage
device.  It also enables support for different file systems as we now
use a POSIX interface to generate physical block lists for the the
storage device hardware.  The problem with this refinement is that we
no longer have an efficient SW mechanism to synchronize the block
device driver running in kernel space with the application running in
user space.

    \begin{figure}[htb]
      \centering
      \includegraphics[width=0.43\textwidth]{data_accel_logical3.pdf}
      \caption{\label{Fig:StringSearch3}Correct interface design
	removes the need for coordination between user and kernel
	threads.}
    \end{figure}
 
To solve to this problem (shown in Figure~\ref{Fig:StringSearch3}), we
can remove the need for explicit SW coordination altogether by giving
each thread uncontested access to its own dedicated HW resources
mapped into disjoint address regions. (There will of course be
implicit synchronization through the file system.)

.. _Sec:StrStrDma:

Shared Access to Host Memory
----------------------------

In the previous implementations, all communication between hardware
and software takes place through memory mapped register IO.  Suppose that instead of
searching for single strings, we want to search for large numbers of
(potentially lengthy) strings stored in the flash array.  Attempting
to transfer these strings to the hardware accelerator using programmed
register transfers introduces a performance bottleneck.  In our final
refinement, the program will allocate memory on the host system,
populate it with the search strings, and pass a reference to this
memory to the hardware accelerator which can then read the search
strings directly from the host memory.

    \begin{figure}[htb]
      \centering
      \includegraphics[width=0.43\textwidth]{data_accel_logical4.pdf}
      \caption{\label{Fig:StringSearch4}Connectal support for DMA.}
    \end{figure}

Efficient high-bandwidth communication in this style requires the
ability to share allocated memory regions between hardware and
software processes without copying.  Normally, a programmer would
simply call application space \textbf{malloc}, but this does not
provide a buffer that can be shared with hardware or other software
processes.  As shown in Figure~\ref{Fig:StringSearch4}, a
special-purpose memory allocator has been implemented in Linux, using
dmabuf\cite{dmabuf} to provide reference counted sharing of memory
buffers across user processes and hardware.

To conclude, we consider how the HW/SW interface changed to
accommodate each step in the refinement process: The hardware
interface required by the design in Figure~\ref{Fig:StringSearch0} is
relatively simple.  Command/response queues in the hardware
accelerator are exposed using a register interface with accompanying
*empty*/*full* signals.  To support the use of *poll* by
the refinement in Figure~\ref{Fig:StringSearch1}, interrupt signals
must be added to the hardware interface and connected to the Linux
kernel.  Partitioning the address space as required by the refinement
in Figure~\ref{Fig:StringSearch3} necessitates a consistent remapping
of registers in both hardware and software.