 |
| |
Pasquale Corsonello received his Masters
degree in Electronics Engineering from the University of Naples,
Italy, in 1988. He joined the Institute of Research on Parallel
Computer System (IRSIP) of National Council of Research of Italy,
working on the design and modeling of electronic transducers for
high precision measurement. In 1992 he joined the Department of
Electronics, Computer Sciences and Systems of the University of
Calabria, where he was involved in application specific IC design.
Currently he is assistant professor at the Department of
Electronics Engineering and Applied Mathematics of the University
of Reggio Calabria. His main research interests are in arithmetic
circuit, reconfigurable computing, and VLSI design. Prof.
Corsonello is a member of the IEEE.
|
|
 |
Cellular automata (CA) models provide
algorithms to solve scientific problems in applications such as
physics, biology, geophysics, economics, and engineering
. The models are based on a simple but efficient
idea: describing the behavior of complex systems and phenomena in
terms of many interacting elementary components. In spite of their
usefulness in modeling complex systems, the performance
requirements to execute CA simulations can be largethey use a
highly parallel computational model, which makes conventional,
sequential machines poorly matched to these kinds of application

. As a consequence, researchers have two main
approaches to execute CA algorithms: using software running on
parallel general-purpose supercomputers or developing
custom-computing machines
.
Reconfigurable computing systems can provide a marked
performance advantage over general-purposes computers at a lower
cost. This article proposes a novel Reconfigurable computing
Machine for Cellular Automata processing (CAREM). This machine uses
an FPGA-based processor as a computational engine.
With FPGAs, you can quickly implement hardware functions for
each different algorithm. Owing to their uniform structure composed
of many parallel-processing elements, FPGAs appear very attractive
for computations where a large, regular-input data stream has to be
processed to produce an output data stream. CA algorithms fit this
computation model well: a large data stream corresponding to the
cells' states is repeatedly updated according to a uniform set of
rules. Although the CAREM processor is realized on a totally
reconfigurable device, it consists of a well-defined structure for
implementing different CA applications with low design effort.
Cellular Automata Processing
A cellular automaton is a discrete dynamical system comprising a
set of interacting, elementary cells. These cells are arranged in a
regular fashion over a mono- or multi-dimensional spatial lattice.
Each cell in the lattice can assume any one of a finite number of
states. At discrete time steps, the cells can modify their own
state: they determine how to change towards a new state by means of
a specified set of rules. These rules, generally common to the
cells, define the state transition function of the cellular
automaton.
The automaton's evolution retains a locality principle: for each
cell (i) the transition function (f) calculates the next state by
examining only its own state and the states of the neighboring
cells (Neigh(i)) at the current time step. The transition function
uses a specific pattern to define which cells are neighboring
cells. Figure 1 shows the typical structure of a
two-dimensional (2D) automaton. The cells are arranged on an m*n
lattice. In this example, the automaton uses a Moore neighborhood
composed of eight cells.
Figure 1: A two-dimensional cellular automaton
The cells' updating is synchronous and arrives in parallel
fashion into the lattice. As a consequence, at each time-step the
entire cellular automaton evolves to a new global state, which
arises from the state's changes in the single cells. Although the
transition function does not explicitly indicate a final state for
the automaton evolution, it could be necessary to define this
condition for certain types of automata.
According to the previously discussed CA model, you can
implement a computing system targeting CA applications using a
processing engine and two memory blocks storing two consecutive
automaton states, S(t) and S(t+1). This basic computing
architecture is shown in Figure 2. The processing engine
contains the processing elements (PEs), which implement the
automaton transition function. The system uses memory blocks A and
B alternatively as source and destination memory. At each step t,
the processing engine reads the current state S(t) from the source
memory and writes the new state S(t+1) into the destination memory.
After updating the entire automaton's state, the roles of the two
memory blocks switch and another iteration starts.
Figure 2: Basic CA processing
The CA model enables a totally parallel computationyou
could compose the Processing Engine using a PE for each cell.
However, technological and cost limitations makes this feasible
only for simple computations. Usually, the number of PEs
implemented in the Processing Engine is lower than the number of
cells. This implies that each PE has to be reused on several cells
during the same iteration.
CAREM System Overview
The proposed reconfigurable system is shown in Figure 3.
The system core is the CAREM board, which provides a standalone,
fast, computing engine. For data input and results visualization,
the CAREM board communicates with an Interface System. To simplify
the system, we have chosen to realize the Interface System using a
Host PC and a custom-designed I/O interface. However, you can use
other techniques for data exchange.
Figure 3: The CAREM system showing how the
processing engine and memories communicate with the host PC
The execution flow occurs during three main operational phases.
During the first phase, the entire initial state of the automaton
is transferred from main memory to local SRAM into the board. In
addition, the Host PC transfers some parameters required for the
computation, such as number of iterations, lattice size, and rule
parameters to the CAREM processor.
In the second phase, the CAREM processor calculates the
automaton evolution according to its specific transition function.
At each iteration, the CAREM processor reads the current automaton
state from source memory and writes the new state into destination
memory. The processor computes n-bits of data at each step,
concurrently updating just a limited number of cells. When updating
the entire automaton state, the roles of the two memory banks are
changed and the processor begins another iteration. According to
the specific algorithm the processor executes, CAREM will stop the
computation and set an appropriate flag bit in order to signal the
end of the computation. The system can generate a stop condition in
two ways: after a specified number of iterations, or using a check
function applied to the automaton data. After recognizing the stop
condition, the Host PC retrieves display data from local
memory.
The CAREM processor uses FPGA technology for high flexibility on
different automata. It's easy to scale the proposed computing
architecture based on desired performance and cost. By increasing
bus width and by using multiple FPGA devices and memories, you can
improve the grade of parallelism of the computation. Since the PEs
work in a totally parallel manner on the automaton data, processing
speed scales linearly with the number of PEs.
The CAREM Processor
A single Xilinx Virtex XCV600 FPGA device
implements the CAREM prototype. To provide a
capability to handle different algorithms, the prototype organizes
the reconfigurable processor as three different sub-modules
(Figure 4). These sub-modules implement the previously
described common CAneighborhood module, rules module, and
control unit. According to the particular CA algorithm the
processor is executing, you can define these modules to support the
requirements of a specific application.
Figure 4: The CAREM processor comprises three
sub-systems: neighborhood module, rules module, and control
unit
The neighborhood module constitutes an interface block between
local source memory and the processing elements inside the rules
module. This module supports fast access to a cell's state for
reading. The neighborhood module consists of parallel multiple rows
of shift registers. At each clock cycle, a new 32-bit word is read
from the source memory and written into the first register's
column. At the same time, all the columns shift to the right by one
position so that a new set of cells and corresponding neighbors are
available for processing. Although the row number is fixed and
equals to the memory word-width, the column number depends on the
neighborhood size of the particular CA algorithm the processor is
implementing. In the application examples in the following
sections, we use a Moore neighborhood
with size equal to three cells.
The rules module, the heart of the reconfigurable processor,
contains the processing elements that perform the transition
function of the cellular automaton. The FPGA's logic blocks define
these application-specific PEs. At each step, the rules module
receives as input the current 32-bit state word provided by the
neighborhood module and generates an updated 32-bit word. This data
word is immediately written into the destination memory. Besides
the updating function, the rules module can provide other
interesting features depending on the particular automaton the
processor is executing.
The control unit performs two main functions. The unit
implements an interface circuit for the exchange of parameters
between the Host PC and the CAREM processor, and it operates as a
management unit for local memory. A Finite State Machine (FSM)
provides both functions. The interface circuit consists of several
registers of different sizes. The execution registers (IER and OER)
store some basic I/O signals that the processor needs for execution
control.
The iterations register (IR) stores the number of CA iterations
that the processor has to perform before stopping the computation.
The count register (CR) stores the number of updated data words per
iteration. Finally, a group of four rules registers (RR1 through
RR4) permits the rules module to store some parameters. These
registers provide greater runtime flexibility for the computation
since the host system can use the registers to dynamically change
the automaton rules without reconfiguring the processor.
The memory unit provides complete management of local memory for
the CAREM processor. As discussed in the previous section, CA
processing requires reading the current automaton state from the
source memory and writing the updated state to the destination
memory. This implies that the processor has to generate addresses
and control signals at each step of the computation for both memory
banks. The processor generates address values sequentially, with
the maximum value of the address dependent on the parameter stored
in the CR register.
Application Examples
In order to demonstrate the effectiveness of the proposed system
in terms of performance and flexibility, we implemented two
significant algorithms on the CAREM machine: a thinning technique
used in image-processing tasks and a forest-fire
simulation. Both algorithms use a 2D CA model with a Moore
neighborhood, but they differ in cell size, transition function,
and execution flow. Owing to its versatile structure, we matched
the CAREM processor to these applications only by reconfiguring the
rules module. The processor exhibited a throughput of about 160
Mbit/sec. To speed-up the CA processing, these example applications
limited communication with the Host PC to the initial and final
data exchanges.
Thinning Algorithm
Binary thinning is a technique that reduces the thickness of
binary images. It is widely used in the pre-processing stages of
pattern-recognition applications such as optical pattern
recognition (OCR), fingerprint-image processing, and computer
vision. By means of an erosion process, this technique deletes
pixels from the boundary, but still maintains original connectivity
of the image. After a finite number of iterations, this
transformation produces another binary image corresponding to a
skeleton of the original image.
This example implements the thinning algorithm of Arcella et
al
. Each input pixel uses a set of eight matching
templates, applied in parallel to the neighboring pixels. We use 32
processing elements in the rules module, updating 32 1-bit cells at
each clock step. The algorithm requires a variable number of
iterations according to the complexity of the input image. The
algorithm terminates when the image does not change between two
consecutive iterations. The control unit will set the stop flag in
the execution register signaling the end of the computation.
Figure 5 illustrates the effect of thinning on a sample
image. Table 1 shows the performance results of our
computation on real working examples. The table compares results
from the CAREM processor to the PAPRICA processor
. In both cases the thinning algorithm executes on
a 256x256 lattice. The PAPRICA processor uses an extended array of
256 PEs
with an instruction cycle-time of 350ns. Those
PEs provide a set of elementary instructions, performing basic
graphic and logical operations on the input image pixels. On the
other hand, the CAREM processor uses PEs the processor specifically
identifies for the thinning algorithm. They exhibit a cycle time of
200ns. Table 1 shows that this technique provides a 65X
cell-update speed increase over the PAPRICA processor.
Figure 5: Thinning execution on a sample image
| System |
Lattice Size |
Execution
Time (Sec) |
Cell
Updates/Sec |
Speed Up |
| PAPRICA |
256*256 |
2.6 |
2.52*10 6 |
1 |
| CAREM |
256*256 |
0.041 |
1.64*10 8 |
65 |
Table 1: Thinning executions (100 iterations)
Forest-Fire Algorithm
The forest-fire algorithm
is another interesting and widely recognized
application of CA modelling for studying the behavior of complex
real-word phenomena. In the basic model, each cell in the automaton
represents a portion of land and the cell's state can assume any
one of four different values: tree, burnt tree, fire, and ground.
We thus need 2 bits to represent a cell's state. The automaton
state evolution follows a simple set of rules:
- If a tree cell catches fire from one of its neighborhood cells,
it burns and becomes a fire cell
- A fire cell becomes a burnt tree cell
- Burnt trees and ground cells do not change their own state,
they only stop the spreading of the fire.
You could also extend this basic forest-fire model to consider
other elements in the forest such as terrain conditions, fuel
types, and weather conditions.
To execute the forest-fire algorithm on the CAREM reconfigurable
machine, we reconfigured the original rules module of the thinning
application to implement the new transition function. In this case
we use 16 processing elements working concurrently. In order to
observe the evolution of the automaton state, the forest-fire
simulation also requires specifying the number of iterations
between two consecutive visualizationsthis value is loaded
into the iterations register IR.
As previously discussed, the rules-registers block inside the
control unit provides a more flexible implementation of the CA
rules. For the forest fire algorithm, the control unit permits
switching from the basic model to an extended model, which
introduces wind effect on fire propagation. To include wind
effects, we load the rules register RR1 with two wind parameters:
direction and speed. The RRI also stores a control bit w, which
sets the model the processor uses. Figure 6 shows two
visualization snapshots of the forest-fire simulation. For this
example, the w bit was set to zero, which means there is no wind.
No wind means uniform spreading of the fire. Conversely, by setting
the w bit to one, the propagation of the fire takes place in a
specified direction, as shown in Figure 7.
Figure 6: Uniform fire propagation
Figure 7: Wind effect on fire propagation
Table 2 shows the performance results for the forest-fire
computation. These results are compared to that obtained on a Meiko
CS2 parallel computer
. The Meiko system uses a configuration of eight
processors. Nevertheless, as can be noted, the CAREM processor
showed a cell-update speed increase of 24 over the Meiko
processor.
| System |
Lattice Size |
Execution
Time (Sec) |
Cell
Updates/Sec |
Speed Up |
| Meiko CS2-8 |
224*224 |
14.91 |
3.37*10 6 |
1 |
| CAREM |
128*256 |
0.41 |
8.19*10 7 |
24 |
Table 2: Forest-fire executions (1000
iterations)
The proposed CAREM machine provides an inexpensive platform for
easily implementing several CA applications. The scalable
architecture of the proposed processor is modular, which makes it
very easy to reconfigure the processor for different automata. The
implementation of two significantly different algorithms has
demonstrated the flexibility of the CAREM processor. Owing to the
efficient parallel processing provided by CAREM machine, the
execution of these algorithms has shown much higher performance
than that obtained on software implementations running on
general-purpose supercomputers and specialized processors.