Newsletter

Custom Reconfigurable Computing Machine for High Performance Cellular Automata Processing





TechOnline

 
ABOUT THE AUTHOR

Pasquale Corsonello received his Masters degree in Electronics Engineering from the University of Naples, Italy, in 1988. He joined the Institute of Research on Parallel Computer System (IRSIP) of National Council of Research of Italy, working on the design and modeling of electronic transducers for high precision measurement. In 1992 he joined the Department of Electronics, Computer Sciences and Systems of the University of Calabria, where he was involved in application specific IC design. Currently he is assistant professor at the Department of Electronics Engineering and Applied Mathematics of the University of Reggio Calabria. His main research interests are in arithmetic circuit, reconfigurable computing, and VLSI design. Prof. Corsonello is a member of the IEEE.
 
Cellular automata (CA) models provide algorithms to solve scientific problems in applications such as physics, biology, geophysics, economics, and engineering. The models are based on a simple but efficient idea: describing the behavior of complex systems and phenomena in terms of many interacting elementary components. In spite of their usefulness in modeling complex systems, the performance requirements to execute CA simulations can be large—they use a highly parallel computational model, which makes conventional, sequential machines poorly matched to these kinds of application. As a consequence, researchers have two main approaches to execute CA algorithms: using software running on parallel general-purpose supercomputers or developing custom-computing machines.

Reconfigurable computing systems can provide a marked performance advantage over general-purposes computers at a lower cost. This article proposes a novel Reconfigurable computing Machine for Cellular Automata processing (CAREM). This machine uses an FPGA-based processor as a computational engine.

With FPGAs, you can quickly implement hardware functions for each different algorithm. Owing to their uniform structure composed of many parallel-processing elements, FPGAs appear very attractive for computations where a large, regular-input data stream has to be processed to produce an output data stream. CA algorithms fit this computation model well: a large data stream corresponding to the cells' states is repeatedly updated according to a uniform set of rules. Although the CAREM processor is realized on a totally reconfigurable device, it consists of a well-defined structure for implementing different CA applications with low design effort.

Cellular Automata Processing
A cellular automaton is a discrete dynamical system comprising a set of interacting, elementary cells. These cells are arranged in a regular fashion over a mono- or multi-dimensional spatial lattice. Each cell in the lattice can assume any one of a finite number of states. At discrete time steps, the cells can modify their own state: they determine how to change towards a new state by means of a specified set of rules. These rules, generally common to the cells, define the state transition function of the cellular automaton.

The automaton's evolution retains a locality principle: for each cell (i) the transition function (f) calculates the next state by examining only its own state and the states of the neighboring cells (Neigh(i)) at the current time step. The transition function uses a specific pattern to define which cells are neighboring cells. Figure 1 shows the typical structure of a two-dimensional (2D) automaton. The cells are arranged on an m*n lattice. In this example, the automaton uses a Moore neighborhood composed of eight cells.

Figure 1:  A two-dimensional cellular automaton

The cells' updating is synchronous and arrives in parallel fashion into the lattice. As a consequence, at each time-step the entire cellular automaton evolves to a new global state, which arises from the state's changes in the single cells. Although the transition function does not explicitly indicate a final state for the automaton evolution, it could be necessary to define this condition for certain types of automata.

According to the previously discussed CA model, you can implement a computing system targeting CA applications using a processing engine and two memory blocks storing two consecutive automaton states, S(t) and S(t+1). This basic computing architecture is shown in Figure 2. The processing engine contains the processing elements (PEs), which implement the automaton transition function. The system uses memory blocks A and B alternatively as source and destination memory. At each step t, the processing engine reads the current state S(t) from the source memory and writes the new state S(t+1) into the destination memory. After updating the entire automaton's state, the roles of the two memory blocks switch and another iteration starts.

Figure 2:  Basic CA processing

The CA model enables a totally parallel computation—you could compose the Processing Engine using a PE for each cell. However, technological and cost limitations makes this feasible only for simple computations. Usually, the number of PEs implemented in the Processing Engine is lower than the number of cells. This implies that each PE has to be reused on several cells during the same iteration.


CAREM System Overview
The proposed reconfigurable system is shown in Figure 3. The system core is the CAREM board, which provides a standalone, fast, computing engine. For data input and results visualization, the CAREM board communicates with an Interface System. To simplify the system, we have chosen to realize the Interface System using a Host PC and a custom-designed I/O interface. However, you can use other techniques for data exchange.

Figure 3:  The CAREM system showing how the processing engine and memories communicate with the host PC

The execution flow occurs during three main operational phases. During the first phase, the entire initial state of the automaton is transferred from main memory to local SRAM into the board. In addition, the Host PC transfers some parameters required for the computation, such as number of iterations, lattice size, and rule parameters to the CAREM processor.

In the second phase, the CAREM processor calculates the automaton evolution according to its specific transition function. At each iteration, the CAREM processor reads the current automaton state from source memory and writes the new state into destination memory. The processor computes n-bits of data at each step, concurrently updating just a limited number of cells. When updating the entire automaton state, the roles of the two memory banks are changed and the processor begins another iteration. According to the specific algorithm the processor executes, CAREM will stop the computation and set an appropriate flag bit in order to signal the end of the computation. The system can generate a stop condition in two ways: after a specified number of iterations, or using a check function applied to the automaton data. After recognizing the stop condition, the Host PC retrieves display data from local memory.

The CAREM processor uses FPGA technology for high flexibility on different automata. It's easy to scale the proposed computing architecture based on desired performance and cost. By increasing bus width and by using multiple FPGA devices and memories, you can improve the grade of parallelism of the computation. Since the PEs work in a totally parallel manner on the automaton data, processing speed scales linearly with the number of PEs.

The CAREM Processor
A single Xilinx Virtex XCV600 FPGA device implements the CAREM prototype. To provide a capability to handle different algorithms, the prototype organizes the reconfigurable processor as three different sub-modules (Figure 4). These sub-modules implement the previously described common CA—neighborhood module, rules module, and control unit. According to the particular CA algorithm the processor is executing, you can define these modules to support the requirements of a specific application.

Figure 4:  The CAREM processor comprises three sub-systems: neighborhood module, rules module, and control unit

The neighborhood module constitutes an interface block between local source memory and the processing elements inside the rules module. This module supports fast access to a cell's state for reading. The neighborhood module consists of parallel multiple rows of shift registers. At each clock cycle, a new 32-bit word is read from the source memory and written into the first register's column. At the same time, all the columns shift to the right by one position so that a new set of cells and corresponding neighbors are available for processing. Although the row number is fixed and equals to the memory word-width, the column number depends on the neighborhood size of the particular CA algorithm the processor is implementing. In the application examples in the following sections, we use a Moore neighborhood with size equal to three cells.

The rules module, the heart of the reconfigurable processor, contains the processing elements that perform the transition function of the cellular automaton. The FPGA's logic blocks define these application-specific PEs. At each step, the rules module receives as input the current 32-bit state word provided by the neighborhood module and generates an updated 32-bit word. This data word is immediately written into the destination memory. Besides the updating function, the rules module can provide other interesting features depending on the particular automaton the processor is executing.

The control unit performs two main functions. The unit implements an interface circuit for the exchange of parameters between the Host PC and the CAREM processor, and it operates as a management unit for local memory. A Finite State Machine (FSM) provides both functions. The interface circuit consists of several registers of different sizes. The execution registers (IER and OER) store some basic I/O signals that the processor needs for execution control.

The iterations register (IR) stores the number of CA iterations that the processor has to perform before stopping the computation. The count register (CR) stores the number of updated data words per iteration. Finally, a group of four rules registers (RR1 through RR4) permits the rules module to store some parameters. These registers provide greater runtime flexibility for the computation since the host system can use the registers to dynamically change the automaton rules without reconfiguring the processor.

The memory unit provides complete management of local memory for the CAREM processor. As discussed in the previous section, CA processing requires reading the current automaton state from the source memory and writing the updated state to the destination memory. This implies that the processor has to generate addresses and control signals at each step of the computation for both memory banks. The processor generates address values sequentially, with the maximum value of the address dependent on the parameter stored in the CR register.

Application Examples
In order to demonstrate the effectiveness of the proposed system in terms of performance and flexibility, we implemented two significant algorithms on the CAREM machine: a thinning technique used in image-processing tasks and a forest-fire simulation. Both algorithms use a 2D CA model with a Moore neighborhood, but they differ in cell size, transition function, and execution flow. Owing to its versatile structure, we matched the CAREM processor to these applications only by reconfiguring the rules module. The processor exhibited a throughput of about 160 Mbit/sec. To speed-up the CA processing, these example applications limited communication with the Host PC to the initial and final data exchanges.

Thinning Algorithm
Binary thinning is a technique that reduces the thickness of binary images. It is widely used in the pre-processing stages of pattern-recognition applications such as optical pattern recognition (OCR), fingerprint-image processing, and computer vision. By means of an erosion process, this technique deletes pixels from the boundary, but still maintains original connectivity of the image. After a finite number of iterations, this transformation produces another binary image corresponding to a skeleton of the original image.

This example implements the thinning algorithm of Arcella et al. Each input pixel uses a set of eight matching templates, applied in parallel to the neighboring pixels. We use 32 processing elements in the rules module, updating 32 1-bit cells at each clock step. The algorithm requires a variable number of iterations according to the complexity of the input image. The algorithm terminates when the image does not change between two consecutive iterations. The control unit will set the stop flag in the execution register signaling the end of the computation.

Figure 5 illustrates the effect of thinning on a sample image. Table 1 shows the performance results of our computation on real working examples. The table compares results from the CAREM processor to the PAPRICA processor. In both cases the thinning algorithm executes on a 256x256 lattice. The PAPRICA processor uses an extended array of 256 PEs with an instruction cycle-time of 350ns. Those PEs provide a set of elementary instructions, performing basic graphic and logical operations on the input image pixels. On the other hand, the CAREM processor uses PEs the processor specifically identifies for the thinning algorithm. They exhibit a cycle time of 200ns. Table 1 shows that this technique provides a 65X cell-update speed increase over the PAPRICA processor.

Figure 5:  Thinning execution on a sample image


System Lattice Size Execution
Time (Sec)
Cell
Updates/Sec
Speed Up
PAPRICA 256*256 2.6 2.52*106 1
CAREM 256*256 0.041 1.64*108 65

Table 1:  Thinning executions (100 iterations)

Forest-Fire Algorithm
The forest-fire algorithm is another interesting and widely recognized application of CA modelling for studying the behavior of complex real-word phenomena. In the basic model, each cell in the automaton represents a portion of land and the cell's state can assume any one of four different values: tree, burnt tree, fire, and ground. We thus need 2 bits to represent a cell's state. The automaton state evolution follows a simple set of rules:

  1. If a tree cell catches fire from one of its neighborhood cells, it burns and becomes a fire cell
  2. A fire cell becomes a burnt tree cell
  3. Burnt trees and ground cells do not change their own state, they only stop the spreading of the fire.

You could also extend this basic forest-fire model to consider other elements in the forest such as terrain conditions, fuel types, and weather conditions.

To execute the forest-fire algorithm on the CAREM reconfigurable machine, we reconfigured the original rules module of the thinning application to implement the new transition function. In this case we use 16 processing elements working concurrently. In order to observe the evolution of the automaton state, the forest-fire simulation also requires specifying the number of iterations between two consecutive visualizations—this value is loaded into the iterations register IR.

As previously discussed, the rules-registers block inside the control unit provides a more flexible implementation of the CA rules. For the forest fire algorithm, the control unit permits switching from the basic model to an extended model, which introduces wind effect on fire propagation. To include wind effects, we load the rules register RR1 with two wind parameters: direction and speed. The RRI also stores a control bit w, which sets the model the processor uses. Figure 6 shows two visualization snapshots of the forest-fire simulation. For this example, the w bit was set to zero, which means there is no wind. No wind means uniform spreading of the fire. Conversely, by setting the w bit to one, the propagation of the fire takes place in a specified direction, as shown in Figure 7.

Figure 6:  Uniform fire propagation

Figure 7:  Wind effect on fire propagation

Table 2 shows the performance results for the forest-fire computation. These results are compared to that obtained on a Meiko CS2 parallel computer. The Meiko system uses a configuration of eight processors. Nevertheless, as can be noted, the CAREM processor showed a cell-update speed increase of 24 over the Meiko processor.


System Lattice Size Execution
Time (Sec)
Cell
Updates/Sec
Speed Up
Meiko CS2-8 224*224 14.91 3.37*106 1
CAREM 128*256 0.41 8.19*107 24

Table 2:  Forest-fire executions (1000 iterations)

The proposed CAREM machine provides an inexpensive platform for easily implementing several CA applications. The scalable architecture of the proposed processor is modular, which makes it very easy to reconfigure the processor for different automata. The implementation of two significantly different algorithms has demonstrated the flexibility of the CAREM processor. Owing to the efficient parallel processing provided by CAREM machine, the execution of these algorithms has shown much higher performance than that obtained on software implementations running on general-purpose supercomputers and specialized processors.



 






Related Content

WEBINAR
1. Introducing Spartan-3A FPGAs - Powering the Next Wave of High-Volume Applications!

WEBINAR
2. Virtex-5 FPGAs and PlanAhead Deliver Maximum Performance

WEBINAR
3. Accelerate IPTV Headend Design Using Available IP and Reference Designs Net Seminar

WEBINAR
4. Power Saving Technologies in the World's First 65nm FPGA

 


 Featured Jobs
SEL seeking Business Development Manager in Pullman, WA

SEL seeking Integration / Automation Engineer in Charlotte, NC

ESRI seeking Business Manager - Support Services in Redlands, CA

Amcor PET Packaging seeking Facilities Engineer in Philadelphia, PA

Mentor Graphics seeking Embedded SW Tele-Sales in San Jose, CA

More jobs on EETimesCareers
 Sponsor
 CAREER CENTER
Ready to take that job and shove it?
SEARCH JOBS:

 SPONSOR

 RECENT JOB POSTINGS
For more great jobs, career related news, features and services, please visit EETimes' Career Center.