SAXPY – Linear Algebra

This example consists of a host program and Mitrion-C program to perform the BLAS level 1 routine SAXPY (Single precision Alpha X Plus Y), utilizing FPGA hardware acceleration. The host program, written in ANSI-C, generates two random vectors and sends data to the Mitrion-C program running on the FPGA (or simulator) to perform the arithmetic. Other BLAS Level 1 algorithms can substitute SAXPY by replacing the function do_saxpy.

Click here to download the source files (ZIP, 40 KB)

The host program also calculates the linear algebra algorithm entirely on the host processor and compares the results to the output of the FPGA version.

Note: In this example the term vector is used to refer to the mathematical concept of a matrix with only one column or row. This is in contrast to the Mitrion-C notion of a vector, which is a collection which causes loops to be implemented via unrolling. The Mitrion-C program must be re-compiled and a new binary generated for each vector length.

The saxpy.mitc program listing for SGI RASC RC100:


Mitrion-C 1.5;
/* SAXPY -- Single Precision calculation of Alpha * X + Y
For every pair of elements in vectors X and Y perform an
element wise scaling and sum of the two. */

/* The size of External Memory on the RC100 is defined by
RASC as two banks of SRAM, each containing 1048576
(0x100000) 128 bit words of memory. This size is required
to be the same in host program. */

const MEM_SRAM_NWORDS = 0x100000;
const MEM_SRAM_NBITS = 128;
type MEM_SRAM = typedef mem bits:MEM_SRAM_NBITS[MEM_SRAM_NWORDS];
const REG_NBITS = 64;
const REG = typedef float:53.11;

const VECTOR_LEN = (MEM_SRAM_NWORDS / 2); /* Half of the words of memory
are the X vector, the other
half are the Y vector */

const READS = (VECTOR_LEN / 4);
const NFLOATS_PER_WORD = 4; /* Number of single precision
values per 128bit
word of external memory */

type float_t = typedef float:24.8; /* IEEE single precision */



(MEM_SRAM, MEM_SRAM) main( MEM_SRAM mem_a_00, MEM_SRAM mem_b_00,
REG reg_alpha)
{

/* Each 128 bit word of external SRAM contains four single precision
floats. Loop through memory yielding a list of floats grouped as
vectors of 4 floats each. */

(vectorY_lv, vectorX_lv, mem_a_03) = foreach ( i in <0 .. READS-1> )
{
float_t[NFLOATS_PER_WORD] vectorY_v;
float_t[NFLOATS_PER_WORD] vectorX_v;
(vectorY_v, mem_a_01) = _memread(mem_a_00, i);
(vectorX_v, mem_a_02) = _memread(mem_a_01, i + READS);
} (vectorY_v, vectorX_v, mem_a_02) ;

float_t alpha = reg_alpha ;

/* Perform the vector dot product on the vector elementsas they
were grouped in the previous loop. */

(result_lv) = foreach ( vectorY_v, vectorX_v in vectorY_lv, vectorX_lv)
{
float_t[NFLOATS_PER_WORD] result_v = foreach (x, y in vectorX_v, vectorY_v)
{
float_t result = y + alpha * x;
} result;
} result_v ;

/* Like the inputs, the results are represented as length 4 vectors,
and are already the appropriate size to be written to external
memory. */

mem_b_02 = foreach ( result_v in result_lv by i )
{
mem_b_01 = _memwrite( mem_b_00, i, (bits:128) result_v );
} mem_b_01;

} (mem_a_03, mem_b_02);