Part A - Introduction

Linear Algebra in Science

Describe vector and matrix usage in scientific applications
Describe the BLAS library for vector and matrix operations

The vector and matrix tools of linear algegbra are suited to represent sets of equations in a compact form.  One vector represents the set of solution variables while another vector represents the set of condition variables.  The sizes of these vectors depend on the number of variables in each set.  The sizes of the vectors and matrices relating them depends on the degree of detail in the problem description.

This chapter extends the two-dimensional description of the preceding chapter to the linear algebraic representation of a set of linear equations.  The solution and condition vectors have n components and the transformation matrix has n rows and n columns.  This chapter introduces the BLAS standard for linear algebraic computations on vectors and matrices and describes how to access the library functions written under this standard in the GNU scientific library.

Dimensionality and Ordering Schemes

Dimensionality

The number of degrees of freedom in a problem defines the dimension of the problem.  A vector's dimension is the number of components that identifies the vector uniquely in the problem space.  For example, if we describe the temperature of a body by the temperature at 35 points on the body, the temperature vector has 35 components and is of dimension 35.

A matrix's dimension is the number of its rows and its columns.  For example, if we describe the temperature of a body by the temperature at 35 different points and the heat flow into the body at those same points, then the conduction matrix, which relates the body's temperature to the heat flow, is a 35 by 35 matrix.

The data structure for storing the components of a vector and the coefficients of a matrix is an array.  We store vector components in singly subscripted arrays and matrix coefficients in doubly subscripted arrays.

Operating on vectors using matrices is a straightforward coding exercise.  We access each component of a vector using a single iteration and each coefficient of a matrix using a nested iteration.  The iteration ranges correspond to the dimensions of the vectors and matrices.

Ordering Schemes

The C/C++ family of languages stores doubly subscripted arrays in memory row by row.  This ordering scheme is called row major order.  The Fortran language stores doubly subscripted arrays column by column.  This ordering scheme is called column major order.  In passing doubly subscripting arrays to library functions across the C/C++/Fortran divide, we pay close attention to the order assumed by the library functions.

N Dimensions

The rules for vector and matrix operations in an n-dimensional problem space are identical to the rules for a 2-dimensional space described in the preceding chapter.  We simply replace the 2 components of a vector with n components and the 2 by 2 coefficients of a matrix with n by n coefficients.

If the calculations for each component of a vector are independent of the other calculations, as they often are, the the order of the iterations involved in performing operations on vectors and matrices does not matter and these operations are parallelizable.

Vector Operations

Adding one n-dimensional vector to another n-dimensional vector produces a third n-dimensional vector.  Each component of the resultant vector is equal to the sum of the corresponding components of the contributing vectors:

 ``` r = p + q = [px0 ... pxi ... pxn-1]T + [qx1 ... qxi ... qxn-1]T = [px0 + qx1 ... pxi + qxi ... pxn-1 + qxn-1]T ```

The subscript xi identifies the component associated with the i+1-th dimension of the problem space.

Coding a vector addition requires a simple iteration:

 ``` for (int i = 0; i < n; i++)  r[i] = p[i] + q[i]; ```

Subtraction

Subtracting an n-dimensional vector from another n-dimensional vector produces a third n-dimensional vector.  Each component of the resultant vector is equal to the difference between the corresponding components of the contributing vectors:

 ``` r = p - q = [px0 ... pxi ... pxn-1]T - [qx0 ... qxi ... qxn-1]T = [px0 - qx0 ... pxi - qxi ... pxn-1 - qxn-1]T ```

Coding a vector subtraction requires a simple iteration:

 ``` for (int i = 0; i < n; i++)  r[i] = p[i] - q[i]; ```

Multiplication by a Scalar

Multiplying an n-dimensional vector by a scalar produces an n-dimensional vector without changing its direction.  Each component of the resultant vector is equal to the product of the scalar and the corresponding component of the contributing vector:

 ``` r = s p = s [px0 ... pxi ... pxn-1]T = [s px0 ... s pxi ... s pxn-1]T ```

Coding a scalar multiplication requires a simple iteration:

 ``` for (int i = 0; i < n; i++)  r[i] = s * p[i];```

Dot Product of Two Vectors

The dot product of two n-dimensional vectors is the sum of the products of their corresponding components:

 ``` s = p∙q = pTq = px0qx1 + ... + pxiqxi + ... + pxnqxn-1 ```

Coding a dot product requires a simple iteration:

 ``` double s = 0; for (int i = 0; i < n; i++)  s += p[i] * q[i];```

The magnitude of an n-dimensional vector is the square root of the sum of the products of the corresponding components of the contributing vectors:

 ` ||p|| = √[(px0)2 + ... + (pxi)2 + ... + (pxn-1)2] `

This is a simple extension of Pythagoras' theorem to n-dimensions.

Coding a magnitude calculation requires a simple iteration and a square root evaluation:

 ``` double s = 0; for (int i = 0; i < n; i++)  s += p[i] * q[i]; double magnitude = std::sqrt(s);```

Matrix Operations

The rules for operating on an n by n matrix follow directly from the rules for 2 by 2 matrices.

Adding two matrices of identical size creates a third matrix of the same size.  The number of rows in each matrix must be identical.  The number of columns in each matrix must also be identical.  Coefficient [i,j] in the resultant matrix is the sum of the corresponding coefficients [i,j] in the contributing matrices:

 ``` C = A + B C0,0 = A0,0 + B0,0 Ci,j = Ai,j + Bi,j Cn-1,n-1 = An-1,n-1 + Bn-1,n-1 ```

Coding a matrix addition requires a doubly nested iteration:

 ``` for (int i = 0; i < n; i++) for (int j = 0; j < n; j++) C[i][j] = A[i][j] + B[i][j]; ```

Note that the number of subscripts in an array (called the array dimension in programming) differs from the number of dimension(s) of the problem space; that is, the dimension(s) of the data stored in the array.

Scalar-Matrix Multiplication

Multiplying a matrix by a scalar creates a matrix of the same size.  Coefficient [i,j] in the resulting matrix is the product of the scalar and the corresponding coefficient [i,j] in the matrix:

 ``` C = s A C0,0 = s * A0,0 Ci,j = s * Ai,j Cn-1,n-1 = s * An-1,n-1 ```

Coding a scalar-matrix multiplication requires a doubly nested iteration:

 ``` for (int i = 0; i < n; i++) for (int j = 0; j < n; j++)  C[i][j] = s * A[i][j];```

Matrix-Vector Multiplication

Multiplying a matrix by a vector creates a vector of size equal to the number of rows in the matrix.  Component [i] in the resulting vector is the dot product of the corresponding row (i) in the left matrix and the vector on the right; that is, the component is the sum of the products of coefficient [i, j] in the left matrix and component [j] in the right vector, where n is the number of rows or columns:

 ``` b = M a b0 = M0,0 * a0 + ... + M0,k * ak + ... + M0,n-1 * an-1 bi = Mi,0 * a0 + ... + Mi,k * ak + ... + Mi,n-1 * an-1 bn-1 = Mn-1,0 * a0 + ... + Mn-1,k * ak + ... + Mn-1,n-1 * an-1 ```

Coding a matrix-vector multiplication requires a doubly nested iteration:

 ``` for (int i = 0; i < n; i++) { sum = 0; for (int j = 0; j < n; j++)  sum += M[i][j] * a[j]; b[i] = sum; }```

Matrix-Matrix Multiplication

Multiplying a matrix by a matrix creates a matrix of size equal to the number of rows in the left matrix and the number of columns in the right matrix.  Coefficient [i,j] in the resulting matrix is the dot product of the corresponding row (i) in the left matrix and the corresponding column (j) in the right matrix; that is, the coefficient is the sum of the products of coefficient [i, k] in the left matrix and coefficient [k, j] in the right matrix, where k = 0, ..., n - 1 and n is the number of rows or columns:

 ``` C = A B C0,0 = A0,0 * B0,0 + ... + A0,k * Bk,0 + ... + A0,n-1 * Bn-1,0 C0,j = A0,0 * B0,j + ... + A0,k * Bk,j + ... + A0,n-1 * Bn-1,j Ci,j = Ai,0 * B0,j + ... + Ai,k * Bk,j + ... + Ai,n-1 * Bn-1,j Cn-1,n-1 = An-1,0 * B0,n-1 + ... + An-1,k * Bk,n-1 + ... + An-1,n-1 * Bn-1,n-1 ```

Coding a matrix-matrix multiplication requires a triply nested iteration:

 ``` for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { sum = 0; for (int k = 0; k < n; k++) sum += A[i][k] * B[k][j];  C[i][j] = sum; } }```

Big-O Orders

The Big-O orders of matrix-matrix addition, matrix-matrix subtraction, and scalar-matrix and matrix-vector multiplication are O(n2).

The Big-O order of matrix-matrix multiplication is O(n3).  This is one operation where parallel programming can speedup execution time considerably for large matrices.  For n = 1000, the number of operations is in the order of 1,000,000,000.

Set of Linear Equations

Stiffness Problem

Consider the following statement of a solution to a stiffness problem.  The force at six different points in a bar of composite material is related to the displacement at each of those points through stiffness coefficients.  The set of linear equations that describes the stiffness relation is:

 ``` f0 = 1.0 * d0 + 0.5 * d1 f1 = 0.5 * d0 + 1.0 * d1 + 0.5 * d2 f2 = 0.5 * d1 + 1.0 * d2 + 0.5 * d3 f3 = 0.5 * d2 + 1.0 * d3 + 0.5 * d4 f4 = 0.5 * d3 + 1.0 * d4 + 0.5 * d5 f5 = 0.5 * d4 + 1.0 * d5```

We can expand each equation to include all displacement components:

 ``` f0 = 1.0 * d0 + 0.5 * d1 + 0.0 * d2 + 0.0 * d3 + 0.0 * d4 + 0.0 * d5 f1 = 0.5 * d0 + 1.0 * d1 + 0.5 * d2 + 0.0 * d3 + 0.0 * d4 + 0.0 * d5 f2 = 0.0 * d0 + 0.5 * d1 + 1.0 * d2 + 0.5 * d3 + 0.0 * d4 + 0.0 * d5 f3 = 0.0 * d0 + 0.0 * d1 + 0.5 * d2 + 1.0 * d3 + 0.5 * d4 + 0.0 * d5 f4 = 0.0 * d0 + 0.0 * d1 + 0.0 * d2 + 0.5 * d3 + 1.0 * d4 + 0.5 * d5 f5 = 0.0 * d0 + 0.0 * d1 + 0.0 * d2 + 0.0 * d2 + 0.5 * d4 + 1.0 * d5```

Vector-Matrix-Vector Representation

We represent these equations using vectors and a coefficient matrix as follows:

 ``` | f0 | | 1.0 0.5 0.0 0.0 0.0 0.0 | | d0 | | f1 | | 0.5 1.0 0.5 0.0 0.0 0.0 | | d1 | | f2 | = | 0.0 0.5 1.0 0.5 0.0 0.0 | | d2 | | f3 | | 0.0 0.0 0.5 1.0 0.5 0.0 | | d3 | | f4 | | 0.0 0.0 0.0 0.5 1.0 0.5 | | d4 | | f5 | | 0.0 0.0 0.0 0.0 0.5 1.0 | | d5 | f = S d ```

For the temperature distribution on the right side, the heat flow solution is the result of a matrix-vector multiplication as shown on the left side:

 ``` | 20 | | 1.0 0.5 0.0 0.0 0.0 0.0 | | 10 | | 30 | | 0.5 1.0 0.5 0.0 0.0 0.0 | | 20 | | 25 | = | 0.0 0.5 1.0 0.5 0.0 0.0 | | 10 | | 25 | | 0.0 0.0 0.5 1.0 0.5 0.0 | | 10 | | 40 | | 0.0 0.0 0.0 0.5 1.0 0.5 | | 20 | | 40 | | 0.0 0.0 0.0 0.0 0.5 1.0 | | 30 |```

The BLAS Standard

The Basic Linear Algebra Subprograms (BLAS) standard defines the specifications for publishing API libraries that perform operations on vectors and matrices.  The BLAS standard was established in 1979 and the BLAS libraries have been optimized to perform operations on vectors and matrices efficiently.

The standard identifies three distinct levels of functionality with respect to these operations:

1. vector operations of the general form
 ` y = α x + y`
where x and y are vectors and α is a scalar

2. matrix-vector operations of the general form
 ` y = α A x + β y`
where x and y are vectors, A is a matrix and α and β are scalars

3. matrix-matrix operations of the general form
 ` C = α A B + β C`
where A, B and C are matrices and α and β are scalars

Implementations

BLAS implementations ship in two distinct forms: for specific systems and for specific processors.  Those that have been optimized for specific operating systems ship with those systems.  Those that have been optimized for specific processors are available from the hardware manufacturers.

The BLAS implementations include:

Naming Convention

The original BLAS standard was developed using the Fortran language.  Because Fortran identifiers were limited to 6 characters through Fortran-77 and Fortran did not accommodate overloading, the naming convention for the BLAS procedures is cryptic.  Procedure names are combinations of sets of letters that uniquely identify the precision, the operation, and the type of vector/matrix involved.

BLAS identifies precision by the following single letters:

• S - single
• D - double
• C - single complex
• Z - double complex

BLAS identifies operations by the following strings:

• DOT - dot or scalar product
• AXPY - vector sum in the form αx + y
• MV - matrix-vector product A x
• SV - matrix-vector solve inv(A) x, where inv denotes inverse
• MM - matrix-matrix product A B
• SM - matrix-matrix solve inv(A) B

BLAS identifies types of matrices by the following letter pairs:

• GE - general
• GB - general banded
• SY - symmetric
• SB - symmetric banded
• SP - symmetric packed
• HE - hermitian
• HB - hermitian banded
• HP - hermitian packed
• TR - triangular
• TB - triangular banded
• TP - triangular packed

Fortran implementations use upper case, while C implementations use lower case.  For example, SGEMM and sgemm stand for single-precision general matrix-matrix multiply in Fortran and C respectively.  C implementations add the prefix cblas_.

GSL Example

The GSL implementation uses the prefix gsl_ and prefixes its macros with GSL_

The header files for GSL are in the gsl system directory.  To include a specific header file, we preface its name with gsl/

 ` #include   `

C++ compilers mangle function identifiers by default.  To suppress identifier mangling, we wrap the header file that is being included in a C++ application in a C linkage directive:

 ``` extern "C" { #include   }```

Level 3 cblas

The following program performs a matrix multiplication on two square matrices of user-specified size by calling the cblas_sgemm function from the GSL cblas library.  Before the call, this program allocates matrix memory in linearized form and fills the operand matrices with random-valued coefficients.  This program uses the chrono library to determine the time for each function call in micro-seconds:

 ``` // Matrix Multiply using BLAS // matMul_BLAS.cpp #include #include #include #include extern "C" { #include } using namespace std::chrono; const int WIDTH = 5; void init(float* a, int n) { const float randf = 1.0f / (float) RAND_MAX; for (int i = 0; i < n; i++) a[i] = std::rand() * randf; } void reportTime(const char* msg, steady_clock::duration span) { auto ms = duration_cast(span); std::cout << msg << " - took - " << ms.count() << " millisecs" << std::endl; } int main(int argc, char** argv) { if (argc/2 != 1) { std::cerr << argv << ": invalid number of arguments\n";  std::cerr << "Usage: " << argv << " size_of_matrices [output]\n";  return 1; } int k, n = std::atoi(argv); float* a = new float[n * n]; float* b = new float[n * n]; float* c = new float[n * n]; // initialization std::srand(std::time(nullptr)); init(a, n * n); init(b, n * n); // compute matrix product auto ts = steady_clock::now(); cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0, a, n, b, n, 0.0, c, n); auto te = steady_clock::now(); reportTime("matrix-matrix operation", te - ts); // output results if user supplied 2 arguments if (argc == 3) { k = 0; std::cout << std::setprecision(6) << std::fixed; for (int i = 0; i < n; i++) { std::cout << std::setw(5) << i+1 << ':'; for (int j = 0; j < n; j++) { std::cout << std::setw(10) << c[k++] << ' '; if (j % WIDTH == WIDTH - 1) std::cout << std::endl << std::setw(5) << i+1 << ':'; } std::cout << std::endl; } std::cout << std::endl; } delete [] a; delete [] b; delete [] c; }```

The CblasRowMajor macro identifies the matrices as stored in row-major order.  The CblasNoTrans macro directs the function not to transpose the matrix before performing the operation.  The level 3 scalar multipliers, α and β, are assigned values 1.0 and 0.0 respectively.

The cblas library functions are available in object form and have not been compiled for profiling.  Hence, gprof will have no access to statistics for any calls to these functions.  For timing information, we need to wrap the function calls ourselves.

chrono

C++11 added the chrono library to provide a flexible cross-platform, time-tracking utility that improves in precision over time.  The identifiers are defined in the std::chrono namespace.  The duration template models time spans.  The steady_clock class provides access to the current time_point and is specifically designed to calculate time intervals.  Its now() class function returns the current time.

To access the chrono library, you need version 4.7.0 of gcc or greater.  To access version 5.2.0 on matrix, add the following alias to your .bash_profile file:

 ` alias g++5.2.0="/usr/local/gcc/5.2.0/bin/g++ -std=c++14"`

The std=c++14 option uses the C++14 language standard (needed for chrono and nullptr).

Then, to compile and link the program use the following command:

 ` g++5.2.0 -Wall matMult_BLAS.cpp -omatMult -lgslcblas`

The -Wall option turns on all warnings.  The -lgslcblas option links in the object code from the gsl/cblas library.

Sample Results

The results for the command

 ` matMult_BLAS 5 x`

are

 ``` 1: 1.114390 1.481488 1.727969 1.607353 1.602096 1: 2: 1.107667 1.501759 2.036034 1.649657 1.832008  2: 3: 0.808281 1.429400 2.074915 1.836768 2.000948 3: 4: 0.720611 1.191038 1.421607 1.377677 1.756079 4: 5: 0.457838 0.796544 1.211210 0.861556 1.111302 5:```

Exercises

• Consider the vectors shown on the left.  Calculate the results of transforming these vectors using the coefficient matrix on the right:  ``` x = [1 2 1 3 2]T y = [1 2 2 1 3]T ``` ``` 1 0 1 0 1 0 1 0 1 0 M = 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1```
The answers are here.

• Calculate the product of the two matrices shown below:  ``` 1 0 1 0 1 0 1 0 1 0 A = 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1``` ``` 1 1 1 0 1 0 1 1 1 0 B = 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1```
The answer is here.

• Review the BLAS Home Page at www.netlib.org/blas
• Skim the documentation for the C interface to BLAS under Section 2 of Annex B at www.netlib.org/blas/blast-forum/cinterface.pdf.  This is a handy reference for working with the cblas versions of the BLAS library.
• Complete Workshop 2 on BLAS

 Designed by Chris Szalwinski Copying From This Site  