The vector and matrix tools of linear algegbra are suited
to represent sets of equations in a compact form. One vector
represents the set of solution variables while another vector represents the
set of condition variables. The sizes of these vectors depend on the
number of variables in each set. The sizes of the
vectors and matrices relating them depends on the degree of detail
in the problem description.
This chapter extends the twodimensional description of the preceding chapter
to the linear algebraic representation of a set of linear equations.
The solution and condition vectors have n components and the transformation
matrix has n rows and n columns. This chapter introduces the BLAS standard
for linear algebraic computations on vectors and matrices and describes how to
access the library functions written under this standard in the GNU scientific
library.
Dimensionality and Ordering Schemes
Dimensionality
The number of degrees of freedom in a problem defines the dimension of the problem.
A vector's dimension is the number of components that identifies the vector
uniquely in the problem space. For example, if we describe the temperature
of a body by the temperature at 35 points on the body, the temperature
vector has 35 components and is of dimension 35.
A matrix's dimension is the number of its rows and its columns. For example,
if we describe the temperature of a body by the temperature at 35 different points
and the heat flow into the body at those same points, then the conduction matrix,
which relates the body's temperature to the heat flow, is a 35 by 35 matrix.
The data structure for storing the components of a vector and the coefficients
of a matrix is an array. We store vector components in singly subscripted
arrays and matrix coefficients in doubly subscripted arrays.
Operating on vectors using matrices is a straightforward coding exercise.
We access each component of a vector using a single iteration and each
coefficient of a matrix using a nested iteration. The iteration
ranges correspond to the dimensions of the vectors and matrices.
Ordering Schemes
The C/C++ family of languages stores doubly subscripted arrays in
memory row by row. This ordering scheme is called row major order.
The Fortran language stores doubly subscripted arrays column by column.
This ordering scheme is called column major order. In passing doubly
subscripting arrays to library functions across the C/C++/Fortran divide, we
pay close attention to the order assumed by the library functions.
N Dimensions
The rules for vector and matrix operations in an ndimensional problem space
are identical to the rules for a 2dimensional space described in the preceding
chapter. We simply replace the 2 components of a vector with n components
and the 2 by 2 coefficients of a matrix with n by n coefficients.
If the calculations for each component of a vector are independent of the
other calculations, as they often are, the the order of the iterations involved
in performing operations on vectors and matrices does not matter and these
operations are parallelizable.
Vector Operations
Addition
Adding one ndimensional vector to another ndimensional vector
produces a third ndimensional vector. Each component of the
resultant vector is equal to the sum of the corresponding components of
the contributing vectors:
r = p + q
= [p_{x0} ... p_{xi} ... p_{xn1}]^{T} + [q_{x1} ... q_{xi} ... q_{xn1}]^{T}
= [p_{x0} + q_{x1} ... p_{xi} + q_{xi} ... p_{xn1} + q_{xn1}]^{T}

The subscript x_{i} identifies the component
associated with the i+1th dimension of the problem space.
Coding a vector addition requires a simple iteration:
for (int i = 0; i < n; i++)
r[i] = p[i] + q[i];

Subtraction
Subtracting an ndimensional vector from another ndimensional vector
produces a third ndimensional vector. Each component of the
resultant vector is equal to the difference between the corresponding
components of the contributing vectors:
r = p  q
= [p_{x0} ... p_{xi} ... p_{xn1}]^{T}  [q_{x0} ... q_{xi} ... q_{xn1}]^{T}
= [p_{x0}  q_{x0} ... p_{xi}  q_{xi} ... p_{xn1}  q_{xn1}]^{T}

Coding a vector subtraction requires a simple iteration:
for (int i = 0; i < n; i++)
r[i] = p[i]  q[i];

Multiplication by a Scalar
Multiplying an ndimensional vector by a scalar produces an
ndimensional vector without changing its direction. Each component
of the resultant vector is equal to the product of the scalar and the
corresponding component of the contributing vector:
r = s p
= s [p_{x0} ... p_{xi} ... p_{xn1}]^{T}
= [s p_{x0} ... s p_{xi} ... s p_{xn1}]^{T}

Coding a scalar multiplication requires a simple iteration:
for (int i = 0; i < n; i++)
r[i] = s * p[i];

Dot Product of Two Vectors
The dot product of two ndimensional vectors is the
sum of the products of their corresponding components:
s = p∙q = p^{T}q
= p_{x0}q_{x1} + ... + p_{xi}q_{xi} + ... + p_{xn}q_{xn1}

Coding a dot product requires a simple iteration:
double s = 0;
for (int i = 0; i < n; i++)
s += p[i] * q[i];

The magnitude of an ndimensional vector is the square root of the sum
of the products of the corresponding components of the contributing
vectors:
p = √[(p_{x0})^{2} + ... + (p_{xi})^{2} + ... + (p_{xn1})^{2}]

This is a simple extension of Pythagoras' theorem to ndimensions.
Coding a magnitude calculation requires a simple iteration and a square root evaluation:
double s = 0;
for (int i = 0; i < n; i++)
s += p[i] * q[i];
double magnitude = std::sqrt(s);

Matrix Operations
The rules for operating on an n by n matrix follow directly from the
rules for 2 by 2 matrices.
MatrixMatrix Addition
Adding two matrices of identical size creates a third matrix of the same size.
The number of rows in each matrix must be identical. The number of columns in
each matrix must also be identical. Coefficient [i,j] in the resultant matrix
is the sum of the corresponding coefficients [i,j] in the contributing matrices:
C = A + B
C_{0,0 } = A_{0,0 } + B_{0,0 }
C_{i,j } = A_{i,j } + B_{i,j }
C_{n1,n1} = A_{n1,n1} + B_{n1,n1}

Coding a matrix addition requires a doubly nested iteration:
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
C[i][j] = A[i][j] + B[i][j];

Note that the number of subscripts in an array (called the array dimension
in programming) differs from the number of dimension(s) of the problem space;
that is, the dimension(s) of the data stored in the array.
ScalarMatrix Multiplication
Multiplying a matrix by a scalar creates a matrix of the same size.
Coefficient [i,j] in the resulting matrix is the product of the scalar and the
corresponding coefficient [i,j] in the matrix:
C = s A
C_{0,0 } = s * A_{0,0 }
C_{i,j } = s * A_{i,j }
C_{n1,n1} = s * A_{n1,n1}

Coding a scalarmatrix multiplication requires a doubly nested iteration:
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
C[i][j] = s * A[i][j];

MatrixVector Multiplication
Multiplying a matrix by a vector creates a vector of size equal to the number
of rows in the matrix. Component [i] in the resulting vector is the dot
product of the corresponding row (i) in the left matrix and the vector on the
right; that is, the component is the sum of the products of coefficient [i, j]
in the left matrix and component [j] in the right vector, where n is the number
of rows or columns:
b = M a
b_{0 } = M_{0,0 } * a_{0} + ... + M_{0,k } * a_{k} + ... + M_{0,n1 } * a_{n1}
b_{i } = M_{i,0 } * a_{0} + ... + M_{i,k } * a_{k} + ... + M_{i,n1 } * a_{n1}
b_{n1} = M_{n1,0} * a_{0} + ... + M_{n1,k} * a_{k} + ... + M_{n1,n1} * a_{n1}

Coding a matrixvector multiplication requires a doubly nested iteration:
for (int i = 0; i < n; i++) {
sum = 0;
for (int j = 0; j < n; j++)
sum += M[i][j] * a[j];
b[i] = sum;
}

MatrixMatrix Multiplication
Multiplying a matrix by a matrix creates a matrix of size equal to the number
of rows in the left matrix and the number of columns in the right matrix.
Coefficient [i,j] in the resulting matrix is the dot product of the corresponding
row (i) in the left matrix and the corresponding column (j) in the right matrix;
that is, the coefficient is the sum of the products of coefficient [i, k] in the left matrix and
coefficient [k, j] in the right matrix, where k = 0, ..., n  1 and n is
the number of rows or columns:
C = A B
C_{0,0 } = A_{0,0 } * B_{0,0 } + ... + A_{0,k } * B_{k,0 } + ... + A_{0,n1 } * B_{n1,0}
C_{0,j } = A_{0,0 } * B_{0,j } + ... + A_{0,k } * B_{k,j } + ... + A_{0,n1 } * B_{n1,j}
C_{i,j } = A_{i,0 } * B_{0,j } + ... + A_{i,k } * B_{k,j } + ... + A_{i,n1 } * B_{n1,j}
C_{n1,n1} = A_{n1,0} * B_{0,n1} + ... + A_{n1,k} * B_{k,n1} + ... + A_{n1,n1} * B_{n1,n1}

Coding a matrixmatrix multiplication requires a triply nested iteration:
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
sum = 0;
for (int k = 0; k < n; k++)
sum += A[i][k] * B[k][j];
C[i][j] = sum;
}
}

BigO Orders
The BigO orders of matrixmatrix addition, matrixmatrix subtraction, and
scalarmatrix and matrixvector multiplication are O(n^{2}).
The BigO order of matrixmatrix multiplication is O(n^{3}).
This is one operation where parallel programming can speedup execution time
considerably for large matrices. For n = 1000, the number of operations
is in the order of 1,000,000,000.
Set of Linear Equations
Stiffness Problem
Consider the following statement of a solution to a stiffness problem.
The force at six different points in a bar of composite material is related
to the displacement at each of those points through stiffness coefficients.
The set of linear equations that describes the stiffness relation is:
f_{0} = 1.0 * d_{0} + 0.5 * d_{1}
f_{1} = 0.5 * d_{0} + 1.0 * d_{1} + 0.5 * d_{2}
f_{2} = 0.5 * d_{1} + 1.0 * d_{2} + 0.5 * d_{3}
f_{3} = 0.5 * d_{2} + 1.0 * d_{3} + 0.5 * d_{4}
f_{4} = 0.5 * d_{3} + 1.0 * d_{4} + 0.5 * d_{5}
f_{5} = 0.5 * d_{4} + 1.0 * d_{5}

We can expand each equation to include all displacement components:
f_{0} = 1.0 * d_{0} + 0.5 * d_{1} + 0.0 * d_{2} + 0.0 * d_{3} + 0.0 * d_{4} + 0.0 * d_{5}
f_{1} = 0.5 * d_{0} + 1.0 * d_{1} + 0.5 * d_{2} + 0.0 * d_{3} + 0.0 * d_{4} + 0.0 * d_{5}
f_{2} = 0.0 * d_{0} + 0.5 * d_{1} + 1.0 * d_{2} + 0.5 * d_{3} + 0.0 * d_{4} + 0.0 * d_{5}
f_{3} = 0.0 * d_{0} + 0.0 * d_{1} + 0.5 * d_{2} + 1.0 * d_{3} + 0.5 * d_{4} + 0.0 * d_{5}
f_{4} = 0.0 * d_{0} + 0.0 * d_{1} + 0.0 * d_{2} + 0.5 * d_{3} + 1.0 * d_{4} + 0.5 * d_{5}
f_{5} = 0.0 * d_{0} + 0.0 * d_{1} + 0.0 * d_{2} + 0.0 * d_{2} + 0.5 * d_{4} + 1.0 * d_{5}

VectorMatrixVector Representation
We represent these equations using vectors and a coefficient matrix as follows:
 f_{0}   1.0 0.5 0.0 0.0 0.0 0.0   d_{0} 
 f_{1}   0.5 1.0 0.5 0.0 0.0 0.0   d_{1} 
 f_{2}  =  0.0 0.5 1.0 0.5 0.0 0.0   d_{2} 
 f_{3}   0.0 0.0 0.5 1.0 0.5 0.0   d_{3} 
 f_{4}   0.0 0.0 0.0 0.5 1.0 0.5   d_{4} 
 f_{5}   0.0 0.0 0.0 0.0 0.5 1.0   d_{5} 
f_{ } = S d_{ }

For the temperature distribution on the right side, the heat flow solution
is the result of a matrixvector multiplication as shown on the left side:
 20   1.0 0.5 0.0 0.0 0.0 0.0   10 
 30   0.5 1.0 0.5 0.0 0.0 0.0   20 
 25  =  0.0 0.5 1.0 0.5 0.0 0.0   10 
 25   0.0 0.0 0.5 1.0 0.5 0.0   10 
 40   0.0 0.0 0.0 0.5 1.0 0.5   20 
 40   0.0 0.0 0.0 0.0 0.5 1.0   30 

The BLAS Standard
The Basic Linear Algebra Subprograms (BLAS) standard defines the
specifications for publishing API libraries that perform operations
on vectors and matrices. The BLAS standard was established in 1979
and the BLAS libraries have been optimized to perform operations on vectors and matrices
efficiently.
The standard identifies three distinct levels of functionality with respect
to these operations:
 vector operations of the general form
where x and y
are vectors and α is a scalar
 matrixvector operations of the general form
where x and y
are vectors, A is a matrix and α
and β are scalars
 matrixmatrix operations of the general form
where A, B
and C are matrices and α
and β are scalars
Implementations
BLAS implementations ship in two distinct forms: for specific systems
and for specific processors. Those that have been optimized for
specific operating systems ship with those systems.
Those that have been optimized for specific processors are available
from the hardware manufacturers.
The BLAS implementations include:
Naming Convention
The original BLAS standard was developed using the Fortran language.
Because Fortran identifiers were limited to 6 characters through
Fortran77 and Fortran did not accommodate overloading, the naming
convention for the BLAS procedures is cryptic. Procedure
names are combinations of sets of letters that uniquely identify the
precision, the operation, and the type of vector/matrix involved.
BLAS identifies precision by the following single letters:
 S  single
 D  double
 C  single complex
 Z  double complex
BLAS identifies operations by the following strings:
 DOT  dot or scalar product
 AXPY  vector sum in the form αx + y
 MV  matrixvector product A x
 SV  matrixvector solve inv(A) x, where inv denotes inverse
 MM  matrixmatrix product A B
 SM  matrixmatrix solve inv(A) B
BLAS identifies types of matrices by the following letter pairs:
 GE  general
 GB  general banded
 SY  symmetric
 SB  symmetric banded
 SP  symmetric packed
 HE  hermitian
 HB  hermitian banded
 HP  hermitian packed
 TR  triangular
 TB  triangular banded
 TP  triangular packed
Fortran implementations use upper case, while C implementations
use lower case. For example, SGEMM and
sgemm stand for singleprecision general
matrixmatrix multiply in Fortran and C respectively.
C implementations add the prefix
cblas_.
GSL Example
The GSL implementation uses the prefix gsl_ and
prefixes its macros with GSL_.
The header files for GSL are in the gsl system
directory. To include a specific header file, we preface its name
with gsl/
#include <gsl/gsl_cblas.h>

C++ compilers mangle function identifiers by default. To suppress
identifier mangling, we wrap the header file that is being
included in a C++ application in a C linkage directive:
extern "C" {
#include <gsl/gsl_cblas.h>
}

Level 3 cblas
The following program performs a matrix multiplication
on two square matrices of userspecified size by calling
the cblas_sgemm function from the
GSL cblas library. Before the
call, this program allocates matrix memory in linearized form
and fills the operand matrices with randomvalued coefficients.
This program uses the chrono
library to determine the time for each function call in microseconds:
// Matrix Multiply using BLAS
// matMul_BLAS.cpp
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <chrono>
extern "C" {
#include <gsl/gsl_cblas.h>
}
using namespace std::chrono;
const int WIDTH = 5;
void init(float* a, int n) {
const float randf = 1.0f / (float) RAND_MAX;
for (int i = 0; i < n; i++)
a[i] = std::rand() * randf;
}
void reportTime(const char* msg, steady_clock::duration span) {
auto ms = duration_cast<milliseconds>(span);
std::cout << msg << "  took  " <<
ms.count() << " millisecs" << std::endl;
}
int main(int argc, char** argv) {
if (argc/2 != 1) {
std::cerr << argv[0] << ": invalid number of arguments\n";
std::cerr << "Usage: " << argv[0] << " size_of_matrices [output]\n";
return 1;
}
int k, n = std::atoi(argv[1]);
float* a = new float[n * n];
float* b = new float[n * n];
float* c = new float[n * n];
// initialization
std::srand(std::time(nullptr));
init(a, n * n);
init(b, n * n);
// compute matrix product
auto ts = steady_clock::now();
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, n, n, 1.0,
a, n, b, n, 0.0, c, n);
auto te = steady_clock::now();
reportTime("matrixmatrix operation", te  ts);
// output results if user supplied 2 arguments
if (argc == 3) {
k = 0;
std::cout << std::setprecision(6) << std::fixed;
for (int i = 0; i < n; i++) {
std::cout << std::setw(5) << i+1 << ':';
for (int j = 0; j < n; j++) {
std::cout << std::setw(10) << c[k++] << ' ';
if (j % WIDTH == WIDTH  1)
std::cout << std::endl << std::setw(5) << i+1 << ':';
}
std::cout << std::endl;
}
std::cout << std::endl;
}
delete [] a;
delete [] b;
delete [] c;
}

The CblasRowMajor macro identifies the matrices
as stored in rowmajor order. The CblasNoTrans
macro directs the function not to transpose the matrix before performing
the operation. The level 3 scalar multipliers, α
and β, are assigned values 1.0 and 0.0
respectively.
The cblas library functions are available in object
form and have not been compiled for profiling. Hence, gprof
will have no access to statistics for any calls to these functions.
For timing information, we need to wrap the function calls ourselves.
chrono
C++11 added the chrono library to provide a flexible
crossplatform, timetracking utility that improves in precision over
time. The identifiers are defined in the std::chrono
namespace. The duration template models time spans.
The steady_clock class provides access to the current
time_point and is specifically designed to calculate time intervals. Its
now() class function returns the current time.
Compilation and Linking
To access the chrono library, you need
version 4.7.0 of gcc or greater.
To access version 5.2.0 on matrix, add the following alias to
your .bash_profile file:
alias g++5.2.0="/usr/local/gcc/5.2.0/bin/g++ std=c++14"

The std=c++14 option uses the C++14 language
standard (needed for chrono and nullptr).
Then, to compile and link the program use the following command:
g++5.2.0 Wall matMult_BLAS.cpp omatMult lgslcblas

The Wall option turns on all warnings.
The lgslcblas option links in the object code
from the gsl/cblas library.
Sample Results
The results for the command
are
1: 1.114390 1.481488 1.727969 1.607353 1.602096
1:
2: 1.107667 1.501759 2.036034 1.649657 1.832008
2:
3: 0.808281 1.429400 2.074915 1.836768 2.000948
3:
4: 0.720611 1.191038 1.421607 1.377677 1.756079
4:
5: 0.457838 0.796544 1.211210 0.861556 1.111302
5:

Exercises

Consider the vectors shown on the left.
Calculate the results of transforming these vectors using the coefficient
matrix on the right:
x = [1 2 1 3 2]^{T}
y = [1 2 2 1 3]^{T}

1 0 1 0 1
0 1 0 1 0
M = 0 0 1 1 1
0 0 0 1 0
1 0 0 0 1

The answers are here.

Calculate the product of the two matrices shown below:
1 0 1 0 1
0 1 0 1 0
A = 0 0 1 1 1
0 0 0 1 0
1 0 0 0 1

1 1 1 0 1
0 1 1 1 0
B = 1 0 1 1 1
0 1 0 1 0
1 0 1 0 1

The answer is here.
 Review the BLAS Home Page at www.netlib.org/blas
 Skim the documentation for the C interface to BLAS under Section 2
of Annex B at www.netlib.org/blas/blastforum/cinterface.pdf.
This is a handy reference for working with the cblas versions of the
BLAS library.
 Complete Workshop 2 on BLAS
