There are neural network applications where the resources, for example, compute power, memory space, battery power etc., are limited. In such instances, we may want to simplify a trained network to better deal with available resources.
While simplifying, we obviously would like to maintain the network accuracy as much as possible. Such a network simplification is termed compression of deep neural networks.
In this three-part blog posts, I am going to share with you some simple examples of neural network compression to help you better understand compressing deep neural nets.
In Part 1, I will begin with a technique, known as singular value decomposition (SVD), which is the basis of compression methods. Part 2 will show compression of feedforward neural networks and Part 3 will demonstrate compression of convolution layers in convolutional neural networks (CNNs).
Singular Value Decomposition (SVD)
Let’s begin our discussion on SVD by recalling that the rank of a matrix is the number of independent columns or rows. For example, the matrix
has rank 2 as its both columns are independents, i.e. one column/row is not a scaled version of the other column. On the other hand, the following matrix is a matrix of rank 1 as the second column/row is same as the first column multiplied by 2.
Given a matrix, sometimes you may want to obtain its approximation. Use of such approximations in common in numerous tasks such as dimensionality reduction, information retrieval and text mining, and in image processing. The singular value decomposition (SVD) is one such method for obtaining low rank approximations of a given matrix. Another popular matrix decomposition method is non-negative matrix factorization (NMF). You can read about it in my earlier blog post. In SVD, a real-valued matrix A of size m x n is factorized as
where U is an orthogonal matrix of size m x m of left singular vectors and V is an orthogonal matrix of size n x n of right singular vectors. The matrix D is a diagonal matrix of size m x n of singular values. A low rank approximation to matrix A of rank r is obtained by using only a subset of singular values and the corresponding left and right singular vectors as given by the following expression. In other words, the approximation is obtained by the weighted sum of rank one matrices.
Let me give you an example by decomposing a matrix and obtaining its different approximations.
This code creates a 4×3 matrix A and decomposes it to obtain U and V matrices along with an array d of singular values as shown below. Note that the array of d singular values represents the diagonal elements of matrix D of the SVD equation given earlier. In this case, there are three non-zero singular values; this means that the rank of matrix A is three.
[[-0.82 -0.28 0.21 -0.46]
[-0.16 0.2 -0.93 -0.26]
[-0.24 -0.62 -0.28 0.69]
[-0.5 0.7 0.1 0.5 ]]
[10.51 5.06 2.63]
[[-0.56 -0.77 -0.32]
[-0.83 0.54 0.16]
[ 0.05 0.36 -0.93]]
Given U, V, and d, we can get back our original matrix A without any error by performing the matrix multiplication UDV^t.
[[6. 6. 2.]
[0. 1. 3.]
[4. 0. 1.]
[0. 6. 2.]]
The reconstructed matrix is identical to the original matrix. Suppose we reconstruct matrix A using only the first two singular values. We do that by replacing ‘len(d)’ above with ‘len(d)-1’ and get the following approximated matrix A.
[[ 5.97 5.8 2.51]
[ 0.12 1.87 0.72]
[ 4.04 0.26 0.31]
[-0.01 5.91 2.25]]
Since this approximation is obtained by summing two rank one matrices, you can easily verify the rank of this matrix as 2. You can also calculate the approximation error by comparing the original and the reconstructed matrix element by element , squaring the difference, and adding them, as shown below. This measure of comparing two matrices is known as Frobenius norm.
Approximation Error = 6.904768216213785
If you look at the approximation error, you will find that it equals square of 2.63, the singular value not used in reconstructing the approximation. You can also approximate matrix A by using only one singular value. We do this by replacing ‘len(d)’ with ‘len(d)-2’ and get the following approximation.
[[4.79 6.57 2.75]
[0.96 1.32 0.55]
[1.43 1.95 0.82]
[2.92 4.01 1.67]]
You can verify that the error in this case would be the sum of squares of the second and third singular values not used in creating approximation.
The singular value decomposition described above creates an approximate matrix whose size remains the same. In some cases, the matrix A to be decomposed is treated as a data matrix with each row representing an observation and each column representing an observed feature. In such a scenario, the objective is not only to approximate but also to achieve a reduction in the size of the approximated matrix, i.e. the resulting matrix has same number of rows (observations) but a fewer number of columns. Thus, the matrix approximation is aimed at achieving dimensionality reduction similar to PCA. The use of SVD in this context is known as truncated SVD. To obtain a reduced approximation matrix, we use only top k < r singular values and the first k columns of the U matrix and calculate the reduced representation as
is the sub-matrix formed by the first k columns of U and
is the sub-matrix of D formed by first k rows and columns.
An example of calculating the truncated SVD is shown below.
The Scikit-learn library also provides an implementation of truncated SVD as shown below.
Before concluding this post, let me show you two examples of SVD applications. The first example is of image compression wherein I have performed singular value decomposition on an image and reconstructed it using 32, 16, and 8 top singular values. The results of this reconstruction along with the original image are shown below. As you can see, the first two approximations provide a good compression without satisfying much of the image quality.
The second example is of truncated SVD application. In this case, the truncated SVD has been used to reduce the Iris data matrix to a two-dimensional space, thus doing dimensionality reduction.
The dimensionality reduction shows that Iris data can be well separated in the transformed two-dimensional space.
With the above introduction to SVD, we are ready to move on to compressing neural networks and convolutional neural networks in the coming blogs.