Saturday, September 24, 2011

Blocking techique

I've tried to implement blocking algorithm for matrices multiplication in la4j and found interesting things.



There is code of both algorithm:

(Simple one)

1: public Matrix multiply(Matrix matrix) throws MatrixException {
 2: 
 3:   int thisRows = this.rows;
 4:   int thisColumns = this.columns;
 5:   int thatRows = matrix.rows();
 6:   int thatColumns = matrix.columns();
 7: 
 8:   if (thisColumns != thatRows) {
 9:    throw new MatrixException(
10:      "can not multiply this matrix: wrong dimmentions");
11:   }
12: 
13:   DenseMatrix result = new DenseMatrix(thisRows, thatColumns);
14:   double resultArr[][] = result.toArray();
15: 
16:   double thatColumn[] = new double[thatRows];
17: 
18:   for (int j = 0; j < thatColumns; j++) {
19:    for (int k = 0; k < thisColumns; k++) {
20:     thatColumn[k] = matrix.get(k, j);
21:    }
22: 
23:    for (int i = 0; i < thisRows; i++) {
24:     double thisRow[] = self[i];
25:     double summand = 0;
26:     for (int k = 0; k < thisColumns; k++) {
27:      summand += thisRow[k] * thatColumn[k];
28:     }
29:     resultArr[i][j] = summand;
30:    }
31:   }
32: 
33:   return result;
34:  }

(Blocking one)

1: public Matrix multiplyBlocks(Matrix matrix) throws MatrixException {
 2:   int thisRows = this.rows;
 3:   int thisColumns = this.columns;
 4:   int thatRows = matrix.rows();
 5:   int thatColumns = matrix.columns();
 6: 
 7:   if (thisColumns != thatRows) {
 8:    throw new MatrixException(
 9:      "can not multiply this matrix: wrong dimmentions");
10:   }
11: 
12:   DenseMatrix result = new DenseMatrix(thisRows, thatColumns);
13: 
14:   double resultArr[][] = result.toArray();
15:   
16:   int block_size = 64;
17: 
18:   for (int i = 0; i<thisRows; i += block_size) {
19:    for (int k = 0; k<thisColumns; k += block_size) {
20:     for (int j = 0; j<thatColumns; j += block_size) {
21:      for(int u = 0; u<block_size; ++u) {
22:       for(int w = 0; w<block_size; ++w) {
23:        for(int v = 0; v<block_size; ++v) {
24:         resultArr[i+u][j+v] += self[i+u][k+w] * matrix.get(k+w, j+v);
25:        }
26:       }
27:      }
28:     }
29:    }
30:   }
31: 
32:   return result;
33:  }


There is results for (1024x1024 matrices):

  • Server compiler:
    • Simple: 2,75 s
    • Blocking: 2,4 s
  • Client compiler:
    • Simple: 3,91 s
    • Blocking:  18,43 s

It's very strange! And now I have just one question: What i should do? May be I need to implement something like profile for la4j, which will chose the better algorithm for any VM. Or I need to forgot about 10-15% speed up and continue use simple algorithm.  

No comments:

Post a Comment