I've tried to implement blocking algorithm for matrices multiplication in la4j and found interesting things.
There is code of both algorithm:
There is results for (1024x1024 matrices):
It's very strange! And now I have just one question: What i should do? May be I need to implement something like profile for la4j, which will chose the better algorithm for any VM. Or I need to forgot about 10-15% speed up and continue use simple algorithm.
There is code of both algorithm:
(Simple one)
1: public Matrix multiply(Matrix matrix) throws MatrixException { 2: 3: int thisRows = this.rows; 4: int thisColumns = this.columns; 5: int thatRows = matrix.rows(); 6: int thatColumns = matrix.columns(); 7: 8: if (thisColumns != thatRows) { 9: throw new MatrixException( 10: "can not multiply this matrix: wrong dimmentions"); 11: } 12: 13: DenseMatrix result = new DenseMatrix(thisRows, thatColumns); 14: double resultArr[][] = result.toArray(); 15: 16: double thatColumn[] = new double[thatRows]; 17: 18: for (int j = 0; j < thatColumns; j++) { 19: for (int k = 0; k < thisColumns; k++) { 20: thatColumn[k] = matrix.get(k, j); 21: } 22: 23: for (int i = 0; i < thisRows; i++) { 24: double thisRow[] = self[i]; 25: double summand = 0; 26: for (int k = 0; k < thisColumns; k++) { 27: summand += thisRow[k] * thatColumn[k]; 28: } 29: resultArr[i][j] = summand; 30: } 31: } 32: 33: return result; 34: }
(Blocking one)1: public Matrix multiplyBlocks(Matrix matrix) throws MatrixException { 2: int thisRows = this.rows; 3: int thisColumns = this.columns; 4: int thatRows = matrix.rows(); 5: int thatColumns = matrix.columns(); 6: 7: if (thisColumns != thatRows) { 8: throw new MatrixException( 9: "can not multiply this matrix: wrong dimmentions"); 10: } 11: 12: DenseMatrix result = new DenseMatrix(thisRows, thatColumns); 13: 14: double resultArr[][] = result.toArray(); 15: 16: int block_size = 64; 17: 18: for (int i = 0; i<thisRows; i += block_size) { 19: for (int k = 0; k<thisColumns; k += block_size) { 20: for (int j = 0; j<thatColumns; j += block_size) { 21: for(int u = 0; u<block_size; ++u) { 22: for(int w = 0; w<block_size; ++w) { 23: for(int v = 0; v<block_size; ++v) { 24: resultArr[i+u][j+v] += self[i+u][k+w] * matrix.get(k+w, j+v); 25: } 26: } 27: } 28: } 29: } 30: } 31: 32: return result; 33: }
There is results for (1024x1024 matrices):
- Server compiler:
- Simple: 2,75 s
- Blocking: 2,4 s
- Client compiler:
- Simple: 3,91 s
- Blocking: 18,43 s
It's very strange! And now I have just one question: What i should do? May be I need to implement something like profile for la4j, which will chose the better algorithm for any VM. Or I need to forgot about 10-15% speed up and continue use simple algorithm.
No comments:
Post a Comment