I've tried to implement blocking algorithm for matrices multiplication in la4j and found interesting things.

There is code of both algorithm:

There is results for (1024x1024 matrices):

It's very strange! And now I have just one question: What i should do? May be I need to implement something like profile for la4j, which will chose the better algorithm for any VM. Or I need to forgot about 10-15% speed up and continue use simple algorithm.

There is code of both algorithm:

(Simple one)

1:publicMatrix multiply(Matrix matrix)throwsMatrixException{2: 3:intthisRows =this.rows;4:intthisColumns =this.columns;5:intthatRows = matrix.rows();6:intthatColumns = matrix.columns();7: 8:if(thisColumns != thatRows){9:thrownewMatrixException(10: "can not multiply this matrix: wrong dimmentions");11:}12: 13: DenseMatrix result =newDenseMatrix(thisRows,thatColumns);14:doubleresultArr[][]= result.toArray();15: 16:doublethatColumn[]=newdouble[thatRows];17: 18:for(intj = 0;j < thatColumns;j++){19:for(intk = 0;k < thisColumns;k++){20: thatColumn[k]= matrix.get(k,j);21:}22: 23:for(inti = 0;i < thisRows;i++){24:doublethisRow[]= self[i];25:doublesummand = 0;26:for(intk = 0;k < thisColumns;k++){27: summand += thisRow[k]* thatColumn[k];28:}29: resultArr[i][j]= summand;30:}31:}32: 33:returnresult;34:}

**
**

(Blocking one)1:publicMatrix multiplyBlocks(Matrix matrix)throwsMatrixException{2:intthisRows =this.rows;3:intthisColumns =this.columns;4:intthatRows = matrix.rows();5:intthatColumns = matrix.columns();6: 7:if(thisColumns != thatRows){8:thrownewMatrixException(9: "can not multiply this matrix: wrong dimmentions");10:}11: 12: DenseMatrix result =newDenseMatrix(thisRows,thatColumns);13: 14:doubleresultArr[][]= result.toArray();15: 16:intblock_size = 64;17: 18:for(inti = 0;i<thisRows;i += block_size){19:for(intk = 0;k<thisColumns;k += block_size){20:for(intj = 0;j<thatColumns;j += block_size){21:for(intu = 0;u<block_size;++u){22:for(intw = 0;w<block_size;++w){23:for(intv = 0;v<block_size;++v){24: resultArr[i+u][j+v]+= self[i+u][k+w]* matrix.get(k+w,j+v);25:}26:}27:}28:}29:}30:}31: 32:returnresult;33:}

There is results for (1024x1024 matrices):

- Server compiler:
- Simple: 2,75 s
- Blocking:
**2,4**s - Client compiler:
- Simple: 3,91 s
- Blocking:
**18,43 s**

It's very strange! And now I have just one question: What i should do? May be I need to implement something like profile for la4j, which will chose the better algorithm for any VM. Or I need to forgot about 10-15% speed up and continue use simple algorithm.

## No comments:

## Post a Comment