I've tried to implement blocking algorithm for matrices multiplication in la4j and found interesting things.

There is code of both algorithm:

There is results for (1024x1024 matrices):

It's very strange! And now I have just one question: What i should do? May be I need to implement something like profile for la4j, which will chose the better algorithm for any VM. Or I need to forgot about 10-15% speed up and continue use simple algorithm.

