Unrolled version of multiply code gives about 1.1x speed up on C1 (Client compiler).
On Atom this code works very slow (~10 sec) for 1000 x 1000 dimension. On my laptop (Nehalem) it works fast (~3 sec) for same size.
I have to implement blocking technique for multiplying.
No comments:
Post a Comment