Many people have been asking them for this sort of content, and it is happening. Can't be more excited. Also note that it is AMD, but not AMD. Being published in the open on an individual github.
Whenever I see code like this, I'm starting to think that GPUs are uniquely unsuited for matrix multiplication.
You're pretending that each streaming multiprocessor can handle independent threads, when in reality you're feeding something that only exists once or twice per SM. It's like independently controlling one out of 32 cars on a 32 lane highway where the cars aren't allowed to switch lanes and having the controls on one car replicated to all the others when in reality everyone is sitting in the same bus.
I'm not sure I follow. Matrix multiplication isn't inherently 'branchy' in a way that we would expect to cause inefficient execution on SIMT (i.e. branch divergence).
Glad to see more articles out using AMD hardware acceleration especially for matrix math. More diversity in this space is welcome.
Many people have been asking them for this sort of content, and it is happening. Can't be more excited. Also note that it is AMD, but not AMD. Being published in the open on an individual github.
Whenever I see code like this, I'm starting to think that GPUs are uniquely unsuited for matrix multiplication.
You're pretending that each streaming multiprocessor can handle independent threads, when in reality you're feeding something that only exists once or twice per SM. It's like independently controlling one out of 32 cars on a 32 lane highway where the cars aren't allowed to switch lanes and having the controls on one car replicated to all the others when in reality everyone is sitting in the same bus.
I'm not sure I follow. Matrix multiplication isn't inherently 'branchy' in a way that we would expect to cause inefficient execution on SIMT (i.e. branch divergence).