samedi 18 juin 2016

iOS Metal compute pipeline slower than CPU implementation for search task


I made simple experiment, by implementing naive char search algorithm searching 1.000.000 rows of 50 characters each (50 mil char map) on both CPU and GPU (using iOS8 Metal compute pipeline).

CPU implementation uses simple loop, Metal implementation gives each kernel 1 row to process (source code below).

To my surprise, Metal implementation is on average 2-3 times slower than simple, linear CPU (if I use 1 core) and 3-4 times slower if I employ 2 cores (each of them searching half of database)! I experimented with diffrent threads per group (16, 32, 64, 128, 512) yet still get very similar results.

iPhone 6:

CPU 1 core:  approx 0.12 sec
CPU 2 cores: approx 0.075 sec
GPU: approx 0.35 sec (relEase mode, validation disabled)

I can see Metal shader spending more than 90% of accessing memory (see below).

What can be done to optimise it?

Any insights will be appreciated, as there are not many sources in the internet (besides standard Apple programming guides), providing details on memory access internals & trade-offs specific to the Metal framework.

METAL IMPLEMENTATION DETAILS:

Host code gist: https://gist.github.com/lukaszmargielewski/0a3b16d4661dd7d7e00d

Kernel (shader) code: https://gist.github.com/lukaszmargielewski/6b64d06d2d106d110126

GPU frame capture profiling results:

enter image description here


Aucun commentaire:

Enregistrer un commentaire