Imgage processing on the GPU


Hello everyone ,
I would like to share with you a couple of test I did on image processing and PyCuda.
The reason I decided to use a python wrapper is quite simple , it s a bit easier for prototyping.
The test I did was converting a serial code for trasforming a color image into a black and withe image.
In order to being able to use PyCuda I had to use numpy , which gave me quite a bit of troubles to understand all the conversion that where happeing under the hood , for example my 3d matrix [col][row][rgb] was getting converted into a 1D matrix and so on.

After some headache I got it up and running and I am here to show you the result and some benchmarcks.
THis is the resulting image :

as you can see nothing special, pretty simple stuff to do.
My serial code on a i7 4900MQ took between 5-6 seconds for a ~1K image but let s see a full log.
My GPU is a GTX 780M.


-----> Opening path : C:/Users/Marco/Desktop/jessTest.jpg
-----> Saving path : C:/Users/Marco/Desktop/jessTestLuminosity.jpg
Image size : (1440, 900)
get and convert Image data : 0.04000210762023926
Processing data : 4.67326807975769
Save image time : 0.03500199317932129
total Execution time : 4.748272180557251

-----> Saving path : C:/Users/Marco/Desktop/jessTestLuminosityCuda.jpg
Image size : (1440, 900)
get and convert Image data to gpu ready : 0.042001962661743164
allocate mem to gpu: 0.006000041961669922
Kernel execution time : 0.04200291633605957
Get data from gpu and convert : 0.010999917984008789
Save image time : 0.03200197219848633
total Execution time : 0.13300681114196777

As you can see even if the GPU is not really utilized for many operations is still giving a nice ~40X speed up! most of the times has been spent on converting and moving data!.If I was applying many filters on the same picture the gap was going to be waaay bigger between cpu and gpu performances

If you guys are interested in more banchmarsk , data (images up to 9k! ) and source code check my blog post :

That s it folks!
Let me know your thoughts about this!



Pretty cool stuff. Thanks for the example.


Very nice blog Marco, I just spent the last half hour reading through your posts. Thank you for sharing.



This is very good and impressive results! Thanks for sharing.


Thanks a lot guys !
THis was just a simple test I will try to make more meaningfull tests.
Might be that I will do that in cpp tho , since I would like to end up doing something in maya and cuda.


[QUOTE=giordi;23679]Thanks a lot guys !
This was just a simple test I will try to make more meaningfull tests.
Might be that I will do that in cpp tho , since I would like to end up doing something in maya and cuda.[/QUOTE]

Hey, nice to have someone else tinkering with Parallel Computing! :smiley: Actually i spent a large amount of the last semester playing with CUDA.
I mean, you dont exactly need to be a genius to make the connection and see the prospect (…even i could :wink: ):
Large models with lots of vertices -> large sets of simultaneous operations -> Deformer with parallel computation -> speed boost beyond compare.

I was planning on doing a blog post about my experiences, but since i dont have a blog, i’ll probably just use this thread…
DISCLAIMER: I am by no means an expert on GPU paralellization or CPP, i just write my experiences here.

Conclusion (i’ll start with this for people who, like me, are lazy readers):
While it was seriously fun to toy around with it, it is really hard to get a serious speed benefit out of paralell computing on the GPU unless the problem you are trying to solve
is really “custom built” for that and the plugin in is custom built for the GPU. I basicaly found that using OpenMP is often a far better and “stupidly” easy to set up solution in the typical,
“I have something and now want to make it parallel example”. (Of course it doesnt compare to GPU parallelization, but its good at parallelizing average tasks “on the fly”).
I will basicaly only look at GPU paralellization again, when
A: I have some massive data to crunch which doesnt need any user interaction while running
B: Shared data architectures are coming (They probably have the power to make GPU paralellization really shine…)

And that doesnt seem to be only my experience. The developers of the Fabric Engine, do the same. We had a workshop with one of the developers, and he said they are basicaly only
wrapping CPU paralellization until shared memory architectures arrive, because the memory copy bottleneck makes it slower than without GPU.
Apart from any parallelization, i think when you see how performant KL compiled tools are running, its a miracle what code optimization ala LLVM can do.
(LLVM or not, writing high performance CPP code is probably the one thing that sets TDs apart from “real” programmers…)

My experience:
I wrote a verlet cloth solver as a Maya CPP plugin that lets you chose to either run serial, OpenMP parallelized or on the GPU using CUDA.
The fastest of these is by far the OpenMP version. CUDA doesnt make any sense at all, performance wise.
The reasons for that are the following:
1. The verlet algorythm isnt particularly a genius pick for GPU computation. Not all computation spreads on the vertices, as soon as you integrate constraints
you have to iterate edges, and you need to iterate them serial (even on the GPU inside a kernel). Experience teaches you…
2. While i tried to optimize the memcopying as much as possible, dividing into static and dynamic data, with static solver data only copied once to the GPU on the initial frame,
that nevertheless proved to be a serious speed killer. You might almost never end up in a situation with no dynamic data, that you need to upload on each step of computation.
Just think about keyframe animation of parameters…A buddy here from the acamedy did a sand solver for Max and he has great speed benefits, but as far as i’m informed
he copies the data once on the initial frame and doesnt allow any animation at all.

I also did the same test that you did with the image operation, only with CPP, and actually CUDA was on par…tendentially slower (only tested on a 2k image though…).
It would be interesting if you did the tests you did also with the operation running serial within a kernel, just to make
sure the difference is really a result of paralellization (and you are not comparing a solution programmed in python with wrapped algorythms, for example).

The second thing i did last semester was playing with the Python C API, understanding how to wrap my CPP in Python (for speed critical parts) and how Python works under the hood.
While i was at that, i also testwise wrapped some kernels, so that you could add up lists on the GPU or do the Houdini sine deform example on the GPU.
I mean i do these things like a TD not like a real programmer, but what i discovered (…i always discover things, which seem so logical afterwards…):
Wrapping CUDA for Python not only gives you the memcopy issue which takes time for infrastructural work, but also the work to extract the data out of Pythons own structures.
For example adding two lists on the GPU:
1. You need to create two array out of the given PyObject*. Whenever i tried to parallelize that (OpenMP), it failed. I think the Python API is not threadsafe, for this task.
2. So that is basically serial and it will be a huge time effort linear to the amount of data…but on the other hand, there’s no point to parallelize on the GPU if the data amount is not massive.

Pheww, long post…
But anyhow, if you are interested in programming, i think GPU paralellization is a particularly fascinating area.
And if you do it right and pick the right problem to solve, you might be able to get a great performance gain.
But to use a quote from the KL workshop: “Parallelization is almost never efficient when added on top of a programm later on”…this might be twice true for parallelization on the GPU.


I had a little free time recently so I finally gave a go of building on Marco’s script and adding a Halide variant for comparison.

Thanks again to Marco for sharing his original tests.




Interesting. Are Halide python binaries available as a binary distribution?



I don’t think so. There are binary distributions of Halide for Linux/OSX/Windows, but they don’t appear to include the python bindings.

Compiling Halide and the python bindings from the github repository was pretty straight forward, the README covered everything needed.




Hm. I just tried. And as with every single CMake project i tried so far…it failed miserably. Did you compile on windows? I started with a whopping 3 succeeded projects and 235 failing heh. The CMake generation did not pop up any error.



Sorry, I’ve not tried compiling it on Windows, I’m strictly Linux (Fedora 17) these days.

I scanned through the Halide mailing list archives and it looks like the Windows build has been unstable for the last six months or so. At least the Halide devs are aware of the issue and plan to address it.




Thanks. Thought so. Will monitor it then, as i currently lack the time to start digging myself sadly