MCeval/progress.txt at master · djbard/MCeval · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
This is where I will keep track of the timing at the various stages.

---------------------
5-min implementation
---------------------
Evaluates Gaussian on 80 threads.
cudamalloc: 200ms.
kernel: 4us.
This is obviously a very inefficient use of resources.


---------------------
Next up: calculate chi2 of test gaussian params, compared to reference distribution.
---------------------
Each thread evaluate chi2 of gaussian - data point, summed over 30*30 data points (note, all data is nonsense).
cudamalloc: 130ms
kernel :1.3ms.
still, obviously way inefficient.


---------------------
Now, let's do 6 gaussians. The aim here is to fit multiple gaussians to one galaxy shape, so this will be a sum of 6 gaussians with params for each specified by the walkers.
---------------------
Dumbly copying 6 sets of gaussian param vectors over the the GPU.
This fails with  'too many resources requested for launch.'. I expect I've exceeded the register memory. If I comment out 2 of the gaussians, it will run.
Ah! I need to use less than 1024 threads/block. The problem is that register memory is assigned per block. 512 threads/block. This might get tricky if I want to use shared memory later on.
cudamalloc: 116ms
memcpy: 20us
kernel: 2.1ms


--------------------
does this scale? Trying 1000 calls to the function, assuming each walker step is independent. Basically, I'm running the thing 1000 times independently.
--------------------
takes ~3.5 sec. If I comment out the kernel, takes ~1 sec, same if I also remove the cudamemcpy. If also comment out the cudamalloc, takes only 0.5 sec (just the host memory malloc 1000 times, that is).


-------------------
scale with # galaxies. Instead of one set of 30*30 pixel vector, I'll try multiplying that to mimic doing a bunch of galaxies in parallel.
-------------------
1 gal: 3.4s	 (3.4 s/gal)
10 gals: 3.2s    (0.32 s/gal)
100 gals: 5.5s   (0.055 s/gal)
1000 gals: 29.1s (0.029 s/gal)
10000 gals: 262s (0.026 s/gal)
Note that I'm still producing nonsense evaluations. I haven't checked very carefully that the output is producing reasonable numbers, but I think it's scaling as I'd expect.
That may be the next step, to try to make reasonable outputs from reasonable inputs.

-----------------
swapping  __expf for expf is pretty nice!
----------------
1 gal: 2.4s      (2.4s/gal)
10 gals: 2.6s	 (0.26s/gal)
100 gals: 4.0s	 (0.04s/gal)
1000 gals: 19.5s (0.0195s/gal)
10000 gals: 175s (0.0175s/gal)