在macOS10.12系统上给thea

首先查了一下，cuda只支持Nvida显卡，所以只好放弃了。转而选择gpuarray backend，这个版本还没有release，都是开发版。

根据官网提示，首先你需要先安装cmake、cython、nose等软件和py库。因为我之前安装过cmake，而且我使用的是anaconda，所以这些py库也都有。真是非常方便。

下面开始安装：

# 后面会发现这其实是一个大坑！
git clone https://github.com/Theano/libgpuarray.git
cd libgpuarray

mkdir Build
cd Build

cmake .. -DCMAKE_BUILD_TYPE=Release
make
make install
cd ..

这一步看起来还算简单。
make install会在/usr/local/include/下创建gpuarray/目录，这里面有下面?build所需的一些头文件，同时在libgpuarray/lib下面会创建libgpuarray.dylib和libgpuarray-static.a这两个特别重要的动态链接库。

下一步安装pygpu，注意，可能需要先改一下setup.py中的include_dirs和library_dirs2个变量如下：

include_dirs = ["/usr/local/include", np.get_include()]
library_dirs = ["lib"]

否则可能提示找不到头文件或者动态链接库。然后运行：

python setup.py build
python setup.py install

这样pygpu就算安装完成了。

下一步就是测试gpu是否正常工作。
创建如下check1.py文件, 它的功能很简单，就是计算长度为vlen的随机数组每个元素的exp值。

### check1.py
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in range(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')

测试cpu性能：

THEANO_FLAGS=device=cpu python check1.py

结果：

[Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
Looping 1000 times took 3.219283 seconds
Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
  1.62323284]
Used the cpu

测试gpu性能：

THEANO_FLAGS=device=opencl0:1 python check1.py

结果出现如下错误：

Wrong major API version for gpuarray:-9997 Make sure Theano and libgpuarray/pygpu are in sync.

看样子应该是版本不符合。google了很久，发现原因是：我刚才从github上安装的是最新的gpuarray，而我的theano是0.8.2，可能不是最新的了，于是我只好更新一下theano：

pip install --upgrade --no-deps git+git://github.com/Theano/Theano.git

更新好theano，下面再执行上面的命令，还是有问题：

clang++ -dynamiclib -g -O3 -fno-math-errno -Wno-unused-label -Wno-unused-variable -Wno-write-strings -march=haswell -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -m64 -fPIC -undefined dynamic_lookup -I/Users/flybywind/anaconda3/lib/python3.5/site-packages/pygpu-0.2.1-py3.5-macosx-10.6-x86_64.egg/pygpu -I/Users/flybywind/anaconda3/lib/python3.5/site-packages/numpy/core/include -I/Users/flybywind/anaconda3/lib/python3.5/site-packages/numpy/core/include -I/Users/flybywind/anaconda3/include/python3.5m -I/Users/flybywind/anaconda3/lib/python3.5/site-packages/theano/gof -L/Users/flybywind/anaconda3/lib -fvisibility=hidden -o /Users/flybywind/.theano/compiledir_Darwin-16.0.0-x86_64-i386-64bit-i386-3.5.2-64/tmppv5z0wy8/mb4366a5a742592cc8864699a71f9f43c.so /Users/flybywind/.theano/compiledir_Darwin-16.0.0-x86_64-i386-64bit-i386-3.5.2-64/tmppv5z0wy8/mod.cpp -lgpuarray
/Users/flybywind/.theano/compiledir_Darwin-16.0.0-x86_64-i386-64bit-i386-3.5.2-64/tmppv5z0wy8/mod.cpp:4:10: fatal error: 'gpuarray/array.h' file not found

这个错误跟刚才类似，我也懒得去找-I是从哪里设置了，索性把/usr/local/include/gpuarray拷贝到/Users/flybywind/anaconda3/lib/python3.5/site-packages/pygpu-0.2.1-py3.5-macosx-10.6-x86_64.egg/pygpu下面了。

然后再运行，又挂了，这次提示：
ld: library not found for -lgpuarray
clang: error: linker command failed with exit code 1 (use -v to see invocation)

故技重施，把上面?libgpuarray/lib中的动态链接库拷贝到/Users/flybywind/anaconda3/lib下面好了。

继续试，终于好了：

Mapped name None to device opencl0:1: Iris
PCI Bus ID: (unsupported for device opencl0:1)
[GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float32, (False,))>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 1.042960 seconds
Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
  1.62323284]
Used the gpu

好了，现在还有2个问题：

opencl0:1到底啥意思？

Device specifiers are composed of the type string and the device id like so:
"cuda0"
"opencl0:1"
For opencl the device id is the platform number, a colon (:) and the device number. There are no widespread and/or easy way to list available platforms and devices. You can experiement with the values, unavaiable ones will just raise an error, and there are no gaps in the valid numbers.

就是说，opencl表示类型，跟cuda类似。但是对于opencl，还要指定platform和设备编号，中间用":"分隔。编号都是连续的，所以这2个数从0开始往后试即可[来源]。一般platform就是0，所以我试了0:0, 发现不对，有问题：

Mapped name None to device opencl0:0: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
PCI Bus ID: (unsupported for device opencl0:0)
[GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float32, (False,))>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 1.459022 seconds
Result is [ 1.23178029  1.61879325  1.52278078 ...,  2.20771813  2.29967737
  1.62323272]
Used the gpu

这个很有意思！首先时间确实缩短了，graph也是GpuElem，最后numpy的检测也显示是gpu，但是device却显示的是CPU。好像是一种混合体。。。
于是再试一下0:1, 发现终于对了：

Mapped name None to device opencl0:1: Iris
PCI Bus ID: (unsupported for device opencl0:1)
[GpuElemwise{exp,no_inplace}(<GpuArrayType<None>(float32, (False,))>), HostFromGpu(gpuarray)(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 1.042960 seconds
Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
  1.62323284]
Used the gpu

此时设备显示也是对的，而时间进一步缩短！

PCI Bus ID: (unsupported for device opencl0:1)什么意思？
如果是cuda，最新的gpuarray是可以显示PCI总线id的：

Mapped name None to device cuda: GeForce 840M;

PCI Bus ID: 0000:0A:00.0

但是opencl就是这幅德行。所以，the end! 我终于解决了所有问题了。