Azure GPU VM Configuration
- Address
 - GPU Info  
configuration: 24 vCPUs, 448G Memory and 4 NVIDIA Tesla V100 card price: 85442.18 HKD per month, or 117.04 HKD per hour. - Windows System  
Processor: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60 GHz 2.59 GHz (2 processors) Installed memory(RAM): 448 GB System type: 64-bit Operationg System, x64-based processor Pen and Touch: No Pen or Touch Input is available for this Display Windows Produc ID: 00430-00000-00000-AA040 Windows edition: Windows Server 2019 Datacenter - Azure Page  
Location: East US Subscription ID: 712cbc67-a904-48cf-83e6-05c0629ce9f3 Operating system: Windows (Windows Server 2019 Datacenter) Size: Standard NC24rs_v3 (24 vcpus, 448 GiB memory) Public IP address: 13.92.214.100 Virtual network/subnet: GPU-vnet/default 
Install Azure VM GPU Drivers
- Install the Drivers
 - 让Windows Server 2019 Datacenter上的IE浏览器可以下载chrome 
右上角工具按钮 -> Internet options -> Security -> Internet -> Customer level -> Downloads: (1) File download -> Enable (2) Fond download -> Enable - 下载安装NVIDIA Tesla (CUDA) drivers: Windows Server 2019的Driver(.exe文件) 
- 安装在: C:\NVIDIA\DisplayDriver\451.82\Win10_64\International - NVIDIA Graphics Driver (Version 451.82): 一路NEXT 
Configure FaceNet Environment
下载安装必要软件
- 下载安装Git  
- 安装在: C:\Program Files\Git - 下载安装Anaconda  
- 安装在: C:\ProgramData\Anaconda3 # 安装了Anaconda,也就有了python和pip - conda: C:\ProgramData\Anaconda3\Scripts\conda.exe # where conda - python: C:\ProgramData\Anaconda3\python.exe # where python - pip: C:\ProgramData\Anaconda3\Scripts\pip.exe # where pip - 下载解压软件7-Zip  
- 安装在: C:\Program Files\7-Zip\ - 下载代码编辑软件VSCode  
- 安装在: C:\Users\gputest\AppData\Local\Programs\Microsoft VS Code\Code.exe下载项目
 - 下载facenet代码  
$cd /c/Users/gputest/Desktop $git clone https://github.com/davidsandberg/facenet $cd facenet $conda create -n facenet python=3.6 && conda activate facenet $pip install -r requirements.txt - 下载模型20180402-114759    LFW accuracy: 0.9965    VGGFace2  
在根目录下新建models文件夹,将模型放入models下: models\20180402-114759\20180402-114759 - 下载数据集LFW  
在根目录下新建datasets文件夹,将lfw解压后放入datasets下: datasets\lfw配置包
 - Add 
envs\facenet\Lib\site-packages\facenet1. envs\facenet\Lib\site-packages下新建facenet文件夹 2. 复制facenet-master\src下所有文件到上述facenet文件夹下 3. 在conda环境中python, 然后import facenet, 不会报错即可 - Edit Program
Desktop\facenet\src\align\align_dataset_mtcnn.pyimport facenet.facenet as facenet # import facenetDesktop\facenet\contributed\predict.pyimport facenet.facenet as facenet # import facenet - Add 
envs\facenet\Lib\site-packages\align1. 复制facenet-master\src\align文件夹到envs\facenet\Lib\site-packages下 2. 在conda环境中python, 然后import align, 不会报错即可 - Change module version  
$pip install numpy==1.16.2 $pip install scipy==1.2.1 - 为了观察程序效率,最好为每个运行程序配置上时间消耗的显示:  
import time if __name__ == '__main__': start_time = time.time() print('Program.py start at: ' + time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())) main(parse_arguments(sys.argv[1:])) print('Program.py end at: ' + time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())) print('Program.py all time: {0} seconds = {1} minutes = {2} hrs'.format((time.time() - start_time), (time.time() - start_time) / 60, (time.time() - start_time) / 3600))测试cpu模式下能否正常运行
# (1) Align $python src/align/align_dataset_mtcnn.py datasets/lfw datasets/lfw_160 --image_size 160 --margin 32 # (2) Train $python src/classifier.py TRAIN datasets/lfw_160 models/20180402-114759/20180402-114759.pb models/lfw.pkl # (3) Classify $python src/classifier.py CLASSIFY datasets/lfw_160 models/20180402-114759/20180402-114759.pb models/lfw.pkl # (4) Predict $python contributed/predict.py datasets/lfw/Aaron_Eckhart/Aaron_Eckhart_0001.jpg models/20180402-114759 models/lfw.pkl 
测试gpu模式下能否正常运行
测试命令
# (1) Align
$python src/align/align_dataset_mtcnn.py datasets/lfw datasets/lfw_160 --image_size 160 --margin 32 # 如果GPU够强劲
$python src/align/align_dataset_mtcnn.py datasets/lfw datasets/lfw_160 --image_size 160 --margin 32 --gpu_memory_fraction 0.5 # 如果GPU不够强劲
# (2) Train
$python src/classifier.py TRAIN datasets/lfw_160 models/20180402-114759/20180402-114759.pb models/lfw.pkl
# (3) Classify
$python src/classifier.py CLASSIFY datasets/lfw_160 models/20180402-114759/20180402-114759.pb models/lfw.pkl
# (4) Predict
$python contributed/predict.py datasets/lfw/Aaron_Eckhart/Aaron_Eckhart_0001.jpg models/20180402-114759 models/lfw.pkl # 如果GPU够强劲
$python contributed/predict.py datasets/lfw/Aaron_Eckhart/Aaron_Eckhart_0001.jpg models/20180402-114759 models/lfw.pkl --gpu_memory_fraction 0.5 # 如果GPU不够强劲
安装环境依赖
配置facenet环境: Visual Studio 2017 + python 3.6.12 + tensorflow-gpu 1.7.0 + CUDA 9.0 + cuDNN 7.0.5 + facenet site-packages
- Visual Studio 2017 
# 安装选项 - Technically only the VS C/C++ compiler is required (cl.exe) - 通用Windows平台开发(包括其子选项C++通用Windows平台工具) - 使用C++的桌面开发 # 安装地址 - 安装在: C:\Program Files (x86)\Microsoft Visual Studio\2017\Community - CUDA Toolkit 9.0 
- Base installer | Patch1 | Patch2 | Patch3 | Patch4 - 安装在: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0 - 右键点击桌面 -> 如果在弹出窗口中看到“NVIDIA控制面板”或“NVIDIA显示”,则说明您具有NVIDIA GPU - cuDNN 7.0.5 
# 把cuDNN下的bin,include,lib文件夹拷贝到CUDA的安装路径: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0 # tensoflow版本不同,所对应的CUDA 和cuDNN版本也不同 (一定要对应上,否则会报错) $nvcc -V # 检查CUDA是否安装成功 # nvcc: NVIDIA (R) Cuda compiler driver # Copyright (c) 2005-2017 NVIDIA Corporation # Built on Fri_Sep__1_21:08:32_Central_Daylight_Time_2017 # Cuda compilation tools, release 9.0, V9.0.176 
添加环境变量
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\libnvvp
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\lib\x64
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\include
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\extras\CUPTI\libx64
C:\Program Files\NVIDIA Corporation\NVSMI   (nvidia-smi.exe的Path)
C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common
# 测试并查看GPU使用率: Task Manager (dedicated GPU) 
$nvidia-smi.exe -h
$nvidia-smi.exe -l 1   # 一秒钟更新一次信息
添加环境包
- install tensorflow-gpu 1.7.0  
$conda activate facenet $pip list # tensorflow 1.7.0 $pip uninstall -y tensorflow $pip install tensorflow-gpu==1.7.0 # tensorflow-gpu 1.7.0 
修改程序(用GPU加速)
src/align/align_dataset_mtcnn.py# 修改sess配置 -> 使用GPU进行数据预处理 (使用GPU加速计算) 不要忘了在虚拟环境中也更替此文件: C:\ProgramData\Anaconda3\envs\facenet\Lib\site-packages\facenet\src\align\align_dataset_mtcnn.py # sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False))修改为 config = tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False, allow_soft_placement=True) config.gpu_options.allow_growth = True sess = tf.Session(config=config)src/train_tripletloss.py# 修改sess配置 -> 使用GPU加速计算(供GPU玩家修改,使用CPU的话就不用修改了,跳过这一点直接训练就好了) # 191行: sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False)) config = tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False, allow_soft_placement=True) config.gpu_options.allow_growth = True sess = tf.Session(config=config)src/classifier.py# 修改sess配置 -> 使用GPU加速计算(供GPU玩家修改,使用CPU的话就不用修改了,跳过这一点直接训练就好了) # with tf.Graph().as_default(): # with tf.Session() as sess: with tf.Graph().as_default(): config = tf.ConfigProto() config.gpu_options.allow_growth = True # 不全部占满显存, 按需分配 config.gpu_options.per_process_gpu_memory_fraction = 1.0 # 限制GPU内存占用率 with tf.Session(config=config) as sess:contributed/predict.py# 修改sess配置 -> 使用GPU加速计算 # sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False)) config = tf.ConfigProto() config.gpu_options.allow_growth = True # 不全部占满显存, 按需分配 config.gpu_options.per_process_gpu_memory_fraction = 1.0 # 限制GPU内存占用率 sess = tf.Session(config=config)
打包及导入环境
- pip打包 
$pip freeze > facenet.txt - conda打包 
$conda env export > facenet.yaml - 导入环境 
$conda create -n facenet python=3.6 $conda activate facenet $pip install -r facenet.txt $conda env create -f facenet.yaml # 或者替换site-packages文件夹就行了 
conda操作
# 创建新环境
$conda create -n facenet python=3.6
# 删除环境
$conda remove -n facenet --all
# 查看所有虚拟环境
$conda info --env
# 查看env所有地址
$where conda
Experiment Results
- 测试数据集: lfw
 - 测试命令如下:  
# 1. Align $python src/align/align_dataset_mtcnn.py datasets/lfw datasets/lfw_160 --image_size 160 --margin 32 # GPU内存足够 $python src/align/align_dataset_mtcnn.py datasets/lfw datasets/lfw_160 --image_size 160 --margin 32 --gpu_memory_fraction 0.5 # GPU内存不够 # 2. Train $python src/classifier.py TRAIN datasets/lfw_160 models/20180402-114759/20180402-114759.pb models/lfw.pkl # 3. Classify $python src/classifier.py CLASSIFY datasets/lfw_160 models/20180402-114759/20180402-114759.pb models/lfw.pkl # 4. Predict $python contributed/predict.py datasets/lfw/Aaron_Eckhart/Aaron_Eckhart_0001.jpg models/20180402-114759 models/lfw.pkl # GPU内存足够 $python contributed/predict.py datasets/lfw/Aaron_Eckhart/Aaron_Eckhart_0001.jpg models/20180402-114759 models/lfw.pkl --gpu_memory_fraction 0.5 # GPU内存不够 
Azure GPU VM (4 GPU)
- GPU Info
configuration: 24 vCPUs, 448G Memory and 4 NVIDIA Tesla V100 card Processor: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60 GHz 2.59 GHz (2 processors) Installed memory(RAM): 448 GB System type: 64-bit Operationg System, x64-based processor Windows edition:Windows Server 2019 Datacenter - GPU Performance
# Align - GPU耗时 Total number of images: 13233 | Number of successfully aligned images: 13233 Program.py all time: 636.5618450641632 seconds = 10.60936408440272 minutes = 0.17682273474004534 hrs So processing efficiency is: 0.0481 second/image - GPU使用情况: $nvidia-smi.exe -l 1 CPU Memory Usage: 952MiB | 480MiB | 480MiB | 480MiB GPU Memory Usage: 1860MiB|1388MiB | 1388MiB| 1388MiB <16280MiB> GPU Utilization: 7% | 0% | 0% | 0% - CPU耗时 Program.py all time: 1133.1881606578827 seconds = 18.886469344298046 minutes = 0.3147744890716341 hrs So processing efficiency is: 0.0856 second/image - CPU使用情况 Utilization 31%, Speed 2.59 GHz, Processes 138, Threads 1620 Memory: 2% (10/448), In use/Available (9.9 GB/438 GB), Committed 10/448 GB, Cached 11.7 GB, Paged pool 1.2 GB # Train - GPU耗时 Number of classes: 5749 | Number of images: 13233 Loading feature extraction model 20180402-114759 -> Calculating features for images -> Training classifier -> Saved classifier model Program.py all time: 1013.1615738868713 seconds = 16.886026231447854 minutes = 0.28143377052413093 hrs So processing efficiency is: 0.0765 second/image - GPU使用情况: $nvidia-smi.exe -l 1 CPU Memory Usage: 2248MiB| 480MiB | 480MiB | 480MiB GPU Memory Usage: 3156MiB|1388MiB | 1388MiB| 1388MiB <16280MiB> GPU Utilization: 0% | 0% | 0% | 0% - CPU耗时 Program.py all time: 1531.372586965561 seconds = 25.522876449426015 minutes = 0.42538127415710025 hrs So processing efficiency is: 0.1157 second/image - CPU使用情况 Utilization 87%, Speed 2.59 GHz, Processes 139, Threads 1685 Memory: 2% (11/448), In use/Available (10.7 GB/437 GB), Committed 11/448 GB, Cached 12.0 GB, Paged pool 1.3 GB # Classify - GPU使用情况: $nvidia-smi.exe -l 1 CPU Memory Usage: 2248MiB| 480MiB | 480MiB | 480MiB GPU Memory Usage: 3156MiB|1388MiB | 1388MiB| 1388MiB <16280MiB> GPU Utilization: 0% | 0% | 0% | 0% - CPU使用情况 Utilization 85%, Speed 2.59 GHz, Processes 114, Threads 1396 Memory: 2% (8/448), In use/Available (7.9 GB/440 GB), Committed 9/448 GB, Cached 1.2 GB, Paged pool 112 MB # Predict - GPU耗时 Program.py all time: 35.54524612426758 seconds = 0.5924207687377929 minutes = 0.009873679478963216 hrs So processing efficiency is: 35.5 second/image - CPU耗时 Program.py all time: 29.006401300430298 seconds = 0.4834400216738383 minutes So processing efficiency is: 29.0 second/image 
Windows Server
- GPU Info  
configuration: 2 NIVIDA GeForce RTX 2080 Ti - GPU Performance
# Align - 耗时 Total number of images: 13233 | Number of successfully aligned images: 13233 ProgramName All Time: 938.7497174739838 seconds = 15.645828624566397 minutes = 0.2607638104094399 hrs So processing efficiency is: 0.0709 second/image - GPU使用情况: $nvidia-smi.exe -l 1 GPU Memory Usage: 1183MiB|456MiB | <11264MiB> GPU Utilization: 8% | 0% # Train - GPU耗时 Number of classes: 5749 | Number of images: 13233 Loading feature extraction model 20180402-114759 -> Calculating features for images -> Training classifier -> Saved classifier model Program All Time: 1790.332600593567 seconds = 29.838876676559448 minutes = 0.497314611275908 hrs GPU processing efficiency is: 0.1353 second/image - GPU使用情况: $nvidia-smi.exe -l 1 GPU Memory Usage: 2582MiB|463MiB | <11264MiB> GPU Utilization: 13% | 0% - CPU耗时 Program All Time: 1733.3322100639343 seconds = 28.88887016773224 minutes = 0.481481169462204 hrs CPU processing efficiency is: 0.130 second/image - CPU使用情况 Utilization 87%, Speed 2.59 GHz, Processes 139, Threads 1685 Memory: 2% (11/448), In use/Available (10.7 GB/437 GB), Committed 11/448 GB, Cached 12.0 GB, Paged pool 1.3 GB # Classify - GPU耗时 Number of classes: 5749 | Number of images: 13233 Program All Time: 36935 seconds = 615 minutes = 10 hrs GPU processing efficiency is: 2.791 second/image # Predict - GPU耗时 Program.py all time: 34.344 seconds So processing efficiency is: 34.3 second/image 
CPU和GPU的性能比较
# Azure VM CPU (4 GPU) CPU Performance:
    image processing efficiency (before training) is:   0.0856  second/image
    face image training efficiency is:                  0.1157  second/image
    face image prediction efficiency is:                29.0    second/image
# Azure VM GPU (4 GPU) GPU Performance:
    image processing efficiency (before training) is:   0.0481  second/image
    face image training efficiency is:                  0.0765  second/image
    face image prediction efficiency is:                35.5    second/image
# Windows Server GPU (2 GPU) GPU Performance:
    image processing efficiency (before training) is:   0.0709  second/image
    face image training efficiency is:                  0.1353  second/image
    face image prediction efficiency is:                34.3    second/image
GPU Log
Azure GPU VM 使用GPU的命令行提示
cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
gpu_device.cc:1344] Found device 0 with properties:
    name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
    pciBusID: 0001:00:00.0
    totalMemory: 15.90GiB freeMemory: 14.60GiB
gpu_device.cc:1344] Found device 1 with properties:
    name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
    pciBusID: 0002:00:00.0
    totalMemory: 15.90GiB freeMemory: 14.60GiB
gpu_device.cc:1344] Found device 2 with properties:
    name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
    pciBusID: 0003:00:00.0
    totalMemory: 15.90GiB freeMemory: 14.60GiB
gpu_device.cc:1344] Found device 3 with properties:
    name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
    pciBusID: 0004:00:00.0
    totalMemory: 15.90GiB freeMemory: 14.60GiB
gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2, 3
gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
gpu_device.cc:917]      0 1 2 3
gpu_device.cc:930] 0:   N N N N
gpu_device.cc:930] 1:   N N N N
gpu_device.cc:930] 2:   N N N N
gpu_device.cc:930] 3:   N N N N
gpu_device.cc:1041] Created TensorFlow device (task:0/device:GPU:0 with 16280 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0001:00:00.0, compute capability: 7.0)
gpu_device.cc:1041] Created TensorFlow device (task:0/device:GPU:1 with 16280 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0002:00:00.0, compute capability: 7.0)
gpu_device.cc:1041] Created TensorFlow device (task:0/device:GPU:2 with 16280 MB memory) -> physical GPU (device: 2, name: Tesla V100-PCIE-16GB, pci bus id: 0003:00:00.0, compute capability: 7.0)
gpu_device.cc:1041] Created TensorFlow device (task:0/device:GPU:3 with 16280 MB memory) -> physical GPU (device: 3, name: Tesla V100-PCIE-16GB, pci bus id: 0004:00:00.0, compute capability: 7.0)
Azure GPU VM GPU闲置时的使用情况
# $nvidia-smi.exe -l 1
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Fri Apr 16 16:22:42 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 385.54                 Driver Version: 385.54                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  TCC  | 00000001:00:00.0 Off |                    0 |
| N/A   32C    P0    24W / 250W |      1MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  TCC  | 00000002:00:00.0 Off |                    0 |
| N/A   29C    P0    22W / 250W |      1MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  TCC  | 00000003:00:00.0 Off |                    0 |
| N/A   28C    P0    23W / 250W |      1MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  TCC  | 00000004:00:00.0 Off |                    0 |
| N/A   29C    P0    24W / 250W |      1MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
**Azure GPU VM 使用GPU的命令行提示**
configuration: 28 vCPUs (Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00 GHz), 128G Memory and 2 NVIDIA GeForce RTX 2080 Ti
```markdown
cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
gpu_device.cc:1344] Found device 0 with properties:
    name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
    pciBusID: 0000:02:00.0
    totalMemory: 11.00GiB freeMemory: 9.90GiB
gpu_device.cc:1344] Found device 1 with properties:
    name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
    pciBusID: 0000:03:00.0
    totalMemory: 11.00GiB freeMemory: 9.90GiB
gpu_device.cc:1423] Adding visible gpu devices: 0, 1
gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
gpu_device.cc:917]      0 1
gpu_device.cc:930] 0:   N N
gpu_device.cc:930] 1:   N N
gpu_device.cc:1041] Created TensorFlow device (task:0/device:GPU:0 with 11264 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5)
gpu_device.cc:1041] Created TensorFlow device (task:0/device:GPU:1 with 11264 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5)