Deploy Face Recognition server in Azure GPU VM

Azure GPU VM Configuration

Address

GPU Info

configuration: 24 vCPUs, 448G Memory and 4 NVIDIA Tesla V100 card
price: 85442.18 HKD per month, or 117.04 HKD per hour.

Windows System

Processor:              Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60 GHz 2.59 GHz (2 processors)
Installed memory(RAM):  448 GB
System type:            64-bit Operationg System, x64-based processor
Pen and Touch:          No Pen or Touch Input is available for this Display
Windows Produc ID:      00430-00000-00000-AA040
Windows edition:        Windows Server 2019 Datacenter

Azure Page

Location:               East US
Subscription ID:        712cbc67-a904-48cf-83e6-05c0629ce9f3
Operating system:       Windows (Windows Server 2019 Datacenter)
Size:                   Standard NC24rs_v3 (24 vcpus, 448 GiB memory)
Public IP address:      13.92.214.100
Virtual network/subnet: GPU-vnet/default

Install Azure VM GPU Drivers

Install the Drivers

让Windows Server 2019 Datacenter上的IE浏览器可以下载chrome

右上角工具按钮 -> Internet options -> Security -> Internet -> Customer level -> Downloads: (1) File download -> Enable (2) Fond download -> Enable

下载安装NVIDIA Tesla (CUDA) drivers: Windows Server 2019的Driver(.exe文件)

- 安装在: C:\NVIDIA\DisplayDriver\451.82\Win10_64\International
- NVIDIA Graphics Driver (Version 451.82): 一路NEXT

Configure FaceNet Environment

下载安装必要软件

下载安装Git
```
- 安装在: C:\Program Files\Git
```

下载安装Anaconda

- 安装在:   C:\ProgramData\Anaconda3
# 安装了Anaconda，也就有了python和pip
- conda:    C:\ProgramData\Anaconda3\Scripts\conda.exe  # where conda
- python:   C:\ProgramData\Anaconda3\python.exe         # where python
- pip:      C:\ProgramData\Anaconda3\Scripts\pip.exe    # where pip

下载解压软件7-Zip
```
- 安装在: C:\Program Files\7-Zip\
```

下载代码编辑软件VSCode

- 安装在: C:\Users\gputest\AppData\Local\Programs\Microsoft VS Code\Code.exe

下载项目

下载facenet代码

$cd /c/Users/gputest/Desktop
$git clone https://github.com/davidsandberg/facenet
$cd facenet
$conda create -n facenet python=3.6 && conda activate facenet
$pip install -r requirements.txt

下载模型20180402-114759 LFW accuracy: 0.9965 VGGFace2

在根目录下新建models文件夹，将模型放入models下: models\20180402-114759\20180402-114759

下载数据集LFW

在根目录下新建datasets文件夹，将lfw解压后放入datasets下: datasets\lfw

配置包

Add envs\facenet\Lib\site-packages\facenet

1. envs\facenet\Lib\site-packages下新建facenet文件夹
2. 复制facenet-master\src下所有文件到上述facenet文件夹下
3. 在conda环境中python, 然后import facenet, 不会报错即可

Edit Program
Desktop\facenet\src\align\align_dataset_mtcnn.py

import facenet.facenet as facenet # import facenet

Desktop\facenet\contributed\predict.py

import facenet.facenet as facenet # import facenet

Add envs\facenet\Lib\site-packages\align

1. 复制facenet-master\src\align文件夹到envs\facenet\Lib\site-packages下
2. 在conda环境中python, 然后import align, 不会报错即可

Change module version

$pip install numpy==1.16.2
$pip install scipy==1.2.1

为了观察程序效率，最好为每个运行程序配置上时间消耗的显示:

import time
if __name__ == '__main__':
    start_time = time.time()
    print('Program.py start at: ' + time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
    main(parse_arguments(sys.argv[1:]))
    print('Program.py end at: ' + time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
    print('Program.py all time: {0} seconds = {1} minutes = {2} hrs'.format((time.time() - start_time), (time.time() - start_time) / 60, (time.time() - start_time) / 3600))

测试cpu模式下能否正常运行

# (1) Align
$python src/align/align_dataset_mtcnn.py datasets/lfw datasets/lfw_160 --image_size 160 --margin 32 
# (2) Train
$python src/classifier.py TRAIN datasets/lfw_160 models/20180402-114759/20180402-114759.pb models/lfw.pkl
# (3) Classify
$python src/classifier.py CLASSIFY datasets/lfw_160 models/20180402-114759/20180402-114759.pb models/lfw.pkl
# (4) Predict
$python contributed/predict.py datasets/lfw/Aaron_Eckhart/Aaron_Eckhart_0001.jpg models/20180402-114759 models/lfw.pkl

测试gpu模式下能否正常运行

测试命令

# (1) Align
$python src/align/align_dataset_mtcnn.py datasets/lfw datasets/lfw_160 --image_size 160 --margin 32 # 如果GPU够强劲
$python src/align/align_dataset_mtcnn.py datasets/lfw datasets/lfw_160 --image_size 160 --margin 32 --gpu_memory_fraction 0.5 # 如果GPU不够强劲
# (2) Train
$python src/classifier.py TRAIN datasets/lfw_160 models/20180402-114759/20180402-114759.pb models/lfw.pkl
# (3) Classify
$python src/classifier.py CLASSIFY datasets/lfw_160 models/20180402-114759/20180402-114759.pb models/lfw.pkl
# (4) Predict
$python contributed/predict.py datasets/lfw/Aaron_Eckhart/Aaron_Eckhart_0001.jpg models/20180402-114759 models/lfw.pkl # 如果GPU够强劲
$python contributed/predict.py datasets/lfw/Aaron_Eckhart/Aaron_Eckhart_0001.jpg models/20180402-114759 models/lfw.pkl --gpu_memory_fraction 0.5 # 如果GPU不够强劲

安装环境依赖

配置facenet环境: Visual Studio 2017 + python 3.6.12 + tensorflow-gpu 1.7.0 + CUDA 9.0 + cuDNN 7.0.5 + facenet site-packages

Visual Studio 2017

# 安装选项
- Technically only the VS C/C++ compiler is required (cl.exe)
- 通用Windows平台开发（包括其子选项C++通用Windows平台工具）
- 使用C++的桌面开发
# 安装地址
- 安装在: C:\Program Files (x86)\Microsoft Visual Studio\2017\Community

CUDA Toolkit 9.0

- Base installer | Patch1 | Patch2 | Patch3 | Patch4
- 安装在: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0
- 右键点击桌面 -> 如果在弹出窗口中看到“NVIDIA控制面板”或“NVIDIA显示”，则说明您具有NVIDIA GPU

cuDNN 7.0.5

# 把cuDNN下的bin,include,lib文件夹拷贝到CUDA的安装路径: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0
# tensoflow版本不同，所对应的CUDA 和cuDNN版本也不同 (一定要对应上，否则会报错)
$nvcc -V   # 检查CUDA是否安装成功
# nvcc: NVIDIA (R) Cuda compiler driver
# Copyright (c) 2005-2017 NVIDIA Corporation
# Built on Fri_Sep__1_21:08:32_Central_Daylight_Time_2017
# Cuda compilation tools, release 9.0, V9.0.176

添加环境变量

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\bin
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\libnvvp
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\lib\x64
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\include
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\extras\CUPTI\libx64
C:\Program Files\NVIDIA Corporation\NVSMI   (nvidia-smi.exe的Path)
C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common

# 测试并查看GPU使用率: Task Manager (dedicated GPU) 
$nvidia-smi.exe -h
$nvidia-smi.exe -l 1   # 一秒钟更新一次信息

添加环境包

install tensorflow-gpu 1.7.0

$conda activate facenet
$pip list # tensorflow 1.7.0 
$pip uninstall -y tensorflow
$pip install tensorflow-gpu==1.7.0 # tensorflow-gpu 1.7.0

修改程序(用GPU加速)

src/align/align_dataset_mtcnn.py

# 修改sess配置 -> 使用GPU进行数据预处理 (使用GPU加速计算) 不要忘了在虚拟环境中也更替此文件: C:\ProgramData\Anaconda3\envs\facenet\Lib\site-packages\facenet\src\align\align_dataset_mtcnn.py

# sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False))修改为
config                          = tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False, allow_soft_placement=True)
config.gpu_options.allow_growth = True
sess                            = tf.Session(config=config)

src/train_tripletloss.py

# 修改sess配置 -> 使用GPU加速计算（供GPU玩家修改，使用CPU的话就不用修改了，跳过这一点直接训练就好了）

# 191行: sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False))
config                          = tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False, allow_soft_placement=True)
config.gpu_options.allow_growth = True
sess                            = tf.Session(config=config)

src/classifier.py

# 修改sess配置 -> 使用GPU加速计算（供GPU玩家修改，使用CPU的话就不用修改了，跳过这一点直接训练就好了）

# with tf.Graph().as_default():
#     with tf.Session() as sess:

with tf.Graph().as_default():
    config                                              = tf.ConfigProto()
    config.gpu_options.allow_growth                     = True      # 不全部占满显存, 按需分配
    config.gpu_options.per_process_gpu_memory_fraction  = 1.0       # 限制GPU内存占用率
    with tf.Session(config=config) as sess:

contributed/predict.py

# 修改sess配置 -> 使用GPU加速计算

# sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options, log_device_placement=False))
config                                              = tf.ConfigProto()
config.gpu_options.allow_growth                     = True                      # 不全部占满显存, 按需分配
config.gpu_options.per_process_gpu_memory_fraction  = 1.0                       # 限制GPU内存占用率
sess                                                = tf.Session(config=config)

打包及导入环境

pip打包
```
$pip freeze > facenet.txt
```
conda打包
```
$conda env export > facenet.yaml 
```

导入环境

$conda create -n facenet python=3.6
$conda activate facenet
$pip install -r facenet.txt
$conda env create -f facenet.yaml
# 或者替换site-packages文件夹就行了

conda操作

# 创建新环境
$conda create -n facenet python=3.6
# 删除环境
$conda remove -n facenet --all
# 查看所有虚拟环境
$conda info --env
# 查看env所有地址
$where conda

Experiment Results

测试数据集: lfw

测试命令如下:

# 1. Align
$python src/align/align_dataset_mtcnn.py datasets/lfw datasets/lfw_160 --image_size 160 --margin 32 # GPU内存足够
$python src/align/align_dataset_mtcnn.py datasets/lfw datasets/lfw_160 --image_size 160 --margin 32 --gpu_memory_fraction 0.5 # GPU内存不够
# 2. Train
$python src/classifier.py TRAIN datasets/lfw_160 models/20180402-114759/20180402-114759.pb models/lfw.pkl
# 3. Classify
$python src/classifier.py CLASSIFY datasets/lfw_160 models/20180402-114759/20180402-114759.pb models/lfw.pkl
# 4. Predict
$python contributed/predict.py datasets/lfw/Aaron_Eckhart/Aaron_Eckhart_0001.jpg models/20180402-114759 models/lfw.pkl # GPU内存足够
$python contributed/predict.py datasets/lfw/Aaron_Eckhart/Aaron_Eckhart_0001.jpg models/20180402-114759 models/lfw.pkl --gpu_memory_fraction 0.5 # GPU内存不够

Azure GPU VM (4 GPU)

GPU Info

configuration:  24 vCPUs, 448G Memory and 4 NVIDIA Tesla V100 card
Processor:      Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60 GHz 2.59 GHz (2 processors)
Installed memory(RAM):  448 GB
System type:    64-bit Operationg System, x64-based processor
Windows edition:Windows Server 2019 Datacenter

GPU Performance

# Align
- GPU耗时
    Total number of images: 13233 | Number of successfully aligned images: 13233
    Program.py all time: 636.5618450641632 seconds = 10.60936408440272 minutes = 0.17682273474004534 hrs
    So processing efficiency is: 0.0481 second/image
- GPU使用情况: $nvidia-smi.exe -l 1
    CPU Memory Usage: 952MiB | 480MiB | 480MiB | 480MiB
    GPU Memory Usage: 1860MiB|1388MiB | 1388MiB| 1388MiB  <16280MiB>
    GPU Utilization:   7%    |   0%   |   0%   |   0%
- CPU耗时
    Program.py all time: 1133.1881606578827 seconds = 18.886469344298046 minutes = 0.3147744890716341 hrs
    So processing efficiency is: 0.0856 second/image
- CPU使用情况
    Utilization 31%, Speed 2.59 GHz, Processes 138, Threads 1620
    Memory: 2% (10/448), In use/Available (9.9 GB/438 GB), Committed 10/448 GB, Cached 11.7 GB, Paged pool 1.2 GB

# Train
- GPU耗时
    Number of classes: 5749 | Number of images: 13233
    Loading feature extraction model 20180402-114759 -> Calculating features for images -> Training classifier -> Saved classifier model
    Program.py all time: 1013.1615738868713 seconds = 16.886026231447854 minutes = 0.28143377052413093 hrs
    So processing efficiency is: 0.0765 second/image
- GPU使用情况: $nvidia-smi.exe -l 1
    CPU Memory Usage: 2248MiB| 480MiB | 480MiB | 480MiB
    GPU Memory Usage: 3156MiB|1388MiB | 1388MiB| 1388MiB  <16280MiB>
    GPU Utilization:   0%    |   0%   |   0%   |   0%
- CPU耗时
    Program.py all time: 1531.372586965561 seconds = 25.522876449426015 minutes = 0.42538127415710025 hrs
    So processing efficiency is: 0.1157 second/image
- CPU使用情况
    Utilization 87%, Speed 2.59 GHz, Processes 139, Threads 1685
    Memory: 2% (11/448), In use/Available (10.7 GB/437 GB), Committed 11/448 GB, Cached 12.0 GB, Paged pool 1.3 GB

# Classify
- GPU使用情况: $nvidia-smi.exe -l 1
    CPU Memory Usage: 2248MiB| 480MiB | 480MiB | 480MiB
    GPU Memory Usage: 3156MiB|1388MiB | 1388MiB| 1388MiB  <16280MiB>
    GPU Utilization:   0%    |   0%   |   0%   |   0%
- CPU使用情况
    Utilization 85%, Speed 2.59 GHz, Processes 114, Threads 1396
    Memory: 2% (8/448), In use/Available (7.9 GB/440 GB), Committed 9/448 GB, Cached 1.2 GB, Paged pool 112 MB

# Predict
- GPU耗时
    Program.py all time: 35.54524612426758 seconds = 0.5924207687377929 minutes = 0.009873679478963216 hrs
    So processing efficiency is: 35.5 second/image
- CPU耗时
    Program.py all time: 29.006401300430298 seconds = 0.4834400216738383 minutes
    So processing efficiency is: 29.0 second/image

Windows Server

GPU Info

configuration:  2 NIVIDA GeForce RTX 2080 Ti

GPU Performance

# Align
- 耗时
    Total number of images: 13233 | Number of successfully aligned images: 13233
    ProgramName All Time: 938.7497174739838 seconds = 15.645828624566397 minutes = 0.2607638104094399 hrs
    So processing efficiency is: 0.0709 second/image
- GPU使用情况: $nvidia-smi.exe -l 1
    GPU Memory Usage: 1183MiB|456MiB  | <11264MiB>
    GPU Utilization:   8%    |   0%

# Train
- GPU耗时
    Number of classes: 5749 | Number of images: 13233
    Loading feature extraction model 20180402-114759 -> Calculating features for images -> Training classifier -> Saved classifier model
    Program All Time: 1790.332600593567 seconds = 29.838876676559448 minutes = 0.497314611275908 hrs
    GPU processing efficiency is: 0.1353 second/image
- GPU使用情况: $nvidia-smi.exe -l 1
    GPU Memory Usage: 2582MiB|463MiB  | <11264MiB>
    GPU Utilization:   13%   |   0%   
- CPU耗时
    Program All Time: 1733.3322100639343 seconds = 28.88887016773224 minutes = 0.481481169462204 hrs
    CPU processing efficiency is: 0.130 second/image
- CPU使用情况
    Utilization 87%, Speed 2.59 GHz, Processes 139, Threads 1685
    Memory: 2% (11/448), In use/Available (10.7 GB/437 GB), Committed 11/448 GB, Cached 12.0 GB, Paged pool 1.3 GB

# Classify
- GPU耗时
    Number of classes: 5749 | Number of images: 13233
    Program All Time: 36935 seconds = 615 minutes = 10 hrs
    GPU processing efficiency is: 2.791 second/image

# Predict
- GPU耗时
    Program.py all time: 34.344 seconds
    So processing efficiency is: 34.3 second/image

CPU和GPU的性能比较

# Azure VM CPU (4 GPU) CPU Performance:
    image processing efficiency (before training) is:   0.0856  second/image
    face image training efficiency is:                  0.1157  second/image
    face image prediction efficiency is:                29.0    second/image
# Azure VM GPU (4 GPU) GPU Performance:
    image processing efficiency (before training) is:   0.0481  second/image
    face image training efficiency is:                  0.0765  second/image
    face image prediction efficiency is:                35.5    second/image
# Windows Server GPU (2 GPU) GPU Performance:
    image processing efficiency (before training) is:   0.0709  second/image
    face image training efficiency is:                  0.1353  second/image
    face image prediction efficiency is:                34.3    second/image

GPU Log

Azure GPU VM 使用GPU的命令行提示

cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
gpu_device.cc:1344] Found device 0 with properties:
    name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
    pciBusID: 0001:00:00.0
    totalMemory: 15.90GiB freeMemory: 14.60GiB
gpu_device.cc:1344] Found device 1 with properties:
    name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
    pciBusID: 0002:00:00.0
    totalMemory: 15.90GiB freeMemory: 14.60GiB
gpu_device.cc:1344] Found device 2 with properties:
    name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
    pciBusID: 0003:00:00.0
    totalMemory: 15.90GiB freeMemory: 14.60GiB
gpu_device.cc:1344] Found device 3 with properties:
    name: Tesla V100-PCIE-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.38
    pciBusID: 0004:00:00.0
    totalMemory: 15.90GiB freeMemory: 14.60GiB
gpu_device.cc:1423] Adding visible gpu devices: 0, 1, 2, 3
gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
gpu_device.cc:917]      0 1 2 3
gpu_device.cc:930] 0:   N N N N
gpu_device.cc:930] 1:   N N N N
gpu_device.cc:930] 2:   N N N N
gpu_device.cc:930] 3:   N N N N
gpu_device.cc:1041] Created TensorFlow device (task:0/device:GPU:0 with 16280 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-16GB, pci bus id: 0001:00:00.0, compute capability: 7.0)
gpu_device.cc:1041] Created TensorFlow device (task:0/device:GPU:1 with 16280 MB memory) -> physical GPU (device: 1, name: Tesla V100-PCIE-16GB, pci bus id: 0002:00:00.0, compute capability: 7.0)
gpu_device.cc:1041] Created TensorFlow device (task:0/device:GPU:2 with 16280 MB memory) -> physical GPU (device: 2, name: Tesla V100-PCIE-16GB, pci bus id: 0003:00:00.0, compute capability: 7.0)
gpu_device.cc:1041] Created TensorFlow device (task:0/device:GPU:3 with 16280 MB memory) -> physical GPU (device: 3, name: Tesla V100-PCIE-16GB, pci bus id: 0004:00:00.0, compute capability: 7.0)

Azure GPU VM GPU闲置时的使用情况

# $nvidia-smi.exe -l 1
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Fri Apr 16 16:22:42 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 385.54                 Driver Version: 385.54                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  TCC  | 00000001:00:00.0 Off |                    0 |
| N/A   32C    P0    24W / 250W |      1MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  TCC  | 00000002:00:00.0 Off |                    0 |
| N/A   29C    P0    22W / 250W |      1MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  TCC  | 00000003:00:00.0 Off |                    0 |
| N/A   28C    P0    23W / 250W |      1MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  TCC  | 00000004:00:00.0 Off |                    0 |
| N/A   29C    P0    24W / 250W |      1MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

**Azure GPU VM 使用GPU的命令行提示**

configuration: 28 vCPUs (Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00 GHz), 128G Memory and 2 NVIDIA GeForce RTX 2080 Ti
```markdown
cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
gpu_device.cc:1344] Found device 0 with properties:
    name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
    pciBusID: 0000:02:00.0
    totalMemory: 11.00GiB freeMemory: 9.90GiB
gpu_device.cc:1344] Found device 1 with properties:
    name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
    pciBusID: 0000:03:00.0
    totalMemory: 11.00GiB freeMemory: 9.90GiB
gpu_device.cc:1423] Adding visible gpu devices: 0, 1
gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
gpu_device.cc:917]      0 1
gpu_device.cc:930] 0:   N N
gpu_device.cc:930] 1:   N N
gpu_device.cc:1041] Created TensorFlow device (task:0/device:GPU:0 with 11264 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5)
gpu_device.cc:1041] Created TensorFlow device (task:0/device:GPU:1 with 11264 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5)