PyTorchはじめました(Object Detection)

べ、べつに TensorFlowを嫌いになったわけじゃないんだからね！！！

ただ、NNgenに入力するために、onnxモデル作らないといけなくて公式でサポートしてるやつがいいなぁとか思ってないし

Tutorial見てて、TF Hubに飛ぶんかい！！！ってツッコミどころがあったり
おっ！PyTorchだとめっちゃ簡単に理解できるし、後から色々カスタマイズ出来るじゃん！！！
とか思ってないし、ほんとただのキマグレンです。

っということで、PyTorchの公式だと Segmentationだったのでちょっと修正して
Object Detectionで動かしてみました。

TorchVision Object Detection Finetuning Tutorial - PyTorch

環境は、Google Colabにて実行して確認してます。(必要であれば、Notebook公開します。)
公式Tutorialにも Colab Versionありますので、そちらを見ていただければOKかなと。
(アクセレータのタイプはGPUありのほうが良いと思います。)

前準備

必要らしいので、インストール。

%%shell

# Install pycocotools
git clone https://github.com/cocodataset/cocoapi.git
cd cocoapi/PythonAPI
python setup.py build_ext install

カスタムデータセットをダンロードしてきます。

%%shell

# download the Penn-Fudan dataset
wget https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip .
# extract it in the current folder
unzip PennFudanPed.zip

2人のおっちゃんが歩いてる。

from PIL import Image
Image.open('PennFudanPed/PNGImages/FudanPed00001.png')

データセット作成

元々が Segmentation用だったので不必要なものをコメントにしてます。
例えば、出力である target["masks"]のところです。

個人的には、PedMasksデータ取得もコメントアウトできるかと思ったのですが、
矩形位置boxesの算出に使用していたので、そのまま使ってます。
なので、boxesっていう情報があればPedMasksのファイル読み込みはいらないです。

今回はLabelが1個だけのようなので複数ある時は修正しないとですね。

import os
import numpy as np
import torch
import torch.utils.data
from PIL import Image

class PennFudanDataset(torch.utils.data.Dataset):
    def __init__(self, root, transforms=None):
        self.root = root
        self.transforms = transforms
        # load all image files, sorting them to
        # ensure that they are aligned
        self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))
        self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))

    def __getitem__(self, idx):
        # load images ad masks
        img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
        mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
        img = Image.open(img_path).convert("RGB")
        # note that we haven't converted the mask to RGB,
        # because each color corresponds to a different instance
        # with 0 being background
        mask = Image.open(mask_path)

        mask = np.array(mask)
        # instances are encoded as different colors
        obj_ids = np.unique(mask)
        # first id is the background, so remove it
        obj_ids = obj_ids[1:]

        # split the color-encoded mask into a set
        # of binary masks
        masks = mask == obj_ids[:, None, None]

        # get bounding box coordinates for each mask
        num_objs = len(obj_ids)
        boxes = []
        for i in range(num_objs):
            pos = np.where(masks[i])
            xmin = np.min(pos[1])
            xmax = np.max(pos[1])
            ymin = np.min(pos[0])
            ymax = np.max(pos[0])
            boxes.append([xmin, ymin, xmax, ymax])

        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        # there is only one class
        labels = torch.ones((num_objs,), dtype=torch.int64)
        # masks = torch.as_tensor(masks, dtype=torch.uint8)

        image_id = torch.tensor([idx])
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        # suppose all instances are not crowd
        iscrowd = torch.zeros((num_objs,), dtype=torch.int64)

        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        # target["masks"] = masks
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd

        if self.transforms is not None:
            img, target = self.transforms(img, target)

        return img, target

    def __len__(self):
        return len(self.imgs)

出てくる情報は以下のような感じです。

dataset = PennFudanDataset('PennFudanPed/')
dataset[0]

(<PIL.Image.Image image mode=RGB size=559x536 at 0x7F59B993C6A0>,
 {'area': tensor([35358., 36225.]), 'boxes': tensor([[159., 181., 301., 430.],
          [419., 170., 534., 485.]]), 'image_id': tensor([0]), 'iscrowd': tensor([0, 0]), 'labels': tensor([1, 1])})

モデル定義

SegmentationではMask R-CNNでしたが、Object Detectionなので Faster R-CNNに変えてます。

import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
# from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor
 
def get_instance_segmentation_model(num_classes):
    # load a model pre-trained pre-trained on COCO
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

    # get the number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    # replace the pre-trained head with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    return model

上記は、1 - Finetuning from a pretrained model ですが、
なんと！公式チュートリアルでは 2 - Modifying the model to add a different backbone についても記載されてますので、必要な方は見てみると理解できると思います。(サイコーかよ

学習と評価

データ準備

補助APIですかね！？(詳細を確認してないので、後々)

%%shell

# Download TorchVision repo to use some files from
# references/detection
git clone https://github.com/pytorch/vision.git
cd vision
git checkout v0.3.0

cp references/detection/utils.py ../
cp references/detection/transforms.py ../
cp references/detection/coco_eval.py ../
cp references/detection/engine.py ../
cp references/detection/coco_utils.py ../

data augmentationや変換するための補助関数を定義するようです。

from engine import train_one_epoch, evaluate
import utils
import transforms as T

def get_transform(train):
    transforms = []
    # converts the image, a PIL image, into a PyTorch Tensor
    transforms.append(T.ToTensor())
    if train:
        # during training, randomly flip the training images
        # and ground-truth for data augmentation
        transforms.append(T.RandomHorizontalFlip(0.5))
    return T.Compose(transforms)

trainやevaluateに渡す DataLoaderを作成するようです。

# use our dataset and defined transformations
dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
dataset_test = PennFudanDataset('PennFudanPed', get_transform(train=False))

# split the dataset in train and test set
torch.manual_seed(1)
indices = torch.randperm(len(dataset)).tolist()
dataset = torch.utils.data.Subset(dataset, indices[:-50])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])

# define training and validation data loaders
data_loader = torch.utils.data.DataLoader(
    dataset, batch_size=2, shuffle=True, num_workers=4,
    collate_fn=utils.collate_fn)

data_loader_test = torch.utils.data.DataLoader(
    dataset_test, batch_size=1, shuffle=False, num_workers=4,
    collate_fn=utils.collate_fn)

モデルをインスタンスおよび Optimizerの設定ですね。

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# our dataset has two classes only - background and person
num_classes = 2

# get the model using our helper function
model = get_instance_segmentation_model(num_classes)
# move model to the right device
model.to(device)

# construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
                            momentum=0.9, weight_decay=0.0005)

# and a learning rate scheduler which decreases the learning rate by
# 10x every 3 epochs
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
                                               step_size=3,
                                               gamma=0.1)

いよいよ学習です。

# let's train it for 10 epochs
num_epochs = 10

for epoch in range(num_epochs):
    # train for one epoch, printing every 10 iterations
    train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
    # update the learning rate
    lr_scheduler.step()
    # evaluate on the test dataset
    evaluate(model, data_loader_test, device=device)

実行結果の最後はこんな感じでした。

Epoch: [9]  [ 0/60]  eta: 0:01:07  lr: 0.000005  loss: 0.0580 (0.0580)  loss_classifier: 0.0373 (0.0373)  loss_box_reg: 0.0113 (0.0113)  loss_objectness: 0.0001 (0.0001)  loss_rpn_box_reg: 0.0093 (0.0093)  time: 1.1315  data: 0.4001  max mem: 3573
Epoch: [9]  [10/60]  eta: 0:00:35  lr: 0.000005  loss: 0.0449 (0.0446)  loss_classifier: 0.0234 (0.0247)  loss_box_reg: 0.0105 (0.0124)  loss_objectness: 0.0001 (0.0002)  loss_rpn_box_reg: 0.0093 (0.0072)  time: 0.7051  data: 0.0396  max mem: 3573
Epoch: [9]  [20/60]  eta: 0:00:28  lr: 0.000005  loss: 0.0382 (0.0474)  loss_classifier: 0.0233 (0.0266)  loss_box_reg: 0.0087 (0.0124)  loss_objectness: 0.0001 (0.0010)  loss_rpn_box_reg: 0.0079 (0.0075)  time: 0.6812  data: 0.0043  max mem: 3573
Epoch: [9]  [30/60]  eta: 0:00:20  lr: 0.000005  loss: 0.0452 (0.0480)  loss_classifier: 0.0235 (0.0266)  loss_box_reg: 0.0089 (0.0124)  loss_objectness: 0.0002 (0.0011)  loss_rpn_box_reg: 0.0080 (0.0079)  time: 0.6868  data: 0.0052  max mem: 3573
Epoch: [9]  [40/60]  eta: 0:00:13  lr: 0.000005  loss: 0.0456 (0.0480)  loss_classifier: 0.0261 (0.0271)  loss_box_reg: 0.0101 (0.0125)  loss_objectness: 0.0002 (0.0011)  loss_rpn_box_reg: 0.0073 (0.0074)  time: 0.6860  data: 0.0054  max mem: 3573
Epoch: [9]  [50/60]  eta: 0:00:06  lr: 0.000005  loss: 0.0410 (0.0469)  loss_classifier: 0.0253 (0.0267)  loss_box_reg: 0.0101 (0.0119)  loss_objectness: 0.0001 (0.0010)  loss_rpn_box_reg: 0.0052 (0.0072)  time: 0.7023  data: 0.0055  max mem: 3573
Epoch: [9]  [59/60]  eta: 0:00:00  lr: 0.000005  loss: 0.0391 (0.0457)  loss_classifier: 0.0226 (0.0260)  loss_box_reg: 0.0085 (0.0118)  loss_objectness: 0.0001 (0.0008)  loss_rpn_box_reg: 0.0056 (0.0070)  time: 0.6736  data: 0.0054  max mem: 3573
Epoch: [9] Total time: 0:00:41 (0.6898 s / it)
creating index...
index created!
Test:  [ 0/50]  eta: 0:00:13  model_time: 0.1189 (0.1189)  evaluator_time: 0.0014 (0.0014)  time: 0.2754  data: 0.1539  max mem: 3573
Test:  [49/50]  eta: 0:00:00  model_time: 0.1331 (0.1296)  evaluator_time: 0.0009 (0.0014)  time: 0.1370  data: 0.0027  max mem: 3573
Test: Total time: 0:00:06 (0.1398 s / it)
Averaged stats: model_time: 0.1331 (0.1296)  evaluator_time: 0.0009 (0.0014)
Accumulating evaluation results...
DONE (t=0.01s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.815
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.992
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.950
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.600
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.825
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.376
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.854
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.854
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.762
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.860

推論

テストデータの一つで実施。
model.eval()っていう実行が必要なんですね。

# pick one image from the test set
img, _ = dataset_test[0]
# put the model in evaluation mode
model.eval()
with torch.no_grad():
    prediction = model([img.to(device)])

推論結果として出てくるデータ:precisionは以下のようです。

[{'boxes': tensor([[ 61.1901,  38.2282, 195.3746, 323.0859],
          [276.6442,  24.0823, 290.8779,  74.9430]], device='cuda:0'),
  'labels': tensor([1, 1], device='cuda:0'),
  'scores': tensor([0.9997, 0.0974], device='cuda:0')}]

画像で見てみるとこんな感じでした。

from PIL import ImageDraw

im = Image.fromarray(img.mul(255).permute(1, 2, 0).byte().numpy())

draw = ImageDraw.Draw(im)
draw.rectangle(prediction[0]['boxes'][0].cpu().numpy())

im

f:id:kocha2012:20191220194433p:plain

感想

簡単に実行できて、やりたいことに対してどういうところを変えていかないといけないかイメージすることができました。
理解出来た気になっているだけだと思うので、これからもう少し PyTorchやっていきたいと思います。

本記事は「Pytorch Advent Calendar 2019 - 18日目」になります。