Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-End Object Detection with Transformers. arXiv preprint arXiv:2005.12872.

torch.cuda.set_device(2)

The data

We are going to need the coco_vocab in it's original order to use the DETR pretrained models.

coco_source = download_coco()
img2bbox = {}
for ann_file in [coco_source['val_ann'], coco_source['train_ann']]:
    images, lbl_bbox = get_annotations(ann_file)
    img2bbox = merge(img2bbox, dict(zip(images, lbl_bbox)))
vocab = L(coco_vocab) + '#na#'

We have to change bb_pad so it does nothing if there are no bounding boxes and classes. This is needed when we are decoding.

bb_pad[source]

bb_pad(samples, pad_idx=0)

Function that collect samples of labelled bboxes and adds padding with pad_idx.

BBoxBlock.dls_kwargs = {'before_batch': partial(bb_pad, pad_idx=len(vocab)-1)}

We create a ParentSplitter which can split files into validation and training set based on the parent folder. This is how the coco dataset is structured.

ParentSplitter[source]

ParentSplitter(train_name='train', valid_name='valid')

Split items from the grand parent folder names (train_name and valid_name).

coco = DataBlock(blocks=(ImageBlock, BBoxBlock, BBoxLblBlock(vocab=list(vocab), add_na=False)), 
                 get_items=compose(get_image_files, partial(filter, lambda x:x.name in img2bbox), L),
                 splitter=ParentSplitter(train_name='train2017', valid_name='val2017'),
                 get_y=[lambda o: img2bbox[o.name][0], lambda o: img2bbox[o.name][1]], 
                 item_tfms=Resize(800),
                 batch_tfms=None,
                 n_inp=1)
dls = coco.dataloaders(coco_source['base'], bs=16, num_workers=0)#
b = dls.one_batch()
dls.show_batch(b, figsize=(10, 10))

Box utils

Bounding boxes come in two flavors:

  • xyxy where the four cordinates are the top left corner and the bottom right corner [left, top, bottom, right]
  • cxcxxy where the first two elements are the xy coordinates of the center and third and four elements are the width and hight of the box.

box_cxcywh_to_xyxy and box_xyxy_to_cxcywh allow you to convert from one format to the other.

box_cxcywh_to_xyxy[source]

box_cxcywh_to_xyxy(x)

x = torch.rand(4, 7, 4)
y = box_cxcywh_to_xyxy(x)
test_eq(y.shape, x.shape)

box_xyxy_to_cxcywh[source]

box_xyxy_to_cxcywh(x)

Also, FastAI uses normalized coordinates from -1 to 1 (with its origin at the center of the image), but the DETR model output uses coordinates normalized from 0 to 1, with origin at the top left corner.

To handle all this option we use three classes (I did not need the four category):

  • TensorBBox: for xyxy bbox centered and from -1 to 1
  • TensorBBoxWH for cxxy xy bbox with origin in the top left corner and scale form 0 to 1
  • TensorBBoxTL for xyxy bbox with origin inthe top left corner and scale from 0 to 1
#     "Basic type for a tensor of bounding boxes in an image"
#     @classmethod
#     def create(cls, x, img_size=None)->None: return cls(tensor(x).view(-1, 4).float(), img_size=img_size)

class TensorBBoxWH[source]

TensorBBoxWH(x, **kwargs) :: TensorPoint

Basic type for points in an image

class TensorBBoxTL[source]

TensorBBoxTL(x, **kwargs) :: TensorPoint

Basic type for points in an image

ToWH[source]

ToWH(enc=None, dec=None, split_idx=None, order=None)

Delegates (__call__,decode,setup) to (encodes,decodes,setups) if split_idx matches

bbox = b[1]
bboxwh = TensorBBoxWH(box_xyxy_to_cxcywh(b[1]*0.5+0.5), img_size=bbox.img_size)
test_eq_type(ToWH(bbox), bboxwh)

ToXYXY[source]

ToXYXY(enc=None, dec=None, split_idx=None, order=None)

Delegates (__call__,decode,setup) to (encodes,decodes,setups) if split_idx matches

test_close(bbox, ToXYXY(bboxwh))
test_eq(type(bbox), type(ToXYXY(bboxwh)))

class ToTL[source]

ToTL(enc=None, dec=None, split_idx=None, order=None) :: Transform

Delegates (__call__,decode,setup) to (encodes,decodes,setups) if split_idx matches

bwh = TensorBBoxWH(box_xyxy_to_cxcywh(b[1]*0.5+0.5), img_size=b[1].img_size)

test_eq(ToXYXY(b[1]), b[1])
test_ne(ToWH(b[1]), b[1])
test_eq_type(ToWH(b[1]), bwh)
test_close(ToXYXY(bwh), b[1])

box_area[source]

box_area(boxes)

all_op[source]

all_op(cmp)

Compares all the elements of a and b using cmp.

The generalized box iou function is a copy of the DETR original implementation. This implementation can handle batched boxes and two type of comparisons: element wise or pairwise.

generalized_box_iou[source]

generalized_box_iou(boxes1, boxes2, pairwise=False)

Generalized IoU from https://giou.stanford.edu/ The boxes should be in [x0, y0, x1, y1] format Returns a [N, M] pairwise matrix, where N = len(boxes1) and M = len(boxes2). This implemenation expects bs as first dim.

b1 = box_cxcywh_to_xyxy(torch.rand(2, 5, 4))
b2 = box_cxcywh_to_xyxy(torch.rand(2, 3, 4))
giou = generalized_box_iou(b1, b2, pairwise=True)
test_eq(giou.shape, torch.Size((b1.shape[0], b1.shape[1], b2.shape[1])))
x = tensor([[[-.5, -.5, 0, 0], [-.25, -.25, .25, .25],[0, 0, .5, .5], [.25, .25, .75, .75]]])
y = tensor([[[-.5, -.5, .5, .5]]])-0.5
l_iou = -generalized_box_iou(x*0.5+0.5, x*0.5+0.5, pairwise=True)
test(l_iou[0,0,:-1], l_iou[0,0, 1:], all_op(lt))

The Loss

This is a reimplementation of the original code. here we put together the matcher and the criterion into the DETRLoss Module.

class DETRLoss[source]

DETRLoss(classw=1, boxw=1, giouw=1, n_queries=100, th=0.7, eos_coef=0.1, n_classes=92) :: Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

rout = [torch.rand(bs, 100, 4), torch.rand(bs, 100, c)]
rout[0] = box_cxcywh_to_xyxy(rout[0])
random_loss =loss(rout+[None], *b[1:])
test(good_loss, random_loss, lt)

Model

We do not reimplement the model, we only wrap it into a Module that sets the model and casts the outputs.

class DETR[source]

DETR(pretrained=True, n_classes=92, aux_loss=False) :: Module

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool

Test model

dls.show_batch(b)
model = DETR(pretrained=False).eval();
model.eval()
img = b[0]
bs = b[0].shape[0]
out_val = model(img)
test_eq(len(out_val), 3)
test_eq(out_val[0].shape, torch.Size((bs, 100, 4)))
test_eq(out_val[1].shape, torch.Size((bs, 100, 92)))
model = DETR(pretrained=True).eval();
model.eval()
img = b[0]
out_val = model(to_cpu(img))
loss = DETRLoss()
out_val_act = loss.decodes(loss.activation(out_val))
bb = (img, out_val_act[0], out_val_act[1])
dls.show_batch(bb)
model = DETR(pretrained=True, aux_loss=True)
model.train()
out = model(to_cpu(img))

Detection Metrics

The CocoEval Callback gathers annotations and prediction during validation and computes the detection metrics using the pycocotools package.

class CocoEval[source]

CocoEval() :: Callback

Basic class handling tweaks of the training loop by changing a Learner in various events

Learn

sorted_detr_trainable_params[source]

sorted_detr_trainable_params(m)

model = DETR(pretrained=True, n_classes=len(dls.vocab), aux_loss=True)
loss = DETRLoss(th=0.0, classw=1, boxw=5, giouw=2).cuda()
ce = CocoEval()

learn = Learner(dls, model, loss, splitter=sorted_detr_trainable_params,
                cbs=[ce],
                metrics=ce.metrics,
                opt_func=partial(Adam, decouple_wd=True))
learn.coco_eval = ce
#with learn.added_cbs(RunNBatches(no_valid=False)):
val_output = learn.validate()
pd.DataFrame([val_output], columns=L('valid_loss')+learn.metrics.attrgot('name'))
valid_loss AP AP50 AP75 AP_small AP_medium AP_large AR1 AR10 AR100 AR_small AR_medium AR_large
0 12.220279 0.349411 0.430659 0.374847 0.210336 0.350109 0.40202 0.27847 0.412292 0.422944 0.259426 0.418572 0.47313
with learn.removed_cbs(learn.coco_eval):
    learn.show_results(max_n=8, figsize=(10,10))
url = 'https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fbarkpost.com%2Fwp-content%2Fuploads%2F2014%2F06%2FDOG-2-superJumbo.jpg&f=1&nofb=1'
with learn.removed_cbs(learn.coco_eval):
    img = PILImage.create(requests.get(url, stream=True).raw)
    out = learn.predict(img, with_input=True)
    ctx = show_image(out[0])
    out[1][1].show(ctx);

The prediction is not perfect. I believe this is because we are resizing our images to 800x800 while the DETR model was trained with a different preprocessing method. Now we put everything together into a learner and dataloader contructurs and we train the model to see if gets better.

CocoDataLoaders

To create a CocoDataLoaders we need to define the getter functions. Since pickle can not serilaze nested or anonimus functions we use classes to define these getters. I did not find a better way to do this.

class GetAnnotatedImageFiles[source]

GetAnnotatedImageFiles(img2bbox)

class GetBboxAnnotation[source]

GetBboxAnnotation(img2bbox)

class GetClassAnnotation[source]

GetClassAnnotation(img2bbox)

class CocoDataLoaders[source]

CocoDataLoaders(*loaders, path='.', device=None) :: DataLoaders

Basic wrapper around several DataLoaders.

coco_source = untar_data(URLs.COCO_TINY)
dls = CocoDataLoaders.from_path(coco_source, item_tfms=None)
dls.show_batch()

DETR Learner

We put everything together into a Learner.

detr_learner[source]

detr_learner(dls, pretrained=True, bs=16)

path = download_coco(force_download=False)
dls = CocoDataLoaders.from_sources(path, vocab=coco_vocab, num_workers=0)
#dls = CocoDataLoaders.from_path(coco_source, vocab=coco_vocab, num_workers=0)
learn = detr_learner(dls)

We fit using learning rates 1e-5 instead of [1e-5, 1e-4, 1e-4] because we start with pretrained wrights.

learn.fit(1, lr=[1e-5, 1e-5, 1e-5])
epoch train_loss valid_loss AP AP50 AP75 AP_small AP_medium AP_large AR1 AR10 AR100 AR_small AR_medium AR_large time
0 5.871920 7.529383 0.484681 0.586819 0.512862 0.318397 0.467615 0.566265 0.361112 0.556489 0.572340 0.394823 0.555191 0.639448 2:07:50
with learn.removed_cbs(learn.coco_eval):
    learn.show_results(max_n=8, figsize=(10,10))
img = PILImage.create(requests.get(url, stream=True).raw)
with learn.removed_cbs(learn.coco_eval):
    out = learn.predict(img, with_input=True)
ctx = show_image(out[0])
out[1][1].show(ctx);
learn_to_save = Learner(learn.dls, learn.model, learn.loss_func)
learn_to_save.export('DETR_800x800px.pkl')
learn_inf = load_learner('DETR_800x800px.pkl')
img2 = PILImage.create(requests.get('https://api.time.com/wp-content/uploads/2015/02/cats.jpg?quality=85&w=1024&h=512&crop=1', stream=True).raw)
img2
out = learn_inf.predict(img2, with_input=True)
ctx = show_image(out[0])
out[1][1][1] = L(out[1][1][1])
lbbox = out[1][1].map(Self[learn_inf.loss_func.scores[0]>0.8])
lbbox[0] = TensorBBox(lbbox[0])
lbbox.show(ctx);