Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-End Object Detection with Transformers. arXiv preprint arXiv:2005.12872.

torch.cuda.set_device(2)

The data

We are going to need the coco_vocab in it's original order to use the DETR pretrained models.

coco_source = download_coco()
img2bbox = {}
for ann_file in [coco_source['val_ann'], coco_source['train_ann']]:
    images, lbl_bbox = get_annotations(ann_file)
    img2bbox = merge(img2bbox, dict(zip(images, lbl_bbox)))
vocab = L(coco_vocab) + '#na#'

We have to change bb_pad so it does nothing if there are no bounding boxes and classes. This is needed when we are decoding.

BBoxBlock.dls_kwargs = {'before_batch': partial(bb_pad, pad_idx=len(vocab)-1)}

We create a ParentSplitter which can split files into validation and training set based on the parent folder. This is how the coco dataset is structured.

coco = DataBlock(blocks=(ImageBlock, BBoxBlock, BBoxLblBlock(vocab=list(vocab), add_na=False)), 
                 get_items=compose(get_image_files, partial(filter, lambda x:x.name in img2bbox), L),
                 splitter=ParentSplitter(train_name='train2017', valid_name='val2017'),
                 get_y=[lambda o: img2bbox[o.name][0], lambda o: img2bbox[o.name][1]], 
                 item_tfms=Resize(800),
                 batch_tfms=None,
                 n_inp=1)

dls = coco.dataloaders(coco_source['base'], bs=16, num_workers=0)#
b = dls.one_batch()
dls.show_batch(b, figsize=(10, 10))

Box utils

Bounding boxes come in two flavors:

xyxy where the four cordinates are the top left corner and the bottom right corner [left, top, bottom, right]
cxcxxy where the first two elements are the xy coordinates of the center and third and four elements are the width and hight of the box.

box_cxcywh_to_xyxy and box_xyxy_to_cxcywh allow you to convert from one format to the other.

x = torch.rand(4, 7, 4)
y = box_cxcywh_to_xyxy(x)
test_eq(y.shape, x.shape)

Also, FastAI uses normalized coordinates from -1 to 1 (with its origin at the center of the image), but the DETR model output uses coordinates normalized from 0 to 1, with origin at the top left corner.

To handle all this option we use three classes (I did not need the four category):

TensorBBox: for xyxy bbox centered and from -1 to 1
TensorBBoxWH for cxxy xy bbox with origin in the top left corner and scale form 0 to 1
TensorBBoxTL for xyxy bbox with origin inthe top left corner and scale from 0 to 1

#     "Basic type for a tensor of bounding boxes in an image"
#     @classmethod
#     def create(cls, x, img_size=None)->None: return cls(tensor(x).view(-1, 4).float(), img_size=img_size)

bbox = b[1]
bboxwh = TensorBBoxWH(box_xyxy_to_cxcywh(b[1]*0.5+0.5), img_size=bbox.img_size)
test_eq_type(ToWH(bbox), bboxwh)

test_close(bbox, ToXYXY(bboxwh))
test_eq(type(bbox), type(ToXYXY(bboxwh)))

bwh = TensorBBoxWH(box_xyxy_to_cxcywh(b[1]*0.5+0.5), img_size=b[1].img_size)

test_eq(ToXYXY(b[1]), b[1])
test_ne(ToWH(b[1]), b[1])
test_eq_type(ToWH(b[1]), bwh)
test_close(ToXYXY(bwh), b[1])

The generalized box iou function is a copy of the DETR original implementation. This implementation can handle batched boxes and two type of comparisons: element wise or pairwise.

b1 = box_cxcywh_to_xyxy(torch.rand(2, 5, 4))
b2 = box_cxcywh_to_xyxy(torch.rand(2, 3, 4))
giou = generalized_box_iou(b1, b2, pairwise=True)
test_eq(giou.shape, torch.Size((b1.shape[0], b1.shape[1], b2.shape[1])))

x = tensor([[[-.5, -.5, 0, 0], [-.25, -.25, .25, .25],[0, 0, .5, .5], [.25, .25, .75, .75]]])
y = tensor([[[-.5, -.5, .5, .5]]])-0.5
l_iou = -generalized_box_iou(x*0.5+0.5, x*0.5+0.5, pairwise=True)
test(l_iou[0,0,:-1], l_iou[0,0, 1:], all_op(lt))

The Loss

This is a reimplementation of the original code. here we put together the matcher and the criterion into the DETRLoss Module.

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

rout = [torch.rand(bs, 100, 4), torch.rand(bs, 100, c)]
rout[0] = box_cxcywh_to_xyxy(rout[0])
random_loss =loss(rout+[None], *b[1:])
test(good_loss, random_loss, lt)

Model

We do not reimplement the model, we only wrap it into a Module that sets the model and casts the outputs.

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Test model

dls.show_batch(b)

model = DETR(pretrained=False).eval();
model.eval()
img = b[0]
bs = b[0].shape[0]
out_val = model(img)
test_eq(len(out_val), 3)
test_eq(out_val[0].shape, torch.Size((bs, 100, 4)))
test_eq(out_val[1].shape, torch.Size((bs, 100, 92)))

model = DETR(pretrained=True).eval();
model.eval()
img = b[0]
out_val = model(to_cpu(img))
loss = DETRLoss()
out_val_act = loss.decodes(loss.activation(out_val))
bb = (img, out_val_act[0], out_val_act[1])
dls.show_batch(bb)

model = DETR(pretrained=True, aux_loss=True)
model.train()
out = model(to_cpu(img))

Detection Metrics

The CocoEval Callback gathers annotations and prediction during validation and computes the detection metrics using the pycocotools package.

Learn

model = DETR(pretrained=True, n_classes=len(dls.vocab), aux_loss=True)
loss = DETRLoss(th=0.0, classw=1, boxw=5, giouw=2).cuda()
ce = CocoEval()

learn = Learner(dls, model, loss, splitter=sorted_detr_trainable_params,
                cbs=[ce],
                metrics=ce.metrics,
                opt_func=partial(Adam, decouple_wd=True))
learn.coco_eval = ce
#with learn.added_cbs(RunNBatches(no_valid=False)):
val_output = learn.validate()
pd.DataFrame([val_output], columns=L('valid_loss')+learn.metrics.attrgot('name'))

with learn.removed_cbs(learn.coco_eval):
    learn.show_results(max_n=8, figsize=(10,10))

url = 'https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fbarkpost.com%2Fwp-content%2Fuploads%2F2014%2F06%2FDOG-2-superJumbo.jpg&f=1&nofb=1'

with learn.removed_cbs(learn.coco_eval):
    img = PILImage.create(requests.get(url, stream=True).raw)
    out = learn.predict(img, with_input=True)
    ctx = show_image(out[0])
    out[1][1].show(ctx);

The prediction is not perfect. I believe this is because we are resizing our images to 800x800 while the DETR model was trained with a different preprocessing method. Now we put everything together into a learner and dataloader contructurs and we train the model to see if gets better.

CocoDataLoaders

To create a CocoDataLoaders we need to define the getter functions. Since pickle can not serilaze nested or anonimus functions we use classes to define these getters. I did not find a better way to do this.

coco_source = untar_data(URLs.COCO_TINY)
dls = CocoDataLoaders.from_path(coco_source, item_tfms=None)

dls.show_batch()

DETR Learner

We put everything together into a Learner.

path = download_coco(force_download=False)
dls = CocoDataLoaders.from_sources(path, vocab=coco_vocab, num_workers=0)
#dls = CocoDataLoaders.from_path(coco_source, vocab=coco_vocab, num_workers=0)
learn = detr_learner(dls)

We fit using learning rates 1e-5 instead of [1e-5, 1e-4, 1e-4] because we start with pretrained wrights.

learn.fit(1, lr=[1e-5, 1e-5, 1e-5])

with learn.removed_cbs(learn.coco_eval):
    learn.show_results(max_n=8, figsize=(10,10))

img = PILImage.create(requests.get(url, stream=True).raw)
with learn.removed_cbs(learn.coco_eval):
    out = learn.predict(img, with_input=True)
ctx = show_image(out[0])
out[1][1].show(ctx);

learn_to_save = Learner(learn.dls, learn.model, learn.loss_func)
learn_to_save.export('DETR_800x800px.pkl')

learn_inf = load_learner('DETR_800x800px.pkl')

img2 = PILImage.create(requests.get('https://api.time.com/wp-content/uploads/2015/02/cats.jpg?quality=85&w=1024&h=512&crop=1', stream=True).raw)
img2

out = learn_inf.predict(img2, with_input=True)
ctx = show_image(out[0])
out[1][1][1] = L(out[1][1][1])
lbbox = out[1][1].map(Self[learn_inf.loss_func.scores[0]>0.8])
lbbox[0] = TensorBBox(lbbox[0])
lbbox.show(ctx);

End-to-End Object Detection with Transformers

The data

`bb_pad`[source]

`ParentSplitter`[source]

Box utils

`box_cxcywh_to_xyxy`[source]

`box_xyxy_to_cxcywh`[source]

`class` `TensorBBoxWH`[source]

`class` `TensorBBoxTL`[source]

`ToWH`[source]

`ToXYXY`[source]

`class` `ToTL`[source]

`box_area`[source]

`all_op`[source]

`generalized_box_iou`[source]

The Loss

`class` `DETRLoss`[source]

Model

`class` `DETR`[source]

Test model

Detection Metrics

`class` `CocoEval`[source]

Learn

`sorted_detr_trainable_params`[source]

CocoDataLoaders

`class` `GetAnnotatedImageFiles`[source]

`class` `GetBboxAnnotation`[source]

`class` `GetClassAnnotation`[source]

`class` `CocoDataLoaders`[source]

DETR Learner

`detr_learner`[source]

End-to-End Object Detection with Transformers

The data

bb_pad[source]

ParentSplitter[source]

Box utils

box_cxcywh_to_xyxy[source]

box_xyxy_to_cxcywh[source]

class TensorBBoxWH[source]

class TensorBBoxTL[source]

ToWH[source]

ToXYXY[source]

class ToTL[source]

box_area[source]

all_op[source]

generalized_box_iou[source]

The Loss

class DETRLoss[source]

Model

class DETR[source]

Test model

Detection Metrics

class CocoEval[source]

Learn

sorted_detr_trainable_params[source]

CocoDataLoaders

class GetAnnotatedImageFiles[source]

class GetBboxAnnotation[source]

class GetClassAnnotation[source]

class CocoDataLoaders[source]

DETR Learner

detr_learner[source]

`bb_pad`[source]

`ParentSplitter`[source]

`box_cxcywh_to_xyxy`[source]

`box_xyxy_to_cxcywh`[source]

`class` `TensorBBoxWH`[source]

`class` `TensorBBoxTL`[source]

`ToWH`[source]

`ToXYXY`[source]

`class` `ToTL`[source]

`box_area`[source]

`all_op`[source]

`generalized_box_iou`[source]

`class` `DETRLoss`[source]

`class` `DETR`[source]

`class` `CocoEval`[source]

`sorted_detr_trainable_params`[source]

`class` `GetAnnotatedImageFiles`[source]

`class` `GetBboxAnnotation`[source]

`class` `GetClassAnnotation`[source]

`class` `CocoDataLoaders`[source]

`detr_learner`[source]