Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-End Object Detection with Transformers. arXiv preprint arXiv:2005.12872.
torch.cuda.set_device(2)
We are going to need the coco_vocab
in it's original order to use the DETR pretrained models.
coco_source = download_coco()
img2bbox = {}
for ann_file in [coco_source['val_ann'], coco_source['train_ann']]:
images, lbl_bbox = get_annotations(ann_file)
img2bbox = merge(img2bbox, dict(zip(images, lbl_bbox)))
vocab = L(coco_vocab) + '#na#'
We have to change bb_pad
so it does nothing if there are no bounding boxes and classes. This is needed when we are decoding.
BBoxBlock.dls_kwargs = {'before_batch': partial(bb_pad, pad_idx=len(vocab)-1)}
We create a ParentSplitter
which can split files into validation and training set based on the parent folder. This is how the coco dataset is structured.
coco = DataBlock(blocks=(ImageBlock, BBoxBlock, BBoxLblBlock(vocab=list(vocab), add_na=False)),
get_items=compose(get_image_files, partial(filter, lambda x:x.name in img2bbox), L),
splitter=ParentSplitter(train_name='train2017', valid_name='val2017'),
get_y=[lambda o: img2bbox[o.name][0], lambda o: img2bbox[o.name][1]],
item_tfms=Resize(800),
batch_tfms=None,
n_inp=1)
dls = coco.dataloaders(coco_source['base'], bs=16, num_workers=0)#
b = dls.one_batch()
dls.show_batch(b, figsize=(10, 10))
Bounding boxes come in two flavors:
- xyxy where the four cordinates are the top left corner and the bottom right corner [left, top, bottom, right]
- cxcxxy where the first two elements are the xy coordinates of the center and third and four elements are the width and hight of the box.
box_cxcywh_to_xyxy
and box_xyxy_to_cxcywh
allow you to convert from one format to the other.
x = torch.rand(4, 7, 4)
y = box_cxcywh_to_xyxy(x)
test_eq(y.shape, x.shape)
Also, FastAI uses normalized coordinates from -1 to 1 (with its origin at the center of the image), but the DETR model output uses coordinates normalized from 0 to 1, with origin at the top left corner.
To handle all this option we use three classes (I did not need the four category):
TensorBBox
: for xyxy bbox centered and from -1 to 1TensorBBoxWH
for cxxy xy bbox with origin in the top left corner and scale form 0 to 1TensorBBoxTL
for xyxy bbox with origin inthe top left corner and scale from 0 to 1
# "Basic type for a tensor of bounding boxes in an image"
# @classmethod
# def create(cls, x, img_size=None)->None: return cls(tensor(x).view(-1, 4).float(), img_size=img_size)
bbox = b[1]
bboxwh = TensorBBoxWH(box_xyxy_to_cxcywh(b[1]*0.5+0.5), img_size=bbox.img_size)
test_eq_type(ToWH(bbox), bboxwh)
test_close(bbox, ToXYXY(bboxwh))
test_eq(type(bbox), type(ToXYXY(bboxwh)))
bwh = TensorBBoxWH(box_xyxy_to_cxcywh(b[1]*0.5+0.5), img_size=b[1].img_size)
test_eq(ToXYXY(b[1]), b[1])
test_ne(ToWH(b[1]), b[1])
test_eq_type(ToWH(b[1]), bwh)
test_close(ToXYXY(bwh), b[1])
The generalized box iou function is a copy of the DETR original implementation. This implementation can handle batched boxes and two type of comparisons: element wise or pairwise.
b1 = box_cxcywh_to_xyxy(torch.rand(2, 5, 4))
b2 = box_cxcywh_to_xyxy(torch.rand(2, 3, 4))
giou = generalized_box_iou(b1, b2, pairwise=True)
test_eq(giou.shape, torch.Size((b1.shape[0], b1.shape[1], b2.shape[1])))
x = tensor([[[-.5, -.5, 0, 0], [-.25, -.25, .25, .25],[0, 0, .5, .5], [.25, .25, .75, .75]]])
y = tensor([[[-.5, -.5, .5, .5]]])-0.5
l_iou = -generalized_box_iou(x*0.5+0.5, x*0.5+0.5, pairwise=True)
test(l_iou[0,0,:-1], l_iou[0,0, 1:], all_op(lt))
This is a reimplementation of the original code. here we put together the matcher and the criterion into the DETRLoss
Module.
rout = [torch.rand(bs, 100, 4), torch.rand(bs, 100, c)]
rout[0] = box_cxcywh_to_xyxy(rout[0])
random_loss =loss(rout+[None], *b[1:])
test(good_loss, random_loss, lt)
We do not reimplement the model, we only wrap it into a Module that sets the model and casts the outputs.
dls.show_batch(b)
model = DETR(pretrained=False).eval();
model.eval()
img = b[0]
bs = b[0].shape[0]
out_val = model(img)
test_eq(len(out_val), 3)
test_eq(out_val[0].shape, torch.Size((bs, 100, 4)))
test_eq(out_val[1].shape, torch.Size((bs, 100, 92)))
model = DETR(pretrained=True).eval();
model.eval()
img = b[0]
out_val = model(to_cpu(img))
loss = DETRLoss()
out_val_act = loss.decodes(loss.activation(out_val))
bb = (img, out_val_act[0], out_val_act[1])
dls.show_batch(bb)
model = DETR(pretrained=True, aux_loss=True)
model.train()
out = model(to_cpu(img))
The CocoEval
Callback gathers annotations and prediction during validation and computes the detection metrics using the pycocotools
package.
model = DETR(pretrained=True, n_classes=len(dls.vocab), aux_loss=True)
loss = DETRLoss(th=0.0, classw=1, boxw=5, giouw=2).cuda()
ce = CocoEval()
learn = Learner(dls, model, loss, splitter=sorted_detr_trainable_params,
cbs=[ce],
metrics=ce.metrics,
opt_func=partial(Adam, decouple_wd=True))
learn.coco_eval = ce
#with learn.added_cbs(RunNBatches(no_valid=False)):
val_output = learn.validate()
pd.DataFrame([val_output], columns=L('valid_loss')+learn.metrics.attrgot('name'))
with learn.removed_cbs(learn.coco_eval):
learn.show_results(max_n=8, figsize=(10,10))
url = 'https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fbarkpost.com%2Fwp-content%2Fuploads%2F2014%2F06%2FDOG-2-superJumbo.jpg&f=1&nofb=1'
with learn.removed_cbs(learn.coco_eval):
img = PILImage.create(requests.get(url, stream=True).raw)
out = learn.predict(img, with_input=True)
ctx = show_image(out[0])
out[1][1].show(ctx);
The prediction is not perfect. I believe this is because we are resizing our images to 800x800 while the DETR model was trained with a different preprocessing method. Now we put everything together into a learner and dataloader contructurs and we train the model to see if gets better.
To create a CocoDataLoaders
we need to define the getter functions. Since pickle can not serilaze nested or
anonimus functions we use classes to define these getters. I did not find a better way to do this.
coco_source = untar_data(URLs.COCO_TINY)
dls = CocoDataLoaders.from_path(coco_source, item_tfms=None)
dls.show_batch()
We put everything together into a Learner
.
path = download_coco(force_download=False)
dls = CocoDataLoaders.from_sources(path, vocab=coco_vocab, num_workers=0)
#dls = CocoDataLoaders.from_path(coco_source, vocab=coco_vocab, num_workers=0)
learn = detr_learner(dls)
We fit using learning rates 1e-5 instead of [1e-5, 1e-4, 1e-4]
because we start with pretrained wrights.
learn.fit(1, lr=[1e-5, 1e-5, 1e-5])
with learn.removed_cbs(learn.coco_eval):
learn.show_results(max_n=8, figsize=(10,10))
img = PILImage.create(requests.get(url, stream=True).raw)
with learn.removed_cbs(learn.coco_eval):
out = learn.predict(img, with_input=True)
ctx = show_image(out[0])
out[1][1].show(ctx);
learn_to_save = Learner(learn.dls, learn.model, learn.loss_func)
learn_to_save.export('DETR_800x800px.pkl')
learn_inf = load_learner('DETR_800x800px.pkl')
img2 = PILImage.create(requests.get('https://api.time.com/wp-content/uploads/2015/02/cats.jpg?quality=85&w=1024&h=512&crop=1', stream=True).raw)
img2
out = learn_inf.predict(img2, with_input=True)
ctx = show_image(out[0])
out[1][1][1] = L(out[1][1][1])
lbbox = out[1][1].map(Self[learn_inf.loss_func.scores[0]>0.8])
lbbox[0] = TensorBBox(lbbox[0])
lbbox.show(ctx);