COCO Semantic Graph Data

5 minute read

Published: June 06, 2024

Both the VSUA paper and SGAE paper make use of the same semantic (scene) graph data for the COCO captioning images. This post should explain the files and the data they contain, whilst making it clear how expressively this data can be used. Data download and extraction instructions can be found here

What Data Do We Get?

The data is designed to be supplementary to the BUTD data so it builds on top of how that data is structured. In BUTD, each is given a .npy file containing a ResNet feature of shape [X, 2048]. The X refers to the number of objects detected in the image.

Below is an example of the type of .npy file provided in the SGAE/VSUA data:

import numpy as np

x = np.load('coco_img_sg/100000.npy',allow_pickle=True, encoding='latin1')
print(x)
########
array({'rela_matrix': 
	array([[  0.,   9., 408.],
       [  0.,  14., 425.],
       [  0.,  26., 425.],
       [  1.,  14., 408.],
       [  1.,  26., 425.],
       [  2.,  22., 410.],
       [  3.,  13., 429.],
       [  3.,  14., 408.],
       [  5.,   7., 410.],
       [  5.,  26., 408.],
       [  7.,  14., 408.],
       [  7.,  26., 408.],
       [ 13.,  14., 463.],
       [ 14.,  26., 408.],
       [ 18.,  20., 408.]]), 
	'obj_attr': 
array([[  0.,  48.,  36., 126., 341., 314., 327.],
       [  1.,  48.,  36.,  49., 341., 314., 327.],
       [  2., 262., 247., 187., 313., 309., 372.],
       [  3., 262., 187., 234., 313., 372., 315.],
       [  4., 227., 218.,  96., 309., 305., 345.],
       [  5., 213., 110.,  40., 313., 350., 336.],
       [  6.,  98., 114.,  38., 313., 332., 315.],
       [  7., 213., 110.,  83., 313., 336., 350.],
       [  8., 227., 218., 255., 305., 325., 309.],
       [  9.,  48.,  36., 126., 314., 341., 327.],
       [ 10., 287., 152.,  73., 313., 380., 401.],
       [ 11., 218., 287., 152., 315., 307., 309.],
       [ 12., 217., 227., 278., 340., 311., 344.],
       [ 13., 262., 187., 276., 313., 309., 372.],
       [ 14., 262.,  48.,  36., 313., 341., 327.],
       [ 15., 213.,  74., 174., 313., 376., 311.],
       [ 16., 213., 174.,  74., 313., 376., 317.],
       [ 17., 213.,  95., 242., 313., 376., 315.],
       [ 18.,  73., 287.,  21., 313., 340., 309.],
       [ 19., 213., 174., 242., 313., 317., 336.],
       [ 20., 287.,  73.,  21., 313., 323., 380.],
       [ 21., 213.,  74.,  95., 313., 376., 317.],
       [ 22., 262.,  98., 114., 313., 323., 380.],
       [ 23., 213., 174.,  74., 313., 376., 317.],
       [ 24., 245., 211.,  34., 313., 323., 377.],
       [ 25., 187., 234., 227., 313., 311., 309.],
       [ 26., 110.,  79.,  28., 313., 327., 307.],
       [ 27., 228., 217., 218., 340., 345., 344.],
       [ 28., 213.,  95., 242., 313., 324., 309.],
       [ 29., 227., 218.,  96., 309., 345., 305.],
       [ 30., 213., 242., 220., 309., 313., 350.]])}, dtype=object)

In these files we get two bits of data for each image (in a single file): the relationships between the objects ('rela_matrix') and the attributes attributed to each of objects in the image ('obj_attr').

Understanding `rela_matrix`

This portion of the data provides the semantic (scene) graph for the image. It is in the shape of [Y, 3] where Y is the number of relationships detected in the image. The format is [index_obj1, index_obj2, rela_id]. The indexes refer to the indexes of the objects in the original BUTD data object for the same image. (They are named using the COCO image ID so matching the files is relatively straightforward). The rela_id however maps to the coco_pred_sg_rela.npy file provided by the data. This file contains a two dictionaries that map a label with a numeric ID (and vice versa). The rela_id in the rela_matrix refers to the ID in this dictionary. In the example above, the 0th object has a relationship with the 14th object that is characterised by the ID 425. Which, when we look up in coco_pred_sg_rela.npy we can see is “under”.

Understanding `obj_attr`

From above, we know that each image has X number of objects and Y number of relationships between the objects. However, the data also includes additional information about those X objects, specifically 6 attributes to further describe the image.

The obj_attr dictionary of the VSUA/SGAE image data contains an [X, 7] array where [:,0] refers to the index of object in the original BUTD data file for the same image (which explains why the 0th element always counts up). The remaining six elements are all attribute IDs, which can be decoded by looking them up in the coco_pred_sg_rela.npy dictionary. Going back to the example above, consider the 0th object, it has attributes: [48., 36., 126., 341., 314., 327.] which maps to: [carpet, floor, grind, pink, brown, beige].

Incorporating this data into deep learning models

Unlike the BUTD data, this data does not come with predetermined deep model features (i.e. ResNet). Although we can use the BUTD features as object nodes, there are no features for the attributes or relationship between objects. These therefore need to be created either with an nn.Embedding table or some other way. How you achieve this is up to you!

Graph Structure

Finally, onto graph structure! They way that the data means that it is very open and accessible for different graph structures. You can use all the data, or just some of it. You can incorporate relationships as edge features or decide to introduce them as a separate node type. There is no right way to do it, and not necessarily any wrong way of doing it either.

What Data Do We Get?

Understanding rela_matrix

Understanding obj_attr

Incorporating this data into deep learning models

Graph Structure

Understanding `rela_matrix`

Understanding `obj_attr`