Fork of the Meshed Memory Transformer for Image Captioning

Overview of the Meshed Memory Transformer

The Meshed Memory Transformer ( $M 2$ ) proposed by Cornia et al. makes use of the increasingly popular Transformer architecture. Unlike other papers which make use of some predefined structure on extracted image features (spatial graph, semantic graph, etc), uses stacks of self-attention layers across the set of all the image regions. The standard key and values from the Transformer are edited to include the concatenation of learnable persistent memory vectors. These allow the architecture to encode a-prior knowledge such as “eggs” and “toast” make up the concept “breakfast”. When decoding the output of the encoder, a stack of self-attention layers is also used. Each decoder layer is connected via a gated cross attention mechanism to each of the encoder layers, giving way to the “meshed” concept of the paper. The output of the decoder block is used to generate the final output caption.

Bug Fixes

There is a bug in the beam search that has been fixed

HPC Support

Added HPC support for the QMUL cluster. See the hpc/ directory for a set of scripts that may be of use.

Adding the SPICE Evaluation Metric

See this post for a guide on adding the SPICE evaluation metric.