Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

Bar-Ilan University1 Allen Institute for AI2 NVIDIA Research3

(a) Semantic Leak in Prompt

A pink sunflower and a yellow flamingo
SynGen (ours)
Stable diffusion

(b) Semantic Leak out of Prompt

a checkered bowl in a cluttered room
SynGen (ours)
Stable diffusion

(c) Attribute Neglect

a horned lion and a spotted monkey
SynGen (ours)
Stable diffusion

Abstract

Text-conditioned image generation models often generate incorrect associations between entities and their visual attributes. This reflects an impaired mapping between linguistic binding of entities and modifiers in the prompt and visual binding of the corresponding elements in the generated image. As one notable example, the query "a pink sunflower and a yellow flamingo" may incorrectly produce an image of a pink flamingo and a yellow sunflower. To remedy this issue, we propose SynGen, an approach which first syntactically analyses the prompt to identify entities and their modifiers, and then uses a novel loss function that encourages the cross-attention maps to agree with the linguistic binding reflected by the syntax. Specifically, we encourage large overlap between attention maps of entities and their modifiers, and small overlap with other entities and modifier words. The loss is optimized during inference, without retraining the model. Human evaluation on three datasets, including one new and challenging set, demonstrate significant improvements of SynGen compared with current state of the art methods. This work highlights how making use of sentence structure during inference can efficiently and substantially improve the faithfulness of text-to-image generation.

Challenges

SynGen

The workflow figure of SynGen
The SynGen workflow and architecture. (a) The text prompt is analyzed to extract entity-nouns and their modifiers. (b) SynGen adds intermediate steps to the diffusion denoising process. In that step, we update the latent representation to minimize a loss over the cross attention maps of entity-nouns and their modifiers
Evolution of cross-attention maps
Evolution of cross-attention maps and latent representation along denoising steps, for the prompt ``a red crown and a golden strawberry''. At first, the attention maps of all modifiers and entity-nouns are intertwined, regardless of the expected binding. During denoising, attention maps gradually becomes separated, adhering the syntactic bindings. The vertical line indicates that after 25 steps intervention stops, but the attention maps remain separated.

BibTeX

@article{rassin2024linguistic,
        title={Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment},
        author={Rassin, Royi and Hirsch, Eran and Glickman, Daniel and Ravfogel, Shauli and Goldberg, Yoav and Chechik, Gal},
        journal={Advances in Neural Information Processing Systems},
        volume={36},
        year={2024}
      }
}