Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

Royi Rassin¹, Eran Hirsch¹, Daniel Glickman¹, Shauli Ravfogel^1,2, Yoav Goldberg^1,2, Gal Chechik^1,3

Bar-Ilan University¹ Allen Institute for AI² NVIDIA Research³

Paper Code Demo

(a) Semantic Leak in Prompt

A pink sunflower and a yellow flamingo

SynGen (ours)

Stable diffusion

(b) Semantic Leak out of Prompt

a checkered bowl in a cluttered room

SynGen (ours)

Stable diffusion

(c) Attribute Neglect

a horned lion and a spotted monkey

SynGen (ours)

Stable diffusion

Abstract

Text-conditioned image generation models often generate incorrect associations between entities and their visual attributes. This reflects an impaired mapping between linguistic binding of entities and modifiers in the prompt and visual binding of the corresponding elements in the generated image. As one notable example, the query "a pink sunflower and a yellow flamingo" may incorrectly produce an image of a pink flamingo and a yellow sunflower. To remedy this issue, we propose SynGen, an approach which first syntactically analyses the prompt to identify entities and their modifiers, and then uses a novel loss function that encourages the cross-attention maps to agree with the linguistic binding reflected by the syntax. Specifically, we encourage large overlap between attention maps of entities and their modifiers, and small overlap with other entities and modifier words. The loss is optimized during inference, without retraining the model. Human evaluation on three datasets, including one new and challenging set, demonstrate significant improvements of SynGen compared with current state of the art methods. This work highlights how making use of sentence structure during inference can efficiently and substantially improve the faithfulness of text-to-image generation.

Challenges

(a) Semantic Leak in Prompt

This phenomenon refers to instances where a modifier unintentionally influences an unrelated entity within the prompt. To illustrate, in the prompt "a pink sunflower and a yellow flamingo", the Stable Diffusion-generated images erroneously assign the color yellow to the sunflower and pink to the flamingo, contrary to the original intent.

Stable Diffusion SynGen (ours)
a yellow flamingo and a pink sunflower

(b) Semantic Leak out of Prompt

This occurs when a modifier impacts an entity not explicitly referenced in the prompt. Take the prompt "a checkered bowl in a cluttered room" for instance: images generated by Stable Diffusion inaccurately extend the "checkered" attribute to background elements (such as a table-cloth, lamp, or a wall) rather than confining it to the bowl as specified. On the other hand, SynGen accurately binds the "checkered" attribute solely to the bowl.

Stable Diffusion SynGen (ours)
a checkered bowl in a cluttered room

(c) Attribute Neglect

The modifier is not present in the image at all. In the prompt "a horned lion and a spotted monkey", generations by Stable Diffusion do not include the horned-attribute or the spotted-attribute in any of the generations. Generations by SynGen include both.

Stable Diffusion SynGen (ours)
a horned lion and a spotted monkey

SynGen

The SynGen workflow and architecture. (a) The text prompt is analyzed to extract entity-nouns and their modifiers. (b) SynGen adds intermediate steps to the diffusion denoising process. In that step, we update the latent representation to minimize a loss over the cross attention maps of entity-nouns and their modifiers

Evolution of cross-attention maps and latent representation along denoising steps, for the prompt ``a red crown and a golden strawberry''. At first, the attention maps of all modifiers and entity-nouns are intertwined, regardless of the expected binding. During denoising, attention maps gradually becomes separated, adhering the syntactic bindings. The vertical line indicates that after 25 steps intervention stops, but the attention maps remain separated.

Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

(a) Semantic Leak in Prompt

A pink sunflower and a yellow flamingo

(b) Semantic Leak out of Prompt

a checkered bowl in a cluttered room

(c) Attribute Neglect

a horned lion and a spotted monkey

Abstract

(a) Semantic Leak in Prompt

Stable Diffusion

SynGen (ours)

a yellow flamingo and a pink sunflower

(b) Semantic Leak out of Prompt

Stable Diffusion

SynGen (ours)

a checkered bowl in a cluttered room

(c) Attribute Neglect

Stable Diffusion

SynGen (ours)

a horned lion and a spotted monkey

SynGen

BibTeX