P
US12561850B2ActiveUtilityPatentIndex 62

Image generation with legible scene text

Assignee: ADOBE INCPriority: Aug 14, 2023Filed: Aug 14, 2023Granted: Feb 24, 2026
Est. expiryAug 14, 2043(~17.1 yrs left)· nominal 20-yr term from priority
Inventors:JINDAL NIPUNGETLIN BRENTBRDICZKA OLIVER
G06T 11/60G06V 30/10G06F 40/103G06F 40/126G06T 3/4053G06V 10/82G06T 11/00
62
PatentIndex Score
1
Cited by
12
References
17
Claims

Abstract

Systems and methods for generating images with legible scene text are described. Embodiments are configured to obtain a prompt describing a scene, where the prompt includes scene text indicating text that is intended to be shown in a generated image; encode, using a prompt encoder, the prompt to generate a prompt embedding; encode, using a character-level encoder, the scene text to generate a character-level embedding; and generate, using an image generation network, an image that includes the scene text based on the prompt embedding and the character-level embedding.

Claims

exact text as granted — not AI-modified
What is claimed is: 
     
         1 . A method comprising:
 obtaining a prompt describing a scene, wherein the prompt includes scene text and a scene element;   extracting scene text from the prompt using a text decomposer;   encoding, using a prompt encoder, the prompt to generate a prompt embedding representing the scene element;   encoding, using a character-level encoder, the scene text to generate a character-level embedding representing the scene text; and   generating, using an image generation network, an image based on the prompt embedding and the character-level embedding, wherein the image depicts the scene including the scene text and the scene element.   
     
     
         2 . The method of  claim 1 , wherein extracting the scene text comprises:
 prompting the text decomposer to identify the scene text in the prompt; and   wrapping the identified scene text in one or more non-character tokens to obtain a formatted scene text; and   encoding the formatted scene text using the character-level encoder.   
     
     
         3 . The method of  claim 2 , wherein extracting the scene text further comprises identifying a typographic property of the scene text based on the prompt, wherein the character-level embedding is based on the typographic property. 
     
     
         4 . The method of  claim 2 , wherein extracting the scene text further comprises identifying a position of the scene text based on the prompt, wherein the character-level embedding is based on the position. 
     
     
         5 . The method of  claim 2 , wherein extracting the scene text further comprises identifying an additional scene text, and wherein the method further comprises generating an additional character-level embedding based on the additional scene text, wherein the image is generated based on the additional character-level embedding. 
     
     
         6 . The method of  claim 1 , wherein the generation further comprises:
 providing the prompt embedding and the character-level embedding as a conditioning input to the image generation network.   
     
     
         7 . The method of  claim 1 , wherein the generation further comprises generating a base image using the image generation network and enhancing a resolution of the base image using a super-resolution network based on the prompt embedding and the character-level embedding to obtain the image. 
     
     
         8 . The method of  claim 1 , wherein the generation further comprises:
 obtaining a noise image; and   performing an iterative reverse diffusion process on the noise image using the image generation network to obtain the image.   
     
     
         9 . A method comprising:
 obtaining training data including a prompt describing a scene, wherein the prompt includes scene text and a scene element;   extracting scene text from the prompt using a text decomposer;   training, using the training data, an image generation network to generate an image that includes the scene text based on a prompt embedding of the prompt and a character-level embedding of the scene text, wherein the image depicts the scene including the scene text and the scene element;   encoding, using a prompt encoder, the prompt to generate the prompt embedding representing the scene element; and   encoding, using a character-level encoder, the scene text to generate the character-level embedding representing the scene text.   
     
     
         10 . The method of  claim 9 , wherein the training further comprises generating a training output using the image generation network, performing character recognition on the training output to obtain predicted scene text, and computing a loss function based on the scene text and the predicted scene text, wherein the image generation network is trained based on the loss function. 
     
     
         11 . The method of  claim 9 , wherein obtaining the training data further comprises:
 obtaining a set of images, performing character recognition on the set of images, and   filtering the set of images based on the character recognition to obtain a set of training images, wherein the image generation network is trained based on the set of training images.   
     
     
         12 . The method of  claim 11 , further comprising:
 identifying a text area for a training image in the set of training images; and   cropping the training image based on the text area.   
     
     
         13 . An apparatus comprising:
 at least one processor;   at least one memory including instructions executable by the processor; and   the apparatus further comprising:   a text decomposer comprising parameters stored in the at least one memory and configured to extract scene text from a prompt, wherein the prompt describes a scene and includes scene text and a scene element;   an image generation network comprising parameters stored in the at least one memory, wherein the image generation network is trained to generate an image based on a prompt embedding of the prompt and a character-level embedding of the scene text, wherein the image depicts the scene including the scene text and the scene element;   a prompt encoder configured to encode the prompt to generate the prompt embedding representing the scene element; and   a character-level encoder configured to encode the scene text to generate the character-level embedding.   
     
     
         14 . The apparatus of  claim 13 , wherein the text decomposer is further configured to:
 wrap the identified scene text in one or more non-character tokens to obtain a formatted scene text.   
     
     
         15 . The apparatus of  claim 13 , further comprising:
 an optical character recognition (OCR) component configured to extract text from training images.   
     
     
         16 . The apparatus of  claim 13 , further comprising:
 a super-resolution network configured to enhance a resolution of a base image.   
     
     
         17 . The apparatus of  claim 13 , wherein:
 the image generation network comprises a diffusion model.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.