US12561850B2ActiveUtilityPatentIndex 62
Image generation with legible scene text
Est. expiryAug 14, 2043(~17.1 yrs left)· nominal 20-yr term from priority
G06T 11/60G06V 30/10G06F 40/103G06F 40/126G06T 3/4053G06V 10/82G06T 11/00
62
PatentIndex Score
1
Cited by
12
References
17
Claims
Abstract
Systems and methods for generating images with legible scene text are described. Embodiments are configured to obtain a prompt describing a scene, where the prompt includes scene text indicating text that is intended to be shown in a generated image; encode, using a prompt encoder, the prompt to generate a prompt embedding; encode, using a character-level encoder, the scene text to generate a character-level embedding; and generate, using an image generation network, an image that includes the scene text based on the prompt embedding and the character-level embedding.
Claims
exact text as granted — not AI-modifiedWhat is claimed is:
1 . A method comprising:
obtaining a prompt describing a scene, wherein the prompt includes scene text and a scene element; extracting scene text from the prompt using a text decomposer; encoding, using a prompt encoder, the prompt to generate a prompt embedding representing the scene element; encoding, using a character-level encoder, the scene text to generate a character-level embedding representing the scene text; and generating, using an image generation network, an image based on the prompt embedding and the character-level embedding, wherein the image depicts the scene including the scene text and the scene element.
2 . The method of claim 1 , wherein extracting the scene text comprises:
prompting the text decomposer to identify the scene text in the prompt; and wrapping the identified scene text in one or more non-character tokens to obtain a formatted scene text; and encoding the formatted scene text using the character-level encoder.
3 . The method of claim 2 , wherein extracting the scene text further comprises identifying a typographic property of the scene text based on the prompt, wherein the character-level embedding is based on the typographic property.
4 . The method of claim 2 , wherein extracting the scene text further comprises identifying a position of the scene text based on the prompt, wherein the character-level embedding is based on the position.
5 . The method of claim 2 , wherein extracting the scene text further comprises identifying an additional scene text, and wherein the method further comprises generating an additional character-level embedding based on the additional scene text, wherein the image is generated based on the additional character-level embedding.
6 . The method of claim 1 , wherein the generation further comprises:
providing the prompt embedding and the character-level embedding as a conditioning input to the image generation network.
7 . The method of claim 1 , wherein the generation further comprises generating a base image using the image generation network and enhancing a resolution of the base image using a super-resolution network based on the prompt embedding and the character-level embedding to obtain the image.
8 . The method of claim 1 , wherein the generation further comprises:
obtaining a noise image; and performing an iterative reverse diffusion process on the noise image using the image generation network to obtain the image.
9 . A method comprising:
obtaining training data including a prompt describing a scene, wherein the prompt includes scene text and a scene element; extracting scene text from the prompt using a text decomposer; training, using the training data, an image generation network to generate an image that includes the scene text based on a prompt embedding of the prompt and a character-level embedding of the scene text, wherein the image depicts the scene including the scene text and the scene element; encoding, using a prompt encoder, the prompt to generate the prompt embedding representing the scene element; and encoding, using a character-level encoder, the scene text to generate the character-level embedding representing the scene text.
10 . The method of claim 9 , wherein the training further comprises generating a training output using the image generation network, performing character recognition on the training output to obtain predicted scene text, and computing a loss function based on the scene text and the predicted scene text, wherein the image generation network is trained based on the loss function.
11 . The method of claim 9 , wherein obtaining the training data further comprises:
obtaining a set of images, performing character recognition on the set of images, and filtering the set of images based on the character recognition to obtain a set of training images, wherein the image generation network is trained based on the set of training images.
12 . The method of claim 11 , further comprising:
identifying a text area for a training image in the set of training images; and cropping the training image based on the text area.
13 . An apparatus comprising:
at least one processor; at least one memory including instructions executable by the processor; and the apparatus further comprising: a text decomposer comprising parameters stored in the at least one memory and configured to extract scene text from a prompt, wherein the prompt describes a scene and includes scene text and a scene element; an image generation network comprising parameters stored in the at least one memory, wherein the image generation network is trained to generate an image based on a prompt embedding of the prompt and a character-level embedding of the scene text, wherein the image depicts the scene including the scene text and the scene element; a prompt encoder configured to encode the prompt to generate the prompt embedding representing the scene element; and a character-level encoder configured to encode the scene text to generate the character-level embedding.
14 . The apparatus of claim 13 , wherein the text decomposer is further configured to:
wrap the identified scene text in one or more non-character tokens to obtain a formatted scene text.
15 . The apparatus of claim 13 , further comprising:
an optical character recognition (OCR) component configured to extract text from training images.
16 . The apparatus of claim 13 , further comprising:
a super-resolution network configured to enhance a resolution of a base image.
17 . The apparatus of claim 13 , wherein:
the image generation network comprises a diffusion model.Cited by (0)
No later patents cite this yet.
References (0)
No backward citations on record.