Rethinking Prompt Design for Inference-time Scaling
in Text-to-visual Generation

TL; DR: Scaling visuals alone plateaus - revise prompts alongside visuals to unlock the best results.

Subin Kim1,   Sangwoo Mo2,   Mamshad Nayeem Rizve3,  
Yiran Xu3,   Difan Liu3,   Jinwoo Shin1,   Tobias Hinz4  


1 KAIST       2 POSTECH       3 Adobe       4 Meta

Scaling Behaviors

Scaling visuals with prompts redesigned corresponding to the scaled visuals → break the plateau.
Simply scaling visuals with a fixed prompt quickly hits a performance ceiling - outputs keep missing parts of the prompt even as compute grows. By redesigning the prompt corresponding to the scaled visuals, we break through this plateau, achieving steadily improving generations and much higher prompt-adherence for both seen and unseen rewards as compute scales.

Image Comparison

Compute scales, but prompts must too - that’s how we get the shoelace-free shoe.
No matter how much compute we scale, a fixed prompt still cannot generate a shoe with no laces. By redesigning the prompt to explicitly address the missing visual patterns at the new scale and emphasize how "no laces" should be realized, we overcome this limitation and produce faithful generations. Image Comparison



Scaling in Text-to-image Generation

Effect of Prompt Redesign on Flux.1-dev

"A mother teaches her two children; the one without a hat looks more frustrated."

Main Image
Flux.1-dev
Side Image 1
Best-of-N (Baseline)
Side Image 2
PRIS (Ours)

"A boy looks at an aquarium with no fish."

Main Image
Flux.1-dev
Side Image 1
Best-of-N (Baseline)
Side Image 2
PRIS (Ours)

"A pencil holder with more pens than pencils."

Main Image
Flux.1-dev
Side Image 1
Best-of-N (Baseline)
Side Image 2
PRIS (Ours)

"A bookshelf with no books, only picture frames."

Main Image
Flux.1-dev
Side Image 1
Best-of-N (Baseline)
Side Image 2
PRIS (Ours)

"In a bright bedroom, there are no yellow pillows on the bed."

Main Image
Flux.1-dev
Side Image 1
Best-of-N (Baseline)
Side Image 2
PRIS (Ours)

"In a room, all the chairs are occupied except one."

Main Image
Flux.1-dev
Side Image 1
Best-of-N (Baseline)
Side Image 2
PRIS (Ours)


Redesigned Prompts > Standard Prompt Expansion

We compare our prompt redesign - which analyzes the scaled visuals and updates the prompt accordingly - against standard prompt expansion that simply lengthens the original prompt. The * symbol denotes results using standard prompt expansion where scaling begins from the expanded prompt.

"The balls on the table have a greater variety of colors than the ones on the floor."

Main Image
Flux.1-dev*
Side Image 1
Best-of-N* (Baseline)
Side Image 2
PRIS* (Ours)

"Two excited elephants to the right of a lost giraffe."

Main Image
Flux.1-dev*
Side Image 1
Best-of-N* (Baseline)
Side Image 2
PRIS* (Ours)

"A monkey with a backpack is jumping from one smaller three to another larger tree."

Main Image
Flux.1-dev*
Side Image 1
Best-of-N* (Baseline)
Side Image 2
PRIS* (Ours)

"A farm with a barn that does not shelter any sheep."

Main Image
Flux.1-dev*
Side Image 1
Best-of-N* (Baseline)
Side Image 2
PRIS* (Ours)

"A bed without the usual cat sleeping on it."

Main Image
Flux.1-dev*
Side Image 1
Best-of-N* (Baseline)
Side Image 2
PRIS* (Ours)

"The two lay in bed, the long-haired one asleep, the short-haired one still awake."

Main Image
Flux.1-dev*
Side Image 1
Best-of-N* (Baseline)
Side Image 2
PRIS* (Ours)


PRIS + T2I Inference-Time Scalings: Superior Results at the Same NFE

Our prompt redesign (PRIS) complements other inference-time scaling methods (e.g., SMC, RBF) that expand the visual search space with a fixed prompt, further improving generation quality, including both prompt adherence and aesthetics, under the same NFE budget.

"A woman in a wheelchair is taller than the boy next to her."

Main Image
Best-of-N
Side Image 1
SMC
Side Image 2
SMC + PRIS

"A child not building a sandcastle at the beach."

Main Image
Best-of-N
Side Image 1
SMC
Side Image 2
SMC + PRIS

"A kitchen with a larger quantity of milk than juice."

Main Image
Best-of-N
Side Image 1
SMC
Side Image 2
SMC + PRIS

"A tissue pack shows two cartoon characters: one in a red dress on the left, one without on the right."

Main Image
Best-of-N
Side Image 1
SMC
Side Image 2
SMC + PRIS

"Four cupcakes with sprinkles on a plate with two forks."

Main Image
Best-of-N
Side Image 1
SMC
Side Image 2
SMC + PRIS

"In an early morning park, a man in a grey and white tracksuit is not running."

Main Image
Best-of-N
Side Image 1
SMC
Side Image 2
SMC + PRIS

"A teddy dog and a Persian cat watch a burning table, with the teddy dog at a farther distance."

Main Image
Best-of-N
Side Image 1
RBF
Side Image 2
RBF + Ours

"Four roses in a clear glass vase, all of which are red, and all of which are not open."

Main Image
Best-of-N
Side Image 1
RBF
Side Image 2
RBF + Ours

"A clock with no hands to tell the time."

Main Image
Best-of-N
Side Image 1
RBF
Side Image 2
RBF + Ours

"A shoe rack without any red pairs of shoes on it."

Main Image
Best-of-N
Side Image 1
RBF
Side Image 2
RBF + Ours

"There is a large fish aquarium in the center of the luxurious living room, but there are no fish in it."

Main Image
Best-of-N
Side Image 1
RBF
Side Image 2
RBF + Ours

"Two frogs on a lotus leaf in a pond, and the one who is drinking is in front of the one who is not."

Main Image
Best-of-N
Side Image 1
RBF
Side Image 2
RBF + Ours


Common Failure Patterns Matter More Than Per-Sample Edits

We compare PRIS with ReflectionFlow, which uses a trained reflection model to iteratively edit each generated sample. PRIS achieves consistently better alignment by refining the prompt itself based on shared failure patterns across samples, rather than performing per-sample post-hoc corrections. This advantage holds even under a significantly lower compute budget: ReflectionFlow uses 3840 NFEs, while PRIS requires only 2000 NFEs.

"A pencil holder with more pens than pencils."

Side Image 1
ReflectionFlow
Side Image 2
PRIS

"Some balls are on the table have a greater variety of colors than those on the floor."

Side Image 1
ReflectionFlow
Side Image 2
PRIS

"A dog in a blue jumper sits next to a Christmas tree decorated with nine stars."

Side Image 1
ReflectionFlow
Side Image 2
PRIS

"Eight yellow rubber ducks lined up on the edge of a bathtub."

Side Image 1
ReflectionFlow
Side Image 2
PRIS

"A kitchen with every cupboard bare.”

Side Image 1
ReflectionFlow
Side Image 2
PRIS


Scaling in Text-to-video Generation

Prompt Redesign on Wan2.1-1.3B: Better Prompt Adherence

"Garden, pan-left."

Best-of-N
PRIS (Ours)

"The glass car window changed into a wooden car window."

Best-of-N
PRIS (Ours)


Prompt Redesign on Wan2.1-14B: Better Prompt Adherence

"A person is working on a project, and then suddenly starts cooking dinner."

Best-of-N
PRIS (Ours)

"A car changes from black to white."

Best-of-N
PRIS (Ours)

"A person is turning on the desk lamp."

Best-of-N
PRIS (Ours)

"A person is breaking a chocolate bar into pieces."

Best-of-N
PRIS (Ours)

"The moon changes from silver to yellow."

Best-of-N
PRIS (Ours)


PRIS + T2V Inference-Time Scalings: Superior Results at the Same NFE

Our prompt redesign (PRIS) complements other text-to-video (T2V) inference-time scaling methods (e.g., EvoSearch), boosting prompt adherence and overall generation quality under the same NFE budget.

"A butterfly’s wing change from yellow to white."

EvoSearch
EvoSearch + PRIS

"A person is opening the window."

EvoSearch
EvoSearch + PRIS