Scaling visuals with prompts redesigned corresponding to the scaled visuals → break the
plateau.
Simply scaling visuals with a fixed prompt quickly hits a performance ceiling - outputs keep missing
parts of the prompt even as compute grows. By redesigning the prompt corresponding to the scaled
visuals, we break through this plateau, achieving steadily improving generations and much higher
prompt-adherence for both seen and unseen rewards as compute scales.
Compute scales, but prompts must too - that’s how we get the shoelace-free shoe.
No matter how much compute we scale, a fixed prompt still cannot generate a shoe with no laces.
By redesigning the prompt to explicitly address the missing visual patterns at the new scale and
emphasize how "no laces" should be realized, we overcome this limitation and produce faithful
generations.
"A mother teaches her two children; the one without a hat looks more frustrated."
"A boy looks at an aquarium with no fish."
"A pencil holder with more pens than pencils."
"A bookshelf with no books, only picture frames."
"In a bright bedroom, there are no yellow pillows on the bed."
"In a room, all the chairs are occupied except one."
We compare our prompt redesign - which analyzes the scaled visuals and updates the prompt accordingly - against standard prompt expansion that simply lengthens the original prompt. The * symbol denotes results using standard prompt expansion where scaling begins from the expanded prompt.
"The balls on the table have a greater variety of colors than the ones on the floor."
"Two excited elephants to the right of a lost giraffe."
"A monkey with a backpack is jumping from one smaller three to another larger tree."
"A farm with a barn that does not shelter any sheep."
"A bed without the usual cat sleeping on it."
"The two lay in bed, the long-haired one asleep, the short-haired one still awake."
Our prompt redesign (PRIS) complements other inference-time scaling methods (e.g., SMC, RBF) that expand the visual search space with a fixed prompt, further improving generation quality, including both prompt adherence and aesthetics, under the same NFE budget.
"A woman in a wheelchair is taller than the boy next to her."
"A child not building a sandcastle at the beach."
"A kitchen with a larger quantity of milk than juice."
"A tissue pack shows two cartoon characters: one in a red dress on the left, one without on the right."
"Four cupcakes with sprinkles on a plate with two forks."
"In an early morning park, a man in a grey and white tracksuit is not running."
"A teddy dog and a Persian cat watch a burning table, with the teddy dog at a farther distance."
"Four roses in a clear glass vase, all of which are red, and all of which are not open."
"A clock with no hands to tell the time."
"A shoe rack without any red pairs of shoes on it."
"There is a large fish aquarium in the center of the luxurious living room, but there are no fish in it."
"Two frogs on a lotus leaf in a pond, and the one who is drinking is in front of the one who is not."
We compare PRIS with ReflectionFlow, which uses a trained reflection model to iteratively edit each generated sample. PRIS achieves consistently better alignment by refining the prompt itself based on shared failure patterns across samples, rather than performing per-sample post-hoc corrections. This advantage holds even under a significantly lower compute budget: ReflectionFlow uses 3840 NFEs, while PRIS requires only 2000 NFEs.
"A pencil holder with more pens than pencils."
"Some balls are on the table have a greater variety of colors than those on the floor."
"A dog in a blue jumper sits next to a Christmas tree decorated with nine stars."
"Eight yellow rubber ducks lined up on the edge of a bathtub."
"A kitchen with every cupboard bare.”
"Garden, pan-left."
"The glass car window changed into a wooden car window."
"A person is working on a project, and then suddenly starts cooking dinner."
"A car changes from black to white."
"A person is turning on the desk lamp."
"A person is breaking a chocolate bar into pieces."
"The moon changes from silver to yellow."
Our prompt redesign (PRIS) complements other text-to-video (T2V) inference-time scaling methods (e.g., EvoSearch), boosting prompt adherence and overall generation quality under the same NFE budget.
"A butterfly’s wing change from yellow to white."
"A person is opening the window."