Error Self-Correcting Large Language Models for Precision Image Synthesis
My Contributions:
-
Contributed to developing the error self-correcting code (Python)
-
Ran image synthesis and error correction experiments (GPT models, Python)
​Using advanced graphics packages requires domain expertise and experience. We demonstrate a text-to-graphics pipeline using GPT models suitable for novice users. Current text-to-image methods fail on visual tasks that require precision or a procedural specification. DALL-E 2 and StableDiffusion cannot accomplish precise design, engineering, or physical simulation tasks. We correctly perform such tasks by turning them into programming tasks, automatically generating code using graphics libraries and running the code to render images and animations. Code generation models often generate errors on complex programs, so we perform local error correction. Rather than subjectively evaluating results on a set of prompts, we generate a new multi-task benchmark of challenge tasks. We demonstrate the applicability of our approach for precise and procedural rendering, animations, and physical simulations using diverse programming languages and graphics environments.
Precise graphics manipulation using GPT-4
In this work, we explored the limitations of text-to-image models by creating a new multi-task benchmark with three difficulty levels. We then found tasks on which Stable Diffusion and DALLE-2 failed and provided an alternative in program synthesis. We noticed that program synthesis might generate programs with errors and performed automatic error correction using trace and program synthesis. Next, we evaluated the ability of program synthesis to modify running programs according to human intent. We demonstrate that 91% of a hundred modifications on the Cornell box render and 66% align with human intent. Finally, we demonstrate the advantages of program synthesis with error correction for text-to-graphics tasks that require precision, involve procedural renderings, and physical simulation. We can feed text-to-graphics outputs into text-to-image models for conditional image generation. Text-to-graphics allows to render virtual worlds and learn by interacting with an environment.
Precision image synthesis with error correction