Enhancing Text-to-Image Generation with Rich Human Feedback

Recent advancements in text-to-image (T2I) models like Stable Diffusion and Imagen have significantly improved the generation of high-resolution images from text descriptions. However, these models still face challenges such as artifacts, misalignments, and low aesthetic quality. For example, an image generated from the prompt “A panda riding a motorcycle” might display two pandas with distorted features and other undesired elements.

Inspired by the success of reinforcement learning from human feedback (RLHF) for large language models, we explored whether learning from human feedback (LHF) could enhance T2I models. In large language models, human feedback ranges from simple preference ratings to detailed responses. However, in T2I, feedback has been limited to preference ratings due to the complexity of fixing problematic images.

Rich Human Feedback for T2I

We designed a process to collect rich human feedback that is specific and easy to obtain, demonstrating its feasibility and benefits. Our contributions are threefold:

RichHF-18K Dataset: We curated and released a dataset with feedback on 18,000 images generated by Stable Diffusion variants.
RAHF Model: We trained a multimodal transformer model, Rich Automatic Human Feedback (RAHF), to predict human feedback, such as implausibility scores, artifact locations, and text misalignments.
Model Improvements: We showed that using predicted feedback can improve image generation, with benefits extending to models like Muse.

Dataset Collection

We selected 17,000 images from the Pick-a-Pic training dataset, ensuring a diverse range of categories and types. These were split into 16,000 training samples and 1,000 validation samples, with an additional 1,000 images from the test set. Annotators examined each image, marked locations of issues, and rated the images on several metrics using a 5-point Likert scale.

Rich Feedback Prediction

Our RAHF model architecture integrates vision-language models, propagating text information to image tokens for heatmap and score prediction. It uses a single head for each prediction type: heatmap, score, and misalignment sequence. By augmenting prompts with task-specific strings, the model effectively learns to predict feedback types.

Model Performance

Case studies on common errors, like human hands, showed that RAHF accurately identifies artifacts. Quantitative analysis demonstrated that RAHF outperforms baseline models, such as fine-tuned ResNet-50, on most metrics for implausibility heatmap prediction.

Application of Feedback

The predicted feedback can be used to fine-tune generative models. For instance, filtering Muse model results with RAHF-predicted scores improved image quality. Using RAHF scores for Classifier Guidance in the Latent Diffusion model also enhanced specific aspects of image generation.

Region Inpainting

We used predicted heatmaps to mask problematic regions and applied Muse inpainting to generate improved images. This process resulted in images with fewer artifacts and better alignment with text prompts.