Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

Seungwook Kim¹, Minsu Cho^1,2

¹ POSTECH ² RLWRLD

CVPR 2026

TL;DR — SOLACE post-trains text-to-image generators using the model's own self-confidence as a reward — no external reward models needed. It improves compositionality, text rendering, and text-image alignment, and complements external rewards to mitigate reward hacking.

SOLACE teaser: qualitative comparison showing SD3.5 vs SD3.5 + SOLACE

Figure 1. Qualitative examples of SOLACE applied to SD3.5-Medium on the Pick-a-Pic dataset. SOLACE post-training consistently improves image quality, compositionality, and text rendering without any external reward model.

Abstract

Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce SOLACE (Self-Originating LAtent Confidence Estimation), a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes. SOLACE converts this intrinsic signal into scalar rewards, enabling fully unsupervised optimization without additional datasets, annotators, or reward models. Empirically, by reinforcing high-confidence generations, SOLACE delivers consistent gains in compositional generation, text rendering and text-image alignment over the baseline. We also find that integrating SOLACE with external rewards results in a complementary improvement, with alleviated reward hacking.

Key Contributions

We present SOLACE, a post-training framework that uses self-confidence as a reward for text-to-image generators, requiring no external reward models, annotators, or datasets.
We define the self-confidence reward by re-noising the model's own outputs and rewarding accurate recovery of the injected noise — a principled, training-aligned signal.
Across standard benchmarks and a comprehensive user study, SOLACE yields consistent gains in compositionality, text rendering, and text-image alignment.
SOLACE complements external-reward pipelines: applying SOLACE on top of externally post-trained models improves non-target capabilities while mitigating reward hacking.

Method Overview

Given a text prompt \(c\), we generate \(G\) different latents using the flow-matching model. Without decoding, we re-noise the latents using \(K\) noise probes across timesteps \(t \in \mathcal{T} \subset [0,1]\). For each generated latent \(z_0^{(i)}\), we formulate the model's self-confidence as its ability to accurately denoise the re-noised latent. This self-confidence serves as an intrinsic reward scalar, which we use to post-train the model via Flow-GRPO.

Figure 2. Overview of SOLACE. Given a text prompt, we generate G latents, re-noise them with K shared noise probes, and measure the model's ability to recover the injected noise. This self-confidence score is used as the intrinsic reward for GRPO post-training.

The self-confidence reward is computed as:

\[ R_{\text{SOLACE}}(z_0^{(i)}, c) = \frac{1}{\sum_{t \in \mathcal{T}} w(t)} \sum_{t \in \mathcal{T}} w(t) \cdot S_{i,t} \]

where

\[ S_{i,t} = -\log\left(\text{MSE}_{i,t} + \delta\right), \quad \text{MSE}_{i,t} = \frac{1}{K}\sum_{m=1}^{K} \left\| \hat{\epsilon}_\theta(z_t^{(i,m)}, t, c) - \epsilon^{(m)} \right\|^2 \]

The negative log transform amplifies small errors into large rewards while stabilizing the dynamic range. The reward is computed entirely in latent space, avoiding the need for decoding and keeping the signal model-native.

Key Design Choices

Training on selective timesteps: We optimize only a suffix of the training schedule (\(|\mathcal{T}_{\text{train}}| = \lceil \rho |\mathcal{T}| \rceil\) with \(\rho = 0.6\)) to prevent collapse from reward hacking on early timesteps.
Self-confidence without CFG: Although images are sampled with classifier-free guidance, self-confidence is computed without CFG to evaluate the base conditional policy rather than a guided proxy.
Online self-confidence: Using the model being trained (\(\pi_\theta\)) for self-confidence computation, rather than a fixed reference model, yields stronger improvements as the model's confidence becomes more reliable through training.
Antithetic noise pairing: We enforce exact mean zero within each noise probe set (\(\epsilon^{(m+K/2)} = -\epsilon^{(m)}\)) for variance reduction.

Quantitative Results

We evaluate SOLACE on compositional generation (GenEval), visual text rendering (OCR), image quality (CLIP-Score, Aesthetic Score), and human preference alignment (PickScore, HPSv2, ImageReward, UnifiedReward).

Model	Task-specific		Image Quality		Human Preference
Model	GenEval	OCR	CLIP	Aesth.	Pick	HPSv2	ImgRwd	UniRwd
SD-XL	0.55	0.14	0.287	5.60	22.42	0.280	0.76	2.93
SD3.5-L	0.71	0.68	0.289	5.50	22.91	0.288	0.96	3.25
SD3.5-M (2.5B)	0.65	0.61	0.282	5.36	22.34	0.279	0.84	3.08
+ SOLACE (Ours)	0.71	0.67	0.288	5.39	22.41	0.278	0.87	3.11
SD3.5-M + FlowGRPO	0.67	0.68	0.278	5.90	23.50	0.314	1.26	3.37
+ FlowGRPO + SOLACE (Ours)	0.77	0.70	0.287	5.63	22.73	0.286	1.07	3.26

Table 1. Quantitative results of SOLACE post-training. SOLACE yields consistent gains across task-specific, image quality, and human preference metrics. Applying SOLACE on top of FlowGRPO further improves compositionality and text rendering while mitigating reward hacking.

User Study

We conducted a comprehensive user study with ~1,800 responses from 20 participants on prompts from PartiPrompts and HPSv2, evaluating visual realism/appeal and text-image alignment.

Figure 3. User study results. SOLACE post-training consistently outperforms the baseline SD3.5-M in both visual realism/appeal and text-image alignment across PartiPrompts and HPSv2 prompt sets.

Qualitative Results

SOLACE improves compositionality (correct object counts, spatial relations, attribute binding), text rendering accuracy, and overall visual quality without any external reward supervision.

Figure 4. Qualitative results of SOLACE on DrawBench, GenEval, and OCR benchmarks. SOLACE shows consistent improvements over the baseline SD3.5 in compositionality, text rendering, and visual appeal.

Figure 5. Effect of applying SOLACE after FlowGRPO post-training with PickScore as the external reward. SOLACE complements external rewards, recovering compositionality lost during reward-targeted optimization while maintaining visual appeal.

Ablation Studies

We validate the key design choices of SOLACE through ablation studies on the number of noise probes, CFG usage during self-confidence computation, and online vs. offline self-confidence.

Configuration	GenEval	OCR	CLIP	Aesth.	Pick	HPSv2	ImgRwd	UniRwd
Number of Noise Probes K
K = 4	0.66	0.67	0.287	5.37	22.34	0.273	0.81	3.08
K = 8 (Ours)	0.71	0.67	0.288	5.39	22.41	0.278	0.87	3.11
K = 16	0.67	0.67	0.288	5.42	22.34	0.278	0.86	3.09
CFG for Self-Confidence
With CFG	0.68	0.59	0.287	5.38	22.39	0.278	0.85	3.10
Without CFG (Ours)	0.71	0.67	0.288	5.39	22.41	0.278	0.87	3.11
Online vs Offline Self-Confidence
Offline	0.69	0.61	0.285	5.36	22.36	0.274	0.82	3.07
Online (Ours)	0.71	0.67	0.288	5.39	22.41	0.278	0.87	3.11

Table 2. Ablation study results validating the design choices of SOLACE.

BibTeX

@inproceedings{kim2026solace,
  title={Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards},
  author={Kim, Seungwook and Cho, Minsu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}