Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

1 POSTECH   2 RLWRLD
CVPR 2026
Paper (coming soon) arXiv (coming soon) Code

TL;DRSPARC post-trains text-to-image generators using the model's own self-confidence as a reward — no external reward models needed. It improves compositionality, text rendering, and text-image alignment, and complements external rewards to mitigate reward hacking.

SPARC teaser: qualitative comparison showing SD3.5 vs SD3.5 + SPARC

Figure 1. Qualitative examples of SPARC applied to SD3.5-Medium on the Pick-a-Pic dataset. SPARC post-training consistently improves image quality, compositionality, and text rendering without any external reward model.

Abstract

Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to better match human preferences, factuality, and improved aesthetics. We introduce SPARC (Self-Probing Adaptive Reward by Confidence), a post-training framework that replaces external reward supervision with an internal self-confidence signal, obtained by evaluating how accurately the model recovers injected noise under self-denoising probes. SPARC converts this intrinsic signal into scalar rewards, enabling fully unsupervised optimization without additional datasets, annotators, or reward models. Empirically, by reinforcing high-confidence generations, SPARC delivers consistent gains in compositional generation, text rendering and text-image alignment over the baseline. We also find that integrating SPARC with external rewards results in a complementary improvement, with alleviated reward hacking.

Key Contributions

  • We present SPARC, a post-training framework that uses self-confidence as a reward for text-to-image generators, requiring no external reward models, annotators, or datasets.
  • We define the self-confidence reward by re-noising the model's own outputs and rewarding accurate recovery of the injected noise — a principled, training-aligned signal.
  • Across standard benchmarks and a comprehensive user study, SPARC yields consistent gains in compositionality, text rendering, and text-image alignment.
  • SPARC complements external-reward pipelines: applying SPARC on top of externally post-trained models improves non-target capabilities while mitigating reward hacking.

Method Overview

Given a text prompt \(c\), we generate \(G\) different latents using the flow-matching model. Without decoding, we re-noise the latents using \(K\) noise probes across timesteps \(t \in \mathcal{T} \subset [0,1]\). For each generated latent \(z_0^{(i)}\), we formulate the model's self-confidence as its ability to accurately denoise the re-noised latent. This self-confidence serves as an intrinsic reward scalar, which we use to post-train the model via Flow-GRPO.

SPARC method overview

Figure 2. Overview of SPARC. Given a text prompt, we generate G latents, re-noise them with K shared noise probes, and measure the model's ability to recover the injected noise. This self-confidence score is used as the intrinsic reward for GRPO post-training.

The self-confidence reward is computed as:

\[ R_{\text{SPARC}}(z_0^{(i)}, c) = \frac{1}{\sum_{t \in \mathcal{T}} w(t)} \sum_{t \in \mathcal{T}} w(t) \cdot S_{i,t} \]

where

\[ S_{i,t} = -\log\left(\text{MSE}_{i,t} + \delta\right), \quad \text{MSE}_{i,t} = \frac{1}{K}\sum_{m=1}^{K} \left\| \hat{\epsilon}_\theta(z_t^{(i,m)}, t, c) - \epsilon^{(m)} \right\|^2 \]

The negative log transform amplifies small errors into large rewards while stabilizing the dynamic range. The reward is computed entirely in latent space, avoiding the need for decoding and keeping the signal model-native.

Key Design Choices

  • Training on selective timesteps: We optimize only a suffix of the training schedule (\(|\mathcal{T}_{\text{train}}| = \lceil \rho |\mathcal{T}| \rceil\) with \(\rho = 0.6\)) to prevent collapse from reward hacking on early timesteps.
  • Self-confidence without CFG: Although images are sampled with classifier-free guidance, self-confidence is computed without CFG to evaluate the base conditional policy rather than a guided proxy.
  • Online self-confidence: Using the model being trained (\(\pi_\theta\)) for self-confidence computation, rather than a fixed reference model, yields stronger improvements as the model's confidence becomes more reliable through training.
  • Antithetic noise pairing: We enforce exact mean zero within each noise probe set (\(\epsilon^{(m+K/2)} = -\epsilon^{(m)}\)) for variance reduction.

Quantitative Results

We evaluate SPARC on compositional generation (GenEval), visual text rendering (OCR), image quality (CLIP-Score, Aesthetic Score), and human preference alignment (PickScore, HPSv2, ImageReward, UnifiedReward).

Model Task-specific Image Quality Human Preference
GenEval OCR CLIP Aesth. Pick HPSv2 ImgRwd UniRwd
SD-XL 0.55 0.14 0.287 5.60 22.42 0.280 0.76 2.93
SD3.5-L 0.71 0.68 0.289 5.50 22.91 0.288 0.96 3.25
SD3.5-M (2.5B) 0.65 0.61 0.282 5.36 22.34 0.279 0.84 3.08
+ SPARC (Ours) 0.71 0.67 0.288 5.39 22.41 0.278 0.87 3.11
SD3.5-M + FlowGRPO 0.67 0.68 0.278 5.90 23.50 0.314 1.26 3.37
+ FlowGRPO + SPARC (Ours) 0.77 0.70 0.287 5.63 22.73 0.286 1.07 3.26

Table 1. Quantitative results of SPARC post-training. SPARC yields consistent gains across task-specific, image quality, and human preference metrics. Applying SPARC on top of FlowGRPO further improves compositionality and text rendering while mitigating reward hacking.

User Study

We conducted a comprehensive user study with ~1,800 responses from 20 participants on prompts from PartiPrompts and HPSv2, evaluating visual realism/appeal and text-image alignment.

User study results

Figure 3. User study results. SPARC post-training consistently outperforms the baseline SD3.5-M in both visual realism/appeal and text-image alignment across PartiPrompts and HPSv2 prompt sets.

Qualitative Results

SPARC improves compositionality (correct object counts, spatial relations, attribute binding), text rendering accuracy, and overall visual quality without any external reward supervision.

Qualitative comparison

Figure 4. Qualitative results of SPARC on DrawBench, GenEval, and OCR benchmarks. SPARC shows consistent improvements over the baseline SD3.5 in compositionality, text rendering, and visual appeal.

SPARC complements external rewards

Figure 5. Effect of applying SPARC after FlowGRPO post-training with PickScore as the external reward. SPARC complements external rewards, recovering compositionality lost during reward-targeted optimization while maintaining visual appeal.

Ablation Studies

We validate the key design choices of SPARC through ablation studies on the number of noise probes, CFG usage during self-confidence computation, and online vs. offline self-confidence.

Configuration GenEval OCR CLIP Aesth. Pick HPSv2 ImgRwd UniRwd
Number of Noise Probes K
K = 4 0.66 0.67 0.287 5.37 22.34 0.273 0.81 3.08
K = 8 (Ours) 0.71 0.67 0.288 5.39 22.41 0.278 0.87 3.11
K = 16 0.67 0.67 0.288 5.42 22.34 0.278 0.86 3.09
CFG for Self-Confidence
With CFG 0.68 0.59 0.287 5.38 22.39 0.278 0.85 3.10
Without CFG (Ours) 0.71 0.67 0.288 5.39 22.41 0.278 0.87 3.11
Online vs Offline Self-Confidence
Offline 0.69 0.61 0.285 5.36 22.36 0.274 0.82 3.07
Online (Ours) 0.71 0.67 0.288 5.39 22.41 0.278 0.87 3.11

Table 2. Ablation study results validating the design choices of SPARC.

BibTeX

@inproceedings{kim2026sparc,
  title={Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards},
  author={Kim, Seungwook and Cho, Minsu},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}