#text-to-image News & Analysis

79 articles tagged with #text-to-image. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

79 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Retrieval-Augmented Anatomical Guidance for Text-to-CT Generation

Researchers propose a retrieval-augmented approach for generating CT scans from radiology reports that combines semantic control with anatomical consistency by retrieving structurally similar clinical cases and using their annotations as guidance. The method improves image fidelity and clinical consistency compared to text-only baselines while enabling spatial controllability without requiring ground-truth annotations at inference time.

AIBullisharXiv – CS AI · Jun 97/10

🧠

ZIPP:Zero-shot Image Personalization from Personas

Researchers introduce ZIPP, a zero-shot image personalization system that conditions text-to-image diffusion models on natural-language personas derived from user behavior rather than requiring fine-tuning or interaction history. The method uses an LLM to rewrite prompts from persona perspectives and achieves 13-20% performance gains while reducing demographic bias compared to existing personalization approaches.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

Researchers demonstrate that safety behaviors in generative AI models can be represented as portable latent directions that transfer across different architectures without requiring unsafe training data on target models. This framework enables cross-model safety steering for text-to-image and text-to-video generation, suggesting safety is a shared property rather than model-specific.

AIBullisharXiv – CS AI · Jun 27/10

🧠

OctoT2I: A Self-Evolving Agentic Text-to-Image Router

Researchers introduce OctoT2I, an agentic text-to-image framework that autonomously routes tasks across multiple T2I models without human annotation. The system uses a self-evolving mechanism to discover each model's capabilities and achieves 90.3% faster inference with 56.6% better energy efficiency compared to existing methods while maintaining competitive quality scores.

AIBearisharXiv – CS AI · Jun 27/10

🧠

Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

Researchers identify prototypicality bias as a systematic flaw in automated text-to-image evaluation metrics, where models prefer visually plausible but semantically incorrect images over accurate ones. The study introduces PROTOBIAS, a diagnostic benchmark revealing that widely-used metrics fail to prioritize semantic faithfulness to prompts, while proposing PROTOSCORE as a mitigation approach.

AIBullisharXiv – CS AI · May 277/10

🧠

Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

Researchers introduce DIDR (Diff-Instruct with Diffused Reward), a reinforcement learning framework that improves one-step text-to-image generation by aligning reward optimization with diffusion dynamics. The method addresses a fundamental mismatch in existing approaches where optimizing for image-space rewards often degrades overall image fidelity, demonstrating superior results compared to current SDXL baselines.

AIBullisharXiv – CS AI · May 127/10

🧠

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

Researchers introduce Auto-Rubric as Reward (ARR), a framework that replaces opaque scalar reward signals in multimodal AI alignment with explicit, structured criteria-based evaluation. By externalizing a model's implicit preferences into interpretable rubrics before comparison, ARR reduces evaluation bias and enables more reliable human-preference alignment in generative models.

AIBullisharXiv – CS AI · May 127/10

🧠

HyperTransport: Amortized Conditioning of T2I Generative Models

HyperTransport is a new hypernetwork framework that dramatically accelerates activation steering for text-to-image models by amortizing optimization costs across multiple concepts. Rather than optimizing intervention parameters for each new concept (which takes minutes), the system learns to map CLIP embeddings directly to steering parameters in a single forward pass, achieving 3600-7000x speedup while matching per-concept baselines on unseen concepts.

AIBullisharXiv – CS AI · May 117/10

🧠

Flow-OPD: On-Policy Distillation for Flow Matching Models

Researchers introduce Flow-OPD, a post-training framework that applies on-policy distillation to Flow Matching text-to-image models, addressing reward sparsity and gradient interference problems. Built on Stable Diffusion 3.5 Medium, the method achieves significant performance gains—GenEval scores improve from 63 to 92 and OCR accuracy from 59 to 94—while maintaining image quality and surpassing individual teacher models.

🧠 Stable Diffusion

AIBullisharXiv – CS AI · May 117/10

🧠

CASCADE: Context-Aware Relaxation for Speculative Image Decoding

Researchers have developed CASCADE, a novel speculative decoding technique that accelerates autoregressive image generation by up to 3.6x through identifying and exploiting redundancies in neural network representations. The method addresses a critical bottleneck in image synthesis by reducing draft token rejection rates without requiring model retraining, advancing the efficiency of text-to-image AI systems.

AIBearisharXiv – CS AI · May 117/10

🧠

OrchJail: Jailbreaking Tool-Calling Text-to-Image Agents by Orchestration-Guided Fuzzing

Researchers have developed OrchJail, a fuzzing framework that discovers vulnerabilities in tool-calling text-to-image AI agents by exploiting how multiple benign steps combine into unsafe outputs. Unlike traditional prompt-injection attacks, OrchJail targets the orchestration layer where agents chain tools together, achieving higher attack success rates while evading existing defenses.

AIBullisharXiv – CS AI · May 117/10

🧠

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

Researchers introduce SCOPE, a framework that addresses the challenge of maintaining semantic commitments throughout the text-to-image generation process by using structured specifications and conditional skill orchestration. The framework achieves significantly higher performance on complex image generation tasks, with a new benchmark (Gen-Arena) and evaluation metric (EGIP) designed to measure commitment-level intent realization.

AINeutralarXiv – CS AI · May 47/10

🧠

When Do Diffusion Models learn to Generate Multiple Objects?

Researchers have identified fundamental limitations in how text-to-image diffusion models handle multi-object generation, finding that scene complexity rather than data imbalance is the primary culprit. Through a controlled framework called MOSAIC, they demonstrate that counting objects is particularly difficult in low-data regimes and that compositional generalization collapses when training combinations are systematically excluded.

AIBullisharXiv – CS AI · May 17/10

🧠

How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

Researchers introduce Flow Map Reward Guidance (FMRG), a novel training-free method for guiding generative models toward user-specified objectives using optimal control theory. The approach achieves comparable or superior results to existing baselines while requiring only 3 neural function evaluations, representing a 10x+ speedup over prior methods.

AIBullisharXiv – CS AI · Apr 107/10

🧠

DiffSketcher: Text Guided Vector Sketch Synthesis through Latent Diffusion Models

DiffSketcher is a novel AI algorithm that generates vector sketches from text prompts by leveraging pre-trained text-to-image diffusion models. The method optimizes Bézier curves using an extended Score Distillation Sampling loss and introduces a stroke initialization strategy based on attention maps, achieving superior results in sketch quality and controllability.

AIBearisharXiv – CS AI · Mar 177/10

🧠

When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators

New research reveals that despite visual improvements, modern text-to-image models from 2022-2025 perform worse as synthetic training data generators for AI classifiers. The study found that newer models collapse to narrow, aesthetic-focused distributions that lack the diversity needed for effective machine learning training.

AIBullisharXiv – CS AI · Mar 177/10

🧠

LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

Researchers propose LESA, a new framework that accelerates Diffusion Transformers (DiTs) by up to 6.25x using learnable predictors and Kolmogorov-Arnold Networks. The method achieves significant speedups while maintaining or improving generation quality in text-to-image and text-to-video synthesis tasks.

AINeutralarXiv – CS AI · Mar 56/10

🧠

Order Is Not Layout: Order-to-Space Bias in Image Generation

Researchers have identified Order-to-Space Bias (OTS) in modern image generation models, where the order entities are mentioned in text prompts incorrectly determines spatial layout and role assignments. The study introduces OTS-Bench to measure this bias and demonstrates that targeted fine-tuning and early-stage interventions can reduce the problem while maintaining generation quality.

AIBullisharXiv – CS AI · Mar 46/103

🧠

TikZilla: Scaling Text-to-TikZ with High-Quality Data and Reinforcement Learning

Researchers have developed TikZilla, a new AI model that generates high-quality scientific figures from text descriptions using TikZ code. The model uses a dataset four times larger than previous versions and combines supervised learning with reinforcement learning to achieve performance matching GPT-5 while using much smaller model sizes.

AIBearisharXiv – CS AI · Mar 47/103

🧠

Semantic-level Backdoor Attack against Text-to-Image Diffusion Models

Researchers have developed SemBD, a new semantic-level backdoor attack against text-to-image diffusion models that achieves 100% success rate while evading current defenses. The attack uses continuous semantic regions as triggers rather than fixed textual patterns, making it significantly harder to detect and defend against.

AIBullisharXiv – CS AI · Mar 47/102

🧠

Fine-Tuning Diffusion Models via Intermediate Distribution Shaping

Researchers present P-GRAFT, a new method for fine-tuning diffusion models by shaping distributions at intermediate noise levels, showing improved performance on text-to-image generation tasks. The framework achieved an 8.81% relative improvement over base Stable Diffusion v2 model on popular benchmarks.

AIBullisharXiv – CS AI · Mar 46/104

🧠

Conditioned Activation Transport for T2I Safety Steering

Researchers introduce Conditioned Activation Transport (CAT), a new framework to prevent text-to-image AI models from generating unsafe content while preserving image quality for legitimate prompts. The method uses a geometry-based conditioning mechanism and nonlinear transport maps, validated on Z-Image and Infinity architectures with significantly reduced attack success rates.

AIBullisharXiv – CS AI · Mar 37/104

🧠

Navigating with Annealing Guidance Scale in Diffusion Space

Researchers propose a new annealing guidance scheduler that dynamically adjusts guidance scales in diffusion models during image generation, improving both image quality and text prompt alignment. The method enhances text-to-image generation performance without requiring additional memory or computational resources.

AIBullisharXiv – CS AI · Mar 37/105

🧠

HierarchicalPrune: Position-Aware Compression for Large-Scale Diffusion Models

Researchers developed HierarchicalPrune, a compression framework that reduces large-scale text-to-image diffusion models' memory footprint by 77.5-80.4% and latency by 27.9-38.0% while maintaining image quality. The technique enables billion-parameter AI models to run efficiently on resource-constrained devices through hierarchical pruning and knowledge distillation.

AIBearisharXiv – CS AI · Feb 277/104

🧠

Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

Researchers reveal a critical evaluation bias in text-to-image diffusion models where human preference models favor high guidance scales, leading to inflated performance scores despite poor image quality. The study introduces a new evaluation framework and demonstrates that simply increasing CFG scales can compete with most advanced guidance methods.

Page 1 of 4Next →