Detecting GPT-Image-2 Generated Text-Rich Images: A Multi-Domain Benchmark and Robustness Analysis
As multimodal image generation models grow increasingly capable of producing realistic textual content and structured visual layouts, detecting AI-generated text-rich images has become a critical challenge for maintaining digital trust and content authenticity. Existing benchmarks primarily focus on object-centric images, lacking scene coverage essential for text semantics and layout organization. This work introduces a multi-domain benchmark for GPT-Image-2 generated text-rich images, comprising 8,602 images across six representative scenarios: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. We evaluate five representative AI-generated image detectors under a zero-shot setting and analyze their overall, category-specific, and post-processing robustness. Results show that detector performance is highly domain-dependent; the strongest traditional detector is extremely sensitive to JPEG compression, and multimodal vision-language models show limited effectiveness on structured formats. The benchmark aims to advance text- and layout-aware detection technologies, and the dataset has been made publicly available.
Background and Context
The proliferation of advanced multimodal image generation models, particularly GPT-Image-2, has fundamentally altered the landscape of digital content authenticity. These models have demonstrated unprecedented capability in synthesizing realistic textual content alongside complex, structured visual layouts. This technological leap presents a critical challenge for maintaining digital trust, as text-rich images often contain privacy-sensitive data, transactional records, or decision-critical information. Unlike natural scene images, these text-heavy visuals require precise semantic coherence and logical layout organization, making them a focal point for content verification systems. The ability of generative models to produce indistinguishable receipts, UI screenshots, and academic posters necessitates a reevaluation of existing detection methodologies.
Existing benchmarks for AI-generated image detection have predominantly focused on object-centric natural images, such as landscapes or portraits. This narrow scope has created a significant gap in evaluating detectors' performance on text-rich scenarios. The semantic and structural complexity of text-heavy images introduces unique artifacts and patterns that differ markedly from those found in general photography. Consequently, current evaluation frameworks fail to capture the specific vulnerabilities associated with detecting synthetic text and layout structures. This oversight has left a void in understanding how well current technologies can distinguish between human-made and AI-generated documents, financial records, and interface designs.
To address this deficiency, this study introduces a comprehensive multi-domain benchmark specifically designed for GPT-Image-2 generated text-rich images. The dataset comprises 8,602 carefully curated and annotated images spanning six representative categories: commercial posters, infographics, academic posters, receipts, tables, and UI screenshots. By covering such a diverse range of scenarios, the benchmark ensures a holistic assessment of detector capabilities across varying degrees of textual density and layout complexity. This initiative aims to provide a standardized platform for evaluating the robustness and generalization ability of AI detection systems in real-world applications where text and structure are paramount.
Deep Analysis
The evaluation of detection capabilities was conducted under a zero-shot setting, testing five representative AI-generated image detectors against unseen data from the benchmark categories. This approach rigorously assesses the generalization power of existing models, simulating real-world conditions where detectors encounter novel domains without prior fine-tuning. The selected detectors, which rely on statistical features, frequency domain analysis, and deep learning-based feature extraction, were subjected to a series of tests to measure their overall accuracy, category-specific performance, and resilience to post-processing attacks. The primary objective was to identify specific failure modes and technical bottlenecks in current detection paradigms when applied to text-rich content. Experimental results revealed a pronounced domain dependency in detector performance. Models that exhibited high accuracy in one category, such as UI screenshots, often failed to generalize to others, such as complex infographics or academic posters. This inconsistency suggests that current detection features may be overly reliant on specific visual patterns or artifacts that are not universal across different types of text-rich images. The lack of cross-domain robustness indicates that existing detectors are not capturing the fundamental generative traces common to all AI-synthesized text layouts, but rather overfitting to superficial characteristics of specific image types. Furthermore, the analysis highlighted severe robustness issues, particularly concerning image compression. The strongest traditional detector demonstrated extreme sensitivity to JPEG compression, with performance degrading significantly even under mild compression levels. This vulnerability implies that the detection signals identified by current models are either too weak or easily disrupted by common image processing techniques. In practical scenarios, where images are frequently compressed for storage or transmission, this sensitivity renders many existing detectors ineffective. The findings underscore the fragility of current detection mechanisms when faced with standard post-processing operations commonly applied to digital images.
The study also explored the potential of multimodal vision-language models (VLMs) for this task. While VLMs possess inherent advantages in understanding textual semantics, their effectiveness in detecting AI-generated structured formats was limited. Despite their advanced language understanding capabilities, these models struggled to leverage semantic information for robust detection in complex layouts such as tables and dense text regions. This result challenges the assumption that integrating language models directly into detection pipelines will automatically yield superior performance for text-rich image verification, suggesting that structural and layout-aware features remain underutilized.
Industry Impact
The implications of these findings are profound for both the open-source research community and industrial applications. For researchers, the release of the 8,602-image multi-domain benchmark provides a critical resource for developing and comparing next-generation detection algorithms. By establishing a standardized evaluation platform, the benchmark facilitates fair and reproducible comparisons, accelerating the iteration of detection technologies. It highlights the urgent need for new methodologies that can effectively capture and utilize text and layout features, moving beyond the limitations of current object-centric detection frameworks.
In the industrial sector, the ability to reliably detect AI-generated text-rich images is essential for preventing fraud, protecting user privacy, and maintaining content integrity. Sectors such as finance, e-commerce, and digital media are increasingly vulnerable to sophisticated forgeries involving synthetic receipts, invoices, and interface designs. The demonstrated vulnerability of current detectors to JPEG compression and domain shifts poses a significant risk to these industries. Companies must recognize that relying on existing detection tools may lead to false negatives, allowing malicious actors to exploit the gaps in current verification systems.
The study's identification of specific weaknesses, such as the sensitivity to compression and the lack of cross-domain generalization, provides clear directions for industrial optimization. Developers of content verification systems must prioritize the development of detectors that are robust to common image processing operations and capable of generalizing across diverse text-rich categories. This may involve integrating more sophisticated feature extraction techniques that focus on the interplay between textual semantics and visual layout structures. The findings serve as a call to action for the industry to invest in more resilient and specialized detection solutions.
Outlook
Looking forward, the development of detection technologies must shift towards a more holistic approach that integrates text semantics with layout structure. The current failure of both traditional detectors and multimodal VLMs to effectively handle structured formats indicates a need for novel architectures that can explicitly model the relationships between text elements and their spatial arrangement. Future research should focus on creating detectors that are inherently aware of typographic inconsistencies, alignment errors, and logical flow anomalies that are characteristic of AI-generated text-rich images. Additionally, there is a pressing need to enhance the robustness of detection models against post-processing attacks. Techniques that can maintain detection performance under various compression levels and image transformations will be crucial for practical deployment. This may involve training detectors on augmented data that includes diverse compression artifacts and noise patterns, thereby improving their resilience to real-world variations. The goal is to create detection systems that are not only accurate but also stable and reliable in dynamic digital environments. The open-sourcing of the benchmark dataset marks a significant step towards addressing these challenges. By providing a rich and diverse set of examples, it enables the community to experiment with new ideas and validate their effectiveness. As multimodal generation models continue to evolve, the benchmark will serve as a vital tool for tracking progress and identifying emerging threats. The ultimate aim is to establish a new standard for content authenticity verification that can keep pace with the rapid advancements in generative AI, ensuring the integrity of digital information in an increasingly complex landscape.
The integration of advanced linguistic analysis with computer vision techniques holds promise for overcoming current limitations. By leveraging the strengths of both modalities, future detectors may achieve a deeper understanding of the generative process, enabling more accurate and robust identification of synthetic content. This interdisciplinary approach will be key to building trust in digital media and safeguarding against the misuse of AI-generated text-rich images. The journey towards reliable detection is ongoing, but this benchmark provides a solid foundation for the next generation of verification technologies.