Zhaokai Wang1*‡, Mingxin Liu1*, Zirun Zhu1*, Ziqian Fan2*, Yiguo He1*, Mohan Zhang3, Leyao Gu1, Xiangyu Zhao1, Ning Liao1, Shaofeng Zhang4, Xuanhe Zhou1, Zhihang Zhong1, Junchi Yan1, Xue Yang1†
1Shanghai Jiao Tong University 2South China University of Technology 3Xiamen University 4University of Science and Technology of China
*Equal Contribution ‡Project Lead †Corresponding Author
DisciplineGen-1M is a million-scale multidisciplinary dataset designed to support text-to-image (T2I) generation and image editing tasks. This project addresses a critical gap in existing image generation models: while they can produce visually appealing natural images, they remain unreliable when generating knowledge-intensive diagrams whose correctness depends on disciplinary concepts, symbolic structure, and precise spatial relations.
- 🎯 Dual Task Support: Supports both text-to-image generation and image editing
- 🧠 Discipline-Informed Reasoning: Introduces a discipline-informed reasoning-generation model
- 📊 1.2M Samples: Large-scale dataset with comprehensive coverage
- 🔬 10+ Disciplines: Mathematics, Physics, Chemistry, Biology, Geography, Computer Science, Economics, History, Music, Sports, and more
- ✨ Structured Annotations: Includes captions, editing instructions, structured annotations, and paired images with controllable semantic differences
The dataset was constructed using a scalable framework combining four complementary methods:
- Vector-Graphics Rendering (SVG/TikZ): Structured rendering from vector graphics formats
- OCR-Based Editing: Optical character recognition for creating editing pairs
- Large-Scale T2I Filtering: Filtering text-to-image data at scale
- Specialized Programmatic Synthesis: Curated programmatic generation of disciplinary content
These pipelines produce:
- Captions
- Editing instructions
- Structured annotations
- Paired images with controllable semantic differences
The dataset features:
- Long and information-dense prompts
- Diverse subject coverage across fine-grained subdomains
- Multiple image categories
- Varied resolutions and aspect ratios
Our approach demonstrates:
- Substantial improvements over open-source baselines on discipline-related benchmarks (GenExam, GRADE)
- Broader transfer capability on general reasoning-informed benchmarks (WISE, RISE)
- Evidence that large-scale structured academic visual data is key for moving image generation from aesthetic plausibility toward verifiable knowledge-grounded visual creation
If you find this work useful in your research, please cite:
@misc{wang2026disciplinegen1mlargescaledatasetmultidisciplinary,
title={DisciplineGen-1M: A Large-Scale Dataset for Multidisciplinary Visual Generation and Editing},
author={Zhaokai Wang and Mingxin Liu and Zirun Zhu and Ziqian Fan and Yiguo He and Mohan Zhang and Leyao Gu and Xiangyu Zhao and Ning Liao and Shaofeng Zhang and Xuanhe Zhou and Zhihang Zhong and Junchi Yan and Xue Yang},
year={2026},
eprint={2607.02290},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2607.02290},
}

