My group & collaborators have developed many popular benchmarks over the years, e.g., MMLU, MATH, APPS---really excited about our latest benchmark OMEGA Ω: 🔍Can LLMs really think outside the box in math? a new benchmark probing 3 axes of generalization: 1️⃣ Exploratory 2️⃣ Compositional 3️⃣ Transformative showing limitations of today's frontier AI and RL-training in these dimensions of generalization. Inspired by Boden’s typology of creativity, OMEGA advances beyond prior benchmarks with a programmatically generated dataset that combines precise control with rich diversity. Spanning a wide range of mathematical domains, it is explicitly designed to evaluate distinct axes of generalization and creative reasoning. By isolating and quantifying fine-grained failure modes, OMEGA provides a foundation for advancing LLMs toward genuine mathematical creativity—beyond mechanical proficiency. Huge thanks to my postdoc @YiyouSun @UCBerkeley leading the project, and amazing collaborators @nouhadziri @HannaHajishirzi @allen_ai and other co-authors!