Exploring AI Performance in UI/UX Design and Coding: Insights from a Global User Survey
In recent months, I conducted a comprehensive, crowdsourced assessment of various AI models to understand their capabilities in UI/UX design and coding tasks. Drawing insights from nearly 5,000 users across the globe, I aimed to benchmark how different models perform when generating websites, interactive applications, 3D models, and data visualizations.
This effort was entirely open-source and freely accessible — I am not earning any revenue from this project. Instead, I hope these findings serve as a valuable resource for developers, designers, and AI enthusiasts interested in leveraging these tools effectively.
Development of the Benchmark
To facilitate this analysis, I built a platform called Design Arena. It enables users to generate and compare outputs from multiple AI models in a single interface. Participants can produce various digital assets—such as websites, games, and models—and vote on which results are most effective and user-friendly.
Key Findings
- Leading Models in Coding and UI/UX Design
The survey revealed that models like Claude and DeepSeek consistently outperform their peers. Users favored Claude Opus for its superior interface generation and coding assistance, often ranking it highest overall. DeepSeek models, especially version 0, excelled in website development thanks to their ability to create visually compelling designs. Interestingly, Grok emerged as a noteworthy contender despite less visibility online, owing to its rapid output, though it has some limitations in speed compared to Claude.
- The Hidden Gem: Grok 3
While not as widely discussed as Claude or GPT, Grok 3 deserves recognition. Its speed and performance place it comfortably within the top tier, making it a reliable option for quick iterations and tasks where turnaround time is critical.
- Variability of Gemini 2.5-Pro
The Gemini 2.5-Pro model exhibits mixed results. Some users report positive experiences in generating UI/UX components, but others have encountered poorly designed applications. Its coding abilities, particularly for business logic, are generally effective. Overall, its inconsistent performance suggests it may be better suited for specific use cases than as a universal solution.
- Middle of the Pack: GPT and Llama
OpenAI’s GPT models sit in the middle of the performance spectrum, providing decent outputs but lacking the consistency of top contenders. Conversely, Meta’s Llama models lag behind, highlighting the

