Insights from 6,000 Respondents Globally on AI Models’ Performance in UI/UX and Coding

Exploring AI Performance in UI/UX Design and Coding: Insights from a Global Survey

In recent months, I embarked on a comprehensive research project to evaluate the capabilities of various AI models in creating user interfaces, designing user experiences, and coding. To achieve this, I launched a crowdsourced benchmarking platform, accessible at Design Arena, where users can generate websites, games, 3D models, and data visualizations using different AI models, then compare their outputs.

Throughout this initiative, nearly 5,000 users have contributed over 4,000 votes, providing valuable data on AI performance across multiple domains. I want to share some key findings derived from this data — all of which is based on open-source models and free-generation tools, with no commercial gain involved.

Key Findings from the AI Benchmarking Study

1. Leading Models for Coding and Design: Claude and DeepSeek

The current leaderboard highlights Claude (specifically Claude Opus) as a top performer in both coding and design tasks. This model consistently receives high user satisfaction scores. Following closely are the DeepSeek models, particularly version 0, which excels in website creation due to digital dominance and versatility. Interestingly, DeepSeek’s models tend to be slower, making Claude the preferred choice for interface implementation where speed is a factor.

2. The Underrated Power of Grok 3

Despite limited mainstream recognition—possibly influenced by its association with Elon Musk—Grok 3 stands out as a highly capable AI model. It ranks within the top five options in our testing and demonstrates remarkable speed, often outperforming its peers in delivering swift, high-quality results.

3. Gemini 2.5-Pro: A Mixed Bag

The Gemini 2.5-Pro model exhibits inconsistent performance. Some users report excellent results in UI/UX design, while others encounter poorly developed applications. Its ability to generate business logic code remains a strong point, but overall, its performance varies significantly based on the task.

4. The Middle of the Pack: OpenAI and Meta

OpenAI’s GPT models show moderate capabilities, performing adequately but not excelling in these specific tasks. Meanwhile, Meta’s Llama models appear to lag behind other competitors, highlighting the ongoing challenges and fierce competition within the AI space. This may also explain Meta’s recent heavy investments to attract top AI talent.

Final Thoughts

Despite the rapid advancements in AI, current models still