LLMs under the hood of Hostinger Horizons: Balancing performance, speed, and cost
The large language model (LLM) race is accelerating, with new architectures, fine-tunes, and specialized systems arriving before the last ones have even settled. With such intense dynamics, selecting the right model takes intention, speed, and constant re-evaluation.
Rather than committing to a single provider or architecture, we systematically benchmark models across a wide range of real-world tasks and domain-specific scenarios. By continuously integrating and testing the latest LLMs, we ensure that Hostinger Horizons, your all-in-one, no-code AI partner, is always powered by top tech to deliver the strongest performance, reliability, and value. Here’s what our latest assessments and experiences reveal.
Who leads the race?
Out of dozens of major LLMs currently competing on the market – each with its own strengths and weaknesses – we always use a combination of at least several and stay up to date with the latest developments and releases. One such example was the launch of Gemini 3 by Google in mid-November last year. It generated quite a buzz, and our internal research confirmed that Gemini 3 is indeed worth the hype.
Today, Gemini 3 powers parts of Hostinger Horizons, delivering more precise, higher-quality code than Gemini 2.5. It also fixes errors more reliably, with our autofix success jumping from 50% to 80%. Though some coding-oriented benchmarks still put Gemini 3 behind GPT-5 mini, GPT-5.1, and now also GPT-5.2, in our experience, Google’s newest model truly delivers.
Expert comment
Gemini 3 is quite capable, especially with more nuanced tasks. For example, while testing it, we were able to generate an intricate finance website with just one prompt. While accurate and powerful, Gemini 3 is rather slow. That is why we don’t use it for simpler changes where a faster model can deliver a similar solution.”
Gemini 3 is one of the LLMs powering Hostinger Horizons. It handles coding tasks and is paired with our communication agent – a new feature that allows AI to ask clarifying questions whenever the prompt is unclear or vague. The communication agent helps Horizons understand what the user wants, which leads to more accurate code generation, an improved final result, and a smoother overall experience. Importantly, these clarifying messages are free – AI credits are only required for code changes.
The newcomer: Opus 4.5
Just days after Google released Gemini 3, Anthropic launched Claude Opus 4.5. In our internal quality score for landpage generation, this newcomer ranks among the top-performing models – right up there with the latest GPT models, as well as Gemini 3.
However, Opus 4.5 uses more tokens to achieve the same result as the older Claude Sonnet 4.5.
“For initial prompts, we’re still mainly using Sonnet 4.5, which has proven reliable for most generation tasks. But we’re investigating Opus 4.5 as an alternative. It follows directions very well, doesn’t make errors, and produces beautiful websites. Technically, it is a very powerful model,” said Dainius Kavoliūnas, Head of Hostinger Horizons.
The real capabilities of Opus 4.5 shine when one pushes the model to its limits – such as by asking it to generate a comprehensive planning app with advanced color palettes, numerous buttons, gradients, and animations in one shot. This is supported by many benchmark scores, indicating that Opus 4.5 outperforms Sonnet 4.5 in areas such as novel problem-solving and advanced reasoning. On SWE-bench Verified, a benchmark used to assess model performance for coding tasks, Opus 4.5 slightly edges out the recent GPT-5.2 Thinking (80.9% vs. 80%) and quite significantly beats Gemini 3 (76.2%).
Finding the balance
By mixing and combining various AI models, we’ve reduced the total response time of Hostinger Horizons by 25%. Also, the background error check after coding now takes only 12 seconds, compared to 40 seconds a month ago.
“In the end, it all comes down to using the right model for the right task and in the right context. So far, we have found that Sonnet 4.5 takes the lead in the initial prompting stage, and Gemini 3 is optimal for subsequent fixes and adjustments, with other models invoked depending on the situation. There’s obviously no single formula, and top scores on benchmarks don’t guarantee the best results when LLMs are used in real-life products. Therefore, we constantly work on testing, improving, and finding the right balance to bring the best experience to our clients,” said Kavoliūnas.
Whether current leaders will maintain their positions or be displaced by competitors remains to be seen. But one thing is certain: we’re intent on staying ahead by continuously testing, comparing, and optimizing. Our goal remains the same: making website creation and management as simple as possible.