A direct apples-to-apples rerun of the Qwen3.6-27B base evaluation against Jackrong's Qwopus3.6-27B-v1-preview reasoning fine-tune. Same 16 prompts. Same hardware. Same harness.
| Item | Value |
|---|---|
| Model | Jackrong/Qwopus3.6-27B-v1-preview-GGUF — Q4_K_M (16 GB) |
| Base | Qwen/Qwen3.6-27B (evaluated separately in Round 1) |
| Training data | ~12K curated examples: Claude-Distillation, GLM-5.1-Reasoning, Kimi-K2.5-Reasoning, Qwen3.5-reasoning |
| Runtime | llama.cpp cuda-12.8, --flash-attn on, --jinja |
| Context | 65,536 tokens, q8_0 K+V cache, single slot |
| Hardware | RTX 5090 (32 GB), all layers offloaded |
| Metric | Qwen3.6-27B base (Q5) | Qwopus3.6 preview (Q4) |
|---|---|---|
| avg tok/s | 55.3 | 62.3 |
| min / max | 51.3 / 56.0 | 61.8 / 62.7 |
| VRAM used | 24.5 GB | ~20 GB |
| Completion tokens (16 runs) | 93,899 | 87,394 |
| Total gen time | 28 min | 23.4 min |
The speed gap is mostly bandwidth: Q4_K_M moves ~16 GB of weights per token vs Q5_K_XL's ~19 GB. That tracks the ~12% delta cleanly. On identical quant the base and the fine-tune should be within 2% of each other on this hardware. What's surprising is the variance collapse — Qwopus held 62 tok/s within a 1% window across all 16 runs, while the base flexed 10% across its range.
In Round 1, 3 of 5 agentic prompts (code_debug, structured_extraction, tool_use_json) burned their entire token budget inside <think> and emitted empty content. Qwopus handled 4 of the same 5 prompts cleanly with thinking on:
| Task | Round 1 (Qwen base) | Qwopus result |
|---|---|---|
| multi_step_planning | Pass — 3,802 tok w/ think | Pass — 3,158 tok w/ think (shorter) |
| tool_use_json | Empty (needed nothink rerun) | Pass — 1,174 tok w/ think |
| code_debug | Empty (needed nothink rerun) | Pass — 1,628 tok w/ think |
| structured_extraction | Empty (needed nothink rerun) | Empty — starved at 1,500 tok of reasoning, nothink rerun required |
| self_critique | Pass — 2,837 tok w/ think | Pass — 1,277 tok w/ think |
The fine-tune generates substantially shorter reasoning traces — 3,158 vs 3,802 on multi-step, 1,277 vs 2,837 on self-critique. That tighter budgeting is what unblocks the three tasks that failed on base. Structured JSON extraction still needs nothink (or ≥ 6K budget with think) because the reasoning is genuinely long.
= vs ==, bounds logic, nums[k] vs nums[k-1]) and produced a corrected version. Match for base.All 5 outputs validated: start with <!DOCTYPE html>, end with </html>, no truncation.
| Prompt | Qwen base | Qwopus |
|---|---|---|
| saas_landing | 35.8 KB · 9.9 k tok | 36.7 KB · 10.0 k tok |
| analytics_dashboard | 40.8 KB · 12.7 k tok | 37.4 KB · 13.2 k tok |
| designer_portfolio | 20.9 KB · 5.4 k tok | 23.1 KB · 7.4 k tok |
| pricing_page | 29.2 KB · 7.8 k tok | 24.3 KB · 8.1 k tok |
| mobile_app_marketing | 32.4 KB · 9.2 k tok | 29.3 KB · 8.0 k tok |
Tighter spread (23-37 KB vs 21-41 KB). Qwopus uses more tokens per KB of HTML — more whitespace/structure per output rather than bigger pages. Both models handle the brief consistently: Inter + JetBrains Mono on the SaaS page, actual SVG charts on the dashboard (not placeholder rects), magnetic CTA on the portfolio, conic-gradient rotating border on the pricing recommended tier, CSS-only iPhone mockup with 4-7-8 breathing animation on the Stillwater page.
| Prompt | Qwen base | Qwopus |
|---|---|---|
| particle_attractor | 13.1 KB · 4.6 k tok | 11.1 KB · 4.2 k tok |
| webgl_shader (Mandelbulb) | 15.2 KB · 4.9 k tok (shader-bug fix required) | 11.5 KB · 4.4 k tok |
| three_scene (crystals) | 19.9 KB · 6.5 k tok | 17.9 KB · 6.4 k tok |
| physics_sandbox | 21.2 KB · 7.3 k tok | 15.1 KB · 4.4 k tok |
| audio_reactive | 17.8 KB · 6.4 k tok | 12.0 KB · 3.0 k tok |
Qwopus produces tighter canvas output across the board (average 13.5 KB vs base 17.6 KB). Whether that lands in a working demo varies — this is exactly the kind of prompt where an early-preview fine-tune can regress on the edge cases. Best practice: open each demo in the browser before shipping. The Round 1 Mandelbulb needed a GLSL type-promotion patch to run; Qwopus's version compiled clean in first inspection but thorough check requires opening them.
max_tokens. 4 of 5 agentic prompts produce useful content with thinking on, vs 2 of 5 on base at matching budgets.Looking at the rendered UI outputs side-by-side in a browser, the base Qwen3.6-27B's designs feel slightly more polished than Qwopus's on the typical brief — the base lands closer to "near-perfect" on the standard pages (SaaS landing, pricing tier, dashboard) where Qwopus ships clean, functional work that's a half-step behind on the finest details. This gap is exactly the kind of thing that closes as training data scales, and is consistent with a preview trained on ~12 K examples.
What's more interesting is that Qwopus occasionally goes further creatively than the base on open-ended prompts. The clearest example is the audio-reactive visualizer: Qwopus produced a structurally unique interpretation compared to the base model's version — different rendering approach, different visual language, different micro-interactions. Slightly less polish in some corners, notably more originality in the whole. That trade — a shade less refinement in exchange for more distinct creative swings — is a reasonable profile for an early reasoning-focused fine-tune.
Qwopus3.6-27B-v1-preview is a clean upgrade over the Qwen3.6-27B base for single-stream reasoning + UI-generation workloads, especially on a consumer 5090 where Q4 fits with 12 GB of headroom.
The headline speedup is largely a quant effect, but the reasoning-trace discipline is a real fine-tune win. If you were running Qwen3.6-27B base at Q5 and losing agentic responses to thinking starvation, swapping to Qwopus3.6 preview at Q4 gets you 1.12× throughput AND meaningfully fewer empty-content failures at the same max_tokens.
For production design work, the base remains slightly more polished on standard briefs while Qwopus trades that small polish gap for more creative variance — both are shippable today; pick the one that matches your workload. For structured JSON tasks, still disable thinking or bump max_tokens to 6 K.
The real number to watch is the full-scale run. v1-preview was trained on ~12 K curated examples; the in-progress full fine-tune (compute pending — we're collaborating on that right now) is sized orders of magnitude larger with a cleaner data pipeline. If the preview already ships tighter reasoning, matches base on HTML design at a smaller quant, and shows more creative range on open-ended prompts, the full model is where I'd expect the polish gap to close and the creativity advantage to solidify.
Raw outputs and per-run metadata JSON preserved alongside each HTML file in this repo. Same harness and prompts as the Qwen3.6-27B base eval.