← back to index

Qwopus3.6-27B v1-preview — Q4_K_M evaluation

by Kyle Hessling · @KyleHessling1 on X · fine-tune by Jackrong

!
Early preview — not the final Qwopus 3.6 model This evaluation is against v1-preview, a small ~12 K-example training pass. I'm currently working with Jackrong to secure more compute for a full fine-tune run — orders of magnitude larger training set, cleaner data pipeline, different base sampling. Treat the numbers here as a directional signal on the fine-tune approach, not on the final model.

A direct apples-to-apples rerun of the Qwen3.6-27B base evaluation against Jackrong's Qwopus3.6-27B-v1-preview reasoning fine-tune. Same 16 prompts. Same hardware. Same harness.

TL;DR

Setup

ItemValue
ModelJackrong/Qwopus3.6-27B-v1-preview-GGUF — Q4_K_M (16 GB)
BaseQwen/Qwen3.6-27B (evaluated separately in Round 1)
Training data~12K curated examples: Claude-Distillation, GLM-5.1-Reasoning, Kimi-K2.5-Reasoning, Qwen3.5-reasoning
Runtimellama.cpp cuda-12.8, --flash-attn on, --jinja
Context65,536 tokens, q8_0 K+V cache, single slot
HardwareRTX 5090 (32 GB), all layers offloaded

Throughput

MetricQwen3.6-27B base (Q5)Qwopus3.6 preview (Q4)
avg tok/s55.362.3
min / max51.3 / 56.061.8 / 62.7
VRAM used24.5 GB~20 GB
Completion tokens (16 runs)93,89987,394
Total gen time28 min23.4 min

The speed gap is mostly bandwidth: Q4_K_M moves ~16 GB of weights per token vs Q5_K_XL's ~19 GB. That tracks the ~12% delta cleanly. On identical quant the base and the fine-tune should be within 2% of each other on this hardware. What's surprising is the variance collapse — Qwopus held 62 tok/s within a 1% window across all 16 runs, while the base flexed 10% across its range.

Agentic reasoning

Thinking starvation — better, not gone

In Round 1, 3 of 5 agentic prompts (code_debug, structured_extraction, tool_use_json) burned their entire token budget inside <think> and emitted empty content. Qwopus handled 4 of the same 5 prompts cleanly with thinking on:

TaskRound 1 (Qwen base)Qwopus result
multi_step_planningPass — 3,802 tok w/ thinkPass — 3,158 tok w/ think (shorter)
tool_use_jsonEmpty (needed nothink rerun)Pass — 1,174 tok w/ think
code_debugEmpty (needed nothink rerun)Pass — 1,628 tok w/ think
structured_extractionEmpty (needed nothink rerun)Empty — starved at 1,500 tok of reasoning, nothink rerun required
self_critiquePass — 2,837 tok w/ thinkPass — 1,277 tok w/ think

The fine-tune generates substantially shorter reasoning traces — 3,158 vs 3,802 on multi-step, 1,277 vs 2,837 on self-critique. That tighter budgeting is what unblocks the three tasks that failed on base. Structured JSON extraction still needs nothink (or ≥ 6K budget with think) because the reasoning is genuinely long.

Quality notes

Front-end design (5 prompts)

All 5 outputs validated: start with <!DOCTYPE html>, end with </html>, no truncation.

PromptQwen baseQwopus
saas_landing35.8 KB · 9.9 k tok36.7 KB · 10.0 k tok
analytics_dashboard40.8 KB · 12.7 k tok37.4 KB · 13.2 k tok
designer_portfolio20.9 KB · 5.4 k tok23.1 KB · 7.4 k tok
pricing_page29.2 KB · 7.8 k tok24.3 KB · 8.1 k tok
mobile_app_marketing32.4 KB · 9.2 k tok29.3 KB · 8.0 k tok

Tighter spread (23-37 KB vs 21-41 KB). Qwopus uses more tokens per KB of HTML — more whitespace/structure per output rather than bigger pages. Both models handle the brief consistently: Inter + JetBrains Mono on the SaaS page, actual SVG charts on the dashboard (not placeholder rects), magnetic CTA on the portfolio, conic-gradient rotating border on the pricing recommended tier, CSS-only iPhone mockup with 4-7-8 breathing animation on the Stillwater page.

Canvas / WebGL (6 prompts)

PromptQwen baseQwopus
particle_attractor13.1 KB · 4.6 k tok11.1 KB · 4.2 k tok
webgl_shader (Mandelbulb)15.2 KB · 4.9 k tok (shader-bug fix required)11.5 KB · 4.4 k tok
three_scene (crystals)19.9 KB · 6.5 k tok17.9 KB · 6.4 k tok
physics_sandbox21.2 KB · 7.3 k tok15.1 KB · 4.4 k tok
audio_reactive17.8 KB · 6.4 k tok12.0 KB · 3.0 k tok

Qwopus produces tighter canvas output across the board (average 13.5 KB vs base 17.6 KB). Whether that lands in a working demo varies — this is exactly the kind of prompt where an early-preview fine-tune can regress on the edge cases. Best practice: open each demo in the browser before shipping. The Round 1 Mandelbulb needed a GLSL type-promotion patch to run; Qwopus's version compiled clean in first inspection but thorough check requires opening them.

What the fine-tune buys you

Caveats

Subjective design quality

Looking at the rendered UI outputs side-by-side in a browser, the base Qwen3.6-27B's designs feel slightly more polished than Qwopus's on the typical brief — the base lands closer to "near-perfect" on the standard pages (SaaS landing, pricing tier, dashboard) where Qwopus ships clean, functional work that's a half-step behind on the finest details. This gap is exactly the kind of thing that closes as training data scales, and is consistent with a preview trained on ~12 K examples.

What's more interesting is that Qwopus occasionally goes further creatively than the base on open-ended prompts. The clearest example is the audio-reactive visualizer: Qwopus produced a structurally unique interpretation compared to the base model's version — different rendering approach, different visual language, different micro-interactions. Slightly less polish in some corners, notably more originality in the whole. That trade — a shade less refinement in exchange for more distinct creative swings — is a reasonable profile for an early reasoning-focused fine-tune.

Verdict

Qwopus3.6-27B-v1-preview is a clean upgrade over the Qwen3.6-27B base for single-stream reasoning + UI-generation workloads, especially on a consumer 5090 where Q4 fits with 12 GB of headroom.

The headline speedup is largely a quant effect, but the reasoning-trace discipline is a real fine-tune win. If you were running Qwen3.6-27B base at Q5 and losing agentic responses to thinking starvation, swapping to Qwopus3.6 preview at Q4 gets you 1.12× throughput AND meaningfully fewer empty-content failures at the same max_tokens.

For production design work, the base remains slightly more polished on standard briefs while Qwopus trades that small polish gap for more creative variance — both are shippable today; pick the one that matches your workload. For structured JSON tasks, still disable thinking or bump max_tokens to 6 K.

The real number to watch is the full-scale run. v1-preview was trained on ~12 K curated examples; the in-progress full fine-tune (compute pending — we're collaborating on that right now) is sized orders of magnitude larger with a cleaner data pipeline. If the preview already ships tighter reasoning, matches base on HTML design at a smaller quant, and shows more creative range on open-ended prompts, the full model is where I'd expect the polish gap to close and the creativity advantage to solidify.

Raw outputs and per-run metadata JSON preserved alongside each HTML file in this repo. Same harness and prompts as the Qwen3.6-27B base eval.