Qwopus3.6-27B v1-preview — Q4_K_M evaluation

by Kyle Hessling · @KyleHessling1 on X · fine-tune by Jackrong

A direct apples-to-apples rerun of the Qwen3.6-27B base evaluation against Jackrong's Qwopus3.6-27B-v1-preview reasoning fine-tune. Same 16 prompts. Same hardware. Same harness.

TL;DR

62.3 tok/s average across all 16 runs — +12.7% over the Qwen3.6-27B base at Q5_K_XL (55.3 tok/s). Mostly a quant-size effect (Q4 vs Q5) rather than fine-tune magic.
Variance collapsed to ±1%. Every run landed between 61.8 and 62.7 tok/s. Base had 51-57 range.
Fewer thinking-starvation failures. 4 of 5 agentic prompts returned a final content answer with thinking on, vs 2 of 5 for the base. The fine-tune's reasoning is tighter.
Design output quality on par or slightly better. Tighter spread (23-37 KB vs base's 21-41 KB) and no truncation.

Setup

Item	Value
Model	`Jackrong/Qwopus3.6-27B-v1-preview-GGUF — Q4_K_M` (16 GB)
Base	Qwen/Qwen3.6-27B (evaluated separately in Round 1)
Training data	~12K curated examples: Claude-Distillation, GLM-5.1-Reasoning, Kimi-K2.5-Reasoning, Qwen3.5-reasoning
Runtime	llama.cpp cuda-12.8, `--flash-attn on`, `--jinja`
Context	65,536 tokens, q8_0 K+V cache, single slot
Hardware	RTX 5090 (32 GB), all layers offloaded

Throughput

Metric	Qwen3.6-27B base (Q5)	Qwopus3.6 preview (Q4)
avg tok/s	55.3	62.3
min / max	51.3 / 56.0	61.8 / 62.7
VRAM used	24.5 GB	~20 GB
Completion tokens (16 runs)	93,899	87,394
Total gen time	28 min	23.4 min

The speed gap is mostly bandwidth: Q4_K_M moves ~16 GB of weights per token vs Q5_K_XL's ~19 GB. That tracks the ~12% delta cleanly. On identical quant the base and the fine-tune should be within 2% of each other on this hardware. What's surprising is the variance collapse — Qwopus held 62 tok/s within a 1% window across all 16 runs, while the base flexed 10% across its range.

Agentic reasoning

Thinking starvation — better, not gone

In Round 1, 3 of 5 agentic prompts (code_debug, structured_extraction, tool_use_json) burned their entire token budget inside <think> and emitted empty content. Qwopus handled 4 of the same 5 prompts cleanly with thinking on:

Task	Round 1 (Qwen base)	Qwopus result
multi_step_planning	Pass — 3,802 tok w/ think	Pass — 3,158 tok w/ think (shorter)
tool_use_json	Empty (needed nothink rerun)	Pass — 1,174 tok w/ think
code_debug	Empty (needed nothink rerun)	Pass — 1,628 tok w/ think
structured_extraction	Empty (needed nothink rerun)	Empty — starved at 1,500 tok of reasoning, nothink rerun required
self_critique	Pass — 2,837 tok w/ think	Pass — 1,277 tok w/ think

The fine-tune generates substantially shorter reasoning traces — 3,158 vs 3,802 on multi-step, 1,277 vs 2,837 on self-critique. That tighter budgeting is what unblocks the three tasks that failed on base. Structured JSON extraction still needs nothink (or ≥ 6K budget with think) because the reasoning is genuinely long.

Quality notes

code_debug: caught all 4 bugs (sort order, = vs ==, bounds logic, nums[k] vs nums[k-1]) and produced a corrected version. Match for base.
self_critique: INITIAL → CRITIQUE → IMPROVED structure clean. Improved version uses expand-around-center O(n²) just like base.
multi_step_planning: 20-step URL-shortener deploy plan, more compact than base's version, still specific.
tool_use_json: correct ordering and args. Like base, dated the trip 2024-05-10 instead of 2025 — prompt didn't anchor the year.
structured_extraction (nothink rerun): valid JSON. One date-math slip: resolved "next Tuesday" to 2025-04-28, which is actually a Monday (2025-04-22 is the real next Tuesday from 2025-04-21). The starved thinking-mode trace had this correct at 2025-04-22 — suggests reasoning was working, it just ran out of tokens.

Front-end design (5 prompts)

All 5 outputs validated: start with <!DOCTYPE html>, end with </html>, no truncation.

Prompt	Qwen base	Qwopus
saas_landing	35.8 KB · 9.9 k tok	36.7 KB · 10.0 k tok
analytics_dashboard	40.8 KB · 12.7 k tok	37.4 KB · 13.2 k tok
designer_portfolio	20.9 KB · 5.4 k tok	23.1 KB · 7.4 k tok
pricing_page	29.2 KB · 7.8 k tok	24.3 KB · 8.1 k tok
mobile_app_marketing	32.4 KB · 9.2 k tok	29.3 KB · 8.0 k tok

Tighter spread (23-37 KB vs 21-41 KB). Qwopus uses more tokens per KB of HTML — more whitespace/structure per output rather than bigger pages. Both models handle the brief consistently: Inter + JetBrains Mono on the SaaS page, actual SVG charts on the dashboard (not placeholder rects), magnetic CTA on the portfolio, conic-gradient rotating border on the pricing recommended tier, CSS-only iPhone mockup with 4-7-8 breathing animation on the Stillwater page.

Canvas / WebGL (6 prompts)

Prompt	Qwen base	Qwopus
particle_attractor	13.1 KB · 4.6 k tok	11.1 KB · 4.2 k tok
webgl_shader (Mandelbulb)	15.2 KB · 4.9 k tok (shader-bug fix required)	11.5 KB · 4.4 k tok
three_scene (crystals)	19.9 KB · 6.5 k tok	17.9 KB · 6.4 k tok
physics_sandbox	21.2 KB · 7.3 k tok	15.1 KB · 4.4 k tok
audio_reactive	17.8 KB · 6.4 k tok	12.0 KB · 3.0 k tok

Qwopus produces tighter canvas output across the board (average 13.5 KB vs base 17.6 KB). Whether that lands in a working demo varies — this is exactly the kind of prompt where an early-preview fine-tune can regress on the edge cases. Best practice: open each demo in the browser before shipping. The Round 1 Mandelbulb needed a GLSL type-promotion patch to run; Qwopus's version compiled clean in first inspection but thorough check requires opening them.

What the fine-tune buys you

Shorter, more disciplined reasoning traces. ~30% fewer thinking tokens for equivalent answer quality. This is where the training signal clearly landed.
Fewer silent failures under a tight max_tokens. 4 of 5 agentic prompts produce useful content with thinking on, vs 2 of 5 on base at matching budgets.
Tighter variance in throughput. Near-zero noise in tok/s — whatever sampling config the fine-tune uses is highly consistent.
Size parity on HTML design. No regression on production-grade UI work, within the noise floor.

Caveats

Q4 vs Q5 confounds the speed delta. Direct fine-tune vs base comparison at matched quant would need re-quantizing both, which Jackrong hasn't published yet. Report's headline 62 tok/s figure is Q4.
Still one thinking-starvation failure. Structured JSON extraction at default caps still needs nothink mode or ≥ 6 K budget.
Early preview. The model card calls out v1-preview as not final; expect behavior drift as Jackrong ships larger versions.
Date math drift. The no-think structured_extraction resolved "next Tuesday" from 2025-04-21 to 2025-04-28 (actually a Monday). Round 1 Qwen base got 2025-04-29 (a Tuesday, correct one interpretation). Thinking-on version had 2025-04-22 which is also valid.

Subjective design quality

Looking at the rendered UI outputs side-by-side in a browser, the base Qwen3.6-27B's designs feel slightly more polished than Qwopus's on the typical brief — the base lands closer to "near-perfect" on the standard pages (SaaS landing, pricing tier, dashboard) where Qwopus ships clean, functional work that's a half-step behind on the finest details. This gap is exactly the kind of thing that closes as training data scales, and is consistent with a preview trained on ~12 K examples.

What's more interesting is that Qwopus occasionally goes further creatively than the base on open-ended prompts. The clearest example is the audio-reactive visualizer: Qwopus produced a structurally unique interpretation compared to the base model's version — different rendering approach, different visual language, different micro-interactions. Slightly less polish in some corners, notably more originality in the whole. That trade — a shade less refinement in exchange for more distinct creative swings — is a reasonable profile for an early reasoning-focused fine-tune.

Verdict

Qwopus3.6-27B-v1-preview is a clean upgrade over the Qwen3.6-27B base for single-stream reasoning + UI-generation workloads, especially on a consumer 5090 where Q4 fits with 12 GB of headroom.

The headline speedup is largely a quant effect, but the reasoning-trace discipline is a real fine-tune win. If you were running Qwen3.6-27B base at Q5 and losing agentic responses to thinking starvation, swapping to Qwopus3.6 preview at Q4 gets you 1.12× throughput AND meaningfully fewer empty-content failures at the same max_tokens.

For production design work, the base remains slightly more polished on standard briefs while Qwopus trades that small polish gap for more creative variance — both are shippable today; pick the one that matches your workload. For structured JSON tasks, still disable thinking or bump max_tokens to 6 K.

The real number to watch is the full-scale run. v1-preview was trained on ~12 K curated examples; the in-progress full fine-tune (compute pending — we're collaborating on that right now) is sized orders of magnitude larger with a cleaner data pipeline. If the preview already ships tighter reasoning, matches base on HTML design at a smaller quant, and shows more creative range on open-ended prompts, the full model is where I'd expect the polish gap to close and the creativity advantage to solidify.

Raw outputs and per-run metadata JSON preserved alongside each HTML file in this repo. Same harness and prompts as the Qwen3.6-27B base eval.