Agentic Coding Analysis: orientman-blog

Date: 2026-03-09 Scope: All work performed with OpenCode on the orientman-blog repository Data sources: OpenCode session database (~/.local/share/opencode/opencode.db), git history, spec-kit specs (001--006), OpenSpec change archive

1. Project Overview

A WordPress-to-static-site blog migration built entirely with an AI coding agent (OpenCode + Claude) over 12 calendar days.

Metric	Value
Total sessions	230
Productive sessions (code)	101 (44%)
Exploration/planning sessions	129 (56%)
Total messages exchanged	6,633
Git commits (non-merge)	156
Spec-kit specs (001--006)	6
Archived OpenSpec changes	29
Total tracked changes	35
Date range	Feb 26 -- Mar 9, 2026
Source code (TS/TSX)	~4,012 lines
Content files (MDX)	63 posts

2. Work Classification

All 29 completed changes plus additional fixes, classified by type and whether human correction was required.

2.1 Content Migration & Transformation

Agent effectiveness: EXCELLENT

Change	Corrections needed?
WordPress blog migration (001)	Minor -- gist embeds, code formatting in ~5 MDX files
LinkedIn content migration	None -- clean 3-post import
LibraryThing reviews migration	Minor -- one book title was wrong language
Book cover images	None -- bulk 25-post update
Word-wrap long prose lines	None -- mechanical reformatting
Fix typos/grammar across 19 posts	None
Normalize/enrich post tags	None

7 changes, ~2 needed minor correction. Agent excels at bulk mechanical transforms across hundreds of files.

2.2 Small Feature Addition

Agent effectiveness: VERY GOOD

Change	Corrections needed?
Add Giscus comments	None -- drop-in widget
Add social links	None
Add social share buttons	None
GoatCounter analytics	None -- simple script injection
Search link in topbar	None
Search post titles (Pagefind)	None
Goodreads reading list widget	None
Add Goodreads external links	Yes -- links pointed to book page, not review page
Add review cover images	None
Show star rating	None
Longer post summaries	None
Rich post excerpts	None
Add Gravatar avatar/favicon	Yes -- 4+ sessions; user finally suggested the simple solution

13 changes, 2 needed correction. Agent is reliable for well-scoped, clearly specified integrations.

2.3 Bug Fixes

Agent effectiveness: GOOD at diagnosis, MIXED at fixing

Change	Corrections needed?
Fix old blog URLs	None
Fix quote post nested blockquotes	Yes -- first fix partial; needed "remove also outer blockquote"
Fix comments undefined NaN	Yes -- required follow-up for datetime format
Dark mode search visibility	Yes -- "Still does not work - see screenshot"
Blog title click -> first page	Yes -- over-engineered; user said "Maybe just point to 1st page?"

5 changes, 4 needed correction. Agent diagnoses well but often over-engineers fixes or makes partial fixes.

2.4 Infrastructure & Tooling

Agent effectiveness: GOOD

Change	Corrections needed?
ESLint + Prettier setup	Yes -- applied too broadly to openspec dirs, needed revert
Upgrade Next.js v16	None -- clean framework upgrade
Changelog setup	None
GitHub Pages deploy (002)	Yes -- CSS/paths broken on first deploy

4 changes, 2 needed correction. Config scoping and deploy verification are weak spots.

2.5 Visual Design & Styling

Agent effectiveness: MIXED -- requires heavy iteration

Change	Corrections needed?
Personal visual style	Yes -- multiple rounds of color/styling negotiation
Visual style improvements	Yes -- link order inconsistency, readability issues
Chateau visual style	Yes -- match-from-screenshot required 82 messages
CV update	Minor -- role formatting preferences
Remove tag mapping	None
Tags index page	None
Weighted tag cloud	Yes -- "tag text should be centered inside clouds"
URL-aware pagination	None
AI badge in header	Yes -- 4+ sessions to get exact styling right

9 changes, 6 needed correction. Visual/aesthetic work is the agent's weakest area.

3. Failure Mode Taxonomy

Seven distinct failure patterns identified from user message analysis:

#	Failure Mode	Count	Examples
1	Collateral damage	3	"revert mdx changes not related to comments datetime"; "Exclude openspec from eslint-prettier-setup and revert"
2	Over-engineering	4	Pagination fix -> user said "just point to 1st page?"; Favicon -> user suggested Gravatar after 4 sessions
3	Partial fix	3	"remove also outer blockquote"; "Still does not work - see screenshot"
4	Data accuracy	3	Goodreads links to book not review; wrong book title language; HTML entities not decoded
5	Visual judgment	6	Header styling iterations; link ordering; tag centering; CRT scanline visibility
6	Self-verification	3	"There are some lint errors. Fix them"; CSS broken on deploy; dark mode not tested
7	Session multiplication	4	Favicon (4 sessions); worktree creation (7+ attempts); AI badge (multiple sessions)

3.1 Collateral Damage

The agent modifies files outside the requested scope. Happens most often during search-and-replace or linting operations. The fix is always a revert, which wastes a round-trip.

3.2 Over-engineering

The agent builds a complex, "complete" solution when a simple one exists. This is arguably the signature agent failure mode -- humans naturally reach for the simplest solution; agents reach for the most architecturally thorough one.

The favicon saga is the canonical example: four sessions of complex favicon generation approaches before the user said "Can't you just use Gravatar links like before?"

3.3 Partial Fix

The agent addresses the visible symptom but misses the root cause. Often requires a second round where the user points out the remaining issue. Common in CSS/styling fixes where multiple DOM elements contribute to the visual problem.

3.4 Data Accuracy

The agent gets URLs, titles, or factual content wrong. This is particularly dangerous in content migration because errors propagate to published content and may not be caught by automated checks.

3.5 Visual Judgment

The agent cannot evaluate whether something "looks right." Every aesthetic decision requires human review, and often 2-3 iterations. Sessions involving visual work average 2-3x more messages than functional work.

3.6 Self-verification Gap

The agent did not reliably verify its own output before presenting it. Three times the user had to ask the agent to run lint or check deployed results. This was partially addressed by adding a lint requirement to AGENTS.md mid-project. A broader lesson: agents need explicit verification gates in their workflow, not just generation capabilities.

3.7 Session Multiplication

Some tasks require repeated restarts because the agent gets stuck or the environment (worktree, git) enters an unrecoverable state. Tool/environment integration is the weakest link in the agent workflow.

4. Correction Rate by Category

Category	Total	Clean	Corrected	Clean Rate
Content migration	7	5	2	71%
Small features	13	11	2	85%
Bug fixes	5	1	4	20%
Infrastructure	4	2	2	50%
Visual/design	9	3	6	33%
Total	38	22	16	58%

5. Effectiveness Spectrum

EXCELLENT ████████████████████████  Bulk content transforms, mechanical refactoring
VERY GOOD ██████████████████████    Drop-in integrations, well-scoped features
GOOD      ████████████████          Framework upgrades, CI setup, search/replace
MIXED     ██████████████            Bug diagnosis (good) -> fix (sometimes partial)
WEAK      ████████████              Config scoping, deploy verification
POOR      ██████████                Visual design, aesthetic judgment
POOR      ████████                  Tool/environment issues (worktree, git state)

6. Key Insights

6.1 Exploration Dominance

56% of sessions produced no code. The agent's role as a thinking partner -- exploring ideas, writing specs, planning changes -- was its most-used function. This is underappreciated: spec-driven development with an agent may deliver more value through structured thinking than through code output.

6.2 Small Feature Reliability

The 85% clean rate for well-scoped features is remarkable. For clearly specified, self-contained additions (drop-in widget, script injection, new UI component), the agent is nearly as reliable as a senior developer. The key predictor of success is specification clarity, not task complexity.

6.3 Bug Fix Paradox

Bug fixes have the worst clean rate (20%), despite debugging being a perceived AI strength. The issue is not diagnosis -- the agent consistently identified root causes correctly. The problem is in the fix: agents tend to over-engineer solutions or address symptoms rather than causes. Human course-correction was needed for 4 out of 5 bug fix changes.

6.4 Visual Iteration Cost

Visual/styling work required 2-3x more messages per change. The "Chateau visual style" session had 82 messages -- the most of any productive session. Aesthetic judgment cannot be delegated. The most efficient pattern was the user providing a screenshot and iterating on specifics, rather than describing the desired look in words.

6.5 Self-verification Gap

6.6 Over-engineering as Signature Failure

When the agent fails, it almost never fails by doing too little. It fails by doing too much -- building elaborate solutions when simple ones exist. The human's most common correction was simplification, not addition. This inverts the common assumption that AI coding assistants are "lazy" or produce minimal solutions.

7. Raw Data

7.1 Top 10 Sessions by Message Count

Title	Messages	Adds	Dels	Files
Next.js static site from repo (no DB)	346	1,673	0	8
Add syntax highlighting for code blocks	285	173	0	2
Clarification workflow for spec verification	260	1,099	1	7
Strikethrough formatting not working	192	1,978	758	526
OpenSpec implementation workflow	158	9,789	5,665	290
Fix comments showing undefined NaN	140	909	0	8
Fix gist embeds in WordPress migration	133	264	0	3
OpenSpec implementation workflow	131	11,457	2,095	110
Implement OpenSpec change tasks	106	830	68	38
OpenSpec task generation	104	4,269	1,050	537

7.2 Session Productivity Split

101 sessions (44%) -- produced code changes
129 sessions (56%) -- exploration, planning, spec writing only
Average messages per productive session: ~40
Average messages per exploration session: ~20

7.3 All 29 Archived OpenSpec Changes (Chronological)

add-gravatar-avatar (2026-03-03)
fix-old-blog-urls (2026-03-04)
remove-tag-mapping (2026-03-04)
tags-index-page (2026-03-04)
weighted-tag-cloud (2026-03-04)
chateau-visual-style (2026-03-05)
eslint-prettier-setup (2026-03-05)
fix-quote-post-nested-blockquotes (2026-03-05)
longer-post-summaries (2026-03-05)
related-posts (2026-03-05)
rich-post-excerpts (2026-03-05)
url-aware-pagination (2026-03-05)
book-cover-images (2026-03-06)
librarything-reviews-migration (2026-03-06)
linkedin-content-migration (2026-03-06)
show-star-rating (2026-03-06)
upgrade-nextjs-v16 (2026-03-06)
add-giscus-comments (2026-03-07)
add-goodreads-external-links (2026-03-07)
add-review-cover-images (2026-03-07)
add-social-links (2026-03-07)
add-social-share-buttons (2026-03-07)
goodreads-reading-list (2026-03-07)
search-link-in-topbar (2026-03-07)
search-post-titles (2026-03-07)
changelog (2026-03-08)
cv-update (2026-03-08)
personal-visual-style (2026-03-08)
visual-style-improvements (2026-03-08)

7.4 All 6 Spec-kit Specs (Chronological)

001-wordpress-blog-migration (Feb 26 -- Mar 1) -- 11 artifacts, 55 tasks (53 checked)
002-gh-pages-deploy (Feb 28 -- Mar 1) -- 7 artifacts, 8 tasks (7 checked)
003-fix-gist-embeds (Mar 1 -- Mar 2) -- 6 artifacts, 10 tasks (0 checked)
004-fix-gfm-strikethrough (Mar 1 -- Mar 2) -- 6 artifacts, 10 tasks (0 checked)
005-syntax-highlighting (Mar 1 -- Mar 2) -- 8 artifacts, 30 tasks (0 checked)
006-fix-comments-undefined-nan (Mar 2) -- 7 artifacts, 13 tasks (0 checked)

8. Spec-kit Era (Feb 26 -- Mar 2)

Before the project adopted OpenSpec, the first 5 days used the spec-kit workflow -- a heavier specification system driven by .specify/ templates and /speckit.* slash commands.

8.1 Overview

Spec-kit produced a deep artifact pipeline for every change:

spec.md -> clarifications.md -> plan.md -> research.md -> data-model.md
-> contracts/ -> quickstart.md -> tasks.md -> checklists/

Each spec could generate up to 9 distinct artifact types plus subdirectories for contracts and checklists. A total of 45 files were produced across 6 specs in 5 days.

8.2 Spec Inventory

#	Spec	Files	Tasks	Checked	Artifact types
001	WordPress blog migration	11	55	53	spec, plan, research, data-model, quickstart, tasks, contracts/2, checklists, migration-audit, data/XML
002	GitHub Pages deploy	7	8	7	spec, plan, research, data-model, quickstart, tasks, checklists
003	Fix gist embeds	6	10	0	spec, plan, research, data-model, tasks, checklists
004	Fix GFM strikethrough	6	10	0	spec, plan, research, quickstart, tasks, checklists
005	Syntax highlighting	8	30	0	spec, plan, research, data-model, quickstart, tasks, contracts/1, checklists
006	Fix comments undefined NaN	7	13	0	spec, plan, research, data-model, quickstart, tasks, checklists

8.3 Observations

Task tracking broke after spec 002. Specs 001 and 002 had diligent task tracking -- 53/55 and 7/8 tasks checked off respectively. Specs 003--006 had zero tasks checked despite all work being completed (the features shipped). The ceremony of updating checkboxes in tasks.md was abandoned once the overhead exceeded its value.

Artifact volume scaled inversely with problem complexity. Spec 005 (syntax highlighting) generated 30 tasks across 7 phases including a contracts/ directory -- for what ultimately required installing rehype-pretty-code and configuring Shiki. Spec 004 (GFM strikethrough) produced 6 artifacts including a full plan.md and research.md -- for a fix that amounted to adding remark-gfm to the MDX pipeline.

Early specs were higher fidelity. Spec 001 (WordPress migration) was the most elaborate: 11 files including a migration-audit.md, the raw WordPress XML export, and two contract documents defining component and route interfaces. This level of specification was justified -- the migration was genuinely complex, touching 63 posts across multiple content types. The same level of specification for a 1-line fix (spec 004) was not.

The agent generated the specifications enthusiastically. Spec-kit used /speckit.clarify, /speckit.plan, /speckit.research and similar commands. The agent never pushed back on generating artifacts -- it produced full research.md files and data-model.md documents even when the problem was trivially understood. This is a form of over-engineering (failure mode #2) applied to the workflow itself rather than to code.

9. Workflow Comparison: Spec-kit vs OpenSpec

The migration from spec-kit to OpenSpec (commit fb2c279, Mar 2) was the most significant process change in the project.

9.1 Side-by-side

Dimension	Spec-kit (days 1--5)	OpenSpec (days 5--12)
Changes completed	6	29
Calendar days	5	7
Velocity (changes/day)	1.2	4.1
Artifacts per change	~7.5 (45 files / 6 specs)	~3 (proposal + design + tasks)
Total artifact files	45	~87
Artifact pipeline	spec -> clarify -> plan -> research -> data-model -> contracts -> quickstart -> tasks -> checklists	proposal -> design -> tasks
Task tracking fidelity	Abandoned after spec 002	Consistently maintained
Max tasks in a single spec	55 (001)	~8--12 typical
Lightest change possible	6 files minimum	3 files

9.2 Why Spec-kit Was Abandoned

Three factors drove the migration:

1. Over-specification overhead. Generating 6--11 artifact files for every change -- including bug fixes that needed one line of code -- created a spec-to-code ratio that was unsustainable. The agent was spending more sessions writing specifications than writing code.

2. Tracking fidelity collapsed. The checkbox-based task tracking in tasks.md was only maintained for the first two specs. Once the user and agent stopped updating checklists, the tracking artifacts became dead weight -- files that existed but conveyed no accurate status information.

3. Artifact types didn't match the work. data-model.md was generated for bug fixes that had no data model. contracts/ directories were created for features that had no API surface. The spec-kit pipeline was designed for greenfield service development, not for iterative blog feature work.

9.3 What OpenSpec Changed

OpenSpec addressed all three issues:

Fewer artifacts: 3 per change instead of ~8, with each one serving a distinct purpose (what -> how -> do).
Lighter tasks: Task lists averaged 8--12 items instead of 30--55, making them practical to track.
Flexible depth: Simple changes got thin proposals; complex ones got detailed designs. The artifact weight scaled with problem complexity instead of being fixed.

9.4 Impact on Agent Effectiveness

The velocity improvement (1.2 -> 4.1 changes/day) is striking but partially misleading -- OpenSpec changes were on average smaller than spec-kit specs. A more meaningful comparison: spec-kit's 6 specs represent roughly the same functional scope as OpenSpec's first 10 changes (the migration, deploy, and initial bug fixes).

The real impact was on feedback loop speed. Under spec-kit, the agent spent multiple sessions generating artifacts before implementation began. Under OpenSpec, the proposal-to-implementation cycle was typically completed within a single session. Faster feedback loops meant corrections were caught earlier and cost less to fix.

9.5 Meta-lesson: The Agent Over-engineered Its Own Process

The spec-kit workflow itself was an instance of the agent's signature failure mode -- over-engineering (Section 3.2). When asked to help design a specification workflow, the agent produced the most thorough, artifact-heavy system it could conceive. It took the human 5 days to recognise that the specification overhead exceeded its value and migrate to something lighter.

This mirrors the code-level pattern exactly: the agent builds the most architecturally complete solution, and the human's primary role is simplification. The difference is that a workflow affects every subsequent change, so the cost of over-engineering compounds. Migrating from spec-kit to OpenSpec was, in effect, the same corrective action as the user saying "Maybe just point to the 1st page?" -- but applied to the development process instead of a single feature.