Agentic Coding Analysis: orientman-blog

Date: 2026-03-09 Scope: All work performed with OpenCode on the orientman-blog repository Data sources: OpenCode session database (~/.local/share/opencode/opencode.db), git history, spec-kit specs (001--006), OpenSpec change archive


1. Project Overview

A WordPress-to-static-site blog migration built entirely with an AI coding agent (OpenCode + Claude) over 12 calendar days.

MetricValue
Total sessions230
Productive sessions (code)101 (44%)
Exploration/planning sessions129 (56%)
Total messages exchanged6,633
Git commits (non-merge)156
Spec-kit specs (001--006)6
Archived OpenSpec changes29
Total tracked changes35
Date rangeFeb 26 -- Mar 9, 2026
Source code (TS/TSX)~4,012 lines
Content files (MDX)63 posts

2. Work Classification

All 29 completed changes plus additional fixes, classified by type and whether human correction was required.

2.1 Content Migration & Transformation

Agent effectiveness: EXCELLENT

ChangeCorrections needed?
WordPress blog migration (001)Minor -- gist embeds, code formatting in ~5 MDX files
LinkedIn content migrationNone -- clean 3-post import
LibraryThing reviews migrationMinor -- one book title was wrong language
Book cover imagesNone -- bulk 25-post update
Word-wrap long prose linesNone -- mechanical reformatting
Fix typos/grammar across 19 postsNone
Normalize/enrich post tagsNone

7 changes, ~2 needed minor correction. Agent excels at bulk mechanical transforms across hundreds of files.

2.2 Small Feature Addition

Agent effectiveness: VERY GOOD

ChangeCorrections needed?
Add Giscus commentsNone -- drop-in widget
Add social linksNone
Add social share buttonsNone
GoatCounter analyticsNone -- simple script injection
Search link in topbarNone
Search post titles (Pagefind)None
Goodreads reading list widgetNone
Add Goodreads external linksYes -- links pointed to book page, not review page
Add review cover imagesNone
Show star ratingNone
Longer post summariesNone
Rich post excerptsNone
Add Gravatar avatar/faviconYes -- 4+ sessions; user finally suggested the simple solution

13 changes, 2 needed correction. Agent is reliable for well-scoped, clearly specified integrations.

2.3 Bug Fixes

Agent effectiveness: GOOD at diagnosis, MIXED at fixing

ChangeCorrections needed?
Fix old blog URLsNone
Fix quote post nested blockquotesYes -- first fix partial; needed "remove also outer blockquote"
Fix comments undefined NaNYes -- required follow-up for datetime format
Dark mode search visibilityYes -- "Still does not work - see screenshot"
Blog title click -> first pageYes -- over-engineered; user said "Maybe just point to 1st page?"

5 changes, 4 needed correction. Agent diagnoses well but often over-engineers fixes or makes partial fixes.

2.4 Infrastructure & Tooling

Agent effectiveness: GOOD

ChangeCorrections needed?
ESLint + Prettier setupYes -- applied too broadly to openspec dirs, needed revert
Upgrade Next.js v16None -- clean framework upgrade
Changelog setupNone
GitHub Pages deploy (002)Yes -- CSS/paths broken on first deploy

4 changes, 2 needed correction. Config scoping and deploy verification are weak spots.

2.5 Visual Design & Styling

Agent effectiveness: MIXED -- requires heavy iteration

ChangeCorrections needed?
Personal visual styleYes -- multiple rounds of color/styling negotiation
Visual style improvementsYes -- link order inconsistency, readability issues
Chateau visual styleYes -- match-from-screenshot required 82 messages
CV updateMinor -- role formatting preferences
Remove tag mappingNone
Tags index pageNone
Weighted tag cloudYes -- "tag text should be centered inside clouds"
URL-aware paginationNone
AI badge in headerYes -- 4+ sessions to get exact styling right

9 changes, 6 needed correction. Visual/aesthetic work is the agent's weakest area.


3. Failure Mode Taxonomy

Seven distinct failure patterns identified from user message analysis:

#Failure ModeCountExamples
1Collateral damage3"revert mdx changes not related to comments datetime"; "Exclude openspec from eslint-prettier-setup and revert"
2Over-engineering4Pagination fix -> user said "just point to 1st page?"; Favicon -> user suggested Gravatar after 4 sessions
3Partial fix3"remove also outer blockquote"; "Still does not work - see screenshot"
4Data accuracy3Goodreads links to book not review; wrong book title language; HTML entities not decoded
5Visual judgment6Header styling iterations; link ordering; tag centering; CRT scanline visibility
6Self-verification3"There are some lint errors. Fix them"; CSS broken on deploy; dark mode not tested
7Session multiplication4Favicon (4 sessions); worktree creation (7+ attempts); AI badge (multiple sessions)

3.1 Collateral Damage

The agent modifies files outside the requested scope. Happens most often during search-and-replace or linting operations. The fix is always a revert, which wastes a round-trip.

3.2 Over-engineering

The agent builds a complex, "complete" solution when a simple one exists. This is arguably the signature agent failure mode -- humans naturally reach for the simplest solution; agents reach for the most architecturally thorough one.

The favicon saga is the canonical example: four sessions of complex favicon generation approaches before the user said "Can't you just use Gravatar links like before?"

3.3 Partial Fix

The agent addresses the visible symptom but misses the root cause. Often requires a second round where the user points out the remaining issue. Common in CSS/styling fixes where multiple DOM elements contribute to the visual problem.

3.4 Data Accuracy

The agent gets URLs, titles, or factual content wrong. This is particularly dangerous in content migration because errors propagate to published content and may not be caught by automated checks.

3.5 Visual Judgment

The agent cannot evaluate whether something "looks right." Every aesthetic decision requires human review, and often 2-3 iterations. Sessions involving visual work average 2-3x more messages than functional work.

3.6 Self-verification Gap

The agent did not reliably verify its own output before presenting it. Three times the user had to ask the agent to run lint or check deployed results. This was partially addressed by adding a lint requirement to AGENTS.md mid-project. A broader lesson: agents need explicit verification gates in their workflow, not just generation capabilities.

3.7 Session Multiplication

Some tasks require repeated restarts because the agent gets stuck or the environment (worktree, git) enters an unrecoverable state. Tool/environment integration is the weakest link in the agent workflow.


4. Correction Rate by Category

CategoryTotalCleanCorrectedClean Rate
Content migration75271%
Small features1311285%
Bug fixes51420%
Infrastructure42250%
Visual/design93633%
Total38221658%

5. Effectiveness Spectrum

EXCELLENT ████████████████████████  Bulk content transforms, mechanical refactoring
VERY GOOD ██████████████████████    Drop-in integrations, well-scoped features
GOOD      ████████████████          Framework upgrades, CI setup, search/replace
MIXED     ██████████████            Bug diagnosis (good) -> fix (sometimes partial)
WEAK      ████████████              Config scoping, deploy verification
POOR      ██████████                Visual design, aesthetic judgment
POOR      ████████                  Tool/environment issues (worktree, git state)

6. Key Insights

6.1 Exploration Dominance

56% of sessions produced no code. The agent's role as a thinking partner -- exploring ideas, writing specs, planning changes -- was its most-used function. This is underappreciated: spec-driven development with an agent may deliver more value through structured thinking than through code output.

6.2 Small Feature Reliability

The 85% clean rate for well-scoped features is remarkable. For clearly specified, self-contained additions (drop-in widget, script injection, new UI component), the agent is nearly as reliable as a senior developer. The key predictor of success is specification clarity, not task complexity.

6.3 Bug Fix Paradox

Bug fixes have the worst clean rate (20%), despite debugging being a perceived AI strength. The issue is not diagnosis -- the agent consistently identified root causes correctly. The problem is in the fix: agents tend to over-engineer solutions or address symptoms rather than causes. Human course-correction was needed for 4 out of 5 bug fix changes.

6.4 Visual Iteration Cost

Visual/styling work required 2-3x more messages per change. The "Chateau visual style" session had 82 messages -- the most of any productive session. Aesthetic judgment cannot be delegated. The most efficient pattern was the user providing a screenshot and iterating on specifics, rather than describing the desired look in words.

6.5 Self-verification Gap

The agent did not reliably verify its own output before presenting it. Three times the user had to ask the agent to run lint or check deployed results. This was partially addressed by adding a lint requirement to AGENTS.md mid-project. A broader lesson: agents need explicit verification gates in their workflow, not just generation capabilities.

6.6 Over-engineering as Signature Failure

When the agent fails, it almost never fails by doing too little. It fails by doing too much -- building elaborate solutions when simple ones exist. The human's most common correction was simplification, not addition. This inverts the common assumption that AI coding assistants are "lazy" or produce minimal solutions.


7. Raw Data

7.1 Top 10 Sessions by Message Count

TitleMessagesAddsDelsFiles
Next.js static site from repo (no DB)3461,67308
Add syntax highlighting for code blocks28517302
Clarification workflow for spec verification2601,09917
Strikethrough formatting not working1921,978758526
OpenSpec implementation workflow1589,7895,665290
Fix comments showing undefined NaN14090908
Fix gist embeds in WordPress migration13326403
OpenSpec implementation workflow13111,4572,095110
Implement OpenSpec change tasks1068306838
OpenSpec task generation1044,2691,050537

7.2 Session Productivity Split

  • 101 sessions (44%) -- produced code changes
  • 129 sessions (56%) -- exploration, planning, spec writing only
  • Average messages per productive session: ~40
  • Average messages per exploration session: ~20

7.3 All 29 Archived OpenSpec Changes (Chronological)

  1. add-gravatar-avatar (2026-03-03)
  2. fix-old-blog-urls (2026-03-04)
  3. remove-tag-mapping (2026-03-04)
  4. tags-index-page (2026-03-04)
  5. weighted-tag-cloud (2026-03-04)
  6. chateau-visual-style (2026-03-05)
  7. eslint-prettier-setup (2026-03-05)
  8. fix-quote-post-nested-blockquotes (2026-03-05)
  9. longer-post-summaries (2026-03-05)
  10. related-posts (2026-03-05)
  11. rich-post-excerpts (2026-03-05)
  12. url-aware-pagination (2026-03-05)
  13. book-cover-images (2026-03-06)
  14. librarything-reviews-migration (2026-03-06)
  15. linkedin-content-migration (2026-03-06)
  16. show-star-rating (2026-03-06)
  17. upgrade-nextjs-v16 (2026-03-06)
  18. add-giscus-comments (2026-03-07)
  19. add-goodreads-external-links (2026-03-07)
  20. add-review-cover-images (2026-03-07)
  21. add-social-links (2026-03-07)
  22. add-social-share-buttons (2026-03-07)
  23. goodreads-reading-list (2026-03-07)
  24. search-link-in-topbar (2026-03-07)
  25. search-post-titles (2026-03-07)
  26. changelog (2026-03-08)
  27. cv-update (2026-03-08)
  28. personal-visual-style (2026-03-08)
  29. visual-style-improvements (2026-03-08)

7.4 All 6 Spec-kit Specs (Chronological)

  1. 001-wordpress-blog-migration (Feb 26 -- Mar 1) -- 11 artifacts, 55 tasks (53 checked)
  2. 002-gh-pages-deploy (Feb 28 -- Mar 1) -- 7 artifacts, 8 tasks (7 checked)
  3. 003-fix-gist-embeds (Mar 1 -- Mar 2) -- 6 artifacts, 10 tasks (0 checked)
  4. 004-fix-gfm-strikethrough (Mar 1 -- Mar 2) -- 6 artifacts, 10 tasks (0 checked)
  5. 005-syntax-highlighting (Mar 1 -- Mar 2) -- 8 artifacts, 30 tasks (0 checked)
  6. 006-fix-comments-undefined-nan (Mar 2) -- 7 artifacts, 13 tasks (0 checked)

8. Spec-kit Era (Feb 26 -- Mar 2)

Before the project adopted OpenSpec, the first 5 days used the spec-kit workflow -- a heavier specification system driven by .specify/ templates and /speckit.* slash commands.

8.1 Overview

Spec-kit produced a deep artifact pipeline for every change:

spec.md -> clarifications.md -> plan.md -> research.md -> data-model.md
-> contracts/ -> quickstart.md -> tasks.md -> checklists/

Each spec could generate up to 9 distinct artifact types plus subdirectories for contracts and checklists. A total of 45 files were produced across 6 specs in 5 days.

8.2 Spec Inventory

#SpecFilesTasksCheckedArtifact types
001WordPress blog migration115553spec, plan, research, data-model, quickstart, tasks, contracts/2, checklists, migration-audit, data/XML
002GitHub Pages deploy787spec, plan, research, data-model, quickstart, tasks, checklists
003Fix gist embeds6100spec, plan, research, data-model, tasks, checklists
004Fix GFM strikethrough6100spec, plan, research, quickstart, tasks, checklists
005Syntax highlighting8300spec, plan, research, data-model, quickstart, tasks, contracts/1, checklists
006Fix comments undefined NaN7130spec, plan, research, data-model, quickstart, tasks, checklists

8.3 Observations

Task tracking broke after spec 002. Specs 001 and 002 had diligent task tracking -- 53/55 and 7/8 tasks checked off respectively. Specs 003--006 had zero tasks checked despite all work being completed (the features shipped). The ceremony of updating checkboxes in tasks.md was abandoned once the overhead exceeded its value.

Artifact volume scaled inversely with problem complexity. Spec 005 (syntax highlighting) generated 30 tasks across 7 phases including a contracts/ directory -- for what ultimately required installing rehype-pretty-code and configuring Shiki. Spec 004 (GFM strikethrough) produced 6 artifacts including a full plan.md and research.md -- for a fix that amounted to adding remark-gfm to the MDX pipeline.

Early specs were higher fidelity. Spec 001 (WordPress migration) was the most elaborate: 11 files including a migration-audit.md, the raw WordPress XML export, and two contract documents defining component and route interfaces. This level of specification was justified -- the migration was genuinely complex, touching 63 posts across multiple content types. The same level of specification for a 1-line fix (spec 004) was not.

The agent generated the specifications enthusiastically. Spec-kit used /speckit.clarify, /speckit.plan, /speckit.research and similar commands. The agent never pushed back on generating artifacts -- it produced full research.md files and data-model.md documents even when the problem was trivially understood. This is a form of over-engineering (failure mode #2) applied to the workflow itself rather than to code.


9. Workflow Comparison: Spec-kit vs OpenSpec

The migration from spec-kit to OpenSpec (commit fb2c279, Mar 2) was the most significant process change in the project.

9.1 Side-by-side

DimensionSpec-kit (days 1--5)OpenSpec (days 5--12)
Changes completed629
Calendar days57
Velocity (changes/day)1.24.1
Artifacts per change~7.5 (45 files / 6 specs)~3 (proposal + design + tasks)
Total artifact files45~87
Artifact pipelinespec -> clarify -> plan -> research -> data-model -> contracts -> quickstart -> tasks -> checklistsproposal -> design -> tasks
Task tracking fidelityAbandoned after spec 002Consistently maintained
Max tasks in a single spec55 (001)~8--12 typical
Lightest change possible6 files minimum3 files

9.2 Why Spec-kit Was Abandoned

Three factors drove the migration:

1. Over-specification overhead. Generating 6--11 artifact files for every change -- including bug fixes that needed one line of code -- created a spec-to-code ratio that was unsustainable. The agent was spending more sessions writing specifications than writing code.

2. Tracking fidelity collapsed. The checkbox-based task tracking in tasks.md was only maintained for the first two specs. Once the user and agent stopped updating checklists, the tracking artifacts became dead weight -- files that existed but conveyed no accurate status information.

3. Artifact types didn't match the work. data-model.md was generated for bug fixes that had no data model. contracts/ directories were created for features that had no API surface. The spec-kit pipeline was designed for greenfield service development, not for iterative blog feature work.

9.3 What OpenSpec Changed

OpenSpec addressed all three issues:

  • Fewer artifacts: 3 per change instead of ~8, with each one serving a distinct purpose (what -> how -> do).
  • Lighter tasks: Task lists averaged 8--12 items instead of 30--55, making them practical to track.
  • Flexible depth: Simple changes got thin proposals; complex ones got detailed designs. The artifact weight scaled with problem complexity instead of being fixed.

9.4 Impact on Agent Effectiveness

The velocity improvement (1.2 -> 4.1 changes/day) is striking but partially misleading -- OpenSpec changes were on average smaller than spec-kit specs. A more meaningful comparison: spec-kit's 6 specs represent roughly the same functional scope as OpenSpec's first 10 changes (the migration, deploy, and initial bug fixes).

The real impact was on feedback loop speed. Under spec-kit, the agent spent multiple sessions generating artifacts before implementation began. Under OpenSpec, the proposal-to-implementation cycle was typically completed within a single session. Faster feedback loops meant corrections were caught earlier and cost less to fix.

9.5 Meta-lesson: The Agent Over-engineered Its Own Process

The spec-kit workflow itself was an instance of the agent's signature failure mode -- over-engineering (Section 3.2). When asked to help design a specification workflow, the agent produced the most thorough, artifact-heavy system it could conceive. It took the human 5 days to recognise that the specification overhead exceeded its value and migrate to something lighter.

This mirrors the code-level pattern exactly: the agent builds the most architecturally complete solution, and the human's primary role is simplification. The difference is that a workflow affects every subsequent change, so the cost of over-engineering compounds. Migrating from spec-kit to OpenSpec was, in effect, the same corrective action as the user saying "Maybe just point to the 1st page?" -- but applied to the development process instead of a single feature.