@gurupanguji

HTML And Embed Cleanup Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Reduce issue #104 markup debt by expanding safe quote normalization and adding a dedicated safe embed normalizer for YouTube and Twitter/X, while explicitly skipping risky historical markup.

Architecture: Keep quote normalization and embed normalization as separate tools. Extend the quote normalizer to convert multiple safe quote blocks and simple raw HTML blockquotes. Add a new embed normalizer that handles YouTube locally and Twitter/X through a migration-time oEmbed helper. Drive both tools with tests first, use dry-run reporting before any rewrite, then apply only the safe batch and verify the results.

Tech Stack: Python 3, unittest, regex/string parsing, Jekyll markdown content, live HTTP fetch for X oEmbed during write-mode migration, existing repository validators

Task 1: Capture The Real Backlog In Tests And Fixtures

Files:

Modify: tests/test_normalize_wp_quotes.py
Create: tests/test_normalize_wp_embeds.py
Optional create: focused text fixtures under tests/fixtures/
Step 1: Add fixture coverage for multiple safe wp:quote blocks

Create at least one fixture or inline test case with two or more independent safe quote blocks in one post and assert they all convert cleanly.

Step 2: Add fixture coverage for simple raw HTML blockquotes

Cover at least:

<blockquote>plain text</blockquote>
multi-paragraph <blockquote><p>...</p><p>...</p></blockquote>
cite-bearing <blockquote>...<cite><a ...>...</a></cite></blockquote>
Step 3: Add fixture coverage for safe and skipped embed shapes

Cover:

YouTube wrapper
Twitter/X wrapper
unsupported provider wrapper
helper failure path for Twitter/X
Step 4: Run the focused test files and confirm the new cases fail first

Run:

python3 -m unittest tests/test_normalize_wp_quotes.py tests/test_normalize_wp_embeds.py

Expected: FAIL for the newly added behaviors before implementation.

Task 2: Expand Quote Normalization Safely

Files:

Modify: scripts/normalize_wp_quotes.py
Modify: tests/test_normalize_wp_quotes.py
Step 1: Teach the normalizer to convert multiple independent safe quote blocks

Replace the current whole-post skip for multiple blocks with per-block conversion when each block is individually safe.

Step 2: Add support for simple raw HTML blockquote cleanup

Implement a second parser path for safe raw HTML blockquotes that map mechanically to markdown quote lines plus optional Source: output.

Step 3: Keep strict skip rules for risky quote content

Preserve hard skips for:

nested quotes
image or figure content
malformed markers
empty paragraphs
unsupported cite shapes
Step 4: Preserve dry-run reporting with clear skip reasons

Make sure the reporting still shows which files would convert and which files are skipped, now with the finer-grained quote support.

Step 5: Run the quote normalizer tests

Run:

python3 -m unittest tests/test_normalize_wp_quotes.py

Expected: PASS

Task 3: Build The Embed Normalizer

Files:

Create: scripts/normalize_wp_embeds.py
Create: tests/test_normalize_wp_embeds.py
Step 1: Add a parser for WordPress embed wrappers

Recognize:



wrapper body URL extraction
Step 2: Implement deterministic YouTube conversion

Support the archive’s common YouTube URL forms and render one stable iframe shape.

Step 3: Implement the Twitter/X helper boundary

Add a small helper that:

accepts a tweet URL
calls X oEmbed
returns embed HTML or a structured failure

Keep this boundary easy to mock in tests.

Step 4: Skip unsupported providers explicitly

For every non-Twitter, non-YouTube provider, report the provider name and skip without rewriting the post.

Step 5: Run the embed normalizer tests

Run:

python3 -m unittest tests/test_normalize_wp_embeds.py

Expected: PASS

Task 4: Add CLI Behavior And Safe Write Semantics

Files:

Modify: scripts/normalize_wp_quotes.py
Modify: scripts/normalize_wp_embeds.py
Modify: related tests
Step 1: Keep dry-run as the default mode for both tools

Assert in tests that no file changes occur without explicit write mode.

Step 2: Add targeted-path support for debugging

Allow one file or a short file list to be processed for local spot checks.

Step 3: Make write mode atomic enough to avoid half-rewrites

If a post contains a mix of supported and unsupported embed blocks, decide at the post level whether partial conversion is acceptable. My recommendation is yes for independent safe blocks, but only if skipped blocks remain untouched and reporting is explicit.

Step 4: Add write-mode tests

Cover:

quote rewrite in place
YouTube embed rewrite in place
Twitter/X rewrite in place with mocked helper output
helper failure leaving the file untouched

Task 5: Audit The Archive Before Any Real Rewrite

Files:

Verify: _posts/*.md
Verify: normalization scripts
Step 1: Run both tools in dry-run mode across the repo

Run:

python3 scripts/normalize_wp_quotes.py
python3 scripts/normalize_wp_embeds.py

Expected:

clear would-convert counts
explicit skip counts and reasons
no file mutations
Step 2: Review the output buckets

Check that the backlog falls into the intended v1 buckets:

safe multi-block quote conversions
safe raw HTML blockquote conversions
YouTube conversions
Twitter/X conversions
skipped long-tail providers
Step 3: Spot-check representative candidates manually

Open at least:

one post with multiple safe quote blocks
one raw HTML blockquote candidate
one YouTube embed candidate
one Twitter/X embed candidate
one long-tail provider skip

Task 6: Apply The Safe Rewrite Batch

Files:

Modify: safe candidate posts in _posts/
Verify: both normalization scripts
Step 1: Run quote normalization with write mode

Run:

python3 scripts/normalize_wp_quotes.py --write

Step 2: Run embed normalization with write mode

Run:

python3 scripts/normalize_wp_embeds.py --write

Expected:

safe candidates are rewritten
helper failures and provider-tail posts remain untouched
reporting clearly separates converted and skipped files
Step 3: Inspect the resulting diff

Run:

git diff --stat
git diff -- _posts

Verify that the content changes stay mechanical and match the planned output shapes.

Task 7: Verify Repository Compatibility

Files:

Verify: tests/
Verify: scripts/validate_posts.py
Verify: HTML validation tooling
Step 1: Run focused unit tests

Run:

python3 -m unittest tests/test_normalize_wp_quotes.py tests/test_normalize_wp_embeds.py

Expected: PASS

Step 2: Run existing repository validators

Run:

python3 scripts/validate_posts.py --today "$(date +%F)"

Expected: PASS

Step 3: Run any existing HTML markdown leak check used in CI

Expected: PASS

Step 4: Do manual spot checks in local preview

Inspect a few converted archive posts in bundle exec jekyll serve to confirm:

markdown quotes render correctly
YouTube embeds render correctly
Twitter/X embeds render correctly
skipped content remains untouched

Task 8: Prepare The Review Surface

Files:

Verify: plan/spec docs
Verify: normalization scripts
Verify: converted posts
Step 1: Summarize skip buckets still left for later

Call out:

image-bearing quotes
nested quotes
malformed quote structures
non-Twitter, non-YouTube embed providers
Twitter/X helper failures if any
Step 2: Keep the PR reviewable

If the rewrite batch is too noisy, split the work into:

tooling and tests
archive content rewrite
Step 3: Document the next manual bucket

After this pass, the next issue should target only one leftover class at a time instead of reopening “clean up HTML” in the abstract.