@gurupanguji

HTML And Embed Cleanup Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Reduce issue #104 markup debt by expanding safe quote normalization and adding a dedicated safe embed normalizer for YouTube and Twitter/X, while explicitly skipping risky historical markup.

Architecture: Keep quote normalization and embed normalization as separate tools. Extend the quote normalizer to convert multiple safe quote blocks and simple raw HTML blockquotes. Add a new embed normalizer that handles YouTube locally and Twitter/X through a migration-time oEmbed helper. Drive both tools with tests first, use dry-run reporting before any rewrite, then apply only the safe batch and verify the results.

Tech Stack: Python 3, unittest, regex/string parsing, Jekyll markdown content, live HTTP fetch for X oEmbed during write-mode migration, existing repository validators


Task 1: Capture The Real Backlog In Tests And Fixtures

Files:

Create at least one fixture or inline test case with two or more independent safe quote blocks in one post and assert they all convert cleanly.

Cover at least:

Cover:

Run:

python3 -m unittest tests/test_normalize_wp_quotes.py tests/test_normalize_wp_embeds.py

Expected: FAIL for the newly added behaviors before implementation.

Task 2: Expand Quote Normalization Safely

Files:

Replace the current whole-post skip for multiple blocks with per-block conversion when each block is individually safe.

Implement a second parser path for safe raw HTML blockquotes that map mechanically to markdown quote lines plus optional Source: output.

Preserve hard skips for:

Make sure the reporting still shows which files would convert and which files are skipped, now with the finer-grained quote support.

Run:

python3 -m unittest tests/test_normalize_wp_quotes.py

Expected: PASS

Task 3: Build The Embed Normalizer

Files:

Recognize:

Support the archive’s common YouTube URL forms and render one stable iframe shape.

Add a small helper that:

Keep this boundary easy to mock in tests.

For every non-Twitter, non-YouTube provider, report the provider name and skip without rewriting the post.

Run:

python3 -m unittest tests/test_normalize_wp_embeds.py

Expected: PASS

Task 4: Add CLI Behavior And Safe Write Semantics

Files:

Assert in tests that no file changes occur without explicit write mode.

Allow one file or a short file list to be processed for local spot checks.

If a post contains a mix of supported and unsupported embed blocks, decide at the post level whether partial conversion is acceptable. My recommendation is yes for independent safe blocks, but only if skipped blocks remain untouched and reporting is explicit.

Cover:

Task 5: Audit The Archive Before Any Real Rewrite

Files:

Run:

python3 scripts/normalize_wp_quotes.py
python3 scripts/normalize_wp_embeds.py

Expected:

Check that the backlog falls into the intended v1 buckets:

Open at least:

Task 6: Apply The Safe Rewrite Batch

Files:

Run:

python3 scripts/normalize_wp_quotes.py --write

Run:

python3 scripts/normalize_wp_embeds.py --write

Expected:

Run:

git diff --stat
git diff -- _posts

Verify that the content changes stay mechanical and match the planned output shapes.

Task 7: Verify Repository Compatibility

Files:

Run:

python3 -m unittest tests/test_normalize_wp_quotes.py tests/test_normalize_wp_embeds.py

Expected: PASS

Run:

python3 scripts/validate_posts.py --today "$(date +%F)"

Expected: PASS

Expected: PASS

Inspect a few converted archive posts in bundle exec jekyll serve to confirm:

Task 8: Prepare The Review Surface

Files:

Call out:

If the rewrite batch is too noisy, split the work into:

After this pass, the next issue should target only one leftover class at a time instead of reopening “clean up HTML” in the abstract.