@gurupanguji

HTML And Embed Cleanup Design

Goal

Address issue #104 with an automation-first cleanup pass that normalizes dominant safe legacy markup shapes in _posts/ while skipping risky historical oddities for later manual review.

The first pass should handle:

multi-block safe wp:quote conversions
safe HTML blockquote-to-markdown conversions where the shape is mechanical
YouTube wp:embed wrappers
Twitter/X wp:embed wrappers through an oEmbed helper

The first pass should skip:

nested or image-bearing quote blocks
malformed quote structures
non-Twitter, non-YouTube embed providers
any Twitter/X embed URL whose helper fetch fails

Problem

The repository already landed a narrow wp:quote normalization pass for issue #100, but the archive still contains broader historical markup debt.

Current audit snapshot on March 27, 2026:

117 posts still contain wp:quote
142 posts still contain legacy wp:embed or wp:core-embed/*
280 posts still contain some form of raw <blockquote>
the current quote normalizer now skips mostly because of multiple quote blocks in one post, then unsupported inner markup
embed volume is dominated by YouTube and Twitter/X, followed by a long provider tail that is not worth automating in v1

This makes issue #104 a cleanup and debt-reduction task, not a single-feature parser fix.

Design

1. Keep quotes and embeds as separate normalization tools

Quote normalization and embed normalization should stay in separate scripts or separate modules with separate tests.

Reason:

quote blocks and embeds fail in different ways
YouTube conversion is deterministic, while Twitter/X conversion depends on live oEmbed fetches
keeping the tools separate makes the diffs easier to reason about and the failure reports easier to trust

One wrapper command can come later if it proves useful. It should not be the starting point.

2. Expand quote normalization from one safe block to many safe blocks

The current quote normalizer is intentionally too narrow for the remaining backlog. It should now convert any number of independent safe wp:quote blocks within the same post when each block is individually safe.

Safe quote block rules for v1:

plain text paragraphs inside the quote
optional   within paragraph content, rendered as line breaks inside markdown quote lines
optional simple cite link that can become Source: [Title](url)
no nested wp:quote
no image, figure, or other rich inner markup

The script should still skip, not guess, when it sees:

nested quote blocks
empty paragraphs
malformed markers
image-bearing quote blocks
cite shapes that do not map cleanly to markdown

3. Add a second quote cleanup path for simple raw HTML blockquotes

Issue #104 is not limited to wp:quote. Some posts contain raw HTML blockquotes that still map cleanly to markdown.

The first pass should support only mechanical shapes such as:

<blockquote>text</blockquote>
<blockquote>texttext 2</blockquote>
<blockquote>text <cite><a href="url">title</a></cite></blockquote>

Converted output shape:

> text
>
> text 2

Source: [title](url)

This converter should skip:

blockquotes with nested blockquotes
blockquotes containing figures, images, iframes, scripts, or other rich HTML
cite structures that do not map cleanly to Source:
old hand-authored HTML where the parser cannot distinguish authored formatting from legacy wrapper noise

This path should be mechanical only. It should not rewrite authored commentary or try to modernize every old HTML post.

4. Add a dedicated embed normalizer for only YouTube and Twitter/X

The embed cleanup should support only the dominant safe providers first.

YouTube

Convert WordPress embed wrappers that contain a YouTube URL into native embed HTML using the canonical youtube.com/embed/<id> shape.

Responsibilities:

extract the video ID from supported YouTube URL forms
preserve useful query parameters only when they belong in the embed URL
output one stable iframe shape used consistently across the archive

Twitter/X

Convert WordPress Twitter/X wrappers by:

extracting the tweet URL from the wrapper
calling X oEmbed with that URL during the migration step
taking the returned html field and writing it into the post

This should be treated as a migration-time helper, not a runtime site dependency.

Important constraints:

the helper is network-dependent
the returned HTML is provider-owned and may vary over time
failures must skip cleanly without partial rewrites

If the helper fails for a given URL, the post should be skipped and logged.

5. Skip the long tail of providers in v1

For non-Twitter, non-YouTube wp:embed wrappers, the first pass should skip and report them.

Examples from the current archive include:

site previews and wp-embed providers
Mastodon and ActivityPub variants
TikTok
provider-specific custom wrappers

Reason:

the long tail adds complexity fast
provider semantics vary
plain links or richer embeds are product choices, not purely mechanical markup normalization

This cleanup pass should reduce the biggest buckets first without inventing output policy for every provider.

6. Use dry-run-first reporting as the review surface

Both cleanup tools should default to dry-run mode and emit:

would-convert files
skipped files
skip reasons
summary counts by category

That report is the main safety valve before any archive rewrite. It should be easy to compare before and after implementation.

7. Keep the rewrite mechanical and non-editorial

This issue is still a markup normalization task.

That means:

no prose rewrites
no taxonomy changes
no style cleanup beyond the minimum spacing needed for valid readable markdown or embed HTML
no opportunistic modernization of unrelated WordPress blocks

The whole point is to reduce parser debt without erasing authored signal.

File Changes

New

docs/superpowers/specs/2026-03-27-html-and-embed-cleanup-design.md
a new embed normalization script
new test coverage for embed normalization

Modify

scripts/normalize_wp_quotes.py
quote normalizer tests
safe candidate posts in _posts/

Verification

Automated

Add test coverage for:

multiple safe quote blocks in one post
raw HTML blockquote conversion for safe shapes
skip behavior for nested or image-bearing quotes
YouTube URL extraction and iframe rendering
Twitter/X URL extraction and helper-boundary behavior
dry-run versus write-mode reporting
skip reporting for unsupported embed providers

Manual

After dry-run:

inspect representative quote-only, quote-plus-cite, YouTube, and Twitter/X candidates
inspect at least one skipped provider-tail post to confirm the skip is intentional

After write:

inspect rendered output in a few archive posts that mix quote blocks, embeds, and commentary

Repository Checks

Run repository validators after the cleanup batch so the rewrite does not break post validation or HTML checks.

Risks

Twitter/X output is provider-owned

The oEmbed helper is useful, but it means the exact returned HTML can change over time. That is acceptable for migration-time conversion, but the helper boundary must be explicit and easy to mock in tests.

Historical HTML varies more than it first appears

Raw <blockquote> content in older posts can blur the line between authored HTML and legacy wrapper debris. The parser should skip whenever that distinction is not clean.

Large archive rewrites can hide bad assumptions

Because the first write pass may touch many posts, the tooling should make classification visible before editing files in place.

Acceptance Criteria

Quote normalization supports multiple independent safe wp:quote blocks in a single post.
Safe raw HTML blockquote shapes can be converted mechanically to markdown blockquotes.
A dedicated embed normalizer converts supported YouTube wrappers to native iframe embeds.
Twitter/X wrappers can be converted through an oEmbed helper that writes provider-returned embed HTML into the post.
Non-Twitter, non-YouTube embed providers are skipped with explicit reasons.
Both tools default to dry-run and require explicit write mode to edit files.
Tests cover both successful conversions and skip paths.
Repository validation passes after the rewritten batch.