HTML And Embed Cleanup Design
Goal
Address issue #104 with an automation-first cleanup pass that normalizes dominant safe legacy markup shapes in _posts/ while skipping risky historical oddities for later manual review.
The first pass should handle:
- multi-block safe
wp:quote conversions
- safe HTML blockquote-to-markdown conversions where the shape is mechanical
- YouTube
wp:embed wrappers
- Twitter/X
wp:embed wrappers through an oEmbed helper
The first pass should skip:
- nested or image-bearing quote blocks
- malformed quote structures
- non-Twitter, non-YouTube embed providers
- any Twitter/X embed URL whose helper fetch fails
Problem
The repository already landed a narrow wp:quote normalization pass for issue #100, but the archive still contains broader historical markup debt.
Current audit snapshot on March 27, 2026:
117 posts still contain wp:quote
142 posts still contain legacy wp:embed or wp:core-embed/*
280 posts still contain some form of raw <blockquote>
- the current quote normalizer now skips mostly because of multiple quote blocks in one post, then unsupported inner markup
- embed volume is dominated by YouTube and Twitter/X, followed by a long provider tail that is not worth automating in v1
This makes issue #104 a cleanup and debt-reduction task, not a single-feature parser fix.
Design
Quote normalization and embed normalization should stay in separate scripts or separate modules with separate tests.
Reason:
- quote blocks and embeds fail in different ways
- YouTube conversion is deterministic, while Twitter/X conversion depends on live oEmbed fetches
- keeping the tools separate makes the diffs easier to reason about and the failure reports easier to trust
One wrapper command can come later if it proves useful. It should not be the starting point.
2. Expand quote normalization from one safe block to many safe blocks
The current quote normalizer is intentionally too narrow for the remaining backlog. It should now convert any number of independent safe wp:quote blocks within the same post when each block is individually safe.
Safe quote block rules for v1:
- plain text paragraphs inside the quote
- optional
<br> within paragraph content, rendered as line breaks inside markdown quote lines
- optional simple cite link that can become
Source: [Title](url)
- no nested
wp:quote
- no image, figure, or other rich inner markup
The script should still skip, not guess, when it sees:
- nested quote blocks
- empty paragraphs
- malformed markers
- image-bearing quote blocks
- cite shapes that do not map cleanly to markdown
3. Add a second quote cleanup path for simple raw HTML blockquotes
Issue #104 is not limited to wp:quote. Some posts contain raw HTML blockquotes that still map cleanly to markdown.
The first pass should support only mechanical shapes such as:
<blockquote>text</blockquote>
<blockquote><p>text</p><p>text 2</p></blockquote>
<blockquote>text <cite><a href="url">title</a></cite></blockquote>
Converted output shape:
> text
>
> text 2
Source: [title](url)
This converter should skip:
- blockquotes with nested blockquotes
- blockquotes containing figures, images, iframes, scripts, or other rich HTML
- cite structures that do not map cleanly to
Source:
- old hand-authored HTML where the parser cannot distinguish authored formatting from legacy wrapper noise
This path should be mechanical only. It should not rewrite authored commentary or try to modernize every old HTML post.
4. Add a dedicated embed normalizer for only YouTube and Twitter/X
The embed cleanup should support only the dominant safe providers first.
YouTube
Convert WordPress embed wrappers that contain a YouTube URL into native embed HTML using the canonical youtube.com/embed/<id> shape.
Responsibilities:
- extract the video ID from supported YouTube URL forms
- preserve useful query parameters only when they belong in the embed URL
- output one stable iframe shape used consistently across the archive
Convert WordPress Twitter/X wrappers by:
- extracting the tweet URL from the wrapper
- calling X oEmbed with that URL during the migration step
- taking the returned
html field and writing it into the post
This should be treated as a migration-time helper, not a runtime site dependency.
Important constraints:
- the helper is network-dependent
- the returned HTML is provider-owned and may vary over time
- failures must skip cleanly without partial rewrites
If the helper fails for a given URL, the post should be skipped and logged.
5. Skip the long tail of providers in v1
For non-Twitter, non-YouTube wp:embed wrappers, the first pass should skip and report them.
Examples from the current archive include:
- site previews and
wp-embed providers
- Mastodon and ActivityPub variants
- TikTok
- provider-specific custom wrappers
Reason:
- the long tail adds complexity fast
- provider semantics vary
- plain links or richer embeds are product choices, not purely mechanical markup normalization
This cleanup pass should reduce the biggest buckets first without inventing output policy for every provider.
6. Use dry-run-first reporting as the review surface
Both cleanup tools should default to dry-run mode and emit:
- would-convert files
- skipped files
- skip reasons
- summary counts by category
That report is the main safety valve before any archive rewrite. It should be easy to compare before and after implementation.
7. Keep the rewrite mechanical and non-editorial
This issue is still a markup normalization task.
That means:
- no prose rewrites
- no taxonomy changes
- no style cleanup beyond the minimum spacing needed for valid readable markdown or embed HTML
- no opportunistic modernization of unrelated WordPress blocks
The whole point is to reduce parser debt without erasing authored signal.
File Changes
New
Modify
Verification
Automated
Add test coverage for:
- multiple safe quote blocks in one post
- raw HTML blockquote conversion for safe shapes
- skip behavior for nested or image-bearing quotes
- YouTube URL extraction and iframe rendering
- Twitter/X URL extraction and helper-boundary behavior
- dry-run versus write-mode reporting
- skip reporting for unsupported embed providers
Manual
After dry-run:
- inspect representative quote-only, quote-plus-cite, YouTube, and Twitter/X candidates
- inspect at least one skipped provider-tail post to confirm the skip is intentional
After write:
- inspect rendered output in a few archive posts that mix quote blocks, embeds, and commentary
Repository Checks
Run repository validators after the cleanup batch so the rewrite does not break post validation or HTML checks.
Risks
The oEmbed helper is useful, but it means the exact returned HTML can change over time. That is acceptable for migration-time conversion, but the helper boundary must be explicit and easy to mock in tests.
Historical HTML varies more than it first appears
Raw <blockquote> content in older posts can blur the line between authored HTML and legacy wrapper debris. The parser should skip whenever that distinction is not clean.
Large archive rewrites can hide bad assumptions
Because the first write pass may touch many posts, the tooling should make classification visible before editing files in place.
Acceptance Criteria
- Quote normalization supports multiple independent safe
wp:quote blocks in a single post.
- Safe raw HTML blockquote shapes can be converted mechanically to markdown blockquotes.
- A dedicated embed normalizer converts supported YouTube wrappers to native iframe embeds.
- Twitter/X wrappers can be converted through an oEmbed helper that writes provider-returned embed HTML into the post.
- Non-Twitter, non-YouTube embed providers are skipped with explicit reasons.
- Both tools default to dry-run and require explicit write mode to edit files.
- Tests cover both successful conversions and skip paths.
- Repository validation passes after the rewritten batch.