@gurupanguji

HTML And Embed Cleanup Design

Goal

Address issue #104 with an automation-first cleanup pass that normalizes dominant safe legacy markup shapes in _posts/ while skipping risky historical oddities for later manual review.

The first pass should handle:

The first pass should skip:

Problem

The repository already landed a narrow wp:quote normalization pass for issue #100, but the archive still contains broader historical markup debt.

Current audit snapshot on March 27, 2026:

This makes issue #104 a cleanup and debt-reduction task, not a single-feature parser fix.

Design

1. Keep quotes and embeds as separate normalization tools

Quote normalization and embed normalization should stay in separate scripts or separate modules with separate tests.

Reason:

One wrapper command can come later if it proves useful. It should not be the starting point.

2. Expand quote normalization from one safe block to many safe blocks

The current quote normalizer is intentionally too narrow for the remaining backlog. It should now convert any number of independent safe wp:quote blocks within the same post when each block is individually safe.

Safe quote block rules for v1:

The script should still skip, not guess, when it sees:

3. Add a second quote cleanup path for simple raw HTML blockquotes

Issue #104 is not limited to wp:quote. Some posts contain raw HTML blockquotes that still map cleanly to markdown.

The first pass should support only mechanical shapes such as:

Converted output shape:

> text
>
> text 2

Source: [title](url)

This converter should skip:

This path should be mechanical only. It should not rewrite authored commentary or try to modernize every old HTML post.

4. Add a dedicated embed normalizer for only YouTube and Twitter/X

The embed cleanup should support only the dominant safe providers first.

YouTube

Convert WordPress embed wrappers that contain a YouTube URL into native embed HTML using the canonical youtube.com/embed/<id> shape.

Responsibilities:

Twitter/X

Convert WordPress Twitter/X wrappers by:

This should be treated as a migration-time helper, not a runtime site dependency.

Important constraints:

If the helper fails for a given URL, the post should be skipped and logged.

5. Skip the long tail of providers in v1

For non-Twitter, non-YouTube wp:embed wrappers, the first pass should skip and report them.

Examples from the current archive include:

Reason:

This cleanup pass should reduce the biggest buckets first without inventing output policy for every provider.

6. Use dry-run-first reporting as the review surface

Both cleanup tools should default to dry-run mode and emit:

That report is the main safety valve before any archive rewrite. It should be easy to compare before and after implementation.

7. Keep the rewrite mechanical and non-editorial

This issue is still a markup normalization task.

That means:

The whole point is to reduce parser debt without erasing authored signal.

File Changes

New

Modify

Verification

Automated

Add test coverage for:

Manual

After dry-run:

After write:

Repository Checks

Run repository validators after the cleanup batch so the rewrite does not break post validation or HTML checks.

Risks

Twitter/X output is provider-owned

The oEmbed helper is useful, but it means the exact returned HTML can change over time. That is acceptable for migration-time conversion, but the helper boundary must be explicit and easy to mock in tests.

Historical HTML varies more than it first appears

Raw <blockquote> content in older posts can blur the line between authored HTML and legacy wrapper debris. The parser should skip whenever that distinction is not clean.

Large archive rewrites can hide bad assumptions

Because the first write pass may touch many posts, the tooling should make classification visible before editing files in place.

Acceptance Criteria

  1. Quote normalization supports multiple independent safe wp:quote blocks in a single post.
  2. Safe raw HTML blockquote shapes can be converted mechanically to markdown blockquotes.
  3. A dedicated embed normalizer converts supported YouTube wrappers to native iframe embeds.
  4. Twitter/X wrappers can be converted through an oEmbed helper that writes provider-returned embed HTML into the post.
  5. Non-Twitter, non-YouTube embed providers are skipped with explicit reasons.
  6. Both tools default to dry-run and require explicit write mode to edit files.
  7. Tests cover both successful conversions and skip paths.
  8. Repository validation passes after the rewritten batch.