@gurupanguji

Portable Post-Body Contract Design

Goal

Define the canonical post-body contract for issue #122 so new authoring and archive cleanup both target one stable shape.

The repository should keep _posts/ as the canonical source of truth, including the real filename and YAML front matter, while making the body itself as portable and markdown-first as possible.

Problem

The current repository already has partial policy spread across multiple places:

scripts/normalize_post_media_links.py rewrites some standalone URLs
scripts/validate_posts.py enforces some body-shape rules
the archive still contains large amounts of WordPress-era HTML, including <img>, <figure>, <blockquote>,  , styled wrappers, and provider-specific embed wrappers

Without one written contract, the repository risks growing two different standards:

one standard for newly-authored posts
another standard for archive cleanup

That split would turn future cleanup into taste debates instead of mechanical normalization.

Decision

The canonical contract is:

_posts/ remains canonical, including filename and front matter
post bodies are markdown-first
HTML is allowed only where markdown cannot represent the desired content shape cleanly
supported standalone social and video URLs are valid author input and normalize into canonical embed markup before commit
every generated embed keeps a visible markdown source line directly under it so the body can later be de-embedded without losing the underlying URL

Canonical Body Shapes

1. Paragraphs and headings

Use plain markdown.

No raw HTML should be used for spacing, typography, or simple layout control.

That means the canonical contract rejects:

 for ordinary prose
<div> wrappers used only for layout
 used only for styling
  or   used only to create vertical rhythm

These are portability and cleanup liabilities, not authored signal.

2. Blockquotes

Use markdown blockquotes only.

Canonical shape:

> quoted text
>
> second paragraph

The canonical contract does not allow raw <blockquote> for ordinary quote rendering and does not rely on special blockquote classes.

3. Images

Use markdown image syntax for normal post images.

Canonical shape:

![Alt text](/assets/images/blog/filename.jpg)

Optional caption shape:

![Alt text](/assets/images/blog/filename.jpg)

This is the caption.

Rules:

no <img> for ordinary images
no <figure>
no <figcaption>
no image-specific HTML classes in canonical body content

This keeps the body portable and makes archive cleanup mechanical.

If a future need appears that markdown truly cannot express, it should be treated as a new policy decision rather than left implicit.

4. Supported standalone embed inputs

These standalone URL shapes are valid author input when they appear on their own line:

YouTube
X or Twitter
Mastodon
Bluesky
Threads

These URLs should be treated as author-friendly input syntax, not as the final committed canonical body shape.

5. Supported embed output shape

Before commit, a supported standalone embed URL should be replaced by:

canonical embed HTML
a visible markdown source line directly under it

The original raw URL line should not remain in the committed body once normalization succeeds.

Canonical pattern:

<iframe ...></iframe>

*Source:* [YouTube](https://example.com)

The exact embed HTML varies by provider, but the body-level contract is stable:

embed HTML first
markdown source line immediately after

6. Source-line label rule

Use fixed platform-name labels in v1.

Canonical labels:

YouTube
X
Mastodon
Bluesky
Threads

Canonical source shape:

*Source:* [YouTube](https://...)

Do not fetch post titles or remote metadata for the label in this issue. Deterministic platform labels are simpler and more stable.

7. Unsupported standalone URLs

Unsupported standalone bare URLs are tolerated for now and remain untouched by the contract in issue #122.

This issue does not decide whether unsupported standalone URLs should:

remain bare
auto-wrap into markdown links
be rejected by validation

That policy belongs to follow-up cleanup work, especially archive cleanup issue #124.

Representative current examples:

Allowed HTML

Issue #122 allows raw HTML only in these cases:

canonical provider embed HTML generated from supported standalone URLs
existing edge cases that cannot be represented in markdown cleanly, until a later issue decides them explicitly

Issue #122 does not allow raw HTML for:

ordinary images
ordinary blockquotes
captions
spacing
typography styling
layout wrappers

This means most existing <img>, <figure>, <figcaption>, <blockquote>,  , and styled container markup is cleanup debt, not part of the new canonical standard.

Archive Examples That Inform This Contract

HTML that should move to markdown

Ordinary image or figure markup that is likely portable as markdown:

Raw blockquote and spacing markup that should become markdown:

HTML that remains allowed under this contract

Supported embed HTML already present in the repo:

Responsibilities By Issue

Issue `#122`

Define the contract only:

what canonical post bodies are allowed to contain
what author input shapes are supported
what output shapes are canonical
what remains out of scope for now

Follow-up issues

issue #121: extend standalone URL normalization to Mastodon, Bluesky, and Threads
issue #126: run normalization before commit and re-check in validation
issue #124: backfill the archive toward this contract
issue #125: reduce hand-authored metadata burden for categories and tags

Implementation Consequences

This contract implies:

scripts/normalize_post_media_links.py should become the normalization surface for supported standalone embed URLs
scripts/validate_posts.py should validate against the canonical shapes defined here
archive cleanup should target this contract exactly instead of making ad hoc cleanup choices file by file

Acceptance Criteria

The repository has one written source of truth for canonical post-body shapes.
The contract clearly distinguishes markdown-first content from the small HTML surface that remains allowed.
Supported standalone embed URLs have an explicit input shape and explicit canonical output shape.
The contract preserves a visible markdown source URL under generated embeds.
Unsupported standalone URLs are explicitly left for follow-up work instead of being decided accidentally during implementation.