Content Readiness
Can AI agents find, read, and understand your content? These checks measure how efficiently agents consume your pages — from discovery signals to token efficiency.
Discovery
- llms.txt
- llms.txt gives AI agents a structured summary of your site — what it does, what APIs exist, and where to find key content. Without it, agents must reverse-engineer your site from raw HTML.llmstxt.org spec
- Structured Data (JSON-LD)
- JSON-LD structured data tells agents exactly what entities exist on your page — products, organizations, articles — and their properties. This eliminates guesswork when agents extract information.Google structured data guide
- Sitemap
- A sitemap lets AI crawlers discover all your pages without following every link. This is critical for agents that need to index or search your full site.Sitemap protocol
- Meta Descriptions & OpenGraph
- Meta descriptions and OpenGraph tags let agents summarize your page without reading the full HTML. Agents use these to decide whether a page is relevant before committing tokens to parse it.
- Heading Hierarchy
- Agents fold and navigate content by heading level. A clear h1 → h2 → h3 hierarchy lets them skip to relevant sections instead of reading the entire page sequentially.
- Canonical URL
- A canonical URL tells agents which version of a page is authoritative. Without it, agents may waste tokens on duplicate content or cite the wrong URL.
Readability
- Markdown Content Negotiation
- When a server responds to Accept: text/markdown with clean markdown, agents get your content at ~80% fewer tokens than raw HTML. This is the single biggest efficiency win for AI consumption.Cloudflare Markdown for Agents
- Token Efficiency
- This measures how much of your HTML is actual content vs. framework noise — CSS classes, nested divs, scripts. Low ratios mean agents burn most of their context window on markup instead of your content.
- Content Extraction Quality
- Agents extract your main content by looking for <main> or <article> elements. Without these semantic wrappers, they pull in navigation, footers, and sidebars — adding noise and wasting tokens.
- Semantic HTML
- Semantic elements like <main>, <nav>, <article>, and <section> give agents a structural map of your page. Generic <div> containers provide no hints about what content they hold.
- Page Token Footprint
- Every page an agent reads consumes context window tokens. Lighter pages leave more room for conversation history and multi-step workflows. Pages over 30k tokens can exhaust smaller model context windows entirely.
Permissions
- AI Crawler Policy (robots.txt)
- robots.txt controls which crawlers can access your site. Blocking AI crawlers like GPTBot and ClaudeBot prevents your content from appearing in AI answers and agent workflows.
- Content-Signal Header
- The Content-Signal header explicitly declares whether your content can be used for AI training, search, and input. Without it, agents must guess your permissions or apply conservative defaults.Content Signals spec