Making Your React Website Accessible to LLM Agents

Intended Audience: Senior Full Stack Web Developers

Estimated Reading Time: 9 minutes

You've built a modern Single Page Application (SPA). It's fast, interactive, and provides a great user experience. But when a Large Language Model (LLM) agent (like ChatGPT) tries to understand your site, it sees almost nothing useful.

I recently spent time testing how well AI agents could navigate my SPA storefront. What I discovered was both frustrating and instructive: the tools aren't ready for the web we've built, and the web isn't ready for the tools. But there's a pragmatic middle path that works today.

The Problem: Agents Are Glorified Curl

When you send an LLM agent to fetch a webpage, it doesn't spin up a browser. There's no JavaScript execution, no DOM manipulation, no client-side rendering. What the agent gets is whatever the server sends in that initial HTML response — essentially the output of curl.

For a modern SPA that renders a nearly-empty HTML shell and builds everything client-side, the agent sees something like:

<body>
  <div id="root"></div>
  <script src="/assets/app.js"></script>
</body>

That's it. Your entire product catalog, your blog posts, your carefully crafted content — invisible.

Why I Skipped Server-Side Rendering

The obvious solution seems to be server-side rendering: generate the full HTML on the server so crawlers and agents can see it. But I decided against this for my React SPA.

The problem is that a static HTML snapshot of an interactive React application isn't very useful. Components that respond to clicks, forms that validate input, dynamic state that changes as users interact — none of that translates to a static render. You end up maintaining two rendering paths for content that doesn't really benefit from SSR anyway.

Instead, I took a different approach: pre-generated static HTML pages that exist alongside the SPA, serving different audiences through parallel paths.

The Hidden Information Problem

Here's something I didn't fully appreciate until testing with an agent: even perfectly valid static HTML hides essential information from these tools.

When an agent fetches a webpage, it doesn't receive the raw HTML. The tooling processes it first, typically extracting just the text content. This means anything stored in HTML attributes gets discarded.

Consider a simple navigation link:

<a href="/trinkets/my-product.html">My Product</a>

A human sees a clickable link. A search engine crawler sees both the link text and the destination URL. But an LLM agent, after its fetch tool processes the page, sees only:

My Product

The URL vanishes entirely. The href attribute — the most important part of a link — is gone. The same applies to image src attributes, data- attributes, alt text, and anything else not directly in a text node.

This meant my carefully structured navigation, my product links, my entire site hierarchy — all invisible. The agent couldn't follow links because it couldn't see them. I had to manually paste URLs into the conversation before the agent could fetch them.

The Hierarchy of Agent Limitations

Through testing, I catalogued several layers of limitation:

Attribute stripping: The href of every link, the src of every image, alt text, data- attributes — anything not in a text node disappears. The DOM structure that makes HTML meaningful is flattened to plain text.

No autonomous navigation: Even if an agent could see link URLs, permission systems often prevent following them automatically. The agent I tested required explicit user approval for each URL, even ones discovered in previously-fetched pages.

Caching: Some agent fetch tools cache responses, meaning rapid iterations during development may return stale content without warning.

No JavaScript execution: This is fundamental and unlikely to change soon. If your content requires JS to render, agents won't see it.

The Pragmatic Solution: Expose Everything as Text

After enough frustration, I arrived at a simple principle: stop hiding information in attributes. If the agent only sees text nodes, put everything important in text nodes.

Instead of relying on href attributes alone, include URLs as visible text:

<li>
  My Product ($9.95) -
  <a href="/trinkets/my-product.html">/trinkets/my-product.html</a>
</li>

The URL appears twice: once in the href for browsers, once as link text for agents. Redundant? Yes. But now the agent can actually see where to go for more details.

I took this further: front-load everything important on the index page. Don't make the agent follow links to understand your site. Give it the complete picture in one request:

<body>
  <main id="static-content">
    <h1>My Store</h1>
    <p>We sell things.</p>

    <h2>Products</h2>
    <ol>
      <li>
        Widget ($19.95) -
        <a href="/products/widget.html">/products/widget.html</a>
      </li>
      <li>
        Gadget ($29.95) -
        <a href="/products/gadget.html">/products/gadget.html</a>
      </li>
    </ol>

    <h2>Blog</h2>
    <ol>
      <li>
        Our Latest Post - <a href="/blog/latest.html">/blog/latest.html</a>
      </li>
    </ol>

    <h2>Pages</h2>
    <ol>
      <li>About Us - <a href="/about.html">/about.html</a></li>
    </ol>
  </main>

  <div id="root"></div>
  <script>
    document.getElementById("static-content")?.remove();
  </script>
</body>

One fetch gives an agent everything: products, prices, URLs for details, blog posts, informational pages. No navigation required.

Two Parallel Paths

This approach serves two audiences through parallel paths in the same HTML document.

Path 1: JavaScript-enabled browsers (humans)

When a visitor loads the page in a normal browser, JavaScript executes. The SPA boots up in the #root div, and that small inline script removes the #static-content block. The visitor never sees the static content — they get the full interactive React application.

Path 2: No JavaScript (agents and crawlers)

When an agent or crawler fetches the page, JavaScript doesn't run. The #static-content remains, providing a complete text-based view of the site. The empty #root div is meaningless without JS to populate it.

The same principle applies to individual pages. Each static product page includes a small redirect script:

<script>
  location.replace("/#/trinkets/my-product");
</script>

If a human lands on /trinkets/my-product.html — perhaps from a search engine result — the script immediately redirects them to the equivalent SPA route at /#/trinkets/my-product. They get the full interactive experience. But an agent fetching that same URL gets the static HTML content, since the redirect never executes.

Automating Static Page Generation

Maintaining static pages by hand would be tedious and error-prone. Instead, I use a Node.js script that runs during the build process to generate everything automatically.

The script reads product data from TypeScript configuration files — the same source of truth the SPA uses at runtime. For each product, it generates a static HTML page with the title, price, description, and images. It also updates the main index.html with a complete listing of all products, blog posts, and pages.

Here's a simplified version of the core logic:

const REDIRECT_SCRIPT = `<script>
location.replace('/#' + location.pathname.replace('.html', ''));
</script>`;

function generateProductPage(product) {
  return `<!DOCTYPE html>
<html lang="en">
<head>
    <title>${product.title}</title>
    <meta name="description" content="${product.description.slice(0, 160)}">
</head>
<body>
    <h1>${product.title}</h1>
    <p><strong>${product.price.toFixed(2)}</strong></p>
    <p>${product.description}</p>
    <p><a href="/">Back to Home</a></p>
    ${REDIRECT_SCRIPT}
</body>
</html>`;
}

function generateStaticContent(products, posts, pages) {
  let html = "<h2>Products</h2><ol>";

  for (const p of products) {
    const url = `/trinkets/${p.slug}.html`;
    html += `<li>${p.title} (${p.price}) - <a href="${url}">${url}</a></li>`;
  }

  html += "</ol>";
  // ... similar loops for blog posts and pages
  return html;
}

The script injects the generated content into index.html at a marker comment, replacing whatever was there before. This keeps the static content in sync with the product catalog without manual intervention.

The build script also generates a sitemap.xml for search engines, using the same product and content data. One source of truth, multiple outputs.

Relative URLs Are Fine

You might wonder whether to use absolute or relative URLs. I settled on relative paths for a simple reason: my site runs on multiple domains. Hardcoding one domain would break the others, and agents are unlikely to jump domains anyway. They'll stay on whatever domain the user pointed them to.

If an agent can't resolve /products/widget.html against the base URL it just fetched, it probably can't do much else useful either.

Skip the Ceremony

I also stopped using <noscript> tags for fallback content. They're a relic from when "does this browser support JavaScript?" was a meaningful question. Today, the answer is effectively always yes for human users. And agents see the full DOM regardless — including content outside <noscript> blocks.

The simpler pattern: serve real content unconditionally, then let JavaScript remove it. Progressive enhancement as it was always meant to work.

The State of Agent Accessibility

There's no established standard for making websites agent-accessible. No equivalent to ARIA roles or semantic HTML best practices. The llms.txt proposal exists but isn't widely adopted. Structured data helps search engines but doesn't address the attribute-stripping problem.

We're in the early days. The web accessibility movement went through a similar phase of "technically correct but practically useless" before tooling caught up. For now, the best agent accessibility is the most obvious thing: put everything in text nodes where nothing can strip it away.

It's not elegant. But it works today, with tools as they actually exist. And sometimes that's what matters.

Additional Challenges: Inconsistent Agent Behavior

Beyond the technical limitations of HTML processing, there's an even more fundamental problem: agents don't behave consistently.

During my testing, I encountered maddening inconsistencies with ChatGPT in particular. When asked to look at a specific URL, it would sometimes route the request through a search engine rather than fetching the page directly. Instead of retrieving https://example.com/my-page.html, it would search for keywords and return whatever the search engine surfaced — which might be tangentially related content, outdated cached versions, or something else entirely.

Other times, ChatGPT would successfully fetch a URL directly, returning the actual page content. But in subsequent conversations — or even later in the same conversation — it would insist that it couldn't possibly fetch URLs directly. It would claim that web browsing wasn't among its capabilities, or that it had been mistaken earlier when it appeared to do so.

This creates an impossible situation for site owners. You can't optimize for an agent that:

May or may not fetch your actual page
May substitute search results for direct access
Can't reliably report its own capabilities
Contradicts itself about what it just did

At least with the limitations I described earlier — attribute stripping, no JavaScript, caching — the behavior is consistent. You can work around consistent limitations. You can't work around an agent that gaslights you about whether it can visit your website.

Claude's tooling, while imperfect, at least behaves predictably. It fetches URLs when asked, returns text content (with the attribute-stripping caveats), and doesn't pretend it can't do what it just did. That predictability is surprisingly valuable when you're trying to make your site accessible to AI agents.

The takeaway: when designing for agent accessibility, assume the lowest common denominator. Put everything in plain text. Front-load information on your index page. Don't require multiple fetches to understand your site. And hope the tooling catches up eventually.

Back to Home