By Joel in JavaScript SEO — 30 Mar 2025

Heading Extraction in SPAs: The Hidden Challenge

You run an SEO audit on your React site. Your tool reports: "H1: About Us. H2: Learn More."

You check the live site. The actual H1 is "How We Help SaaS Companies Reduce Churn by 40%." The H2s are detailed feature sections.

Your audit tool lied to you. Not intentionally - it just doesn't see what users and LLMs actually see.

This is the heading extraction problem in single-page applications (SPAs). And it's why your content strategy might be based on completely wrong information.

Why Traditional Tools Fail

Most SEO tools request your HTML and parse it immediately. This works fine for server-rendered sites. For SPAs built with React, Vue, or Angular, it's useless.

What traditional tools see:

<div id="root"></div>
<script src="/static/js/main.js"></script>

That's it. No headings. No content. Just an empty div and a JavaScript file.

What actually renders for users:

<div id="root">
  <h1>API Monitoring for DevOps Teams</h1>
  <h2>Real-Time Alerts</h2>
  <p>Get notified within 30 seconds...</p>
  <h2>Historical Analytics</h2>
  <p>Track performance trends...</p>
</div>

The content only exists after JavaScript executes. If your tool doesn't run JavaScript, it doesn't see your actual content structure.

Why this matters:

Google does render JavaScript (with limitations)
ChatGPT does process rendered content
Perplexity does see your actual headings
Your SEO tool does not

You're optimizing based on phantom content that no one actually sees.

How LLMs Read JavaScript-Rendered Content

When an LLM processes your SPA, here's what happens:

Step 1: Initial HTML parse

The LLM requests your URL and receives the initial HTML. For SPAs, this is mostly empty.

Step 2: JavaScript execution

If the LLM is sophisticated (like Google's crawler or specialized tools), it runs your JavaScript in a headless browser. This takes 2-5 seconds depending on your bundle size.

Step 3: DOM extraction

After JavaScript execution completes, the LLM extracts the rendered DOM - the actual HTML that users see.

Step 4: Content analysis

Only now does the LLM see your real headings, content, and structure. This is what it uses to answer queries.

The gap:

Traditional SEO tools stop after Step 1. LLMs complete all four steps. So when you audit your site, you're seeing Step 1 content. When LLMs cite your site (or don't), they're evaluating Step 3 content.

This gap explains why:

Your well-structured React site gets no AI citations
Your "optimized" headings don't match search queries
Your content clarity score is mysteriously low
Google says you have thin content when you know you don't

The Technical Challenge

Rendering JavaScript before extraction isn't simple. Here's why:

Challenge 1: Timing

JavaScript frameworks render content asynchronously. React might update the DOM 12 times before settling on final content. When do you extract?

Too early: You capture loading states ("Loading..." placeholders) Too late: You waste time waiting for animations and secondary updates

You need to wait for "network idle" - when all critical resources have loaded and the DOM has stabilized.

Challenge 2: Client-side routing

SPAs don't do full page loads. When you navigate from /features to /pricing, the URL changes but the HTML request doesn't happen. The content updates via JavaScript.

Traditional crawlers see one page. Users and LLMs see your entire site.

Challenge 3: Hydration

Next.js and similar frameworks use server-side rendering (SSR) plus client-side hydration. The initial HTML contains content, but JavaScript "hydrates" it with interactivity.

During hydration, content might shift. The pre-hydration H1 might differ from the post-hydration H1.

Challenge 4: Conditional rendering

Many React apps show different content based on user state, viewport size, or feature flags. Which version do you extract?

Example: Your mobile heading structure might be completely different from desktop. If you extract mobile headings, you're not seeing what desktop users (and most bots) see.

How to Extract Headings Correctly

To see the same content structure that LLMs see, you need to:

Step 1: Use a headless browser

Tools like Puppeteer or Playwright actually execute JavaScript. They launch Chrome, load your page, wait for rendering, then extract the DOM.

This is what Content LLM Analyzer does under the hood - it uses Puppeteer to fully render your page before extracting headings.

Step 2: Wait for network idle

Don't extract immediately. Wait until:

No network requests for 500ms
DOM mutations have stopped
Critical resources have loaded

Puppeteer's networkidle0 state waits for zero active connections. This catches most SPAs correctly.

Step 3: Extract from rendered DOM

Once the page is stable, extract headings by querying the actual DOM:

// This is what proper extraction looks like (simplified)
const headings = await page.evaluate(() => {
  const elements = document.querySelectorAll('h1, h2, h3, h4, h5, h6');
  return Array.from(elements).map(el => ({
    level: el.tagName.toLowerCase(),
    text: el.textContent.trim(),
    hierarchy: /* determine nesting */
  }));
});

This gives you the actual heading structure that users and LLMs see.

Step 4: Verify viewport

Extract headings at desktop viewport size (1920x1080 is standard). Mobile-first designs might hide or reorder content at different sizes.

Common Heading Problems in SPAs

Once you're extracting correctly, you'll likely find issues you didn't know existed:

Problem 1: Multiple H1s

React components often contain their own H1s. When you compose them, you accidentally create multiple H1s on a single page.

❌ What you see in components:

// Header.jsx
<h1>Site Title</h1>

// Hero.jsx  
<h1>Page Title</h1>

❌ What renders: Two H1s on the same page - confusing for LLMs and bad for SEO.

✅ The fix: Make heading levels props:

// Hero.jsx
<h1>{props.pageTitle}</h1>

// Header.jsx (site title should be H2 or div)
<div className="site-title">Site Title</div>

Problem 2: Heading hierarchy breaks

Components render independently, breaking logical hierarchy.

❌ What renders:

<h1>API Monitoring</h1>
<h3>Real-Time Alerts</h3>  <!-- H2 skipped -->
<h2>Pricing</h2>            <!-- Back to H2? -->

LLMs expect strict hierarchy. Skipping levels signals poor structure.

✅ The fix: Audit rendered output, not component files. Reorder or change heading levels to maintain strict hierarchy (H1 → H2 → H3, never skip).

Problem 3: Loading states leak

You extract too early and capture placeholder text.

❌ What you extract:

<h1>Loading...</h1>
<h2>Please wait</h2>

This isn't your real content. It's what renders before data loads.

✅ The fix: Wait for network idle before extraction. Or check for specific rendered content (if H1 text is "Loading...", wait longer).

Problem 4: Conditional content not captured

Your SPA shows different content to logged-in users. SEO tools (and LLMs) see the logged-out version.

This is actually correct - you want to extract the public view. But it means your heading structure might differ from what power users see.

Document which headings are public vs. authenticated, and optimize the public view for LLMs.

Testing Your Heading Structure

Here's how to verify you're extracting correctly:

Test 1: Compare tool output to live site

Open your site in Chrome. Right-click → Inspect. Use the Elements panel to manually find all H1-H6 tags.

Compare this to what your SEO tool reports. If they differ, your tool isn't rendering JavaScript.

Test 2: Check Google's rendered view

Google Search Console → URL Inspection → View Crawled Page → "More info" → Rendered HTML

This shows what Google actually sees. If it matches your live site but not your SEO tool, your tool is the problem.

Test 3: Run a headless browser yourself

Quick Puppeteer test:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://yoursite.com/page', {
    waitUntil: 'networkidle0'
  });
  
  const headings = await page.evaluate(() => {
    const elements = document.querySelectorAll('h1, h2, h3, h4, h5, h6');
    return Array.from(elements).map(el => ({
      tag: el.tagName,
      text: el.textContent.trim()
    }));
  });
  
  console.log(headings);
  await browser.close();
})();

This shows your actual rendered headings. Compare to your SEO audit.

Tools That Actually Work

If you're auditing SPAs, use tools that explicitly support JavaScript rendering:

Option 1: Content LLM Analyzer

Built specifically for this problem. Renders your page with Puppeteer, extracts rendered DOM, analyzes heading structure.

Shows you:

Actual rendered headings (after JavaScript execution)
Hierarchy validation (whether you skip levels)
Content-to-heading ratio
Heading parallelism check

Option 2: Screaming Frog (with JavaScript rendering enabled)

Settings → Configuration → Spider → Rendering → JavaScript

Note: Slower than normal crawls (it has to render each page), but accurate.

Option 3: Google's Rich Results Test

Not a full audit tool, but shows you exactly what Google renders. Good for spot-checking critical pages.

Option 4: Build your own

If you're technical, a simple Puppeteer script (like above) can extract headings across your site. Store results in a database, diff against previous runs to catch regressions.

Why This Matters for AEO

LLMs cite content they can extract clearly. If your headings are wrong (because you're auditing pre-render HTML), you're optimizing ghost content.

Real impact:

A SaaS company using React thought their H1 was "Solutions." Their actual rendered H1 was a dynamic {product.name} that resolved to different products per URL. SEO tool showed one H1. LLMs saw 30+ different H1s across the site - total inconsistency.

They fixed it by:

Running Puppeteer extraction to see actual headings
Discovering the dynamic H1 issue
Restructuring components to use consistent H1 templates
Re-extracting to verify

Result: Content clarity score increased from 48 to 71. ChatGPT citations went from 0 to 14 over 3 months.

You can't fix heading structure if you can't see it. And if your site is an SPA, traditional tools won't show you what matters.

For more on optimizing JavaScript-rendered content, see "The Modern SEO Guide to JavaScript-Rendered Content". For understanding how LLMs extract and cite content, check out "The Complete Guide to How LLMs Read Your Website".

Start by extracting your actual rendered headings. Use a tool that runs JavaScript, or build a quick Puppeteer script. Compare the output to what your current SEO tool shows. If they differ - and they probably do - you've been optimizing blind.