Skip to main content

Building a weekly SEO monitoring stack with AI (canonical, schema, sitemap, CWV)

June 4, 2026
7 min read
ai-automationautomationtechnical-seomonitoringseo-automation

Three years ago I spent €450/month on Semrush. Most of what I used it for was monitoring — "has anything broken on my site this week?" The other 90% of the feature set was theatre.

Here's the four-piece monitoring stack I run now. Total cost: about €0 in tools, €2/month in AI API calls. Total time: 15 minutes a week to read alerts. Catches the things that actually matter and ignores the rest.

What you should monitor weekly

Four signals. Everything else is noise at this scale.

  1. Canonical URLs. If your canonicals drift (point somewhere wrong, change unexpectedly), indexing collapses. The localhost canonical disaster was a 4-day detection lag for me — too long.
  2. Schema validation. JSON-LD that worked last week and breaks this week (template change, content type change, escaping bug) silently kills rich results and AI citation eligibility.
  3. Sitemap drift. If your sitemap suddenly lists 30 URLs when it used to list 80, or vice versa — something's wrong. Either a real content change you wanted, or a bug.
  4. Core Web Vitals. LCP, INP, CLS. The actual user experience metrics Google rewards.

Skip: keyword position tracking (volatility makes this nearly useless at small scale), backlink monitoring (low-frequency change), traffic dashboards (you already have GA4).

Piece 1: Canonical URL monitor

This is the most important one because canonical bugs cause the worst damage and the slowest detection.

Approach: list 5–10 representative URLs in a config file. Daily cron job hits each one, parses the canonical tag, compares to the expected value. If drift, send Slack alert.

# canonical-monitor.py
import requests
import re
import json
import sys

EXPECTED = {
    "https://booplex.com/": "https://booplex.com/",
    "https://booplex.com/blog": "https://booplex.com/blog",
    "https://booplex.com/about": "https://booplex.com/about",
    "https://booplex.com/tools/canonical-checker": "https://booplex.com/tools/canonical-checker",
    "https://booplex.com/blog/never-done-learning-forever-tinkering": "https://booplex.com/blog/never-done-learning-forever-tinkering",
}

def get_canonical(url):
    r = requests.get(url, timeout=15, allow_redirects=True)
    m = re.search(r'

Cron at 06:00 daily. Pipe output to Slack via webhook if exit code is non-zero.

The same logic is exposed publicly at the canonical URL checker — if you don't want to run your own monitor, you can ad-hoc check any URL there.

Piece 2: Schema validation monitor

Run validation against the same set of URLs once a week. Two options:

Option A — Google's Rich Results Test API. No public API as of writing, but you can scrape the test page. Brittle but free.

Option B — Local validation with a schema validator library. Pull the JSON-LD from each page, validate against the schema.org spec via a Node/Python validator. More work to set up, fully under your control.

I run option B because the dependencies aren't external. Here's the prompt I give Claude after the script extracts schemas:

You have access to schemas.json (extracted JSON-LD from 5 URLs).
Validate each schema against the Schema.org spec.

For each URL, report:
- Type detected
- Required fields present / missing
- Invalid field values (e.g., wrong date format)
- Deprecated types or fields

Write results to validation-report.json.
If any URL has an error, exit with code 1.
Otherwise exit with code 0.
No prose.

JSON only.

Validation runs in ~30 seconds. If any schema fails, the cron job posts the offending URL to Slack.

Piece 3: Sitemap drift checker

Daily fetch of sitemap.xml. Compare URL count + URL set to yesterday's snapshot. Flag any unexpected shrinkage or growth.

# sitemap-drift.py
import requests
import re
import json
import os
from datetime import date

SITEMAP = "https://booplex.com/sitemap.xml"
SNAPSHOTS = "./sitemap-snapshots/"

resp = requests.get(SITEMAP, timeout=20)
urls = sorted(set(re.findall(r"(.*?)", resp.text)))
today = date.today().isoformat()

os.makedirs(SNAPSHOTS, exist_ok=True)
today_file = f"{SNAPSHOTS}{today}.json"
with open(today_file, "w") as f:
    json.dump(urls, f, indent=2)

# Compare to yesterday
yesterday_file = sorted(os.listdir(SNAPSHOTS))[-2] if len(os.listdir(SNAPSHOTS)) > 1 else None
if yesterday_file:
    with open(f"{SNAPSHOTS}{yesterday_file}") as f:
        prev = set(json.load(f))
    curr = set(urls)
    added = curr - prev
    removed = prev - curr
    if removed or len(curr) < len(prev) - 5:
        print(json.dumps({"added": list(added), "removed": list(removed)}, indent=2))
        exit(1)
print("ok")

The trigger threshold ("removed any URL" or "shrunk by more than 5") is tunable. For Booplex's slow cadence, even one removed URL deserves a check.

Piece 4: Core Web Vitals monitor

Use Google's PageSpeed Insights API. Free, generous quota, returns LCP/INP/CLS for any URL.

I run this weekly (not daily — CWV doesn't fluctuate that fast). Tracks 5 URLs against mobile + desktop, logs results to a CSV, alerts if any metric falls into the "poor" band (LCP > 4s, INP > 500ms, CLS > 0.25).

# cwv-monitor.py
import requests
import json
import csv
import os
from datetime import date

API_KEY = os.environ["PSI_KEY"]
URLS = [
    "https://booplex.com/",
    "https://booplex.com/blog",
    "https://booplex.com/about",
    "https://booplex.com/tools/canonical-checker",
    "https://booplex.com/blog/never-done-learning-forever-tinkering",
]
STRATEGIES = ["mobile", "desktop"]

results = []
for url in URLS:
    for strategy in STRATEGIES:
        api = f"https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url={url}&strategy={strategy}&key={API_KEY}"
        data = requests.get(api, timeout=60).json()
        metrics = data["lighthouseResult"]["audits"]
        lcp = metrics["largest-contentful-paint"]["numericValue"]
        inp = metrics.get("interaction-to-next-paint", {}).get("numericValue", None)
        cls = metrics["cumulative-layout-shift"]["numericValue"]
        results.append([date.today().isoformat(), url, strategy, lcp, inp, cls])

with open("cwv-log.csv", "a") as f:
    csv.writer(f).writerows(results)

problems = [r for r in results if r[3] > 4000 or (r[4] and r[4] > 500) or r[5] > 0.25]
if problems:
    print(json.dumps(problems, indent=2))
    exit(1)

Stitching them together

Each script lives in ~/seo-monitor/. crontab:

0 6 * * *  cd ~/seo-monitor && python canonical-monitor.py || ./alert.sh "Canonical drift"
0 7 * * *  cd ~/seo-monitor && python sitemap-drift.py || ./alert.sh "Sitemap drift"
0 7 * * 1  cd ~/seo-monitor && python schema-validate.py || ./alert.sh "Schema validation"
0 8 * * 1  cd ~/seo-monitor && python cwv-monitor.py || ./alert.sh "CWV regression"

alert.sh is a one-liner that posts to a Slack webhook. Total cron config: 4 lines. Total runtime per day: under 90 seconds.

What this catches in practice

Real alerts from the past 6 months on Booplex:

  • Canonical drift (1 alert). The original localhost disaster. Caught on Day 2 in the next-iteration of the monitor (the first version had a bug; ironic).
  • Schema validation fail (3 alerts). Each time a Lexical editor change shipped a new content type without matching schema. All caught within 24 hours of deploy.
  • Sitemap drift (1 alert). Accidentally noindex'd a category page — sitemap auto-excluded it. Reverted within an hour.
  • CWV regression (0 alerts). Static site, hasn't moved.

Five alerts in six months. Zero false positives. Each one prevented something that would have hurt indexing or rich results.

Where this falls short

Three real gaps:

1. Coverage by sampling. The 5–10 URLs you monitor catch most issues but miss site-wide problems if your sample doesn't include the affected page. Mitigation: rotate the sample monthly, or add a one-time "every URL" check quarterly.

2. False sense of completeness. Monitoring catches drift from known good state. It doesn't catch problems you've never noticed before. The localhost canonical bug was actually present for 11 days before this stack would have caught it — the stack monitors what the canonical is, not what it should be.

Mitigation: combine with periodic full audits.

3. No competitive signal. This stack monitors your site. It doesn't tell you if a competitor's redesign is eating your SERPs. For that, you still need rank tracking.

I gave up rank tracking at the small-site stage — but you might not.

The €0 vs €450/month tradeoff

What you getThis stackSemrush Pro
Canonical monitoringYesYes (Site Audit)
Schema validationYesYes
Sitemap driftYesYes
CWV trackingYesYes
Rank trackingNoYes
Backlink monitoringNoYes
Competitor analysisNoYes
Content auditsNoYes
Monthly cost~€2 in API€450

What you're giving up: the SaaS UX, the breadth, and the comfort of a single dashboard. What you're saving: about €450/month.

If you're an agency tracking 50 client sites, the SaaS still wins. For a small site or a solo operator with a handful of sites, the custom stack wins.

The next iteration

What I'm planning to add over the next quarter:

  • Per-route 404 monitor (catch broken internal links faster than the link-audit cadence does)
  • llms.txt drift check (catch when a deploy strips out the file or breaks the format)
  • IndexNow ping verification (catch when IndexNow fails silently — happens more than you'd think)

Each adds 30 lines of code. None require a SaaS.

FAQ

How do I monitor SEO without Semrush?

Build a 4-piece stack: canonical drift, schema validation, sitemap drift, Core Web Vitals. Each is a 30-line script + a Slack webhook. Total cost near zero.

What's the minimum SEO monitoring for a small site?

Canonical drift + sitemap drift. Those two catch ~80% of indexing-killing bugs. The other two pieces (schema, CWV) are higher-value but slower-moving.

How often should I run canonical URL monitoring?

Daily. The cost of catching a canonical bug 12 hours after deploy is much lower than catching it 7 days after deploy.

Can I monitor SEO with just Google Search Console?

Partially. GSC catches indexing problems and CWV regressions, but with a 3–7 day lag. For faster detection, supplement with active monitoring like the stack above.

What does the AI part actually do in this stack?

Two places. First, Claude generates the schema validation reports (analyzing extracted JSON-LD against the spec). Second, when alerts fire, Claude can be prompted with the alert payload to triage — "is this a real problem or noise?" That triage saves time on the human review.

Is this overkill for a personal site?If your site earns its keep — pulls traffic, builds a brand, drives inquiries — the monitoring is worth 30 minutes of setup. If your site is dormant, skip it.

Topics:ai-automationautomationtechnical-seomonitoringseo-automation

Found This Useful?

Share it with someone who might learn from my mistakes!