Smyths scraper — hourly run

A Linux container wakes every hour, scrapes 40 Smyths Toys product pages, diffs against the previous run, and commits the results to this repo. This page is rendered from those commits.

Loading status…

Last 24h success rate

—

Runs in last 24h

—

Changes detected (24h)

—

What this is

This is a small, single-purpose product watcher. Every hour it pulls the same 40 Smyths Toys product pages, parses each one into structured fields (name, price, availability, image), and compares against the previous run's snapshot.

If anything moved — a price drop, a stockout, a renamed product — that change shows up in the feed below within an hour of it happening on the live site. If a run fails, the failure itself becomes a row in the timeline, with the stderr tail attached.

There is no database and no server. The scraper writes JSON files into this repository; GitHub Pages serves the dashboard; the browser does the rendering. View source on this page — the fetch paths are the data contract.

How it works

A cron tick every 15 min kicks the Docker container once an hour and ingests results in between; a Claude Code agent reviews health and repairs code on failure.
Container boots: Tor daemon up, Xvfb virtual display up, then a self-check verifies Tor control auth + NEWNYM works (refuses to start if not), then patchright launches a real headed Chrome.
One warm session against the Smyths homepage to clear Imperva's initial challenge and pick up cookies.
For each of the 40 product URLs: fetch in-page, extract JSON-LD + DOM fallbacks, normalise to a flat record.
If an exit hits an hCaptcha doodle, a free vision-LLM solver (Gemini → OpenRouter → Groq → Nvidia) reads the legend, locates the targets, and clicks in order with human-like jitter.
On consecutive failures, send SIGNAL NEWNYM to Tor to rotate the exit circuit, then retry.
Diff this run's records against the previous run's snapshot → write data/latest-changes.json.
Append one row to data/timeline.json, refresh data/run-summary.json, commit + push.
GitHub Pages picks up the commit; next page load (or the 5-minute auto-refresh) shows the new run.

Live timeline

Reverse-chronological. Newest run at top. Click any row to see captcha counts, Tor rotations, duration, and (on failures) the stderr tail.

Loading timeline…

Last detected changes

Field-level diff from the most recent run. Price moves are coloured down / up.

Compared against what? Each run is diffed against the immediately previous run's snapshot — the product records stored in this repo at data/runs/<previous-timestamp>.ndjson. Matching is by product SKU. The very first run has nothing to compare against, so every product counts as new; from the second run onward you see only what actually moved (price, in-stock, name, image, category).

Loading changes…

Captchas the solver saw

When a Tor exit gets challenged, Smyths shows an hCaptcha “select in order” doodle puzzle. The free vision-LLM solver reads the legend, locates each target, and clicks in order. Below is what it actually faced this run — the coloured rings are where the solver clicked — with the verdict.

Loading captchas…

Issues we hit + how we solved them

Real problems from building + running this, newest first. The Tor-rotation bug below was found and fixed on this very deployment — the failed run on the timeline is it.

In the container, Tor exit rotation silently did nothing — the first container run got stuck on a flagged exit and failed.→The Dockerfile had baked a control-port password hash that didn't match the plaintext the app sends (tor --hash-password uses a random salt each call). Every SIGNAL NEWNYM failed auth (515), so the scraper could never leave a bad exit. Three-part fix: the hash is now generated at build time from the real password so it always matches; rotateExit() now parses Tor's reply and only reports success on a real 250 OK (it used to report success even on auth failure); and the container runs a boot-time self-check that refuses to start if rotation auth is broken.

Free vision providers (Gemini, Groq) returned 503/429 under load, so the captcha solver couldn't read the puzzle.→Providers are chained (Gemini native → Gemini → OpenRouter Qwen3-VL → Groq → Nvidia) and rotate on rate-limit/5xx. When all free tiers are saturated, the lane rotates the Tor exit instead — a fresh exit often isn't challenged at all. Transient overload is recovered from, not a code bug.

Vanilla Playwright + stealth was blocked instantly by Imperva.→Switched to patchright, a Chromium fork that strips the CDP automation signals Imperva fingerprints.

Headless mode was detected even with stealth.→Run a real headed Chrome inside Xvfb (a virtual display) in the container. Nothing is drawn to a physical screen, but Chrome behaves as if it had one — this is exactly what lets it run in a Linux container with no GUI.

A single exit IP got rate-limited within a handful of requests.→Local Tor instance; on consecutive failures the runner sends SIGNAL NEWNYM over the control port to rotate the exit circuit (free, instant).

hCaptcha doodle puzzles appeared on some exits.→Free vision-LLM solver reads the legend, locates each target doodle as normalised 0–1000 coordinates, and clicks them in order with human-like timing jitter. See the gallery above for live examples + verdicts.

Retina screenshots came back at 2× the page coords, so clicks landed at half-coordinates.→Pass scale: 'css' to page.screenshot() so the solver and the page share one coordinate system.

Transient apt-get failure during the Docker build (mirror hiccup).→The Dockerfile is idempotent; a plain rerun cleared it. No source change required.

Self-healing

Two layers: fast automatic recovery inside each run, and an agent loop that monitors and repairs the code itself.

In-run, automatic

Tor exit rotation — on consecutive failures or a flagged exit, SIGNAL NEWNYM picks a fresh circuit.
Vision-provider rotation — on rate-limit/5xx the solver falls through Gemini → OpenRouter → Groq → Nvidia.
Boot self-check — the container refuses to start if Tor control auth is broken (the exact bug that caused the first failed run).
Failure → timeline — a non-zero exit, zero successes, or all-blocked becomes a timeline row with the stderr tail + a classified cause.

Monitor + agent repair

A cron tick runs every 15 minutes (zero LLM cost): it ingests finished runs, diffs them, updates this site, kicks the hourly scrape when due, kills stuck containers, and prunes old artifacts.
On top of that a Claude Code agent reviews health and, when a run fails for a new reason (e.g. the Tor-515 bug, or Smyths changing their markup), it diagnoses the cause from the logs, patches the scraper code, rebuilds the Docker image, and pushes the fix to the smyths-scraper repo — the next run picks it up.
The Tor-515 fix below was done exactly this way: detected, diagnosed, patched, pushed. Every code change lands as a normal commit you can review.

Recent failure / heal events

No failures recorded yet.

Repos & lineage

philipposk/smyths-scraperprivate

The scraper itself: patchright runner, Tor wiring, vision-LLM captcha solver, Dockerfile, hourly cron unit. Request access from Philippos.

philipposk/3.6x7.grpublic

This site. Hourly the scraper commits new JSON files here; GitHub Pages serves them; this page renders them client-side.

1.6x7.grprevious

The one-shot precursor: a single backfill of 58,605 UK Smyths products. This project is the recurring-watcher follow-up, deliberately scoped to 40 URLs to prove robustness + diffing.

hydrascrape downstream

The captcha solver here is mirrored into the hydrascrape codebase on Tamas & Amine's side as a drop-in module.