salesai1 June 2026EN

Web enrichment with Playwright and browser-use — what we pull from a lead URL and how

Nortinia Sales AI enrichment uses Playwright + browser-use to pull lead URLs. What, how, why we never solve captchas, and the March rotating-IP bug.

Web enrichment with Playwright + browser-use

The deepest and riskiest component of Nortinia Sales AI is the enrichment-svc. A Python FastAPI service that uses Playwright and the browser-use library to pull data from lead URLs. This piece covers what it does exactly, why we built it this way, and what we broke in March.

Why Playwright and not requests + BeautifulSoup

80% of modern B2B sites are JavaScript-heavy. A plain HTTP GET returns an empty HTML shell in most cases — the actual content shows up only after hydration. We tried requests first — one in four sites was missing the hero copy, the About page was empty, and tech-stack detection failed because the tracking scripts never ran.

Playwright with headless Chromium runs the full page, waits for network idle, and only then reads. It is slower (3-8x), but the data is usable.

What browser-use brings

browser-use is an LLM-assisted browser-control library. Classic Playwright scripts rely on brittle selectors (.hero-text, #about-section). They break weekly because every tenant redesigns something.

browser-use instead takes a high-level instruction ("extract an About section if present") and uses an LLM to find it on the page. 4x more resilient to layout changes, and new tenant onboarding does not need a selector map.

Cost side-effect: every enrichment fires 2-4 LLM completions. Cost monitoring is mandatory.

What we pull

Landing hero — first-screen copy, primary CTA, language detect
Navigation tree — top-menu structure, used to discover deeper pages
About / Team — headcount signals (group photo, name listings)
Pricing / Plans — if public, the price tier and target segment
Tech stack hints — from HTTP headers (Server, X-Powered-By), HTML attributes (data-react, ng-app), CDN signatures (Cloudflare, Vercel, Akamai)
Contact — public email patterns, phone (public only, never dialled)
Recent news — Google News API on the company name, last 90 days

CAPTCHA strategy: NONE

The most important design call: we do not solve CAPTCHAs. No 2captcha-style service, no in-house ML solver. If a site shows a CAPTCHA, that is a signal: do not fetch this now. The scraper backs off exponentially (24h, 48h, 96h) and falls back to public sources only (LinkedIn public profiles, Google News, public-domain WHOIS-like info).

Same with robots.txt: we honour it. If a path is Disallow-ed, no scrape. This is not a regulatory obligation — it is a voluntary limit — and it has saved us from getting banned from services more than once.

The March rotating-IP bug

In March 2026 we banged our heads against the wall for two weeks. Enrichment success rate dropped from 67% to 41%. We tried bumping Playwright timeouts, swapping the browser-use model, rewriting selector strategies — nothing helped.

The root cause was IP-rotation pool exhaustion. At our residential proxy provider, one specific region (Central Europe) ran out of IPs and the system silently fell back to datacenter IPs. Datacenter IPs were fingerprinted and blocked by the target sites.

The fix: new provider, better pool monitoring, plus an internal dashboard that shows success rate broken down by IP class. If it dips below 50% for 6 hours, alarm. Has not recurred in six months.

The lesson: scraping infrastructure is brittle enough to deserve its own observability layer. Application-level APM is not enough.

Cost breakdown

One average company enrichment:

Playwright session (1 site × 3-5 paths): 8-12 seconds CPU + 80 MB RAM
browser-use LLM calls (2-4× gpt-4.1-mini): 0.006-0.012 USD
Structured extract (1× gpt-4.1-mini): 0.008 USD
News API: 0.001 USD
Proxy bandwidth: 0.003 USD

Total average: 0.02 USD / company. For a 1,420-leads-per-week tenant that is ~120 USD enrichment cost per month.

What we do not do

We do not scrape login-walled data. We do not solve CAPTCHAs. We do not buy leaked databases. We do not select the IP pool for country-spoofing. Enrichment uses public sources only — or it does not run at all.