salesai25 May 2026EN

Lead scoring with AI — 14 features, gradient-boosted model, LLM re-rank at the borderlines 2026

14 features, LightGBM and LLM re-rank at borderlines. Top 20% hit 41% conversion, drift detector triggers on a 5pp drop.

Lead scoring with AI — prioritisation in 2026

Lead scoring is not a new topic. People have been doing it for 15 years with Salesforce-style scoring rules: if the email domain is an education TLD, +10 points; if they hit the pricing page, +20. That approach had one core weakness — humans wrote the rules, and humans do not scale. The Nortinia Sales AI scoring module is ML-based — but we did not throw heuristics out, we just gave them a different role.

The 14 features

The model takes 14 features as input, each derived from the enrichment output or from historical touch data:

Industry (NACE code, when extractable)
Size (estimated headcount: 0-50 / 50-200 / 200-1000 / 1000+)
Recent funding (last 12 months)
Website tech stack (modern vs. legacy)
Email pattern (firstname.lastname vs. info@ — different maturity signal)
Pricing transparency (public price vs. "contact us")
Geographic reach (EU only, EMEA, global)
Product maturity proxy (changelog freshness, blog cadence)
Sales team size (LinkedIn search proxy)
Recent news sentiment (acquisition rumours, layoffs, expansion)
Tech-stack overlap with us (e.g. if they run a Nortinia Engine-compatible stack)
Engagement signal (any prior website visit)
Referral signal (introduced by another customer)
Vertical fit (similarity to the tenant-configured ICP)

The model

LightGBM gradient-boosted trees. Trained on 8,000 historical labelled examples — each example is one lead whose lifecycle has closed (either converted or honestly lost). The label is binary: converted yes / no.

Why gradient boosting and not a deep net: the data is small (10k order of magnitude), the features are mostly categorical / numerical, and crucially: every decision is explainable. With a SHAP value we can say about a given lead: "this scored 73 because size is +18 and tech-stack is +12, but no-pricing is -7". That is communicable to a sales rep.

LLM re-rank

Model scores are 0-100. In the borderline zone (60-65) an LLM (gpt-4.1-mini) gets a structured prompt: lead features + 2-3 similar past leads from the training set, and a suggested nudge up or down by no more than ±5 points. This does not override the model — it just refines the band edge.

Why only here and not everywhere: an LLM call on every lead is 1,420/week × 0.01 USD = 14 USD/week, which is not much. But the model itself is at 95% agreement with the human labeller in the stable zones — the LLM only adds value at the band edge. Pareto principle.

Drift detector

The model retrains monthly, but only if there is drift. The drift detector watches two things:

Input drift: the distribution of the 14 features compared to the training set. KL divergence, threshold 0.15.
Output accuracy: on leads whose lifecycle has closed, what is the model's accuracy? Threshold: a 5-percentage-point drop from baseline.

If either triggers, retraining starts automatically. The new model runs in a shadow layer first (production score untouched, observation only) and is only promoted after 200 leads if metrics improve.

Outcome

Numbers from one mid-tier tenant after six months:

Top 20% (A+B bucket): 41% conversion
Other 80%: 12% conversion
Random baseline (no scoring): 18% conversion

Scoring lifted top-bucket conversion 2.3x over baseline, and made sales-rep time 4.2x more efficient (because they focus on high-score leads).

What we do not put into the model

Ethnic or gender signals — deliberately not a feature
Personal data (home address, phone) — only company-level features
Competitor-touch history ("already talked to a competitor") — too easy to misinterpret on the sales-rep side

This is not just a compliance question. The model is trustworthy when we understand why it decides — and the fewer morally loaded features it has, the better we understand it.

The expensive mistake: first six weeks

The first model version included a "country" feature. The labelled dataset was disproportionately Western European, and the model automatically marked Eastern European leads down. A Central European tenant complained three weeks in that we were "underestimating" their home market. They were right. We removed the feature, retrained, and since then no geo is disproportionately penalised. The lesson: feature selection is moralising.