Methodology

This page states exactly how The State of Small-Business Websites was produced: how the sample was drawn, what was measured, how the crawl behaved, what the data does not claim, and how a site owner can ask to be excluded. Transparency is what makes a number citable, so nothing here is left implicit.

The sample

The 2026 edition measured 391 public small-business websites, drawn from a working list of independent businesses concentrated in Ontario and other Canadian provinces. The list is reproducible: it is built from a fixed seed and sampled deterministically, so re-running the pipeline on the same inputs selects the same sites. The sample is stratified across industries so that no single industry dominates the cuts, and only an industry with at least 20 measured sites is reported as its own row.

The crawl attempted 420 sites and analyzed 391. The other 29were excluded for an honest reason: the homepage returned an error, the request timed out, the site blocked automated access, or the site's robots.txt disallowed our crawler (which we respect). Of those, 7 were excluded specifically because robots.txt told us not to crawl.

The seed is built from internal market-research lists. Those lists are used only as a source of public website addresses to measure. No business on them is contacted as part of this research, and no business name appears in the published data or in the stored crawl files.

What was measured per site

One public homepage was fetched per site, and these page-level signals were extracted from it. Every signal is a presence check or a count, never a judgment.

HTTPS: whether the homepage served over HTTPS after redirects.
Mobile viewport: whether a viewport meta tag was present, the minimum a page needs to adapt to a phone.
Title and meta description: whether each was present, and its length. The text itself was not stored, only its length.
Structured data: whether the page carried JSON-LD or microdata, and which schema.org types appeared.
OpenGraph: whether OpenGraph tags were present for social and messaging previews.
Page weight: the size of the homepage HTML in bytes, plus counts of script and stylesheet references as a cheap proxy for additional weight.
Platform: the website builder or framework, inferred from fingerprints in the markup (WordPress, Wix, Squarespace, Shopify, the GoDaddy builder, and others). A site with no recognized fingerprint is recorded as custom or unidentified.
Copyright-year recency: the most recent year shown in a copyright notice, used as a proxy for whether the site has been refreshed. Only the year was read, never the surrounding text.
Blog presence: whether the homepage linked to a blog, news, or articles section.

Definitions

No structured data means the homepage carried neither JSON-LD nor microdata, so a search engine or an AI assistant had no machine-readable statement of what the business is.

Stale copyright means a site that displayed a copyright year three or more years before the crawl year. The percentage is taken only over sites that showed a year at all, and that base is stated wherever the figure appears.

Median page weight is the median size of the homepage HTML document, in kilobytes, across the sample.

The ethics of the crawl

The crawl measured public pages the way a search engine does, and nothing more. Specifically:

Public pages only. One public homepage per site. No login walls, no private pages, no forms submitted.
Robots respected.The crawler read each site's robots.txt first and did not fetch any site that disallowed it. 7 sites were excluded on that basis.
Rate limited. At least two seconds between requests to the same domain, with bounded concurrency across domains, so no site saw meaningful load.
Identifiable. The crawler identified itself with the user agent AtlasForgeResearchBot/1.0 (+https://www.atlasforge.one/research/state-of-small-business-websites/methodology), which links back to this page, so any owner who saw it in their logs could find out who we are.
No personal data. No email address, phone number, contact name, or business name was collected or stored. The stored data is page-level signals keyed only to a website host. This is the measurement, not outreach: no business was contacted as part of this research.

Limits

This is an initial sample of 391 sites, not a census. The findings are honest for the sample and are not extrapolated to every small-business website. The sample is concentrated in Ontario and other Canadian provinces, so a figure may not hold in a different market.

The signals are homepage-level and structural. A missing meta description or a missing schema block is a real, measurable gap, but it is not a verdict on a business or its service. A site can be missing every signal here and still serve its customers well; the data measures the website, not the company.

Total page weight beyond the HTML document (images, scripts, fonts, third-party embeds) was not fully measured in this edition. A performance subsample using a scoring service is planned for a future edition and will be disclosed here when it runs.

Requesting exclusion

If you own a site and would prefer it not be included in a future edition, email hello@atlasforge.one with the domain and we will exclude it from the next crawl. The published data already names no business, so there is nothing about your specific site to remove from this edition; the request applies to future crawls. This is an inbound request you make to us; we do not reach out to ask.

You can also block AtlasForgeResearchBot in your robots.txt, and the crawler will skip your site, the same as it did for the 7 sites excluded that way this round.

Reproducibility

The pipeline (seed, crawl, aggregate) and the aggregate dataset are committed in the AtlasForge repository. The dataset carries the crawl date and the sample size, so every number in the report is auditable against the committed data. The report is dated 2026-06-10 and refreshed annually; the next edition is published at a dated URL and this one is kept, so the time series itself becomes a record.