Data sourcing transparency
StatVault's Knox Oracle answers questions by combining several independent data sources. Every source is used under one of the four categories below. We disclose the category of every source we use so the legal posture is auditable; we do not disclose the specific upstream identities, both because the combination is proprietary (see proprietary information) and because several upstreams prefer derivative use stay low-key.
The four categories
1. Official feeds
Endpoints operated by the league or governing body itself. Public, unauthenticated, and intended for derivative consumer use. No terms-of-service restriction on programmatic access.
2. Open-license datasets
Datasets published under permissive open-source licenses (MIT, CC0, or equivalent) that explicitly authorize programmatic and commercial reuse, including in aggregated products like ours.
3. Supplemental aggregators
Public undocumented endpoints that aggregator sites use to power their own consumer products. We rely on them only as supplemental sources — never as the sole basis for a value — and we redact their identities publicly in deference to their preference for derivative use to stay low-key.
4. Licensed uploads
Data that StatVault operators have legally obtained under a paid subscription or commercial license (e.g. Stathead CSV exports) and uploaded through the founder-authenticated ingest endpoint. Every licensed upload is logged with the operator's license claim.
What we will not do
- We do not scrape websites whose terms of service prohibit it, even when the data is technically reachable. This explicitly includes the Sports Reference family (pro-football-reference, basketball-reference, baseball-reference, hockey-reference, FBref, and Stathead web HTML) and any other source that disallows programmatic access. Stathead data enters our system only via the legal CSV-upload path.
- We do not use Natural Stat Trick or other personal-use-only datasets for commercial output without an explicit license agreement.
- We do not republish raw upstream payloads. Knox answers carry our derived values plus a verifiable receipt; consumers can verify the chain but cannot reconstruct the upstream feed.
How to verify our sourcing posture
The public Oracle health board at /api/v1/oracle/sources returns the full list of sources we currently consult, redacted to opaque ids and annotated with the category each falls into. Authorized auditors can unlock the full mapping with an operator key.
Every Knox answer carries a deterministic receipt id (rc_*) tied to a Merkle-anchored evidence chain. Look up any receipt at /api/v1/knox-receipt/[id] to verify it.
Reporting a sourcing concern
If you operate an upstream and want us to stop calling your service, change how we credit it, or move it between categories, email legal@statvault.org with the URL of your terms-of-service page. We will confirm receipt within two business days and remove the source within five business days where good-faith action is appropriate, while preserving our audit log.
Last updated 2026-05-18. Boundary codified in our internal Creative Direction Record on the same date.