v0.4.1 · OPEN PROTOCOL Universal Content Fingerprinting

Fingerprint
anything
deterministically.

UCFP is a protocol and runtime for producing stable, content-addressed identifiers across any modality: text, images, audio, video, code, embeddings, datasets, model weights. One algorithm. One ID space.

◐ INPUT
LIVE · DETERMINISTIC — bytes
◑ FINGERPRINT UCFP-256
ucfp1·
Algorithm
UCFP-256
Config hash
Compute
Entropy
Distance
Bytes
— bytes

One call. Any input.

The SDK is a single function. Pass anything that can be serialized to bytes — a string, a Buffer, a stream, a file handle, an embedding vector — and get back a UCFP identifier.

IDs are stable across runtimes, platforms, and language bindings. The same input always returns the same fingerprint. Different inputs return different fingerprints with cryptographic confidence.

// Fingerprint anything.
import { fingerprint } from "ucfp";

const id = await fingerprint(input, {
  modality: "auto",    // text | image | audio | bytes
  bits: 256,
  encoding: "multibase",
});

// → ucfp1·b3k4q7n2…wpx9
console.log(id.toString());

A flat namespace
for all content.

Every UCFP is comparable to every other UCFP — across modality, source, and time. Build dedup, provenance, and search on a single primitive.

/01

Deterministic

Same bytes in, same fingerprint out — for the next twenty years, on any machine.

/02

Modality-agnostic

Text, images, audio, video, embeddings, weights — all collapse to one ID space.

/03

Perceptual mode

Opt into Hamming-comparable IDs that survive resizing, re-encoding, and minor edits.

/04

Streaming

Constant memory. Fingerprint a 40 GB shard the same way you fingerprint a tweet.

/05

Content-addressed

Use UCFPs as keys in any KV, blob store, or graph. Provenance becomes a join.

/06

Bindings

TypeScript, Python, Go, Rust, Swift, JVM. WASM build for the browser.

/07

Open spec

UCFP-1 is a public, versioned protocol. No vendor, no rotation, no lock-in.

/08

Self-describing

Every ID carries its modality, version, and bit-width. Parse without context.

If it's bytes, it has a fingerprint.

plain textmarkdownjson / yamlsource codepng / jpg / webpsvgmp3 / wav / flacmp4 / mov / webmpdfparquet / arrowsafetensorsonnx / ggufembeddings (f32, f16, i8)arbitrary bytes

Three calls. Zero ceremony.

/01 · INGEST

Hand it bytes.

Pipe a stream, pass a string, point at a file. The SDK detects the modality and routes to the right canonicalizer — NFKC for text, DCT for images, Wang landmarks for audio.

/02 · COLLAPSE

Collapse to a vector.

Each modality has a deterministic algorithm: MinHash for text, perceptual hashes for images, Shazam-style landmarks for audio. The result is a single, comparable signature.

/03 · INDEX

Index, query, prove.

Store as a content-addressed key. Query by similarity, dedup by exact match, prove provenance by signature. Every operation is O(log n) at scale.

Where teams ship
UCFP today.

These aren't abstract — every shape below is a workload UCFP was designed for. Pick the one that matches your bottleneck.

TRAINING DATA

Dedup at corpus scale.

Strip exact and near-duplicate documents from a 12B-token pretraining set in a single MapReduce pass. UCFPs are stable across shards, so the join is just a hash compare — no re-tokenization, no per-modality glue.

· ~840 MB/s · single core· O(1) compare· streaming
CONTENT PROVENANCE

Watermark-free attribution.

Sign every artifact your pipeline produces — checkpoints, generated images, synthetic audio — with its UCFP. Trace any downstream output back to the exact bytes it came from, without modifying the content.

· 256-bit · multibase· self-describing· cross-modal
VECTOR INDEX

A flat namespace for embeddings.

Use UCFPs as the primary key in your vector store. Same fingerprint, same row — across f32, f16, and i8 quantizations of the same model. Eliminates an entire class of dedup bugs in retrieval.

· embedding-aware· quantization-stable· KV-friendly
CDN + EDGE

Cache by content, not by URL.

Compute a UCFP at the edge for any uploaded asset. Two uploads of the same image — different filenames, different MIME hints — collapse to one cache entry. Bandwidth bills go down; hit rates go up.

· <1 ms p99· WASM build· no central authority

Drop in to what you already run.

PostgreSQL STORAGE
Redis CACHE
S3 / R2 BLOBS
DuckDB ANALYTICS
Parquet COLUMNAR
Arrow INTEROP
Qdrant VECTOR
Vectorize VECTOR
OpenAI EMBEDDINGS
Cohere EMBEDDINGS
HuggingFace MODELS
Triton INFERENCE

Open core. Honest cloud.

The protocol is permissive Apache-2.0 forever. The hosted API is metered per call with a real free tier, no SaaS-shaped pricing tricks.

OPEN
$0 / forever

The full Apache-2.0 SDK. Self-host the runtime; own your fingerprints.

  • Every modality + algorithm
  • WASM, Rust, Python, TS, Go bindings
  • Self-hosted server (axum + redb)
  • No telemetry, no rate limits
  • GitHub issues only
Clone the repo
ENTERPRISE
Talk to us

BYOC, private regions, signed SBOMs, and a designated compliance contact.

  • Unlimited throughput
  • On-prem or private VPC
  • SOC 2 + HIPAA-ready
  • SAML SSO + RBAC
  • Solutions engineering hours
Book a call

Engineers running it.

"We replaced four bespoke dedup pipelines — text, images, audio, parquet — with a single UCFP join. Pretraining throughput went up 2.4×."
Riya N. STAFF MLE · HARMONIC LABS
"Content-addressed by default means every artifact in our pipeline is reproducible by construction. We stopped writing provenance code six months ago."
Marko V. PLATFORM LEAD · NEARFIELD
"The fact that the protocol is open and the spec is short — under fifty pages — is the only reason we shipped it through legal in a single quarter."
Anya O. PRINCIPAL · CTRL/Z RESEARCH

No jargon. Just answers.

How is UCFP different from a normal cryptographic hash?
SHA-256 changes completely if you re-encode an image or normalize whitespace in a JSON document. UCFP is content-addressed in the perceptual sense: minor lossy transforms produce a Hamming-comparable ID, while tampering still flips far enough bits to be detectable. You opt into either mode per call.
Is the spec really stable?
Yes. UCFP-1 is on draft 04 and has had no breaking changes in twelve months. Every signature carries its schema version inline, so old fingerprints stay parseable forever — and we will refuse to ship a v2 that breaks v1 reads.
Can I run this without sending data to your servers?
The Apache-2.0 SDK is the entire runtime — server, indices, and all. The hosted Cloud tier is a convenience, not a dependency. Nothing in the protocol requires a network round-trip.
What does it cost in practice?
Open tier: the SDK is free forever. Cloud: $0.000004 per fingerprint after the 10M monthly free quota. A team indexing a million records per day lands around $80 / month. Enterprise is a flat annual contract.
How does it handle GDPR / right-to-be-forgotten?
A UCFP is a one-way fingerprint, not a reversible identifier. You delete the underlying record from your storage, the fingerprint stops resolving. We never see your raw bytes on the open or self-host tier.
Why should I trust the throughput numbers?
Every benchmark in the docs is reproducible from a single make bench command in the public repo. CI publishes a fresh number per commit. If your hardware comes back slower, file an issue — we treat regressions as bugs.

One ID space
for every byte
you'll ever ship.

Open protocol, hosted runtime, real benchmarks. Ten minutes to your first fingerprint, zero account required.