UCFP — Universal Content Fingerprinting

One call. Any input.

The SDK is a single function. Pass anything that can be serialized to bytes — a string, a Buffer, a stream, a file handle, an embedding vector — and get back a UCFP identifier.

IDs are stable across runtimes, platforms, and language bindings. The same input always returns the same fingerprint. Different inputs return different fingerprints with cryptographic confidence.

// Fingerprint anything. import { fingerprint } from "ucfp"; const id = await fingerprint(input, { modality: "auto", // text | image | audio | bytes bits: 256, encoding: "multibase", }); // → ucfp1·b3k4q7n2…wpx9 console.log(id.toString());

Three calls. Zero ceremony.

/01 · INGEST →

Hand it bytes.

Pipe a stream, pass a string, point at a file. The SDK detects the modality and routes to the right canonicalizer — NFKC for text, DCT for images, Wang landmarks for audio.

/02 · COLLAPSE ≡

Collapse to a vector.

Each modality has a deterministic algorithm: MinHash for text, perceptual hashes for images, Shazam-style landmarks for audio. The result is a single, comparable signature.

/03 · INDEX ∎

Index, query, prove.

Store as a content-addressed key. Query by similarity, dedup by exact match, prove provenance by signature. Every operation is O(log n) at scale.

TRAINING DATA

Dedup at corpus scale.

Strip exact and near-duplicate documents from a 12B-token pretraining set in a single MapReduce pass. UCFPs are stable across shards, so the join is just a hash compare — no re-tokenization, no per-modality glue.

· ~840 MB/s · single core· O(1) compare· streaming

CONTENT PROVENANCE

Watermark-free attribution.

Sign every artifact your pipeline produces — checkpoints, generated images, synthetic audio — with its UCFP. Trace any downstream output back to the exact bytes it came from, without modifying the content.

· 256-bit · multibase· self-describing· cross-modal

VECTOR INDEX

A flat namespace for embeddings.

Use UCFPs as the primary key in your vector store. Same fingerprint, same row — across f32, f16, and i8 quantizations of the same model. Eliminates an entire class of dedup bugs in retrieval.

· embedding-aware· quantization-stable· KV-friendly

CDN + EDGE

Cache by content, not by URL.

Compute a UCFP at the edge for any uploaded asset. Two uploads of the same image — different filenames, different MIME hints — collapse to one cache entry. Bandwidth bills go down; hit rates go up.

· <1 ms p99· WASM build· no central authority

Engineers running it.

"We replaced four bespoke dedup pipelines — text, images, audio, parquet — with a single UCFP join. Pretraining throughput went up 2.4×."

Riya N. STAFF MLE · HARMONIC LABS

"Content-addressed by default means every artifact in our pipeline is reproducible by construction. We stopped writing provenance code six months ago."

Marko V. PLATFORM LEAD · NEARFIELD

"The fact that the protocol is open and the spec is short — under fifty pages — is the only reason we shipped it through legal in a single quarter."

Anya O. PRINCIPAL · CTRL/Z RESEARCH

No jargon. Just answers.

How is UCFP different from a normal cryptographic hash?

SHA-256 changes completely if you re-encode an image or normalize whitespace in a JSON document. UCFP is content-addressed in the perceptual sense: minor lossy transforms produce a Hamming-comparable ID, while tampering still flips far enough bits to be detectable. You opt into either mode per call.

Is the spec really stable?

Yes. UCFP-1 is on draft 04 and has had no breaking changes in twelve months. Every signature carries its schema version inline, so old fingerprints stay parseable forever — and we will refuse to ship a v2 that breaks v1 reads.

Can I run this without sending data to your servers?

The Apache-2.0 SDK is the entire runtime — server, indices, and all. The hosted Cloud tier is a convenience, not a dependency. Nothing in the protocol requires a network round-trip.

What does it cost in practice?

Open tier: the SDK is free forever. Cloud: $0.000004 per fingerprint after the 10M monthly free quota. A team indexing a million records per day lands around $80 / month. Enterprise is a flat annual contract.

How does it handle GDPR / right-to-be-forgotten?

A UCFP is a one-way fingerprint, not a reversible identifier. You delete the underlying record from your storage, the fingerprint stops resolving. We never see your raw bytes on the open or self-host tier.

Why should I trust the throughput numbers?

Every benchmark in the docs is reproducible from a single make bench command in the public repo. CI publishes a fresh number per commit. If your hardware comes back slower, file an issue — we treat regressions as bugs.

Fingerprint
anything—
deterministically.

One call. Any input.

A flat namespace
for all content.

Deterministic

Modality-agnostic

Perceptual mode

Streaming

Content-addressed

Bindings

Open spec

Self-describing

If it's bytes, it has a fingerprint.

Three calls. Zero ceremony.

Hand it bytes.

Collapse to a vector.

Index, query, prove.

Where teams ship
UCFP today.

Dedup at corpus scale.

Watermark-free attribution.

A flat namespace for embeddings.

Cache by content, not by URL.

Drop in to what you already run.

Open core. Honest cloud.

Engineers running it.

No jargon. Just answers.

One ID space
for every byte
you'll ever ship.

One call. Any input.

A flat namespacefor all content.

Deterministic

Modality-agnostic

Perceptual mode

Streaming

Content-addressed

Bindings

Open spec

Self-describing

If it's bytes, it has a fingerprint.

Three calls. Zero ceremony.

Hand it bytes.

Collapse to a vector.

Index, query, prove.

Where teams shipUCFP today.

Dedup at corpus scale.

Watermark-free attribution.

A flat namespace for embeddings.

Cache by content, not by URL.

Drop in to what you already run.

Open core. Honest cloud.

Engineers running it.

No jargon. Just answers.

One ID space for every byte you'll ever ship.

A flat namespace
for all content.

Where teams ship
UCFP today.

One ID space
for every byte
you'll ever ship.