Dedup at corpus scale.
Strip exact and near-duplicate documents from a 12B-token pretraining set in a single MapReduce pass. UCFPs are stable across shards, so the join is just a hash compare — no re-tokenization, no per-modality glue.
UCFP is a protocol and runtime for producing stable, content-addressed identifiers across any modality: text, images, audio, video, code, embeddings, datasets, model weights. One algorithm. One ID space.
The SDK is a single function. Pass anything that can be serialized to bytes — a string, a Buffer, a stream, a file handle, an embedding vector — and get back a UCFP identifier.
IDs are stable across runtimes, platforms, and language bindings. The same input always returns the same fingerprint. Different inputs return different fingerprints with cryptographic confidence.
// Fingerprint anything. import { fingerprint } from "ucfp"; const id = await fingerprint(input, { modality: "auto", // text | image | audio | bytes bits: 256, encoding: "multibase", }); // → ucfp1·b3k4q7n2…wpx9 console.log(id.toString());
Every UCFP is comparable to every other UCFP — across modality, source, and time. Build dedup, provenance, and search on a single primitive.
Same bytes in, same fingerprint out — for the next twenty years, on any machine.
Text, images, audio, video, embeddings, weights — all collapse to one ID space.
Opt into Hamming-comparable IDs that survive resizing, re-encoding, and minor edits.
Constant memory. Fingerprint a 40 GB shard the same way you fingerprint a tweet.
Use UCFPs as keys in any KV, blob store, or graph. Provenance becomes a join.
TypeScript, Python, Go, Rust, Swift, JVM. WASM build for the browser.
UCFP-1 is a public, versioned protocol. No vendor, no rotation, no lock-in.
Every ID carries its modality, version, and bit-width. Parse without context.
Pipe a stream, pass a string, point at a file. The SDK detects the modality and routes to the right canonicalizer — NFKC for text, DCT for images, Wang landmarks for audio.
Each modality has a deterministic algorithm: MinHash for text, perceptual hashes for images, Shazam-style landmarks for audio. The result is a single, comparable signature.
Store as a content-addressed key. Query by similarity, dedup by exact match, prove provenance by signature. Every operation is O(log n) at scale.
These aren't abstract — every shape below is a workload UCFP was designed for. Pick the one that matches your bottleneck.
Strip exact and near-duplicate documents from a 12B-token pretraining set in a single MapReduce pass. UCFPs are stable across shards, so the join is just a hash compare — no re-tokenization, no per-modality glue.
Sign every artifact your pipeline produces — checkpoints, generated images, synthetic audio — with its UCFP. Trace any downstream output back to the exact bytes it came from, without modifying the content.
Use UCFPs as the primary key in your vector store. Same fingerprint, same row — across f32, f16, and i8 quantizations of the same model. Eliminates an entire class of dedup bugs in retrieval.
Compute a UCFP at the edge for any uploaded asset. Two uploads of the same image — different filenames, different MIME hints — collapse to one cache entry. Bandwidth bills go down; hit rates go up.
The protocol is permissive Apache-2.0 forever. The hosted API is metered per call with a real free tier, no SaaS-shaped pricing tricks.
The full Apache-2.0 SDK. Self-host the runtime; own your fingerprints.
Hosted UCFP API on Cloudflare. Scale-to-zero, regional, signed responses.
BYOC, private regions, signed SBOMs, and a designated compliance contact.
"We replaced four bespoke dedup pipelines — text, images, audio, parquet — with a single UCFP join. Pretraining throughput went up 2.4×."
"Content-addressed by default means every artifact in our pipeline is reproducible by construction. We stopped writing provenance code six months ago."
"The fact that the protocol is open and the spec is short — under fifty pages — is the only reason we shipped it through legal in a single quarter."
Open protocol, hosted runtime, real benchmarks. Ten minutes to your first fingerprint, zero account required.