Docs Examples
Examples
Three full recipes. Copy, adapt, ship.
1. Dedup a pretraining corpus
You have 50 M short documents. You want to drop near-duplicates before training a language model. Use MinHash + LSH for cheap bucket lookup.
import requests, itertools, json
API = "https://ucfp.dev/api/fingerprint"
KEY = "ucfp_…"
HEADERS = {"Authorization": f"Bearer {KEY}", "Content-Type": "application/json"}
def fingerprint(text: str) -> dict:
r = requests.post(API, headers=HEADERS, json={
"text": text,
"params": {"algorithm": "lsh", "h": 128, "k": 5}
})
r.raise_for_status()
return r.json()
seen_buckets: set[tuple] = set()
kept = []
for doc in itertools.islice(corpus(), 1_000_000):
fp = fingerprint(doc["text"])
bucket = tuple(fp["bands"][:3]) # rough bucket key
if bucket in seen_buckets:
continue
seen_buckets.add(bucket)
kept.append(doc)
print(f"kept {len(kept):,} of {1_000_000:,}")Practical notes:
- Send in batches of 100–500 with
aiohttpto amortise TLS. The server is happy with concurrency up to your per-minute budget. - Persist the
record_idper kept document so you can later compute exact Jaccard between any pair viaGET /v1/records/{tid}/{rid}. - For full Jaccard similarity (not just bucket-equality), pull the full fingerprint with
?include=fingerprintand compute MinHash overlap client-side.
2. Image dedup at upload
Every time a user uploads an image, you want to (a) reject exact dupes; (b) flag near-dupes for moderator review.
async function ingestUpload(file: File): Promise<UploadDecision> {
const fp = await fetch('/api/fingerprint', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.UCFP_KEY}`,
'Content-Type': file.type
},
body: file
}).then(r => r.json());
// 1. Exact match: same fingerprint bytes ⇒ same image.
const exact = await db.query(
'SELECT id FROM uploads WHERE phash = ? LIMIT 1',
[fp.phash_hex]
);
if (exact.length) return { decision: 'reject_exact', existing: exact[0].id };
// 2. Near match: Hamming distance ≤ 8 on the 64-bit pHash.
const near = await db.query(
'SELECT id, BIT_COUNT(phash ^ ?) AS d FROM uploads HAVING d <= 8 ORDER BY d ASC LIMIT 5',
[fp.phash_hex]
);
if (near.length) return { decision: 'review', similar: near };
// 3. Novel: persist and accept.
await db.execute(
'INSERT INTO uploads (id, phash, ucfp_record) VALUES (?, ?, ?)',
[crypto.randomUUID(), fp.phash_hex, fp.record_id]
);
return { decision: 'accept' };
}Practical notes:
- Use
algorithm=multifor the strongest signal; if you need raw 64-bit pHash for an indexed Hamming column, requestalgorithm=phashseparately. - Storage: the 64-bit pHash fits in a
BIGINTcolumn. Build a Hamming-distance index by partitioning on the high 16 bits (BK-tree style) for sub-linear neighbour search.
3. Audio matching against a catalogue
You want a Shazam-style "what song is this?" service over a 100k-track catalogue.
Pre-fingerprint the catalogue once:
for track in catalogue/*.flac; do
curl -sS -X POST \
"https://ucfp.dev/v1/ingest/audio/17/$(basename "$track" .flac)?algorithm=wang&sample_rate=44100" \
-H "Authorization: Bearer ucfp_…" \
-H "Content-Type: audio/flac" \
--data-binary "@$track" \
> "/tmp/$(basename "$track" .flac).json"
doneMatch a clip:
curl -sS -X POST \
'https://ucfp.dev/api/query?modality=audio&algorithm=wang&top=5' \
-H 'Authorization: Bearer ucfp_…' \
-H 'Content-Type: audio/wav' \
--data-binary @clip.wavResponse (sketch):
{
"matches": [
{ "record_id": "track_00471", "score": 0.94, "offset_ms": 12340 },
{ "record_id": "track_19022", "score": 0.41, "offset_ms": 0 }
]
}A score above ~0.7 is a confident match for Wang landmarks. Below ~0.4 is noise.
Practical notes:
- Live recognition: use the streaming subroute
POST /v1/ingest/audio/{tid}/{rid}/streamto keep a connection open and score against the catalogue as audio arrives. - For pitch- or tempo-shifted versions (DJ mixes, sped-up uploads), use
algorithm=panakoinstead ofwang. - For "songs that sound like this" rather than literal matches, use
algorithm=neuralwithmodel_id=clap-htsat.
More recipes
- See API reference: text, image, audio for every parameter.
- The JS SDK and Python SDK wrap these recipes for you.