← All posts

June 12 — culling rules that find nothing, and a bucket that wasn't what it looked like

Now that every file's metadata is in a database, the next phase is generating candidates for deletion — never deleting anything, just surfacing "you probably don't want these" by rule. The photo library itself stays the source of truth; this phase only writes CSV lists and a summary. That read-only discipline turned out to matter, because several of the rules were wrong in instructive ways.

Built

A set of documented candidate rules, split into a high-confidence tier and a human-review tier: exact duplicates, orphaned short clips, screenshots, burst extras, blurry shots, and obvious junk. Each rule is a query that emits a CSV; nothing mutates the database except one approved write (below). I also built a sampler that pulls a reproducible random 40 rows per rule and renders them into HTML contact sheets, so I could actually look at what each rule was flagging instead of trusting the count.

Problems & fixes

The bucket that wasn't junk

The single biggest flagged bucket — over 5,000 items, hundreds of gigabytes — came through as "shared album." I only caught what it really was by rendering the thumbnails and looking: it's all my own footage from a wearable camera, tagged with a naming signature my rule had misread. It must never be a delete candidate. The "shared album" flag didn't mean what I assumed it meant; it had been set entirely by that one import.

That's the whole lesson of the day. The rules are confident and wrong all the time. The contact sheet is what keeps you honest.

Decisions

Learned

Still open

The perceptual near-dup pass, a texture-aware blur guard, smarter burst handling that scales the keep-count to the cluster size, and designing the decision schema before any tagging touches the database.