← All posts

June 9 — building a photo-metadata pipeline with no sudo and a flaky network

The plan for the whole project is simple to say and hard to do: I have on the order of 91,000 photos and videos accumulated over twenty years, and I want to cull them systematically instead of by guilt and vibes. The first step is boring and load-bearing — read every file's metadata into a database so I can ask real questions before I delete anything. This was day one of building that ingest tool.

Built

A single ingest script, grown over four iterations and tested on real directories at each step before moving on. It walks the library, pulls EXIF and file metadata, computes a content hash, computes perceptual hashes and a blur score for images, pairs up Live Photos, and writes it all to a SQLite database. The database lives on a local SSD with write-ahead logging, never on the network share — SQLite and network filesystems despise each other over small writes and locking.

One design choice paid off immediately: I created the full table schema — every column and index — up front, so later iterations only backfill columns and never have to migrate. Resume is built in: skip a file if it's already processed and its size and mtime are unchanged, and never re-hash a file that hasn't moved.

Problems & fixes

The brief said the metadata tools were "already installed." They were not — and there's no passwordless sudo on the box, so I couldn't just install them. The fix was to vendor the standalone Perl distribution of the EXIF tool and call it directly. No root required, fully self-contained.

Then the big Python image libraries kept dying mid-download on a flaky TLS connection (DECRYPTION_FAILED_OR_BAD_RECORD_MAC, over and over). The fix was a small script that fetches each wheel with a resumable download, then installs from the local folder with no index. One gotcha worth writing down: the computer-vision wheels ship as abi3 (cp37-abi3), not cp312 — a selector keyed on the exact Python version silently misses them.

Performance — not where I'd have guessed

Two results reshaped my mental model:

I also made sure to read each image's bytes exactly once and feed them to both the hasher and the decoder — no second read of the same file over a slow mount — and to compute blur on a grayscale image resized to a fixed longest edge, so the scores are comparable across resolutions.

Learned

Still open

The full multi-day ingest run, a duration backfill for videos once a media probe tool is available, and a decision about whether to keep full-file hashing every large video over the network or sample head/tail instead. And a couple of modern formats (webp, avif) aren't in the extension list yet, so they're being silently skipped — a thing to catch before it bites.