I scraped 1.94M Airbnb photos for opium dens, pet cameos, and messy kitchens

· ai · Source ↗

TLDR

  • Burla demo processing 1.7M Airbnb photos via CLIP plus Claude Haiku Vision and 50.7M reviews on a 1.7K-CPU/20-A100 dynamic cluster to surface weird listings.

Key Takeaways

  • Pipeline: CLIP embeds 1.7M photos against text prompts, top suspects sent to Claude Haiku Vision for category confirmation (pets, drug-den vibes, bad TV mounts, chaotic kitchens).
  • Reviews use a 3-tier funnel: regex on all 50M, SBERT embedding cluster on top 200K, Claude Haiku reranking on top 12K.
  • Burla’s remote_parallel_map scaled to 1.7K CPU workers and 20 A100s on one dynamic cluster; no Docker or Kubernetes required.
  • Demand validation uses bootstrap 95% CI on 365-night calendar occupancy per listing; results labeled accepted only when group bars don’t overlap.
  • Data source is Inside Airbnb’s public dump across 119 cities and 4 quarterly snapshots.

Hacker News Comment Review

  • Commenters largely treated this as content marketing for Burla’s managed cloud service, noting the prominent Burla branding and that the author works there.
  • Classification quality drew skepticism: “drug-den vibes” flags appeared to catch poorly lit small rooms rather than genuinely suspicious listings, suggesting CLIP prompt engineering was too coarse.
  • Inside Airbnb’s community guidelines explicitly request no scraping, raising legal and ethical flags several commenters noted alongside general concerns about resource waste for novelty output.

Notable Comments

  • @devmor: classification logic is “insane leaps” – dark or obscured photos flagged as drug dens, mostly just poor photography.
  • @danhon: Inside Airbnb guidelines explicitly prohibit scraping; direct data access requires emailing data@insideairbnb.com.

Original | Discuss on HN