About
I love listening to old episodes of In Our Time.
There are enough old episodes that, when I want to learn about a topic, this show is where I turn first. But finding episodes from the archive is hard. Hence this very unoffical site.
Current status: let’s say v0.2? There’s barely any design yet. It’s likely there are still bugs.
How this site was built
The canonical source of everything here is the official In Our Time website by the BBC.
The pages are spidered, retrieved, and stored locally. Python Requests-HTML is used for the web scraping, a standard tool.
(Yes I know about Wikipedia’s hyperlinked list of In Our Time episodes but I wanted to do something with more automation, and a greater focus on findability.)
In Our Time has been broadcast for 25 years so while there is some consistency to the way the show notes are structured, there is also a lot of variation.
So to extract the data, we use gpt-3.5-turbo
by OpenAI which is the large language model that powers ChatGPT. The HTML is minimally simplified then GPT is prompted to extract:
- the episode description
- guests and their affiliations
- and the reading list.
GPT is asked to respond with valid JSON (a machine-readable data format) which it mostly does, although this often needs fixing. (This technique was sufficiently novel when the site launched that the announcement post at Hacker News has 688 points.)
Episode transcripts are not used for the main site.
GPT is also used to:
- give each episode a Dewey-Decimal classification (a library code) – the guess turns out to be pretty good
- calculate similar episodes by converting each show description to an embedding vector, and finding the nearest neighbours using cosine distance.
Converting the scraped pages into machine-readable JSON doesn’t require programmatic use of an AI… but it’s considerably more straightforward than writing lots of fiddly code to do the equivalent job. On the other hand, classifying episodes and finding related episodes are ideal uses of a large language model; both are surprisingly reliable.
Read the full prompts here. Want to know how to talk to an AI to extract structured data from a web page? This is how.
AI is also used to power the search. Episode descriptions and queries are converted to embeddings using BAAI’s bge-base-en-v1.5
model. The search results are ranked by distance in semantic (embedding) space. It’s simple and effective. The embeddings and vector database are hosted in a Cloudflare Worker. The search engine is the single dynamic part of braggoscope.com which is otherwise an entirely static site. You can read a technical write-up of this search feature (with diagrams!).
There is a reading list for many episodes. Books are checked against the Google Books API and a link is provided to the Google Books website when a matching title can be found. There is approx 88% success in matching books this way. Thanks to Tom Critchlow for early analysis in finding the best approach to validating book data.
Interim data is stored in sqlite (shout out to Datasette for making this easy to explore while I’ve been developing).
Finally static pages for a website are written out, and this public site is built using GitHub Pages.
There’s a deeper dive into what it feels like to code with AI on my blog here.
An Easter egg
Hello! Famously, Melvyn Bragg introduces each show with a statement like: Hello. Beauty is truth, truth beauty.
For fans, I’ve included these at the bottom of episode pages, where available. I use an automated process to download each episode (~29GB of audio to date), transcribe the MP3 using OpenAI’s Whisper model, and then extract the relevant text.
This is the only feature that makes use of episode transcripts. I will not otherwise display or share the transcripts.
Errors
- It is likely that this site will lag the official site in showing new episodes. This is because I haven’t built the automation yet.
- Because of the use of large language models and web scraping, there may be omissions - or even AI hallucinations! - in the data presented here.
Please let me know about any errors and I will endeavour to fix them.
Who made this?
I’m Matt Webb. Find out about me here. (Email address etc at that link too.) Some trivia: I was involved in setting up the In Our Time podcast, way back in 2004. It was the first podcast by the BBC, and the BBC was the first national broadcaster to do any podcasting at all. Still a fan.