Prompts

Want to see the prompts that are used to extract data from the episode notes? Here they are.

For reference, take a look at a sample episode page on the BBC website. This is the In Our Time episode page for “Jupiter”. You can see episode notes including the guests (you may need to tap Show more), and a reading list.

But Braggoscope needs structured data to build all its pages and to link them together. (Structured data means “data in columns”, like a spreadsheet, rather than “data in paragraphs”.)

How would you do this? Remember that the BBC archive has been built up over 25 years. There are noticeable and subtle differences in the notes over that time. We want consistency!

This is what AI is great at: it’s a tolerant, universal coupler, and we “program” it in plain English that we call prompts.

So behind the scenes we pull out the main body of the page and convert the HTML to Markdown, a kind of simplified format that preserves some styling like lists and italics. That’s the input.

Then there are two different prompts that are used to extract and structure the input data. One is for the episode notes, and the other is for the reading list.

Structuring the episode notes

The description and guests are extracted from the episode notes. These are fairly straightforward, although sometimes the name of the affiliation can look like the name of another guest (e.g. “Professor Alice B, John Smith Fellow”).

The Dewey Decimal classification is a bit more tricky. It’s a library code that helps to classify the episode for the unique Directory page. The classification is based on the topic of the episode – and this is not in the source data. So the prompt asks the AI to guess.

The AI is pretty good at this due to its training data. There is one trick that I have found that hugely improves quality: the reason field in the output.

I call this ‘inline chain-of-thought’ after the original paper, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Jan 2022). It gives the AI a chance to generate tokens to narrow down the possibility space of the code and label that follow (the reasoning has to appear before the Dewey code itself is generated).

The system prompt is:

Extract the description and a list of guests from the supplied episode notes from a podcast.

1. Description

- Do not change the wording of the description at all, but remove line breaks.
- Give the entire description, as written, and no more.
- Do not include the list of guests in the description.

2. Guests

- There are 3 or (rarely) 2 guests. Each has a name and a description. The description will is usually just the affiliation. Any remainder is the optional extended description.
- Name: include the title if that appears as part of the name, for example "Professor John Smith"
- Affiliation: the guest's position and institution from the description. Take this verbatim from the description, ensuring the first letter is capitalised. If there is no affiliation, return null.
- Extended description: the remainder of the description after the affiliation has been removed. If present, take this verbatim but formatted in sentence case. If there is none, return null.

3. Dewey Decimal

- Provide a Dewey Decimal Classification code, label, and reason for the classification.
- Reason: summarise your deduction process for the Dewey code, for example considering the topic and era of history by referencing lines in the episode description. Bias towards the main topic of the episode which is at the beginning of the description.
- Code: be as specific as possible with the code, aiming to give a second level code (e.g. "510") or even lower level (e.g. "510.1"). If you cannot be more specific than the first level (e.g. "500"), then use that.

Return valid JSON conforming to the following Typescript type definition:

```
{
    "description": string,
    "guests": {"name": string, "affiliation": string | null, "extended_description": string | null}[]
    "dewey_decimal": {"reason": string, "code": string, "label": string},
}
```

This is used together with a user message that gives the episode notes to the AI:

Episode synopsis (Markdown):

{notes}

The AI then responds with the structured data.

Structuring the reading list

The second prompt is the reading list.

Confounding the data extraction is that the reading list is not always in the same format. Mostly it is “{title}, {author}” but sometimes the reverse. Sometimes there are dates. Sometimes it is mixed with hyperlinks, which we don’t want. Commas are all over the place. Trickier: book names can sometimes look like author names.

As an output, we want fields that are labeled confidently. The books data will be enriched using the Google Books API later, so we’re not too worried about missing fields – but the better the data, the better the matching.

Here’s the system prompt. You can see a reasoning field that works in the same way as the Dewey Decimal classification above.

Extract the reading list from the supplied episode notes from a podcast. Move text verbatim into the correct fields. You can re-order text, but do not re-phrase or invent any text.

Only include books and papers, not links or websites or other media. There may be other sections other than the reading list. Ignore these.

title and author fields MUST NOT be null or empty. If there is not a title or author, do not include the entry in the list.

Usually the reading list has entries in the format "[title], [author]" but sometimes it is in a different format. For books, use the "reasoning" field to very briefly think out loud about how to break down each entry into the correct fields, e.g. "the author is second". Do not use the field to replicate any text from the title or author fields.

The author field will sometimes have multiple authors, or an editor denoted like "John Smith (ed.)". Always put the verbatim text in the author field, and do not try to split it up.

Return a valid JSON list of dictionaries, where the returned JSON object conforms to the following Typescript type definition:

```
{
    "reasoning": string;
    "title": string;
    "author": string;
    "url": string | null;
    "publisher": string | null;
    "year": string | null;
}[]
```

Output only valid JSON. Do not output any delimiters or other text.

The user message includes the relevant part of the episode notes, similar to the episode notes prompt.