Part 2: Run Elasticsearch

We have transcript text, so the next step is search. The research agent won't read every podcast episode for every question. It searches titles and subtitles first, looks at likely matches, and only then uses a model.

Run Elasticsearch locally in Docker:

docker run -it \
  --rm \
  --name elasticsearch \
  -m 4GB \
  -p 9200:9200 \
  -p 9300:9300 \
  -v elasticsearch-data:/usr/share/elasticsearch/data \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  docker.elastic.co/elasticsearch/elasticsearch:9.2.0

Verify that it responds:

curl http://localhost:9200

Install the Python client in flow/:

uv add elasticsearch

Connect from the notebook:

from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")

Create the podcasts index

Elasticsearch stores searchable data in indices, so we create one named podcasts with two searchable text fields:

  • title, the podcast episode title.
  • subtitles, the full timestamped transcript.

We use an English analyzer that lowercases text, removes stop words, and stems words. This lets queries such as getting started in machine learning match related forms in the transcripts.

Start with the stop word list:

stopwords = [
    "a","about","above","after","again","against","all","am","an","and","any",
    "are","aren","aren't","as","at","be","because","been","before","being",
    "below","between","both","but","by","can","can","can't","cannot","could",
    "couldn't","did","didn't","do","does","doesn't","doing","don't","down",
    "during","each","few","for","from","further","had","hadn't","has","hasn't",
    "have","haven't","having","he","he'd","he'll","he's","her","here","here's",
    "hers","herself","him","himself","his","how","how's","i","i'd","i'll",
    "i'm","i've","if","in","into","is","isn't","it","it's","its","itself",
    "let's","me","more","most","mustn't","my","myself","no","nor","not","of",
    "off","on","once","only","or","other","ought","our","ours","ourselves",
    "out","over","own","same","shan't","she","she'd","she'll","she's","should",
    "shouldn't","so","some","such","than","that","that's","the","their",
    "theirs","them","themselves","then","there","there's","these","they",
    "they'd","they'll","they're","they've","this","those","through","to",
    "too","under","until","up","very","was","wasn't","we","we'd","we'll",
    "we're","we've","were","weren't","what","what's","when","when's","where",
    "where's","which","while","who","who's","whom","why","why's","with",
    "won't","would","wouldn't","you","you'd","you'll","you're","you've",
    "your","yours","yourself","yourselves",
    "get"
]

The custom analyzer uses the list above and the built-in English stemmers:

index_settings = {
    "settings": {
        "analysis": {
            "filter": {
                "english_stop": {"type": "stop", "stopwords": stopwords},
                "english_stemmer": {"type": "stemmer", "language": "english"},
                "english_possessive_stemmer": {
                    "type": "stemmer",
                    "language": "possessive_english"
                }
            },
            "analyzer": {
                "english_with_stop_and_stem": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "english_possessive_stemmer",
                        "english_stop",
                        "english_stemmer"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "english_with_stop_and_stem",
                "search_analyzer": "english_with_stop_and_stem"
            },
            "subtitles": {
                "type": "text",
                "analyzer": "english_with_stop_and_stem",
                "search_analyzer": "english_with_stop_and_stem"
            }
        }
    }
}

Create the index.

During development it's fine to delete and recreate it so the mapping stays consistent:

index_name = "podcasts"

if es.indices.exists(index=index_name):
    es.indices.delete(index=index_name)

es.indices.create(index=index_name, body=index_settings)
print(f"Index '{index_name}' created successfully")

If you want an assistant to explain the Elasticsearch settings, use a prompt like this:

I do not understand what happens when we create this Elasticsearch index.
Explain the analysis filters, analyzer, mappings, and searchable fields.

Index one transcript

Index the transcript from Part 1: Fetch one transcript as one Elasticsearch document.

We use the YouTube video ID as the document ID, so later we can ask whether a video was already processed:

doc = {
    "video_id": video_id,
    "title": "Reinventing a Career in Tech",
    "subtitles": subtitles
}

es.index(index="podcasts", id=video_id, document=doc)
print(f"Indexed video: {video_id}")

We store video_id inside the document's _source and also use it as the Elasticsearch document ID. That duplication is convenient because search results can return _id, while full document retrieval returns _source.

Search with snippets

Now create a search function that queries both title and subtitles.

Matching the title counts more because title^3 boosts that field:

def search_videos(query: str, size: int = 5):
    body = {
        "size": size,
        "query": {
            "multi_match": {
                "query": query,
                "fields": ["title^3", "subtitles"],
                "type": "best_fields",
                "analyzer": "english_with_stop_and_stem"
            }
        },
        "highlight": {
            "pre_tags": ["*"],
            "post_tags": ["*"],
            "fields": {
                "title": {"fragment_size": 150, "number_of_fragments": 1},
                "subtitles": {"fragment_size": 150, "number_of_fragments": 1}
            }
        }
    }

    response = es.search(index="podcasts", body=body)
    hits = response.body['hits']['hits']

Return only the snippet fields that the agent needs for the first pass:

    results = []
    for hit in hits:
        highlight = hit['highlight']
        highlight['video_id'] = hit['_id']
        results.append(highlight)

    return results

Test the search function:

results = search_videos("machine learning")
results

This mirrors a normal search engine, where we first show the matching snippets. Later the agent can fetch a full transcript by video ID when a snippet looks promising.

Next we discover all podcast episodes and process the full archive in Part 3: Discover podcast videos.

Questions & Answers

Sign up to ask questions, track your progress, and get access to other workshops · Already have an account? Sign in