Part 2: Run Elasticsearch

We now have transcript text. The next step is search. The research agent will not read every podcast episode for every question. It will search titles and subtitles first, look at likely matches, and only then use a model.

Run Elasticsearch locally in Docker:

docker run -it \
  --rm \
  --name elasticsearch \
  -m 4GB \
  -p 9200:9200 \
  -p 9300:9300 \
  -v elasticsearch-data:/usr/share/elasticsearch/data \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  docker.elastic.co/elasticsearch/elasticsearch:9.2.0

Verify that it responds:

curl http://localhost:9200

Install the Python client in flow/:

uv add elasticsearch

Connect from the notebook:

from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")

Create the podcasts index

Elasticsearch stores searchable data in indices. Here the index is named podcasts, and it has two searchable text fields:

  • title, the podcast episode title.
  • subtitles, the full timestamped transcript.

The workshop uses an English analyzer that lowercases text, removes stop words, and stems words. This lets queries such as getting started in machine learning match related forms in the transcripts.

Start with the stop word list:

stopwords = [
    "a","about","above","after","again","against","all","am","an","and","any",
    "are","aren","aren't","as","at","be","because","been","before","being",
    "below","between","both","but","by","can","can","can't","cannot","could",
    "couldn't","did","didn't","do","does","doesn't","doing","don't","down",
    "during","each","few","for","from","further","had","hadn't","has","hasn't",
    "have","haven't","having","he","he'd","he'll","he's","her","here","here's",
    "hers","herself","him","himself","his","how","how's","i","i'd","i'll",
    "i'm","i've","if","in","into","is","isn't","it","it's","its","itself",
    "let's","me","more","most","mustn't","my","myself","no","nor","not","of",
    "off","on","once","only","or","other","ought","our","ours","ourselves",
    "out","over","own","same","shan't","she","she'd","she'll","she's","should",
    "shouldn't","so","some","such","than","that","that's","the","their",
    "theirs","them","themselves","then","there","there's","these","they",
    "they'd","they'll","they're","they've","this","those","through","to",
    "too","under","until","up","very","was","wasn't","we","we'd","we'll",
    "we're","we've","were","weren't","what","what's","when","when's","where",
    "where's","which","while","who","who's","whom","why","why's","with",
    "won't","would","wouldn't","you","you'd","you'll","you're","you've",
    "your","yours","yourself","yourselves",
    "get"
]

The custom analyzer uses the list above and the built-in English stemmers:

index_settings = {
    "settings": {
        "analysis": {
            "filter": {
                "english_stop": {"type": "stop", "stopwords": stopwords},
                "english_stemmer": {"type": "stemmer", "language": "english"},
                "english_possessive_stemmer": {
                    "type": "stemmer",
                    "language": "possessive_english"
                }
            },
            "analyzer": {
                "english_with_stop_and_stem": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "english_possessive_stemmer",
                        "english_stop",
                        "english_stemmer"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "english_with_stop_and_stem",
                "search_analyzer": "english_with_stop_and_stem"
            },
            "subtitles": {
                "type": "text",
                "analyzer": "english_with_stop_and_stem",
                "search_analyzer": "english_with_stop_and_stem"
            }
        }
    }
}

Create the index. During development it is fine to delete and recreate it so the mapping stays consistent:

index_name = "podcasts"

if es.indices.exists(index=index_name):
    es.indices.delete(index=index_name)

es.indices.create(index=index_name, body=index_settings)
print(f"Index '{index_name}' created successfully")

If you want an assistant to explain the Elasticsearch settings, use a prompt like this:

I do not understand what happens when we create this Elasticsearch index.
Explain the analysis filters, analyzer, mappings, and searchable fields.

Index one transcript

Index the transcript from Part 1: Fetch one transcript as one Elasticsearch document. The document ID is the YouTube video ID, so later we can ask whether a video was already processed:

doc = {
    "video_id": video_id,
    "title": "Reinventing a Career in Tech",
    "subtitles": subtitles
}

es.index(index="podcasts", id=video_id, document=doc)
print(f"Indexed video: {video_id}")

The document stores video_id inside _source and also uses it as the Elasticsearch document ID. That duplication is convenient because search results can return _id, while full document retrieval returns _source.

Search with snippets

Now create a search function. The query searches both title and subtitles. Matching the title counts more because title^3 boosts that field:

def search_videos(query: str, size: int = 5):
    body = {
        "size": size,
        "query": {
            "multi_match": {
                "query": query,
                "fields": ["title^3", "subtitles"],
                "type": "best_fields",
                "analyzer": "english_with_stop_and_stem"
            }
        },
        "highlight": {
            "pre_tags": ["*"],
            "post_tags": ["*"],
            "fields": {
                "title": {"fragment_size": 150, "number_of_fragments": 1},
                "subtitles": {"fragment_size": 150, "number_of_fragments": 1}
            }
        }
    }

    response = es.search(index="podcasts", body=body)
    hits = response.body['hits']['hits']

Return only the snippet fields that the agent needs for the first pass:

    results = []
    for hit in hits:
        highlight = hit['highlight']
        highlight['video_id'] = hit['_id']
        results.append(highlight)

    return results

Test the function:

results = search_videos("machine learning")
results

This mirrors a normal search engine. First we show the matching snippets. Later the agent can fetch a full transcript by video ID when the snippet is worth checking.

Next we discover all podcast episodes and process the full archive in Part 3: Discover podcast videos.

Questions & Answers (0)

Sign in to ask questions