Part 2: Run Elasticsearch
We have transcript text, so the next step is search. The research agent won't read every podcast episode for every question. It searches titles and subtitles first, looks at likely matches, and only then uses a model.
Run Elasticsearch locally in Docker:
docker run -it \
--rm \
--name elasticsearch \
-m 4GB \
-p 9200:9200 \
-p 9300:9300 \
-v elasticsearch-data:/usr/share/elasticsearch/data \
-e "discovery.type=single-node" \
-e "xpack.security.enabled=false" \
docker.elastic.co/elasticsearch/elasticsearch:9.2.0
Verify that it responds:
curl http://localhost:9200
Install the Python client in flow/:
uv add elasticsearch
Connect from the notebook:
from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
Create the podcasts index
Elasticsearch stores searchable data in indices, so we create one named
podcasts with two searchable text fields:
title, the podcast episode title.subtitles, the full timestamped transcript.
We use an English analyzer that lowercases text, removes stop
words, and stems words. This lets queries such as getting started in machine
learning match related forms in the transcripts.
Start with the stop word list:
stopwords = [
"a","about","above","after","again","against","all","am","an","and","any",
"are","aren","aren't","as","at","be","because","been","before","being",
"below","between","both","but","by","can","can","can't","cannot","could",
"couldn't","did","didn't","do","does","doesn't","doing","don't","down",
"during","each","few","for","from","further","had","hadn't","has","hasn't",
"have","haven't","having","he","he'd","he'll","he's","her","here","here's",
"hers","herself","him","himself","his","how","how's","i","i'd","i'll",
"i'm","i've","if","in","into","is","isn't","it","it's","its","itself",
"let's","me","more","most","mustn't","my","myself","no","nor","not","of",
"off","on","once","only","or","other","ought","our","ours","ourselves",
"out","over","own","same","shan't","she","she'd","she'll","she's","should",
"shouldn't","so","some","such","than","that","that's","the","their",
"theirs","them","themselves","then","there","there's","these","they",
"they'd","they'll","they're","they've","this","those","through","to",
"too","under","until","up","very","was","wasn't","we","we'd","we'll",
"we're","we've","were","weren't","what","what's","when","when's","where",
"where's","which","while","who","who's","whom","why","why's","with",
"won't","would","wouldn't","you","you'd","you'll","you're","you've",
"your","yours","yourself","yourselves",
"get"
]
The custom analyzer uses the list above and the built-in English stemmers:
index_settings = {
"settings": {
"analysis": {
"filter": {
"english_stop": {"type": "stop", "stopwords": stopwords},
"english_stemmer": {"type": "stemmer", "language": "english"},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english_with_stop_and_stem": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"english_possessive_stemmer",
"english_stop",
"english_stemmer"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english_with_stop_and_stem",
"search_analyzer": "english_with_stop_and_stem"
},
"subtitles": {
"type": "text",
"analyzer": "english_with_stop_and_stem",
"search_analyzer": "english_with_stop_and_stem"
}
}
}
}
Create the index.
During development it's fine to delete and recreate it so the mapping stays consistent:
index_name = "podcasts"
if es.indices.exists(index=index_name):
es.indices.delete(index=index_name)
es.indices.create(index=index_name, body=index_settings)
print(f"Index '{index_name}' created successfully")
If you want an assistant to explain the Elasticsearch settings, use a prompt like this:
I do not understand what happens when we create this Elasticsearch index.
Explain the analysis filters, analyzer, mappings, and searchable fields.
Index one transcript
Index the transcript from Part 1: Fetch one transcript as one Elasticsearch document.
We use the YouTube video ID as the document ID, so later we can ask whether a video was already processed:
doc = {
"video_id": video_id,
"title": "Reinventing a Career in Tech",
"subtitles": subtitles
}
es.index(index="podcasts", id=video_id, document=doc)
print(f"Indexed video: {video_id}")
We store video_id inside the document's _source and also use it as the
Elasticsearch document ID. That duplication is convenient because search
results can return _id, while full document retrieval returns _source.
Search with snippets
Now create a search function that queries both title and subtitles.
Matching the title counts more because title^3 boosts that field:
def search_videos(query: str, size: int = 5):
body = {
"size": size,
"query": {
"multi_match": {
"query": query,
"fields": ["title^3", "subtitles"],
"type": "best_fields",
"analyzer": "english_with_stop_and_stem"
}
},
"highlight": {
"pre_tags": ["*"],
"post_tags": ["*"],
"fields": {
"title": {"fragment_size": 150, "number_of_fragments": 1},
"subtitles": {"fragment_size": 150, "number_of_fragments": 1}
}
}
}
response = es.search(index="podcasts", body=body)
hits = response.body['hits']['hits']
Return only the snippet fields that the agent needs for the first pass:
results = []
for hit in hits:
highlight = hit['highlight']
highlight['video_id'] = hit['_id']
results.append(highlight)
return results
Test the search function:
results = search_videos("machine learning")
results
This mirrors a normal search engine, where we first show the matching snippets. Later the agent can fetch a full transcript by video ID when a snippet looks promising.
Next we discover all podcast episodes and process the full archive in Part 3: Discover podcast videos.