Part 2: Run Elasticsearch
We now have transcript text. The next step is search. The research agent will not read every podcast episode for every question. It will search titles and subtitles first, look at likely matches, and only then use a model.
Run Elasticsearch locally in Docker:
docker run -it \
--rm \
--name elasticsearch \
-m 4GB \
-p 9200:9200 \
-p 9300:9300 \
-v elasticsearch-data:/usr/share/elasticsearch/data \
-e "discovery.type=single-node" \
-e "xpack.security.enabled=false" \
docker.elastic.co/elasticsearch/elasticsearch:9.2.0
Verify that it responds:
curl http://localhost:9200
Install the Python client in flow/:
uv add elasticsearch
Connect from the notebook:
from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
Create the podcasts index
Elasticsearch stores searchable data in indices. Here the index is named
podcasts, and it has two searchable text fields:
title, the podcast episode title.subtitles, the full timestamped transcript.
The workshop uses an English analyzer that lowercases text, removes stop
words, and stems words. This lets queries such as getting started in machine
learning match related forms in the transcripts.
Start with the stop word list:
stopwords = [
"a","about","above","after","again","against","all","am","an","and","any",
"are","aren","aren't","as","at","be","because","been","before","being",
"below","between","both","but","by","can","can","can't","cannot","could",
"couldn't","did","didn't","do","does","doesn't","doing","don't","down",
"during","each","few","for","from","further","had","hadn't","has","hasn't",
"have","haven't","having","he","he'd","he'll","he's","her","here","here's",
"hers","herself","him","himself","his","how","how's","i","i'd","i'll",
"i'm","i've","if","in","into","is","isn't","it","it's","its","itself",
"let's","me","more","most","mustn't","my","myself","no","nor","not","of",
"off","on","once","only","or","other","ought","our","ours","ourselves",
"out","over","own","same","shan't","she","she'd","she'll","she's","should",
"shouldn't","so","some","such","than","that","that's","the","their",
"theirs","them","themselves","then","there","there's","these","they",
"they'd","they'll","they're","they've","this","those","through","to",
"too","under","until","up","very","was","wasn't","we","we'd","we'll",
"we're","we've","were","weren't","what","what's","when","when's","where",
"where's","which","while","who","who's","whom","why","why's","with",
"won't","would","wouldn't","you","you'd","you'll","you're","you've",
"your","yours","yourself","yourselves",
"get"
]
The custom analyzer uses the list above and the built-in English stemmers:
index_settings = {
"settings": {
"analysis": {
"filter": {
"english_stop": {"type": "stop", "stopwords": stopwords},
"english_stemmer": {"type": "stemmer", "language": "english"},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english_with_stop_and_stem": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"english_possessive_stemmer",
"english_stop",
"english_stemmer"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english_with_stop_and_stem",
"search_analyzer": "english_with_stop_and_stem"
},
"subtitles": {
"type": "text",
"analyzer": "english_with_stop_and_stem",
"search_analyzer": "english_with_stop_and_stem"
}
}
}
}
Create the index. During development it is fine to delete and recreate it so the mapping stays consistent:
index_name = "podcasts"
if es.indices.exists(index=index_name):
es.indices.delete(index=index_name)
es.indices.create(index=index_name, body=index_settings)
print(f"Index '{index_name}' created successfully")
If you want an assistant to explain the Elasticsearch settings, use a prompt like this:
I do not understand what happens when we create this Elasticsearch index.
Explain the analysis filters, analyzer, mappings, and searchable fields.
Index one transcript
Index the transcript from Part 1: Fetch one transcript as one Elasticsearch document. The document ID is the YouTube video ID, so later we can ask whether a video was already processed:
doc = {
"video_id": video_id,
"title": "Reinventing a Career in Tech",
"subtitles": subtitles
}
es.index(index="podcasts", id=video_id, document=doc)
print(f"Indexed video: {video_id}")
The document stores video_id inside _source and also uses it as the
Elasticsearch document ID. That duplication is convenient because search
results can return _id, while full document retrieval returns _source.
Search with snippets
Now create a search function. The query searches both title and
subtitles. Matching the title counts more because title^3 boosts that
field:
def search_videos(query: str, size: int = 5):
body = {
"size": size,
"query": {
"multi_match": {
"query": query,
"fields": ["title^3", "subtitles"],
"type": "best_fields",
"analyzer": "english_with_stop_and_stem"
}
},
"highlight": {
"pre_tags": ["*"],
"post_tags": ["*"],
"fields": {
"title": {"fragment_size": 150, "number_of_fragments": 1},
"subtitles": {"fragment_size": 150, "number_of_fragments": 1}
}
}
}
response = es.search(index="podcasts", body=body)
hits = response.body['hits']['hits']
Return only the snippet fields that the agent needs for the first pass:
results = []
for hit in hits:
highlight = hit['highlight']
highlight['video_id'] = hit['_id']
results.append(highlight)
return results
Test the function:
results = search_videos("machine learning")
results
This mirrors a normal search engine. First we show the matching snippets. Later the agent can fetch a full transcript by video ID when the snippet is worth checking.
Next we discover all podcast episodes and process the full archive in Part 3: Discover podcast videos.