Part 3: Discover podcast videos
One transcript proves the search path works. Now we need all DataTalks.Club
podcast episodes. The website is built from a GitHub repository, and the
podcast metadata is in _data/events.yaml.
Install the dependencies:
uv add requests pyyaml tqdm
Use the pinned commit below. Pinning the commit keeps the workshop reproducible even if the live website changes later:
import requests
import yaml
commit_id = '187b7d056a36d5af6ac33e4c8096c52d13a078a7'
events_url = f'https://raw.githubusercontent.com/DataTalksClub/datatalksclub.github.io/{commit_id}/_data/events.yaml'
raw_yaml = requests.get(events_url).content
events_data = yaml.load(raw_yaml, yaml.CSafeLoader)
podcasts = [
d for d in events_data
if (d.get('type') == 'podcast') and (d.get('youtube'))
]
print(f"Found {len(podcasts)} podcasts")
Extract video IDs from the YouTube URLs. Skip videos that are known to cause ingestion problems:
videos = []
for podcast in podcasts:
_, video_id = podcast['youtube'].split('watch?v=')
if video_id in ['FRi0SUtxdMw', 's8kyzy8V5b8']:
continue
videos.append({
'title': podcast['title'],
'video_id': video_id
})
print(f"Will process {len(videos)} videos")
Now each item has the shape the pipeline needs:
videos[0]
The expected fields are title and video_id. Everything after this point
can ignore the original YAML structure.
Process all videos
The first version of the ingestion loop is plain Python. It checks whether a video is already indexed, fetches subtitles when needed, and writes the document to Elasticsearch:
from tqdm.auto import tqdm
for video in tqdm(videos):
video_id = video['video_id']
video_title = video['title']
if es.exists(index='podcasts', id=video_id):
print(f'already processed {video_id}')
continue
transcript = fetch_transcript(video_id)
subtitles = make_subtitles(transcript)
Then index the document:
doc = {
"video_id": video_id,
"title": video_title,
"subtitles": subtitles
}
es.index(index="podcasts", id=video_id, document=doc)
The loop reveals the production problem. YouTube can block the IP after a small number of transcript requests, including in the middle of the podcast archive. That is exactly the failure mode we need the workshop to handle.
Proxy support
The workaround for blocked cloud or local IP addresses is a proxy. Use a residential proxy when possible because residential IPs look like normal computers, while cloud provider IPs are more likely to be blocked.
Put proxy credentials in .env:
PROXY_BASE_URL=...
PROXY_USER=...
PROXY_PASSWORD=...
Make sure .env is ignored:
echo '.env' >> .gitignore
You can load the variables with dirdotenv:
echo 'alias dirdotenv="uvx dirdotenv"' >> ~/.bashrc
echo 'eval "$(dirdotenv hook bash)"' >> ~/.bashrc
source ~/.bashrc
You can also use python-dotenv inside Python:
uv add python-dotenv
Then load the file from the notebook or script:
from dotenv import load_dotenv
load_dotenv()
Configure the YouTube transcript client with a proxy:
import os
from youtube_transcript_api.proxies import GenericProxyConfig
proxy_user = os.environ['PROXY_USER']
proxy_password = os.environ['PROXY_PASSWORD']
proxy_base_url = os.environ['PROXY_BASE_URL']
proxy_url = f'http://{proxy_user}:{proxy_password}@{proxy_base_url}'
proxy = GenericProxyConfig(
http_url=proxy_url,
https_url=proxy_url,
)
Use the proxy when fetching transcripts:
def fetch_transcript(video_id):
ytt_api = YouTubeTranscriptApi(proxy_config=proxy)
transcript = ytt_api.fetch(video_id)
return transcript
Gotcha: proxies reduce IP blocking, but they introduce more network failures. A request now goes through your machine, the proxy service, a residential proxy host, YouTube, and back. SSL errors, timeouts, and blocked proxy IPs are expected.
That gotcha is why the notebook loop is not enough. We need retries, observable progress, and a workflow that can continue after failures. That is the Temporal part in Part 4: Temporal motivation.