Part 3: Discover podcast videos
One transcript proves the search path works, so now we need all
DataTalks.Club podcast episodes. The website is built from a GitHub
repository, and the podcast metadata lives in _data/events.yaml.
Install the discovery dependencies:
uv add requests pyyaml tqdm
Read the metadata from a specific commit so the workshop stays reproducible even if the live website changes later:
import requests
import yaml
commit_id = '187b7d056a36d5af6ac33e4c8096c52d13a078a7'
events_url = f'https://raw.githubusercontent.com/DataTalksClub/datatalksclub.github.io/{commit_id}/_data/events.yaml'
raw_yaml = requests.get(events_url).content
events_data = yaml.load(raw_yaml, yaml.CSafeLoader)
podcasts = [
d for d in events_data
if (d.get('type') == 'podcast') and (d.get('youtube'))
]
print(f"Found {len(podcasts)} podcasts")
Extract video IDs from the YouTube URLs.
Skip videos that are known to cause ingestion problems:
videos = []
for podcast in podcasts:
_, video_id = podcast['youtube'].split('watch?v=')
if video_id in ['FRi0SUtxdMw', 's8kyzy8V5b8']:
continue
videos.append({
'title': podcast['title'],
'video_id': video_id
})
print(f"Will process {len(videos)} videos")
Now each item has the structure the pipeline needs:
videos[0]
The expected fields are title and video_id. Everything after this point
can ignore the original YAML structure.
Process all videos
The first version of the ingestion loop is plain Python.
It checks whether a video is already indexed, fetches subtitles when needed, and writes the document to Elasticsearch:
from tqdm.auto import tqdm
for video in tqdm(videos):
video_id = video['video_id']
video_title = video['title']
if es.exists(index='podcasts', id=video_id):
print(f'already processed {video_id}')
continue
transcript = fetch_transcript(video_id)
subtitles = make_subtitles(transcript)
Then index the document:
doc = {
"video_id": video_id,
"title": video_title,
"subtitles": subtitles
}
es.index(index="podcasts", id=video_id, document=doc)
Run this loop and you hit the real problem. YouTube can block the IP after a small number of transcript requests, sometimes in the middle of the podcast archive. That's exactly the failure mode we need the workshop to handle.
Proxy support
The workaround for blocked cloud or local IP addresses is a proxy. Use a residential proxy when possible. Residential IPs look like normal computers, while cloud provider IPs are more likely to be blocked.
Put proxy credentials in .env:
PROXY_BASE_URL=...
PROXY_USER=...
PROXY_PASSWORD=...
Make sure .env is ignored:
echo '.env' >> .gitignore
Load the variables with dirdotenv:
echo 'alias dirdotenv="uvx dirdotenv"' >> ~/.bashrc
echo 'eval "$(dirdotenv hook bash)"' >> ~/.bashrc
source ~/.bashrc
You can also use python-dotenv inside Python:
uv add python-dotenv
Then load the file from the notebook or script:
from dotenv import load_dotenv
load_dotenv()
Configure the YouTube transcript client with a proxy:
import os
from youtube_transcript_api.proxies import GenericProxyConfig
proxy_user = os.environ['PROXY_USER']
proxy_password = os.environ['PROXY_PASSWORD']
proxy_base_url = os.environ['PROXY_BASE_URL']
proxy_url = f'http://{proxy_user}:{proxy_password}@{proxy_base_url}'
proxy = GenericProxyConfig(
http_url=proxy_url,
https_url=proxy_url,
)
Use the proxy when fetching transcripts:
def fetch_transcript(video_id):
ytt_api = YouTubeTranscriptApi(proxy_config=proxy)
transcript = ytt_api.fetch(video_id)
return transcript
Proxies reduce IP blocking, but they introduce more network failures. A request now goes through your machine and the proxy service. Then it goes to a residential proxy host, YouTube, and back. Expect SSL errors, timeouts, and blocked proxy IPs.
Because of that gotcha, we can't rely on the notebook loop alone. We need retries, observable progress, and a workflow that can continue after failures, which is the Temporal part in Part 4: Temporal motivation.