Part 3: Discover podcast videos

One transcript proves the search path works. Now we need all DataTalks.Club podcast episodes. The website is built from a GitHub repository, and the podcast metadata is in _data/events.yaml.

Install the dependencies:

uv add requests pyyaml tqdm

Use the pinned commit below. Pinning the commit keeps the workshop reproducible even if the live website changes later:

import requests
import yaml

commit_id = '187b7d056a36d5af6ac33e4c8096c52d13a078a7'
events_url = f'https://raw.githubusercontent.com/DataTalksClub/datatalksclub.github.io/{commit_id}/_data/events.yaml'

raw_yaml = requests.get(events_url).content
events_data = yaml.load(raw_yaml, yaml.CSafeLoader)

podcasts = [
    d for d in events_data
    if (d.get('type') == 'podcast') and (d.get('youtube'))
]

print(f"Found {len(podcasts)} podcasts")

Extract video IDs from the YouTube URLs. Skip videos that are known to cause ingestion problems:

videos = []

for podcast in podcasts:
    _, video_id = podcast['youtube'].split('watch?v=')

    if video_id in ['FRi0SUtxdMw', 's8kyzy8V5b8']:
        continue

    videos.append({
        'title': podcast['title'],
        'video_id': video_id
    })

print(f"Will process {len(videos)} videos")

Now each item has the shape the pipeline needs:

videos[0]

The expected fields are title and video_id. Everything after this point can ignore the original YAML structure.

Process all videos

The first version of the ingestion loop is plain Python. It checks whether a video is already indexed, fetches subtitles when needed, and writes the document to Elasticsearch:

from tqdm.auto import tqdm

for video in tqdm(videos):
    video_id = video['video_id']
    video_title = video['title']

    if es.exists(index='podcasts', id=video_id):
        print(f'already processed {video_id}')
        continue

    transcript = fetch_transcript(video_id)
    subtitles = make_subtitles(transcript)

Then index the document:

    doc = {
        "video_id": video_id,
        "title": video_title,
        "subtitles": subtitles
    }

    es.index(index="podcasts", id=video_id, document=doc)

The loop reveals the production problem. YouTube can block the IP after a small number of transcript requests, including in the middle of the podcast archive. That is exactly the failure mode we need the workshop to handle.

Proxy support

The workaround for blocked cloud or local IP addresses is a proxy. Use a residential proxy when possible because residential IPs look like normal computers, while cloud provider IPs are more likely to be blocked.

Put proxy credentials in .env:

PROXY_BASE_URL=...
PROXY_USER=...
PROXY_PASSWORD=...

Make sure .env is ignored:

echo '.env' >> .gitignore

You can load the variables with dirdotenv:

echo 'alias dirdotenv="uvx dirdotenv"' >> ~/.bashrc
echo 'eval "$(dirdotenv hook bash)"' >> ~/.bashrc
source ~/.bashrc

You can also use python-dotenv inside Python:

uv add python-dotenv

Then load the file from the notebook or script:

from dotenv import load_dotenv

load_dotenv()

Configure the YouTube transcript client with a proxy:

import os
from youtube_transcript_api.proxies import GenericProxyConfig

proxy_user = os.environ['PROXY_USER']
proxy_password = os.environ['PROXY_PASSWORD']
proxy_base_url = os.environ['PROXY_BASE_URL']

proxy_url = f'http://{proxy_user}:{proxy_password}@{proxy_base_url}'

proxy = GenericProxyConfig(
    http_url=proxy_url,
    https_url=proxy_url,
)

Use the proxy when fetching transcripts:

def fetch_transcript(video_id):
    ytt_api = YouTubeTranscriptApi(proxy_config=proxy)
    transcript = ytt_api.fetch(video_id)
    return transcript

Gotcha: proxies reduce IP blocking, but they introduce more network failures. A request now goes through your machine, the proxy service, a residential proxy host, YouTube, and back. SSL errors, timeouts, and blocked proxy IPs are expected.

That gotcha is why the notebook loop is not enough. We need retries, observable progress, and a workflow that can continue after failures. That is the Temporal part in Part 4: Temporal motivation.

Questions & Answers (0)

Sign in to ask questions