Skip to content

Latest commit

 

History

History
137 lines (98 loc) · 10.5 KB

File metadata and controls

137 lines (98 loc) · 10.5 KB

cppa_youtube_script_tracker — Service API

Module path: cppa_youtube_script_tracker.services Description: YouTube channel metadata, video metadata, transcript state, and speaker links for C++ conference talks. Single place for all writes to cppa_youtube_script_tracker models. Speaker profiles live in cppa_user_tracker.YoutubeSpeaker.

Type notation: Model types refer to cppa_youtube_script_tracker.models unless noted. YoutubeSpeaker refers to cppa_user_tracker.models.YoutubeSpeaker.


Public API (generated)

Function Parameters Return type Summary
get_or_create_channel channel_id: str, channel_title: str = '' YouTubeChannel Get or create a YouTubeChannel by channel_id (PK).
get_or_create_tag tag_name: str CppaTags Get or create a CppaTags entry by tag_name.
get_or_create_video video_id: str, channel: Optional[YouTubeChannel], metadata_dict: dict[str, Any] tuple[YouTubeVideo, bool] Get or create a YouTubeVideo by video_id (PK). Returns (video, created).
link_speaker_to_video video: YouTubeVideo, speaker: Any YouTubeVideoSpeaker Link a YoutubeSpeaker to a YouTubeVideo (get-or-create). Returns YouTubeVideoSpeaker.
link_tag_to_video video: YouTubeVideo, tag: CppaTags YouTubeVideoTags Link a CppaTags entry to a YouTubeVideo (get-or-create). Returns YouTubeVideoTags.
remove_speaker_links_by_name video: YouTubeVideo, speaker_name: str int Remove all speaker links for a video where speaker.display_name matches speaker_name.
update_video_transcript video: YouTubeVideo, transcript_path: str YouTubeVideo Mark video as having a transcript and store its path. Returns the updated video.

metadata_dict accepted keys:

Key Type Notes
title str
description str
published_at datetime or str ISO string is parsed via parse_datetime
duration_seconds int
view_count int | None
like_count int | None
comment_count int | None
search_term str Search term used to discover the video
scraped_at datetime or str ISO string is parsed via parse_datetime

Tags are not part of metadata_dict; use get_or_create_tag and link_tag_to_video (in this module) to associate tags with a video after creating or fetching it.


YoutubeSpeaker (in cppa_user_tracker)

Function Parameter types Return type Description
get_or_create_youtube_speaker external_id: str, display_name: str = "", identity: Identity | None = None tuple[YoutubeSpeaker, bool] Get or create a speaker by external_id; updates display_name when provided. Raises ValueError if external_id is empty.

Module path: cppa_user_tracker.services


Preprocessor

Module path: cppa_youtube_script_tracker.preprocessor

Function Parameter types Return type Description
preprocess_youtube_for_pinecone failed_ids: list[str], final_sync_at: datetime | None tuple[list[dict], bool] Build Pinecone sync documents for YouTube videos. Returns (docs, is_chunked=False).

Each document dict has:

  • content — Title, speakers, channel, published date, description, and transcript text (if available).
  • metadata["doc_id"]"youtube-{video_id}".
  • metadata["ids"] — DB primary key of the YouTubeVideo row (for retry tracking).
  • metadata["type"]"youtube".
  • metadata["url"]"https://www.youtube.com/watch?v={video_id}".
  • metadata["title"], metadata["author"] (comma-separated speaker names), metadata["channel"], metadata["timestamp"] (Unix timestamp), metadata["has_transcript"].

Workspace helpers

Module path: cppa_youtube_script_tracker.workspace

Function Return type Description
get_workspace_root() Path workspace/cppa_youtube_script_tracker/
get_raw_dir() Path workspace/raw/cppa_youtube_script_tracker/ (permanent JSON archive)
get_raw_transcripts_dir() Path workspace/raw/cppa_youtube_script_tracker/transcripts/ (permanent VTT archive)
get_metadata_queue_dir() Path workspace/cppa_youtube_script_tracker/metadata/ (short-lived; moved after persist)
get_raw_metadata_path(video_id) Path Raw metadata JSON archive path for a video.
get_metadata_queue_path(video_id) Path Metadata queue JSON path for a video.
get_transcript_path(video_id, lang="en") Path VTT path for a video.
iter_metadata_queue_jsons() Iterator[Path] Yield all *.json files in the metadata queue directory.

Fetcher

Module path: cppa_youtube_script_tracker.fetcher

Function Parameter types Return type Description
fetch_videos published_after: datetime, published_before: datetime, channel_title: str | None = None, skip_video_ids: set[str] | None = None, min_duration_seconds: int = 0 list[dict] Fetch video metadata from YouTube Data API v3 for the given time window. Returns normalised metadata dicts.

Each returned dict contains the following keys:

Key Type Notes
video_id str YouTube video ID
title str
description str
channel_id str
channel_title str
published_at str ISO 8601 datetime string from API
duration_seconds int Parsed from ISO 8601 duration (e.g. PT1H2M10S)
view_count int | None
like_count int | None
comment_count int | None
tags list
search_term str Query used to discover the video
scraped_at str ISO 8601 datetime when the API call was made

channel_title behaviour: If channel_title matches a key in the C_PLUS_PLUS_CHANNELS dict, the API call is filtered by that channel's ID. If channel_title is unrecognised, a keyword search by name is used. If channel_title is None, all known C++ channels are searched.

Requires: YOUTUBE_API_KEY setting. Raises ValueError if missing. Raises ImportError if google-api-python-client is not installed.


Transcript downloader

Module path: cppa_youtube_script_tracker.transcript

Function Parameter types Return type Description
download_vtt video_id: str, output_dir: Path, cookies_file: str | None = None Path | None Download English VTT subtitles for video_id into output_dir. Returns path to the .vtt file, or None if not found.

Tries manual captions first, then auto-generated (writeautomaticsub). The output file is written as {video_id}.en.vtt; falls back to any {video_id}*.vtt file in output_dir if the expected name is not present.

Requires: yt-dlp. Raises ImportError if not installed.


Related docs