-
-
Notifications
You must be signed in to change notification settings - Fork 1
Multi dwl #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @itskavin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades the Thinkific Downloader project by introducing parallel download capabilities with a modern, rich terminal user interface. It refactors the core download logic for improved reliability and performance, alongside implementing an atomic resume and backup system to prevent data loss during interruptions. Additionally, the project's development lifecycle is bolstered with new CI/CD pipelines for automated quality assurance and security, and all user-facing documentation has been thoroughly updated to guide users through the enhanced features and setup processes. Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a major refactoring to support parallel downloads, a more robust resume/backup system, and an improved user interface using rich. The changes are extensive and significantly improve the project's architecture by introducing a DownloadManager and a two-phase process of collecting tasks then executing them. My review focuses on addressing significant maintainability issues in downloader.py, such as the heavy reliance on global variables and overly complex functions, by suggesting a move to a class-based structure. I've also included feedback on standardizing HTTP libraries, improving dependency management, and correcting documentation to ensure consistency and clarity.
| # Globals to mirror PHP behavior | ||
| ROOT_PROJECT_DIR = Path.cwd() | ||
| COURSE_CONTENTS: List[Dict[str, Any]] = [] | ||
| SETTINGS: Optional[Settings] = None | ||
| BASE_HOST: Optional[str] = None | ||
| DOWNLOAD_MANAGER: Optional[DownloadManager] = None | ||
| DOWNLOAD_TASKS: List[Dict[str, Any]] = [] # Collect all download tasks for parallel execution | ||
| CONTENT_PROCESSOR: Optional[ContentProcessor] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The heavy use of global variables (SETTINGS, DOWNLOAD_MANAGER, DOWNLOAD_TASKS, etc.) makes the application state difficult to manage, test, and reason about. It creates tight coupling between functions and can lead to unpredictable behavior.
This should be refactored to an object-oriented approach. Encapsulating the state and logic within a class (e.g., CourseDownloader) would make the code more modular, maintainable, and robust.
| def collect_chapter_tasks(content_ids: Iterable[Any], chapter_path: Path): | ||
| """Collect download tasks for a specific chapter.""" | ||
| from .wistia_downloader import video_downloader_wistia, video_downloader_videoproxy | ||
| global COURSE_CONTENTS, SETTINGS, DOWNLOAD_TASKS | ||
|
|
||
| index = 1 | ||
| for content_id in content_ids: | ||
| match = next((c for c in COURSE_CONTENTS if c['id'] == content_id), None) | ||
| if not match: | ||
| print(f" ⚠️ No content found for id {content_id}") | ||
| index += 1 | ||
| continue | ||
|
|
||
| ctype = match.get('contentable_type') or match.get('default_lesson_type_label') | ||
| print(f" 🔍 Found {ctype}: {match.get('name')}") | ||
|
|
||
| # HTML Item (Notes) - Collect download tasks | ||
| if ctype == 'HtmlItem': | ||
| fname = filter_filename(f"{match['slug']}.html") | ||
| dc = chapter_path / filter_filename(f"{index}. {match['name']} Text") | ||
| dc.mkdir(exist_ok=True) | ||
|
|
||
| if not (dc / fname).exists(): | ||
| j = api_get(f"/api/course_player/v2/html_items/{match['contentable']}") | ||
| if j: | ||
| html_text = j.get('html_item', {}).get('html_text', '') | ||
| decoded = unicode_decode(html_text) | ||
|
|
||
| # Collect MP3 audio files | ||
| mp3_matches = MP3_PATTERN.findall(decoded) | ||
| if mp3_matches: | ||
| for audio_url in set(mp3_matches): | ||
| audio_name = filter_filename(Path(urlparse(audio_url).path).name) | ||
| add_download_task(audio_url, dc / audio_name, "audio") | ||
|
|
||
| # Save HTML content to file | ||
| fname = fname.replace(" ", "-") | ||
| (dc / fname).write_text(decoded, encoding='utf-8', errors='replace') | ||
|
|
||
| # Collect video download tasks | ||
| videoproxy_matches = VIDEOPROXY_PATTERN.findall(decoded) | ||
| if videoproxy_matches: | ||
| for video_url in set(videoproxy_matches): | ||
| collect_video_task_videoproxy(video_url, filter_filename(match['name']), dc) | ||
|
|
||
| wistia_matches = WISTIA_PATTERN.findall(decoded) | ||
| if wistia_matches: | ||
| for wistia_id in set(wistia_matches): | ||
| collect_video_task_wistia(wistia_id, filter_filename(match['name']), dc) | ||
|
|
||
| index += 1 | ||
| continue | ||
|
|
||
| # Multimedia (iframe) - Collect download tasks | ||
| if match.get('default_lesson_type_label') == 'Multimedia': | ||
| dc = chapter_path / filter_filename(f"{index}. {match['name']} Multimedia") | ||
| dc.mkdir(exist_ok=True) | ||
|
|
||
| j = api_get(f"/api/course_player/v2/iframes/{match['contentable']}") | ||
| file_contents = '' | ||
| if j: | ||
| src_url = unicode_decode(j.get('iframe', {}).get('source_url') or '') | ||
| if re.search(r"(\.md|\.html|/)$", src_url): | ||
| try: | ||
| file_contents = http_get(src_url) | ||
| except Exception: | ||
| file_contents = src_url | ||
| else: | ||
| file_contents = src_url | ||
|
|
||
| # Collect attached files | ||
| if j.get('download_files'): | ||
| for download_file in j['download_files']: | ||
| download_file_name = filter_filename(download_file.get('label') or 'file') | ||
| download_file_url = download_file.get('download_url') | ||
| if download_file_url: | ||
| add_download_task(download_file_url, dc / download_file_name, "file") | ||
|
|
||
| # Save HTML file | ||
| fname = f"{match['name']}.html" | ||
| fname = re.sub(r"[^A-Za-z0-9\_\-\. \?]", '', fname) | ||
| fname = filter_filename(fname) | ||
| (dc / fname).write_text(file_contents, encoding='utf-8', errors='replace') | ||
|
|
||
| index += 1 | ||
| continue | ||
|
|
||
| # Lesson (videos + html + attachments) - Collect download tasks | ||
| if ctype == 'Lesson': | ||
| dc = chapter_path / filter_filename(f"{index}. {match['name']} Lesson") | ||
| dc.mkdir(exist_ok=True) | ||
| vname = filter_filename(match['name']) | ||
|
|
||
| j = api_get(f"/api/course_player/v2/lessons/{match['contentable']}") | ||
| if j: | ||
| # Collect video download tasks | ||
| videos = j.get('videos') or [] | ||
| if videos: | ||
| for video in videos: | ||
| storage = video.get('storage_location') | ||
| identifier = video.get('identifier') | ||
| if storage == 'wistia' and identifier: | ||
| collect_video_task_wistia(identifier, vname, dc) | ||
| elif storage == 'videoproxy' and identifier: | ||
| collect_video_task_videoproxy(f"https://platform.thinkific.com/videoproxy/v1/play/{identifier}", vname, dc) | ||
| else: | ||
| direct = video.get('url') | ||
| if direct: | ||
| add_download_task(direct, dc / f"{vname}.mp4", "video") | ||
|
|
||
| # Save lesson HTML content | ||
| lesson_info = j.get('lesson', {}) | ||
| html_text = lesson_info.get('html_text') if isinstance(lesson_info, dict) else None | ||
| if html_text and html_text.strip(): | ||
| html_filename = f"{vname}.html" | ||
| (dc / html_filename).write_text(html_text, encoding='utf-8', errors='replace') | ||
|
|
||
| # Collect attached files | ||
| for dlf in j.get('download_files', []) or []: | ||
| download_file_name = filter_filename(dlf.get('label') or 'file') | ||
| download_file_url = dlf.get('download_url') | ||
| if download_file_url: | ||
| add_download_task(download_file_url, dc / download_file_name, "file") | ||
|
|
||
| index += 1 | ||
| continue | ||
|
|
||
| # PDF - Collect download tasks | ||
| if ctype == 'Pdf': | ||
| dc = chapter_path / filter_filename(f"{index}. {match['name']}") | ||
| dc.mkdir(exist_ok=True) | ||
|
|
||
| j = api_get(f"/api/course_player/v2/pdfs/{match['contentable']}") | ||
| if j: | ||
| pdf = j.get('pdf', {}) | ||
| pdf_url = pdf.get('url') | ||
| if pdf_url: | ||
| fname = filter_filename(Path(urlparse(pdf_url).path).name) | ||
| add_download_task(pdf_url, dc / fname, "pdf") | ||
|
|
||
| index += 1 | ||
| continue | ||
|
|
||
| # Download (shared files) - Collect download tasks | ||
| if ctype == 'Download': | ||
| dc = chapter_path / filter_filename(f"{index}. {match['name']}") | ||
| dc.mkdir(exist_ok=True) | ||
|
|
||
| j = api_get(f"/api/course_player/v2/downloads/{match['contentable']}") | ||
| if j: | ||
| for dlf in j.get('download_files', []) or []: | ||
| label = filter_filename(dlf.get('label') or 'file') | ||
| url = dlf.get('download_url') | ||
| if url: | ||
| add_download_task(url, dc / label, "file") | ||
|
|
||
| index += 1 | ||
| continue | ||
|
|
||
| # Audio - Collect download tasks | ||
| if ctype == 'Audio': | ||
| dc = chapter_path / filter_filename(f"{index}. {match['name']}") | ||
| dc.mkdir(exist_ok=True) | ||
|
|
||
| j = api_get(f"/api/course_player/v2/audio/{match['contentable']}") | ||
| if j: | ||
| audio = j.get('audio', {}) | ||
| audio_url = audio.get('url') | ||
| if audio_url: | ||
| fname = filter_filename(Path(urlparse(audio_url).path).name) | ||
| add_download_task(audio_url, dc / fname, "audio") | ||
|
|
||
| index += 1 | ||
| continue | ||
|
|
||
| # Presentation - Collect download tasks | ||
| if ctype == 'Presentation': | ||
| dc = chapter_path / filter_filename(f"{index}. {match['name']}") | ||
| dc.mkdir(exist_ok=True) | ||
|
|
||
| j = api_get(f"/api/course_player/v2/presentations/{match['contentable']}") | ||
| if j: | ||
| pres = j.get('presentation', {}) | ||
| pdf_url = pres.get('source_file_url') | ||
| pdf_name = filter_filename(pres.get('source_file_name') or 'slides.pdf') | ||
| if pdf_url: | ||
| add_download_task(pdf_url, dc / pdf_name, "presentation") | ||
|
|
||
| # Handle presentation merging - collect slide assets | ||
| merge_flag = SETTINGS.ffmpeg_presentation_merge if SETTINGS else False | ||
| if merge_flag: | ||
| from shutil import which | ||
| if which('ffmpeg'): | ||
| items = j.get('presentation_items') or [] | ||
| for it in items: | ||
| pos = it.get('position') | ||
| img_url = it.get('image_file_url') | ||
| aud_url = it.get('audio_file_url') | ||
| if img_url: | ||
| img_url = 'https:' + img_url if img_url.startswith('//') else img_url | ||
| img_name = filter_filename(f"{pos}{it.get('image_file_name','slide.png')}") | ||
| add_download_task(img_url, dc / img_name, "image") | ||
| if aud_url: | ||
| aud_url = 'https:' + aud_url if aud_url.startswith('//') else aud_url | ||
| aud_name = filter_filename(f"{pos}{it.get('audio_file_name','audio.m4a')}") | ||
| add_download_task(aud_url, dc / aud_name, "audio") | ||
|
|
||
| index += 1 | ||
| continue | ||
|
|
||
| # Quiz - Handle separately (complex logic) | ||
| if ctype == 'Quiz': | ||
| dc = chapter_path / filter_filename(f"{index}. {match['name']} Quiz") | ||
| dc.mkdir(exist_ok=True) | ||
|
|
||
| fname = filter_filename(f"{match['name']} Answers.html") | ||
| qname = filter_filename(f"{match['name']} Questions.html") | ||
|
|
||
| result = api_get(f"/api/course_player/v2/quizzes/{match['contentable']}") | ||
| if result: | ||
| file_contents_with_answers = "<h3 style='color: red;'>Answers of this Quiz are marked in RED </h3>" | ||
| file_contents_with_questions = "" | ||
|
|
||
| for qs in result.get("questions", []): | ||
| choice = 'A' | ||
| position = qs.get("position", 0) + 1 | ||
| prompt = unicode_decode(qs.get("prompt", "")) | ||
| explanation = unicode_decode(qs.get("text_explanation", "")) | ||
|
|
||
| file_contents_with_answers += f"{position}) <strong>{prompt}</strong> Explanation: {explanation}<br><br>" | ||
|
|
||
| # Collect embedded video tasks | ||
| wistia_matches = WISTIA_PATTERN.findall(prompt) | ||
| if wistia_matches: | ||
| for wistia_match in set(wistia_matches): | ||
| collect_video_task_wistia(wistia_match, f"QA Video {position}", dc) | ||
|
|
||
| file_contents_with_questions += f"{position}) <strong>{prompt}</strong><br><br>" | ||
|
|
||
| for ch in result.get("choices", []): | ||
| if ch.get("question_id") == qs.get("id"): | ||
| try: | ||
| import base64 | ||
| ans = base64.b64decode(ch.get("credited", "")).decode('utf-8', 'ignore') | ||
| ans = re.sub(r'\d', '', ans) | ||
| except Exception: | ||
| ans = "" | ||
|
|
||
| choice_text = unicode_decode(ch.get("text", "")) | ||
| if ans == "true": | ||
| file_contents_with_questions += f"{choice}) {choice_text}<br>" | ||
| file_contents_with_answers += f"<em style='color: red;'>{choice}) {choice_text}</em><br>" | ||
| else: | ||
| file_contents_with_questions += f"{choice}) {choice_text}<br>" | ||
| file_contents_with_answers += f"{choice}) {choice_text}<br>" | ||
|
|
||
| choice = chr(ord(choice) + 1) | ||
|
|
||
| file_contents_with_questions += "<br>" | ||
| file_contents_with_answers += "<br>" | ||
|
|
||
| (dc / qname).write_text(file_contents_with_questions, encoding='utf-8', errors='replace') | ||
| (dc / fname).write_text(file_contents_with_answers, encoding='utf-8', errors='replace') | ||
|
|
||
| index += 1 | ||
| continue | ||
|
|
||
| # Assignment/Survey placeholders | ||
| if ctype in ['Assignment', 'Survey']: | ||
| print(f" ⚠️ {ctype} content type not yet implemented: {match['name']}") | ||
| index += 1 | ||
| continue | ||
|
|
||
| index += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function collect_chapter_tasks is excessively long and complex, handling the logic for every content type. This violates the Single Responsibility Principle and makes the code very difficult to read, maintain, and test.
This function should be broken down into smaller, more focused functions, each responsible for a single content type (e.g., _collect_lesson_tasks, _collect_pdf_tasks, etc.). A dispatcher can then call the appropriate function based on the content type, which will greatly improve readability and maintainability.
| def test_resume_tracker_atomic_save(self): | ||
| from thinkific_downloader.resume_tracker import ResumeTracker | ||
| import tempfile | ||
| with tempfile.TemporaryDirectory() as tmpdir: | ||
| status_file = Path(tmpdir) / ".download_status.json" | ||
| tracker = ResumeTracker(str(status_file)) | ||
| tracker.status_data["test"] = {"status": "completed"} | ||
| tracker._save_status() | ||
| assert status_file.exists() | ||
| backup_file = status_file.with_suffix('.json.bak') | ||
| assert backup_file.exists() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ``` | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| install_requires=[ | ||
| "requests>=2.31.0", | ||
| "rich>=13.0.0", | ||
| "tqdm>=4.65.0", | ||
| "urllib3>=2.0.0", | ||
| ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be an inconsistency between the dependencies listed in setup.py and requirements.txt. For example, requests has different version specifiers in each file. To ensure consistency and a single source of truth, it's best to define abstract dependencies in install_requires here. requirements.txt can then be used for development dependencies or for pinning exact versions for deployment. A common practice for development is to have requirements.txt simply contain -e . to install the package in editable mode along with its dependencies from setup.py.
| from rich.console import Console | ||
| from .config import Settings | ||
| from .file_utils import filter_filename |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request introduces major improvements to CI/CD automation, dependency management, environment configuration, and documentation for the Thinkific Downloader project. The changes add a comprehensive CI pipeline with multi-platform testing, code quality checks, Docker image verification, security scanning, and documentation validation. Automated dependency security checks are now scheduled weekly. The
.env.examplefile is reorganized for clarity and feature grouping. Documentation and development guides are updated to reflect the new atomic resume/backup system and the migration fromfile_validator.pytoresume_tracker.py.Continuous Integration & Automation
.github/workflows/ci.ymlcovering code formatting, linting, multi-platform testing, Docker build verification, security scans, documentation checks, and summary reporting..github/workflows/dependencies.ymlfor automated dependency security checks, update suggestions, and informational pull requests.Environment Configuration
.env.examplefor clarity, grouping settings by authentication, basic, enhanced, advanced, and optional features; added new options for validation, resume, rate limiting, debug mode, and improved documentation.Documentation & Developer Experience
README.mdandDEVELOPMENT.mdto highlight the new atomic resume/backup system, clarify quick start instructions, and reflect the migration fromfile_validator.pytoresume_tracker.pyin both architecture and test suite. [1] [2] [3] [4] [5] [6]resume_tracker.pytoDEVELOPMENT.md.These changes greatly improve project reliability, maintainability, and developer onboarding.