Seeking Alpha Scraper Suite

A comprehensive suite of tools for collecting, downloading, and extracting earnings call transcripts from Seeking Alpha.

Overview

This project provides a set of Python scripts to automate the process of:

Collecting article links from Seeking Alpha
Downloading article HTML content
Extracting earnings call transcripts from the HTML content
Storing data in a structured format (SQL Server database and/or local files)

Components

Link Collection

link-collector.py: Collects article links from Seeking Alpha author pages and stores them in a SQL Server database.
seekingalpha_scraper.py: More advanced scraper with link collection functionality and improved anti-detection features.

Content Downloading

content-downloader.py: Downloads HTML content for collected links and stores them as local files.
seekingalpha_scraper.py: Also provides content downloading capabilities with additional features.

Transcript Extraction

transcript-extractor.py: Extracts transcript content from downloaded HTML files and saves as JSON.
html_unified.py: Standalone utility for extracting transcript content using various parsing techniques.

Database Management

db-setup.py: Sets up and manages the SQL Server database tables required for the project.

All-in-One Solutions

unified.py: Combined workflow for link collection, content downloading, and transcript extraction.

Requirements

Python 3.7+
SQL Server database (for database storage mode)
Chrome browser (for web scraping)

Dependencies

undetected_chromedriver
BeautifulSoup4
pyodbc
python-dotenv
And other standard Python libraries

Setup

Create a .env file with your database credentials:

UID="your_username"
PID="your_password"
SERVER="your_server"
DATABASE="your_database"

Initialize the database:
```
python db-setup.py
```
Use the scraper components as needed (see Usage section)

Usage

Link Collection

# Using link-collector.py (database mode)
python link-collector.py --url "https://seekingalpha.com/author/your-author" --max-links 500 --headless

# Using seekingalpha_scraper.py (CSV mode)
python seekingalpha_scraper.py links --url "https://seekingalpha.com/author/your-author" --output "links.csv" --max-pages 10

Content Downloading

# Using content-downloader.py (database mode)
python content-downloader.py --output "sa_content" --batch-size 50 --one-time

# Using seekingalpha_scraper.py (CSV mode)
python seekingalpha_scraper.py articles --csv "links.csv" --output "sa_content"

Transcript Extraction

# Using transcript-extractor.py (database mode)
python transcript-extractor.py --html-dir "sa_content" --output "transcripts" --one-time

# Using html_unified.py (standalone mode)
python html_unified.py --input "sa_content" --output "transcripts"

Unified Workflow

# Using unified.py (combined workflow)
python unified.py --author "https://seekingalpha.com/author/your-author" --output "output_directory" --max-links 500

Advanced Features

Anti-Detection Measures: Uses undetected_chromedriver and various browser fingerprinting techniques to avoid detection.
Captcha Handling: Includes manual and automated captcha-handling capabilities.
Parallel Processing: Supports multi-threaded downloads and extractions for improved performance.
Adaptive Delays: Implements smart delays based on captcha encounters to avoid rate limiting.
Multiple Extraction Methods: Uses various techniques to extract transcript content even when page structure changes.

Project Structure

alpha/
├── .env                    # Environment variables
├── link-collector.py       # Link collection script (DB mode)
├── seekingalpha_scraper.py # Advanced scraper with multiple modes
├── content-downloader.py   # Content downloading script (DB mode)
├── transcript-extractor.py # Transcript extraction script (DB mode)
├── db-setup.py             # Database initialization script
├── unified.py              # All-in-one workflow script
├── html_unified.py         # Standalone HTML extraction utility
├── sa_content/             # Downloaded HTML content
└── logs/                   # Log files

Notes

Premium Seeking Alpha access is required to download full transcript content.
The scripts handle login sessions and maintain cookies between runs.
Be respectful of Seeking Alpha's servers by using reasonable delays between requests.
This tool is designed for research and personal use only.

Legal Disclaimer

This software is intended for personal research and educational purposes only. Users are responsible for ensuring their use of this software complies with Seeking Alpha's Terms of Service and all applicable laws. The authors of this software are not responsible for any misuse or violation of terms.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
captcha_20250226_011439.png		captcha_20250226_011439.png
content-downloader.py		content-downloader.py
content_downloader.log		content_downloader.log
csv_link_collector.log		csv_link_collector.log
db-setup.py		db-setup.py
html_unified.py		html_unified.py
link-collector.py		link-collector.py
link_collector.log		link_collector.log
links.csv		links.csv
play.py		play.py
seekingalpha_bypass.log		seekingalpha_bypass.log
seekingalpha_scraper.log		seekingalpha_scraper.log
seekingalpha_scraper.py		seekingalpha_scraper.py
simple_browser.log		simple_browser.log
stealth_browser.log		stealth_browser.log
transcript-extractor.py		transcript-extractor.py
transcript_extractor.log		transcript_extractor.log
unified.py		unified.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seeking Alpha Scraper Suite

Overview

Components

Link Collection

Content Downloading

Transcript Extraction

Database Management

All-in-One Solutions

Requirements

Dependencies

Setup

Usage

Link Collection

Content Downloading

Transcript Extraction

Unified Workflow

Advanced Features

Project Structure

Notes

Legal Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Seeking Alpha Scraper Suite

Overview

Components

Link Collection

Content Downloading

Transcript Extraction

Database Management

All-in-One Solutions

Requirements

Dependencies

Setup

Usage

Link Collection

Content Downloading

Transcript Extraction

Unified Workflow

Advanced Features

Project Structure

Notes

Legal Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages