Skip to content

qcom-anilyada/release-evaluation-graph

Repository files navigation

GlobeNewswire LLM Signal Pipeline — Learning Project

Personal Learning Project: This is an educational project created in my free time to learn about LLM orchestration, agentic pipelines, and financial data analysis. All data used is publicly available.

Project Overview

This project is a hands-on learning exercise to understand:

  • LangGraph for building multi-step LLM agentic workflows
  • LLM orchestration — chaining extraction, enrichment, and reasoning steps
  • Prompt engineering for structured data extraction and signal generation
  • Data pipeline design with SQLite caching and JSON output
  • Financial data analysis using public market data

Disclaimer: This is purely for educational purposes. Not financial advice. All data is from public sources.


Learning Objectives

What I'm Learning

  1. LLM Concepts & Frameworks

    • LangGraph StateGraph for defining multi-node agentic pipelines
    • State management across graph nodes using TypedDict
    • Conditional edges and early exits in directed graphs
    • Subgraph composition (main pipeline + per-article strategy subgraph)
    • LLM prompt design for structured JSON extraction
    • Multi-step reasoning: extract → enrich → verdict
  2. Python Programming

    • Abstract base classes and strategy pattern
    • TypedDict for structured state definitions
    • SQLite caching to avoid redundant LLM calls and HTTP fetches
    • Environment variable management with python-dotenv
    • Modular project layout for maintainability
  3. Financial Data Analysis

    • Understanding press release events (equity offerings, partnerships, etc.)
    • Fetching public market data with yfinance
    • Computing financial ratios (dilution %, discount to close, deal % of market cap)
    • Anchoring market data to a pre-event timestamp (T-1) to avoid look-ahead bias
  4. Data Engineering

    • ETL pipeline design (scrape → parse → cache → enrich → classify → output)
    • SQLite caching strategies (article bodies + signal records)
    • JSON output design with nested structured fields

What This Project Does

Data Collection (Public Sources Only)

Processing Pipeline

  1. Build URL — construct a GlobeNewswire search URL from configured filters
  2. Scrape — paginate through search results, collect article stubs
  3. Fetch — retrieve full article bodies (SQLite-cached to avoid re-fetching)
  4. Extract — LLM reads each press release and outputs structured facts (ticker, event type, deal size, etc.)
  5. Enrichyfinance fetches pre-event market data anchored to T-1 (prev close, 52-week range, avg volume, market cap)
  6. Verdict — LLM combines extracted facts and market context to produce a 7-point signal
  7. Output — single annotated JSON file with all articles and their signal records

Signal Scale

speculative_bearish → bearish → mildly_bearish → neutral → mildly_bullish → bullish → speculative_bullish

Technical Stack

Languages & Frameworks

  • Python 3.11 — core language
  • LangGraph — agentic pipeline orchestration
  • LangChain Core — LLM abstractions

APIs & Data Sources

  • GlobeNewswire — public press releases
  • Yahoo Finance (yfinance) — public market data
  • QGenie / OpenRouter — LLM providers (API key required, configured in .env)

Infrastructure

  • SQLite — local caching for article bodies and signal records
  • BeautifulSoup4 — HTML parsing
  • python-dotenv — environment variable management

Project Architecture

release_evaluation_langgraph/
│
├── CONFIG.py                           ← all user settings (edit this to change anything)
├── run.py                              ← entry point
├── graph.py                            ← main LangGraph StateGraph definition
├── state.py                            ← PipelineState and ArticleSignalState TypedDicts
│
├── nodes/                              ← one file per graph node
│   ├── build_url.py
│   ├── scrape_pages.py
│   ├── fetch_bodies.py
│   ├── save_articles.py
│   ├── run_signals.py
│   └── merge_output.py
│
├── signal_strategies/
│   ├── base.py                         ← BaseSignalStrategy abstract class
│   ├── __init__.py                     ← strategy registry
│   └── default/
│       ├── strategy.py                 ← DefaultSignalStrategy (LangGraph subgraph)
│       ├── STRATEGY.md                 ← taxonomy, schema, and signal logic docs
│       ├── extractor.py                ← Step 1: LLM extraction
│       ├── enricher.py                 ← Step 2: yfinance market data
│       └── verdict.py                  ← Step 3: LLM verdict
│
├── utils/                              ← infrastructure (not strategy-specific)
│   ├── url_builder.py
│   ├── article_scraper.py
│   ├── article_fetcher.py
│   ├── article_cache.py                ← SQLite cache for article bodies
│   ├── signal_cache.py                 ← SQLite cache for signal records
│   ├── llm_client.py                   ← unified LLM abstraction (QGenie / OpenRouter)
│   └── utils.py
│
├── filter_mapping.json                 ← maps human labels to GlobeNewswire URL codes
└── export/                             ← output JSON files (created at runtime, gitignored)

LangGraph Pipeline Flow

START
  │
  ▼
build_url_node         Resolves date range, builds GlobeNewswire search URL
  │
  ▼
scrape_pages_node      Paginates through results → article stubs
  │
  ├─[no stubs]──► END
  │
  ▼
fetch_bodies_node      Fetches full article bodies (SQLite-cached)
  │
  ▼
save_articles_node     Saves raw articles to export/
  │
  ▼
run_signals_node       Loops over articles → calls signal strategy per article
  │
  ▼
merge_output_node      Merges signal records into JSON, overwrites file
  │
  ▼
END

Each article runs through a signal strategy subgraph:

extract_node → [no ticker? → skip] → enrich_node → verdict_node → save_signal_node

Getting Started

Prerequisites

  • Python 3.11+
  • Git
  • Virtual environment (recommended)
  • API key for QGenie or OpenRouter

Installation

  1. Clone the repository

    git clone <repo-url>
    cd release_evaluation_langgraph
  2. Create virtual environment

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install langgraph langchain-core requests beautifulsoup4 yfinance python-dotenv openai
    pip install qgenie          # if using QGenie
  4. Configure environment

    cp .env.example .env
    # Edit .env and fill in your API keys and model names
  5. Edit CONFIG.py

    RUN_MODE        = "backtest"                       # or "live"
    DATE_RANGE      = ("2026-04-01", "2026-04-20")    # for backtest
    LLM_PROVIDER    = "qgenie"                         # or "openrouter"
  6. Run the pipeline

    python run.py

Output

  • JSON file: export/articles_YYYY-MM-DD_HHMMSS_<mode>.json

Adding a New Signal Strategy

The main graph is strategy-agnostic. To add your own:

  1. Create signal_strategies/my_strategy/strategy.py implementing BaseSignalStrategy
  2. Create signal_strategies/my_strategy/STRATEGY.md documenting your logic
  3. Register it in signal_strategies/__init__.py
  4. Set SIGNAL_STRATEGY = "my_strategy" in CONFIG.py

Zero changes to graph.py, nodes/, or state.py.


Legal & Ethical Considerations

Data Sources

  • All data is publicly available
  • No proprietary or confidential information
  • Respects rate limits and public API terms

Usage

  • Educational purposes only
  • Not financial advice
  • No commercial use
  • Personal learning project

Compliance

  • No CCI (Confidential Company Information)
  • No PII (Personally Identifiable Information)
  • No API keys or credentials in code (all in .env, gitignored)
  • No internal company systems, endpoints, or infrastructure references

What I Learned

Technical Skills

  • Building multi-step LLM agentic pipelines with LangGraph
  • Structuring a strategy pattern for pluggable LLM logic
  • Implementing SQLite caching for both HTTP responses and LLM outputs
  • Avoiding look-ahead bias when enriching with market data
  • Designing JSON output schemas for downstream evaluation

Domain Knowledge

  • Understanding press release event categories (dilutive equity, partnerships, earnings, etc.)
  • How offering discounts and dilution percentages relate to short-term price movement
  • Pre-event vs. post-event data boundaries in signal generation

Best Practices

  • Separating infrastructure (utils/) from strategy-specific logic (signal_strategies/)
  • Keeping all configuration in one place (CONFIG.py)
  • Using environment variables for credentials — never in code
  • Documenting signal taxonomy and decision rules in STRATEGY.md

Future Enhancements (Learning Goals)

  • Step 4: Evaluator — fetch T+1/T+2/T+5 closes, score signal accuracy
  • Parallel article processing for faster runs
  • Additional signal strategies (earnings, M&A, partnerships)
  • Unit tests with pytest
  • REST API wrapper (FastAPI)
  • Visualization of signal distribution and accuracy

Important Notes

  1. Not Financial Advice — this project is for learning LLM concepts and data pipeline design
  2. Public Data Only — all data sources are publicly available
  3. Educational Purpose — built in free time for skill development
  4. No Guarantees — signals are experimental outputs of an LLM prompt, not validated predictions
  5. Personal Project — not affiliated with any organization

Status: Active Learning Project
Purpose: Educational — LLM orchestration, LangGraph, agentic pipelines, financial data analysis

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages