Skip to content

tech-engine/goscrapy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

531 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GoScrapy: Web Scraping Framework in Go

License: BSL Go Version Discord Go Report Card

GoScrapy is a high-performance web scraping framework for Go, designed with the familiar architecture of Python's Scrapy. It provides a robust, developer-centric experience for building sophisticated data extraction systems, purposefully crafted for those making the leap from Python to the Go ecosystem.

Why GoScrapy?

While low-level scraping libraries are powerful, many teams require the high-level architectural framework established by Scrapy. GoScrapy brings this architectural discipline natively to Go, organizing your request callbacks, middlewares, and pipelines into a structured, manageable workflow.

Instead of manually orchestrating retries, cookie isolation, or database handoffs, GoScrapy provides the engine that powers your spiders. You focus purely on the extraction logic; the framework manages the high-throughput lifecycle and concurrency in the background.

Features

  • πŸš€ Blazing Fast β€” Built on Go's concurrency model for high-throughput parallel scraping
  • 🐍 Scrapy-inspired β€” Familiar architecture for anyone coming from Python's Scrapy
  • πŸ› οΈ CLI Scaffolding β€” Generate project structure instantly with goscrapy startproject
  • πŸ” Smart Retry β€” Automatic retries with exponential back-off on failures
  • πŸͺ Cookie Management β€” Maintains separate cookie sessions per scraping target
  • πŸ” CSS & XPath Selectors β€” Flexible HTML parsing with chainable selectors
  • πŸ“¦ Built-in Pipelines β€” Export scraped data to CSV, JSON, MongoDB, Google Sheets, and Firebase out of the box
  • 🧩 Built-in Middleware β€” Plug in robust middlewares like Azure TLS and advanced Dupefilters
  • πŸ”Œ Extensible by Design β€” Almost every layer of the framework is built to be swapped or extended
  • πŸŽ›οΈ Telemetry & Monitoring β€” Optional built-in telemetry hub for real-time stats

Examples

For practical examples and real-world use cases, check the _examples directory:

Architecture

GoScrapy's data flow is designed for clarity and concurrent execution:

flowchart LR
    %% Request Flow
    Spider -->|1. Request| Engine
    Engine -->|2. Schedule| Scheduler
    Scheduler -->|3. Pull Worker| WorkerQueue[(Worker Queue)]
    WorkerQueue -.->|4. Available Worker| Scheduler
    Scheduler -->|5. Pass Work| Worker
    Worker -->|6. Trigger| Executor
    Executor -->|7. Forward| Middlewares
    Middlewares -->|8. Request| HTTP_Client

    %% Response Flow
    HTTP_Client -.->|9. Response| Middlewares
    Middlewares -.->|10. Response| Executor
    Executor -.->|11. Callback| Spider

    %% Data Flow
    Spider ==>|12. Yield Record| Engine
    Engine ==>|13. Push Data| PipelineManager
    PipelineManager ==>|14. Export| Pipelines[(DB, CSV, File)]

    style Spider fill:#F5C4B3,stroke:#993C1D,stroke-width:1px,color:#711B0C
    style Engine fill:#B5D4F4,stroke:#185FA5,stroke-width:1px,color:#0C447C
    style Scheduler fill:#CECBF6,stroke:#534AB7,stroke-width:1px,color:#3C3489
    style WorkerQueue fill:#D3D1C7,stroke:#5F5E5A,stroke-width:1px,color:#444441
    style Worker fill:#9FE1CB,stroke:#0F6E56,stroke-width:1px,color:#085041
    style Executor fill:#FAC775,stroke:#854F0B,stroke-width:1px,color:#633806
    style Middlewares fill:#E5B8F3,stroke:#842B9E,stroke-width:1px,color:#4B1161
    style HTTP_Client fill:#C0DD97,stroke:#3B6D11,stroke-width:1px,color:#27500A
    style PipelineManager fill:#F4C0D1,stroke:#993556,stroke-width:1px,color:#72243E
    style Pipelines fill:#D3D1C7,stroke:#5F5E5A,stroke-width:1px,color:#444441
Loading

Getting Started

Important

GoScrapy requires Go 1.22 or higher.

1. Install GoScrapy CLI

go install github.com/tech-engine/goscrapy/cmd/...@latest

Tip

This command installs both goscrapy and the shorter gos alias. You can use either command to run the scaffolding tool!

2. Verify Installation

gos -v
# or
goscrapy -v

3. Create a New Project

goscrapy startproject books_to_scrape

This will automatically initialize a new Go module and generate all necessary files. You will also be prompted to resolve dependencies (go mod tidy) instantly.

\tech-engine\go\go-test-scrapy> goscrapy startproject books_to_scrape

πŸš€ GoScrapy generating project files. Please wait!

πŸ“¦ Initializing Go module: books_to_scrape...
βœ”οΈ  books_to_scrape\base.go
βœ”οΈ  books_to_scrape\constants.go
βœ”οΈ  books_to_scrape\errors.go
βœ”οΈ  books_to_scrape\job.go
βœ”οΈ  main.go
βœ”οΈ  books_to_scrape\record.go
βœ”οΈ  books_to_scrape\spider.go

πŸ“¦ Do you want to resolve dependencies now (go mod tidy)? [Y/n]: Y
πŸ“¦ Resolving dependencies...

✨ Congrats, books_to_scrape created successfully.

Quick Look: Powerful Features

GoScrapy streamlines your workflow by allowing you to configure middlewares and export pipelines in a centralized settings.go file.

settings.go

This file is automatically generated by the CLI and allows you to configure middlewares and export pipelines in a centralized location.

package myspider

import (
	"time"

	pm "github.com/tech-engine/goscrapy/pkg/pipeline_manager"
	"github.com/tech-engine/goscrapy/pkg/middlewaremanager"
	"github.com/tech-engine/goscrapy/pkg/builtin/middlewares"
	"github.com/tech-engine/goscrapy/pkg/builtin/pipelines/csv"
)

// Add Azure TLS client and Retry functionality seamlessly
var MIDDLEWARES = []middlewaremanager.Middleware{
	middlewares.AzureTLS(azureTLSOpts),
	middlewares.Retry(), // 3 retries, 5s back-off
}

// Prepare CSV export pipeline
var export2CSV = csv.New[*Record](csv.Options{
	Filename: "itstimeitsnowornever.csv",
})

// Export to CSV instantly
var PIPELINES = []pm.IPipeline[*Record]{
	export2CSV,
}

base.go

The boilerplate engine setup is hidden away in base.go, which is generated by the CLI but still configurable if needed.

package myspider

import (
	"context"
	"github.com/tech-engine/goscrapy/pkg/gos"
)

type Spider struct {
	gos.ICoreSpider[*Record]
}

func New(ctx context.Context) *Spider {
	// Initialize and configure everything in one go
	app := gos.NewApp[*Record]().Setup(MIDDLEWARES, PIPELINES)

	spider := &Spider{app}

	go func() {
		_ = app.Start(ctx)
		spider.Close(ctx)
	}()

	return spider, errCh
}

spider.go

Your spider.go (also scaffolded by the CLI) remains clean and focused entirely on parsing.

package myspider

import (
	"context"
	"encoding/json"

	"github.com/tech-engine/goscrapy/pkg/core"
)

// StartRequest is the entrypoint to the spider
func (s *Spider) StartRequest(ctx context.Context, job *Job) {
	// Create a new request. This request must not be reused.
	req := s.Request(ctx)
	req.Url("https://httpbin.org/get")
	s.Parse(req, s.parse)
}

func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) {
	s.Logger().Infof("status: %d", resp.StatusCode())

	var data Record
	if err := json.Unmarshal(resp.Bytes(), &data); err != nil {
		s.Logger().Errorf("failed to unmarshal record: %v", err)
		return
	}

	// Yield sends the data securely to your configured pipelines
	s.Yield(&data)
}

func (s *Spider) Close(ctx context.Context) {
}

Wiki

Please follow the official Wiki docs for complete details on creating custom pipelines, middlewares, and using the robust selector engine.

Status Note

GoScrapy is currently in active v0.x development. We are continually refining the Core API towards a stable v1.0 release. We welcome community use, feedback, and Pull Requests to help us shape the future of scraping in Go!

License

GoScrapy is offered under the Business Source License (BSL).

What does this mean for developers?
We want you to build amazing things with GoScrapy! You are completely free to use this framework in production, build your own commercial SaaS products that rely on it, and scrape data for your business without paying any licensing fees.

The BSL is simply in place to ensure the sustainability of the project. To protect the core framework, we ask that you respect a few common-sense boundaries: please avoid offering GoScrapy as a competitive, managed "Scraper-as-a-Service," repackaging the framework under a new name, or commercializing direct codebase ports into other languages (whether translated manually or AI or via any other tooling) as your own work.

By contributing to the GoScrapy project, you agree to the terms of the license.

Logging

GoScrapy includes a built-in logging system that defaults to INFO level. You can control the framework's output using the GOS_LOG_LEVEL environment variable:

  • DEBUG: Detailed execution trace.
  • INFO: Basic startup/shutdown info (Default).
  • WARN: Warnings and retry notifications.
  • ERROR: Fatal errors.
  • NONE: Completely disable framework logging.

You can also pass a custom implementation of the core.ILogger interface using the .WithLogger() method during application setup.

Roadmap

  • Cookie management
  • Builtin & Custom Middlewares support
  • Css & Xpath Selectors
  • Logging & Custom Logger Support
  • Increasing E2E test coverage

Partners

Get in touch

Join our Discord Community

About

GoScrapy: Harnessing Go's perfomance for blazingly fast web scraping, inspired by Python's Scrapy framework.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors