GoScrapy is a high-performance web scraping framework for Go, designed with the familiar architecture of Python's Scrapy. It provides a robust, developer-centric experience for building sophisticated data extraction systems, purposefully crafted for those making the leap from Python to the Go ecosystem.
While low-level scraping libraries are powerful, many teams require the high-level architectural framework established by Scrapy. GoScrapy brings this architectural discipline natively to Go, organizing your request callbacks, middlewares, and pipelines into a structured, manageable workflow.
Instead of manually orchestrating retries, cookie isolation, or database handoffs, GoScrapy provides the engine that powers your spiders. You focus purely on the extraction logic; the framework manages the high-throughput lifecycle and concurrency in the background.
- π Blazing Fast β Built on Go's concurrency model for high-throughput parallel scraping
- π Scrapy-inspired β Familiar architecture for anyone coming from Python's Scrapy
- π οΈ CLI Scaffolding β Generate project structure instantly with
goscrapy startproject - π Smart Retry β Automatic retries with exponential back-off on failures
- πͺ Cookie Management β Maintains separate cookie sessions per scraping target
- π CSS & XPath Selectors β Flexible HTML parsing with chainable selectors
- π¦ Built-in Pipelines β Export scraped data to CSV, JSON, MongoDB, Google Sheets, and Firebase out of the box
- π§© Built-in Middleware β Plug in robust middlewares like Azure TLS and advanced Dupefilters
- π Extensible by Design β Almost every layer of the framework is built to be swapped or extended
- ποΈ Telemetry & Monitoring β Optional built-in telemetry hub for real-time stats
For practical examples and real-world use cases, check the _examples directory:
- Google Maps Scraper β Complete scraper for businesses on Google Maps.
- Books to Scrape β Standard scraping example for a book catalog.
- TUI Stats Integration β Example showing how to use the built-in TUI for real-time monitoring.
- Fingerprint Spoofing β advanced usage for bypassing bot detection.
GoScrapy's data flow is designed for clarity and concurrent execution:
flowchart LR
%% Request Flow
Spider -->|1. Request| Engine
Engine -->|2. Schedule| Scheduler
Scheduler -->|3. Pull Worker| WorkerQueue[(Worker Queue)]
WorkerQueue -.->|4. Available Worker| Scheduler
Scheduler -->|5. Pass Work| Worker
Worker -->|6. Trigger| Executor
Executor -->|7. Forward| Middlewares
Middlewares -->|8. Request| HTTP_Client
%% Response Flow
HTTP_Client -.->|9. Response| Middlewares
Middlewares -.->|10. Response| Executor
Executor -.->|11. Callback| Spider
%% Data Flow
Spider ==>|12. Yield Record| Engine
Engine ==>|13. Push Data| PipelineManager
PipelineManager ==>|14. Export| Pipelines[(DB, CSV, File)]
style Spider fill:#F5C4B3,stroke:#993C1D,stroke-width:1px,color:#711B0C
style Engine fill:#B5D4F4,stroke:#185FA5,stroke-width:1px,color:#0C447C
style Scheduler fill:#CECBF6,stroke:#534AB7,stroke-width:1px,color:#3C3489
style WorkerQueue fill:#D3D1C7,stroke:#5F5E5A,stroke-width:1px,color:#444441
style Worker fill:#9FE1CB,stroke:#0F6E56,stroke-width:1px,color:#085041
style Executor fill:#FAC775,stroke:#854F0B,stroke-width:1px,color:#633806
style Middlewares fill:#E5B8F3,stroke:#842B9E,stroke-width:1px,color:#4B1161
style HTTP_Client fill:#C0DD97,stroke:#3B6D11,stroke-width:1px,color:#27500A
style PipelineManager fill:#F4C0D1,stroke:#993556,stroke-width:1px,color:#72243E
style Pipelines fill:#D3D1C7,stroke:#5F5E5A,stroke-width:1px,color:#444441
Important
GoScrapy requires Go 1.22 or higher.
go install github.com/tech-engine/goscrapy/cmd/...@latestTip
This command installs both goscrapy and the shorter gos alias. You can use either command to run the scaffolding tool!
gos -v
# or
goscrapy -vgoscrapy startproject books_to_scrapeThis will automatically initialize a new Go module and generate all necessary files. You will also be prompted to resolve dependencies (go mod tidy) instantly.
\tech-engine\go\go-test-scrapy> goscrapy startproject books_to_scrape
π GoScrapy generating project files. Please wait!
π¦ Initializing Go module: books_to_scrape...
βοΈ books_to_scrape\base.go
βοΈ books_to_scrape\constants.go
βοΈ books_to_scrape\errors.go
βοΈ books_to_scrape\job.go
βοΈ main.go
βοΈ books_to_scrape\record.go
βοΈ books_to_scrape\spider.go
π¦ Do you want to resolve dependencies now (go mod tidy)? [Y/n]: Y
π¦ Resolving dependencies...
β¨ Congrats, books_to_scrape created successfully.GoScrapy streamlines your workflow by allowing you to configure middlewares and export pipelines in a centralized settings.go file.
This file is automatically generated by the CLI and allows you to configure middlewares and export pipelines in a centralized location.
package myspider
import (
"time"
pm "github.com/tech-engine/goscrapy/pkg/pipeline_manager"
"github.com/tech-engine/goscrapy/pkg/middlewaremanager"
"github.com/tech-engine/goscrapy/pkg/builtin/middlewares"
"github.com/tech-engine/goscrapy/pkg/builtin/pipelines/csv"
)
// Add Azure TLS client and Retry functionality seamlessly
var MIDDLEWARES = []middlewaremanager.Middleware{
middlewares.AzureTLS(azureTLSOpts),
middlewares.Retry(), // 3 retries, 5s back-off
}
// Prepare CSV export pipeline
var export2CSV = csv.New[*Record](csv.Options{
Filename: "itstimeitsnowornever.csv",
})
// Export to CSV instantly
var PIPELINES = []pm.IPipeline[*Record]{
export2CSV,
}The boilerplate engine setup is hidden away in base.go, which is generated by the CLI but still configurable if needed.
package myspider
import (
"context"
"github.com/tech-engine/goscrapy/pkg/gos"
)
type Spider struct {
gos.ICoreSpider[*Record]
}
func New(ctx context.Context) *Spider {
// Initialize and configure everything in one go
app := gos.NewApp[*Record]().Setup(MIDDLEWARES, PIPELINES)
spider := &Spider{app}
go func() {
_ = app.Start(ctx)
spider.Close(ctx)
}()
return spider, errCh
}Your spider.go (also scaffolded by the CLI) remains clean and focused entirely on parsing.
package myspider
import (
"context"
"encoding/json"
"github.com/tech-engine/goscrapy/pkg/core"
)
// StartRequest is the entrypoint to the spider
func (s *Spider) StartRequest(ctx context.Context, job *Job) {
// Create a new request. This request must not be reused.
req := s.Request(ctx)
req.Url("https://httpbin.org/get")
s.Parse(req, s.parse)
}
func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) {
s.Logger().Infof("status: %d", resp.StatusCode())
var data Record
if err := json.Unmarshal(resp.Bytes(), &data); err != nil {
s.Logger().Errorf("failed to unmarshal record: %v", err)
return
}
// Yield sends the data securely to your configured pipelines
s.Yield(&data)
}
func (s *Spider) Close(ctx context.Context) {
}Please follow the official Wiki docs for complete details on creating custom pipelines, middlewares, and using the robust selector engine.
GoScrapy is currently in active v0.x development. We are continually refining the Core API towards a stable v1.0 release. We welcome community use, feedback, and Pull Requests to help us shape the future of scraping in Go!
GoScrapy is offered under the Business Source License (BSL).
What does this mean for developers?
We want you to build amazing things with GoScrapy! You are completely free to use this framework in production, build your own commercial SaaS products that rely on it, and scrape data for your business without paying any licensing fees.
The BSL is simply in place to ensure the sustainability of the project. To protect the core framework, we ask that you respect a few common-sense boundaries: please avoid offering GoScrapy as a competitive, managed "Scraper-as-a-Service," repackaging the framework under a new name, or commercializing direct codebase ports into other languages (whether translated manually or AI or via any other tooling) as your own work.
By contributing to the GoScrapy project, you agree to the terms of the license.
GoScrapy includes a built-in logging system that defaults to INFO level. You can control the framework's output using the GOS_LOG_LEVEL environment variable:
DEBUG: Detailed execution trace.INFO: Basic startup/shutdown info (Default).WARN: Warnings and retry notifications.ERROR: Fatal errors.NONE: Completely disable framework logging.
You can also pass a custom implementation of the core.ILogger interface using the .WithLogger() method during application setup.
Cookie managementBuiltin & Custom Middlewares supportCss & Xpath SelectorsLogging & Custom Logger Support- Increasing E2E test coverage


