Skip to content
techengine edited this page Apr 17, 2026 · 24 revisions

GoScrapy: Wiki

Prerequisites.

Goscrapy requires Go version 1.22 or higher to run.

Goscrapy cli.

Goscrapy provides the goscrapy cli tool to help you scaffold a goscrapy project.

Usage

  • Install
go install github.com/tech-engine/goscrapy@latest
  • Verify installation
gos -v
# or
goscrapy -v
  • Create a project
gos startproject scrapejsp

This will automatically initialize a new Go module and generate all necessary files. You will also be prompted to resolve dependencies (go mod tidy) instantly.

  • Create a custom pipeline
gos pipeline export_2_DB

Base Concepts

GoScrapy operates around the below three concepts.

  • Job: Describes an input to your spider.
  • Record: Represents an output produced by your spider.
  • Spider: Contains the main logic of your scraper.

Job(auto generated)

Job represents an input to goscrapy spider and must implement core.IJob interface.

type IJob interface {
    Id() string
}

job.go

type Job struct {
    id string
    // add your own fields here
}

func (j *Job) Id() string {
    return j.id
}

Record(auto generated)

A Record represents an output produced by a spider(via yield) and must implement core.IOutput.

type IOutput interface {
    Record() *Record
    RecordKeys() []string
    RecordFlat() []any
    Job() IJob
}

record.go

type Record struct {
    J    *Job   `json:"-" csv:"-"`
}

func (r *Record) Record() *Record {
    return r
}

func (r *Record) RecordKeys() []string {
    ....
    keys := make([]string, numFields)
    ....
    return keys
}

func (r *Record) RecordFlat() []any {
    ....
    return slice
}

func (r *Record) Job() core.IJob {
    return r.J
}

Spider(auto generated)

Encapsulates the main logic of a goscrapy spider. We embed gos.ICoreSpider to make our spider work.

spider.go

// This is the entrypoint to the spider
func (s *Spider) StartRequest(ctx context.Context, job *Job) {
  // for each request we must call NewRequest() and never reuse it
  req := s.Request(ctx)

  var headers http.Header

  /* GET is the request method, method chaining possible
  req.Url("<URL_HERE>").
  Meta("MY_KEY1", "MY_VALUE").
  Meta("MY_KEY2", true).
  Header(headers)
  */
    
  /* POST
  req.Url(<URL_HERE>)
  req.Method("POST")
  req.Body(<BODY_HERE>)
  */
    
  // call the next parse method
  s.Parse(req, s.parse)
}

func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) {
  // response.Body()
  // response.StatusCode()
  // response.Header()
  // response.Bytes()
  // response.Meta("MY_KEY1")
	
  // yielding output pushes output to be processed by pipelines, also check output.go for the fields
  var data Record

  err := json.Unmarshal(resp.Bytes(), &data)
  if err != nil {
    log.Panicln(err)
  }

  // s.Yield(&data)
}

// can be called when spider exits
func (s *Spider) Close(ctx context.Context) {
}

Settings(auto generated)

In addition to all the files discussed, we also have settings.go where we can import all middlewares and pipelines we want to use in our project.

settings.go

// HTTP Transport settings

// Default: 10000
const MIDDLEWARE_HTTP_TIMEOUT_MS = ""

// Default: 1000
const MIDDLEWARE_HTTP_MAX_IDLE_CONN = ""

// Default: 1000
const MIDDLEWARE_HTTP_MAX_CONN_PER_HOST = ""

// Default: 1000
const MIDDLEWARE_HTTP_MAX_IDLE_CONN_PER_HOST = ""

// Inbuilt Retry middleware settings

// Default: 3
const MIDDLEWARE_HTTP_RETRY_MAX_RETRIES = ""

// Default: 500, 502, 503, 504, 522, 524, 408, 429
const MIDDLEWARE_HTTP_RETRY_CODES = ""

// Default: 1s
const MIDDLEWARE_HTTP_RETRY_BASE_DELAY = ""

// Default: 50000
const SCHEDULER_REQ_RES_POOL_SIZE = ""

// Default: num. of CPU * 30
const SCHEDULER_CONCURRENCY = ""

// Default: 50000
const SCHEDULER_WORK_QUEUE_SIZE = ""

// Pipeline Manager settings

// Default: 10000
const PIPELINEMANAGER_ITEMPOOL_SIZE = ""

// Default: 24
const PIPELINEMANAGER_ITEM_SIZE = ""

// Default: 5000
const PIPELINEMANAGER_OUTPUT_QUEUE_BUF_SIZE = ""

// Default: 150
const PIPELINEMANAGER_MAX_PROCESS_ITEM_CONCURRENCY = ""

// Middlewares here
var MIDDLEWARES = []middlewaremanager.Middleware{
	middlewares.Retry(),
	middlewares.MultiCookieJar,
	middlewares.DupeFilter,
}

var export2CSV = csv.New[*Record](csv.Options{
	Filename: "itstimeitsnowornever.csv",
})

// Pipelines here
var PIPELINES = []pm.IPipeline[*Record]{
	export2CSV,
	// export2Json,
}
...

Base(auto generated)

Mainly used for setup.

base.go

package scrapejsp

import (
	"context"

	"github.com/tech-engine/goscrapy/cmd/gos"
)

type Spider struct {
	gos.ICoreSpider[*Record]
}

func New(ctx context.Context) *Spider {

	core := gos.New[*Record]().Setup(MIDDLEWARES, PIPELINES)

	spider := &Spider{
		core,
	}

	go func() {
		_ = core.Start(ctx)
		spider.Close(ctx)
	}()

	return spider
}

Examples

For practical examples and real-world use cases, check the _examples directory:

More examples coming...

Usage

main.go.

func main() {
	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	// start spider
	spider := books_to_scrape.New(ctx)

	// start the scraper with a job, currently nil is passed but you can pass your job here
	spider.StartRequest(ctx, nil)

	fmt.Println("🕷️  GoScrapy spider is running. Press Ctrl+C to stop.")

	// wait for completion
	if err := spider.Wait(true); err != nil && !errors.Is(err, context.Canceled) {
		fmt.Fprintf(os.Stderr, "❌ Engine finished with error: %v\n", err)
		os.Exit(1)
	}

	fmt.Println("✨ Engine finished successfully.")
}

Customize the Default client.

Option Description Default
WithProxies Accepts multiple proxy url strings. by default client uses proxy from enviroment
WithTimeout Http client timeout. 10 seconds
WithMaxIdleConns Controls the max no. of idle(keep-alive) conns. across all hosts. 0 means unlimited. 100
WithMaxIdleConnsPerHost Same as WithMaxIdleConns but per host. 100
WithMaxConnsPerHost Limits the total no. of conns. per host. 0 mean unlimited. 100
WithProxyFn Accepts a custom proxy function for transport. Round robin

[spider.go]

func New(ctx context.Context) (*Spider, <-chan error) {
    // default client options
    // proxies := gos.WithProxies("proxy_url1", "proxy_url2", ...)
     
    // core := gos.New[*Record]().WithClient(
    // 	  gos.DefaultClient(proxies),
    // )

    // we can also provide in our custom client
    // core := gos.New[*Record]().WithClient(myCustomHTTPClient)
}

Pipelines

Pipelines help in managing, transforming, and fine-tuning the scraped data.

Built-in Pipelines

Use Pipelines

We can add pipelines using coreSpider.PipelineManager.Add().

[settings.go]

// use export 2 csv pipeline
export2Csv := csv.New[*scrapejsp.Record](csv.Options{
	Filename: "itstimeitsnowornever.csv",
})

// use export 2 json pipeline
export2Json := json.New[*scrapejsp.Record](json.Options{
	Filename:  "itstimeitsnowornever.json",
	Immediate: true,
})

Pipeline Group

A Group allows us to execute multiple pipelines concurrently. All pipelines in a group behave as one single pipeline. This can be useful in scenarios we may want to export our data both to multiple destinations. Instead of exporting sequentially, we can bundle them together in a group.

Pipelines in a group shouldn't be used for data transformation but for independent tasks like data exporting to a database etc.

[settings.go]

func myCustomPipelineGroup() *pm.Group[*Record] {
  // create a group
  pipelineGroup := pm.NewGroup[*Record]()

  pipelineGroup.Add(export2CSV)
  // pipelineGroup.Add(export2Json)
  return pipelineGroup
}

// Pipelines here
// Executed in the order they appear.
var PIPELINES = []pm.IPipeline[*Record]{
  export2CSV,
  // export2Json,
  // myCustomPipelineGroup(), // use group as if it were a single pipeline
}

Middlewares

GoScrapy also support inbuilt and custom middlewares for manipulation outgoing request.

Built-in Middlewares

  • MultiCookieJar - used for maintaining different cookie sessions while scraping.
  • DupeFilter - filters duplicate requests
  • Retry - retry request with exponential back-off upon failure or with http status codes 500, 502, 503, 504, 522, 524, 408, 429
  • AzureTLS - integrated AzureTLS as a middleware for fingerprint spoofing.
  • Stats - used to display basic scraping stats on scraper shutdown.
Option Description Default
MaxRetries Additional retries after failure. 3
Codes Http code to trigger retry. 500, 502, 503, 504, 522, 524, 408, 429
BaseDelay Exponential Backoff multiplier. 1 second
Cb Callback executed after every retry. If callback returns false, further retry is skipped. nil

Use Middlewares

We can add middlewares using gos.MiddlewareManager.Add().

[settings.go]

var MIDDLEWARES = []middlewaremanager.Middleware{
	middlewares.Retry(),
	middlewares.MultiCookieJar,
	middlewares.DupeFilter,
}

Custom Pipelines

GoScrapy supports custom pipelines. To create one, you can use goscrapy cli.

abc\go\go-test-scrapy>scrapejsp> gos pipeline export_2_DB

✔️  pipelines\export_2_DB.go

✨ Congrates, export_2_DB created successfully.

Custom middlewares

To create one, you can use goscrapy cli. Custom middlewares must have the below function signature.

func MultiCookieJar(next http.RoundTripper) http.RoundTripper {
	return core.MiddlewareFunc(func(req *http.Request) (*http.Response, error) {
		// you middleware custom code here
	})
}

Selectors

GoScrapy supports CSS and XPATH selectors.

[spider.go]

func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) {

        // CSS selector - select all products a tags and extract the href attribute value
        var productUrls []string
        productUrls = resp.Css("article.product_pod h3 a").Attr("href")

        // select all the text node values
        var productNames []string
        productNames = resp.Css("article.product_pod h3 a").Text()

        // Selector chaining is possible too
        productUrls = resp.Css("article.product_pod").Css("h3 a").Attr("href")

        // Xpath selector
        productUrls = resp.Xpath("//article[contains(@class, 'product_pod')]//h3//a").Attr("href")

        // chaining xpath and css also possible
        productUrls = resp.Xpath("//article[contains(@class, 'product_pod')]").Css("h3 a").Attr("href")


        // Get all matching nodes
        var productUrlNodes []*html.Node
        productUrlNodes = resp.Css("article.product_pod h3 a").GetAll()

        // Get the first matching node
        var firstProductUrlNode *html.Node
        firstProductUrlNode = resp.Css("article.product_pod h3 a").Get()
}

Get in touch

Discord