Fetching and parsing HTML contents

Now that we have a full picture of the behavior we expect our crawler should have, we can move on to lower levels and focus on every little brick we need to build in order to reach our goal. In this cases, application of TDD comes naturally by adopting a bottom-up approach to the design of the system, independent units linked together to form a bigger system, piece by piece.

The first component we're going to design and implement is the HTTP fetching object, basically a wrapper around an HTTP client that navigates through the HTML tree of each fetched page and extracts every link found.

Let's start with a breakdown of what we're going to need to implement our fetcher:

HTTP client
- Retry mechanism
HTML parser
- Link extractor

Parsing HTML documents

So we move on writing some basic unit tests to define the behavior we expect from these two parts, starting with the HTML parser. The core feature we want to implement is a function that ingests an HTML document and return all the links it finds.

Note: After a brief search I found out that GoQuery by PuerkitoBio is the easiest and most handy library to parse HTML contents offering a jquery-like DSL to navigate through the entire tree. The alternative was the navigation by hand and probably some regex, not worth the hassle given the purpose of the project

fetcher/parser_test.go

package fetcher

import (
    "bytes"
    "net/url"
    "reflect"
    "testing"
)

func TestGoqueryParsePage(t *testing.T) {
    parser := NewGoqueryParser()
    firstLink, _ := url.Parse("https://example-page.com/sample-page/")
    secondLink, _ := url.Parse("http://localhost:8787/sample-page/")
    thirdLink, _ := url.Parse("http://localhost:8787/foo/bar")
    expected := []*url.URL{firstLink, secondLink, thirdLink}
    content := bytes.NewBufferString(
        `<head>
            <link rel="canonical" href="https://example-page.com/sample-page/" />
            <link rel="canonical" href="http://localhost:8787/sample-page/" />
         </head>
         <body>
            <a href="foo/bar"><img src="/baz.png"></a>
            <img src="/stonk">
            <a href="foo/bar">
        </body>`,
    )
    res, err := parser.Parse("http://localhost:8787", content)
    if err != nil {
        t.Errorf("GoqueryParser#ParsePage failed: expected %v got %v", expected, err)
    }
    if !reflect.DeepEqual(res, expected) {
        t.Errorf("GoqueryParser#ParsePage failed: expected %v got %v", expected, res)
    }
}

Let's now move on with the implementation, we could very well define a parser interface exposing a single method Parse(string, *io.Reader)([]*url.URL, error).
This definition of an interface foreseeing the implementation, is closer to a classical OOP style, think about Java usage of interfaces, we're going to define a contract to enforce a behavior; but it's not really the best and only usage of this language feature. More on this later.

Little dissertation over interfaces in Go

One of the strongest features of Go is that there's no need to explicitly declare when we want to implement an interface, we just need to implement the methods that it defines and we're good. This makes possible to build abstractions that we foresee as useful, like in this case (classic OOP style), but also to adapt abstractions as needed after we already worked a bit on the problems we're trying to solve:

Go is an attempt to combine the safety and performance of statically typed languages with the convenience and fun of dynamically typed interpretative languages.
Rob Pike

Let's say we're about to design an object ImageWriter that writes binary formatted images to disk, we just need to implement the method Write of the io.Writer interface without explicitly declare that we're implementing it, this way our ImageWriter object can be used anywhere an io.Writer is accepted. At the same time let's say we have an object from a third-party library that exposes a method ReadLine, we can easily declare an interface ReadLiner with only a method ReadLine inside and use either the third-party object (or whatever object with a ReadLine method) or a newly defined object with the ReadLine method defined into a simple function ReadByLine(r ReadLiner).
It's the principle of accepts interfaces, return structs², in other words if a function signature accepts an interface, then callers have the option to pass in any concrete type, just as long as it implements that interface. The implication is that interfaces should be declared close to where they're used.
This is really akin to a duck-typing behavior at compile time, and it's enabled by this feature of Go, making it, for some aspects, really similar to dynamic languages like Python or Ruby.

So to start grasping the problem we prefer to just start with a concrete implementation, we're going to decide later where and when (and if) we're going to need a general abstraction.

fetcher/parser.go

// Package fetcher defines and implement the fetching and parsing utilities
// for remote resources
package fetcher

import (
    "io"
    "net/url"
    "path/filepath"
    "sync"

    "github.com/PuerkitoBio/goquery"
)

// GoqueryParser is just an algorithm `Parser` definition that uses
// `github.com/PuerkitoBio/goquery` as a backend library
type GoqueryParser struct {
    excludedExts map[string]bool
    seen         *sync.Map
}

// NewGoqueryParser create a new parser with goquery as backend
func NewGoqueryParser() GoqueryParser {
    return GoqueryParser{
        excludedExts: make(map[string]bool),
        seen:         new(sync.Map),
    }
}

// ExcludeExtensions add extensions to be excluded to the default exclusion
// pool
func (p *GoqueryParser) ExcludeExtensions(exts ...string) {
    for _, ext := range exts {
        p.excludedExts[ext] = true
    }
}

Go does not natively support the set data structure, the idiomatic way to implement it is to use a map with bool as value type

GoqueryParser could very well been an empty struct, as long as it implements the parse method we're good, but the object is meant to be shared to multiple workers and it's advisable to filter out repeated links already. For a simple list of exclusion we use a set (a map with bool value), representing the file extensions that we don't want to extract from each HTML document we fetch. The seen map is a concurrent implementation of a map from the sync package, its an optimized version of a simple map guarded by mutex.

Let's implement the parse method with the goquery library:

fetcher/parser.go


// Parse is the implementation of the `Parser` interface for the
// `GoqueryParser` struct, read the content of an `io.Reader` (e.g.
// any file-like streamable object) and extracts all anchor links.
// It returns a `ParserResult` object or any error that arises from the goquery
// call on the data read.
func (p GoqueryParser) Parse(baseURL string, reader io.Reader) ([]*url.URL, error) {
    doc, err := goquery.NewDocumentFromReader(reader)
    if err != nil {
        return nil, err
    }
    links := p.extractLinks(doc, baseURL)
    return links, nil
}

// extractLinks retrieves all anchor links inside a `goquery.Document`
// representing an HTML content.
// It returns a slice of string containing all the extracted links or `nil` if
// the passed document is a `nil` pointer.
func (p *GoqueryParser) extractLinks(doc *goquery.Document, baseURL string) []*url.URL {
    if doc == nil {
        return nil
    }
    foundURLs := []*url.URL{}
    doc.Find("a,link").FilterFunction(func(i int, element *goquery.Selection) bool {
        hrefLink, hrefExists := element.Attr("href")
        linkType, linkExists := element.Attr("rel")
        anchorOk := hrefExists && !p.excludedExts[filepath.Ext(hrefLink)]
        linkOk := linkExists && linkType == "canonical" && !p.excludedExts[filepath.Ext(linkType)]
        return anchorOk || linkOk
    }).Each(func(i int, element *goquery.Selection) {
        res, _ := element.Attr("href")
        if link, ok := resolveRelativeURL(baseURL, res); ok {
            if present, _ := p.seen.LoadOrStore(link.String(), false); !present.(bool) {
                foundURLs = append(foundURLs, link)
                p.seen.Store(link.String(), true)
            }
        }
    })
    return foundURLs
}

// resolveRelativeURL just correctly join a base domain to a relative path
// to produce an absolute path to fetch on.
// It returns a tuple, a string representing the absolute path with resolved
// paths and a boolean representing the success or failure of the process.
func resolveRelativeURL(baseURL string, relative string) (*url.URL, bool) {
    u, err := url.Parse(relative)
    if err != nil {
        return nil, false
    }
    if u.Hostname() != "" {
        return u, true
    }
    base, err := url.Parse(baseURL)
    if err != nil {
        return nil, false
    }
    return base.ResolveReference(u), true
}

Well, let's try running those tests, hopefully they'll give a positive outcome:

go test -v ./...

Command go test, just like go run and go build, takes care of downloading dependencies and updating go.mod file automatically.

Fetching HTML documents

A web crawler operates on the 7th layer, as simple as that, the main communication protocol used to fetch outside contents from websites is HTTP, so the core component of the fetcher package will be an HTTP client.

The next step is the definition of fetching unit tests, what we expect here is the possibility to simply fetch a single link, ignoring its content and fetching a link extracting all contained links.

We're going to mock a server probably, luckily Go provides us everything we need in the net/http/httptest.

fetcher/fetcher_test.go

package fetcher

import (
    "fmt"
    "net/http"
    "net/http/httptest"
    "net/url"
    "reflect"
    "testing"
    "time"
)

func serverMock() *httptest.Server {
    handler := http.NewServeMux()
    handler.HandleFunc("/foo/bar", resourceMock)
    server := httptest.NewServer(handler)
    return server
}

func resourceMock(w http.ResponseWriter, r *http.Request) {
    _, _ = w.Write([]byte(
        `<head>
            <link rel="canonical" href="https://example.com/sample-page/" />
            <link rel="canonical" href="/sample-page/" />
         </head>
         <body>
            <a href="foo/bar"><img src="/baz.png"></a>
            <img src="/stonk">
            <a href="foo/bar">
         </body>`,
    ))
}

func TestStdHttpFetcherFetch(t *testing.T) {
    server := serverMock()
    defer server.Close()
    f := New("test-agent", nil, 10*time.Second)
    target := fmt.Sprintf("%s/foo/bar", server.URL)
    _, res, err := f.Fetch(target)
    if err != nil {
        t.Errorf("StdHttpFetcher#Fetch failed: %v", err)
    }
    if res.StatusCode != 200 {
        t.Errorf("StdHttpFetcher#Fetch failed: %#v", res)
    }
    _, res, err = f.Fetch("testUrl")
    if err == nil {
        t.Errorf("StdHttpFetcher#Fetch failed: %v", err)
    }
}

func TestStdHttpFetcherFetchLinks(t *testing.T) {
    server := serverMock()
    defer server.Close()
    f := New("test-agent", NewGoqueryParser(), 10*time.Second)
    target := fmt.Sprintf("%s/foo/bar", server.URL)
    firstLink, _ := url.Parse("https://example.com/sample-page/")
    secondLink, _ := url.Parse(server.URL + "/sample-page/")
    thirdLink, _ := url.Parse(server.URL + "/foo/bar")
    expected := []*url.URL{firstLink, secondLink, thirdLink}
    _, res, err := f.FetchLinks(target)
    if err != nil {
        t.Errorf("StdHttpFetcher#FetchLinks failed: expected %v got %v", expected, err)
    }
    if !reflect.DeepEqual(res, expected) {
        t.Errorf("StdHttpFetcher#FetchLinks failed: expected %v got %v", expected, res)
    }
}

Ok, we just need to write some logic now to make those tests pass again.

Our first move will be the definition of a standard HTTP client wrapper, encapsulating parsing capabilities through a parser interface dependency, its first feature will be the simple fetching of a link. We also want to add some kind of re-try mechanism in case of failure, maintaining a degree of politeness toward the target, a simple exponential backoff between calls is enough for now, rehttp library allow us to do it gracefully, once again courtesy of PuerkitoBio work.

Note: we just ignore any possible invalid TLS certificates, in a real production case this could represent a security issue, and we'd strongly prefer to trace when something like that happens.

fetcher/fetcher.go

// Package fetcher defines and implement the downloading and parsing utilities
// for remote resources
package fetcher

import (
    "crypto/tls"
    "fmt"
    "io"
    "net/http"
    "net/url"
    "time"

    "github.com/PuerkitoBio/rehttp"
)

// Parser is an interface exposing a single method `Parse`, to be used on
// raw results of a fetch call
type Parser interface {
    Parse(string, io.Reader) ([]*url.URL, error)
}

// stdHttpFetcher is a simple Fetcher with std library http.Client as a
// backend for HTTP requests.
type stdHttpFetcher struct {
    userAgent string
    parser    Parser
    client    *http.Client
}

// New create a new Fetcher specifying an user-agent to set on each call,
// a parser interface to parse HTML contents and a timeout.
// By default it retries when a temporary error occurs (most temporary
// errors are HTTP ones) for a specified number of times by applying an
// exponential backoff strategy.
func New(userAgent string, parser Parser, timeout time.Duration) *stdHttpFetcher {
    transport := rehttp.NewTransport(
        &http.Transport{
            TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
        },
        rehttp.RetryAll(rehttp.RetryMaxRetries(3), rehttp.RetryTemporaryErr()),
        rehttp.ExpJitterDelay(1, 10*time.Second),
    )
    client := &http.Client{Timeout: timeout, Transport: transport}
    return &stdHttpFetcher{userAgent, parser, client}
}

// Fetch is a private function used to make a single HTTP GET request
// toward an URL.
// It returns an `*http.Response` or any error occured during the call.
func (f stdHttpFetcher) Fetch(url string) (time.Duration, *http.Response, error) {
    req, err := http.NewRequest("GET", url, nil)
    if err != nil {
        return time.Duration(0), nil, err
    }
    req.Header.Set("User-Agent", f.userAgent)
    // We want to time the request
    start := time.Now()
    res, err := f.client.Do(req)
    elapsed := time.Since(start)
    if err != nil {
        return elapsed, nil, err
    }
    return elapsed, res, nil
}

Now that we have a simple Fetch method, we can easily extend that behavior to fetch the page content and extract all the contained links, finally using that Parser interface we have inserted into the stdHttpFetcher.

Go's approach to abstractions through interfaces is immediately highlighted in the snippet above: we declared a Parser interface after the concrete implementation of GoqueryParser, as we previously discussed, this is somewhat confusing for ordinary OOP patterns of pre-declaring interfaces as contract, enforcing a behavior for every concrete type (like in Java for example).
Languages that requires explicit implementation of interfaces promote that style, where you want to predict the abstraction you're going to implement. Go instead allows to implement interfaces implicitly by just implementing their methods. This unlocks a more flexible style, which promotes a better understanding of the problem before abstracting away details; thus in Go interfaces is generally declared closer to their clients. This way it's the client that dictates the abstraction needed and not the other way around, making the entire development process more flixible.

Note: the Fetch and FetchLinks methods, return also a time.Duration as first value, that is the wall time of the call, telling us the time elapsed to make the HTTP call. It'll return useful later for calculating a politeness delay during the crawling process

fetcher/fetcher.go

// Parse an URL extracting the protion <scheme>://<host>:<port>
// Returns a string with the base domain of the URL
func parseStartURL(u string) string {
    parsed, _ := url.Parse(u)
    return fmt.Sprintf("%s://%s", parsed.Scheme, parsed.Host)
}

// Fetch contact and download raw data from a specified URL and parse the
// content into a `ParserResult` struct.
// It returns a `*ParserResult` or any error occuring during the call or the
// parsing of the results.
func (f stdHttpFetcher) FetchLinks(targetURL string) (time.Duration, []*url.URL, error) {
    if f.parser == nil {
        return time.Duration(0), nil, fmt.Errorf("fetching links from %s failed: no parser set", targetURL)
    }
    // Extract base domain from the url
    baseDomain := parseStartURL(targetURL)
    elapsed, resp, err := f.Fetch(targetURL)
    if err != nil {
        return elapsed, nil, fmt.Errorf("fetching links from %s failed: %w", targetURL, err)
    }
    defer resp.Body.Close()
    if resp.StatusCode >= http.StatusBadRequest {
        return elapsed, nil, fmt.Errorf("fetching links from %s failed: %s", targetURL, resp.Status)
    }
    links, err := f.parser.Parse(baseDomain, resp.Body)
    if err != nil {
        return elapsed, nil, fmt.Errorf("fetching links from %s failed: %w", targetURL, err)
    }
    return elapsed, links, nil
}

Running the simple unit tests we written should result in a success outcome

go test -v ./...
=== RUN   TestStdHttpFetcherFetch
--- PASS: TestStdHttpFetcherFetch (0.00s)
=== RUN   TestStdHttpFetcherFetchLinks
--- PASS: TestStdHttpFetcherFetchLinks (0.00s)
=== RUN   TestGoqueryParsePage
--- PASS: TestGoqueryParsePage (0.00s)
PASS
ok      webcrawler/fetcher    0.006s

The project structure should be the following

tree
.
├── fetcher
│   ├── fetcher.go
│   ├── fetcher_test.go
│   ├── parser.go
│   └── parser_test.go
├── go.mod
└── go.sum

². https://commandercoriander.net/blog/2018/03/30/go-interfaces/ ↩

2.1. Fetch and parse

Fetching and parsing HTML contents

Parsing HTML documents

Little dissertation over interfaces in Go

Fetching HTML documents

results matching ""

No results matching ""