Solving the Go Lang Web Crawler

I just posted my solution to the golang web crawler excersise on github but wanted to give a quick explanation of my approach.

Originally I had 2 channels, one for data to flow and one to signal processing was complete, but nesting this is a recursive function lead to quite messy code. I then watched a great presentation by Rob Pike and started to rethink my solution in terms of gophers!

I moved to a solution that was far more simple:

  • A single function would load a page, process it and extract links
    • The result of the processing would be sent to a channel as a work queue
  • A controller type function would take work and spin up a worker (go routine) to process this data
  • A simple counter would clock when work starts and ends

Thinking in terms of gophers vastly simplified my work, I then used a map to keep a track of pages I’ve visited to not go back to them. The 2 important functions are:

func getPage(url string, depth int, r chan unprocessed){
body, urls, err := fetcher.Fetch(url)
fmt.Printf("found: %s %q\n", url, body)

if err != nil {
fmt.Println(err)
}

r <- unprocessed{ depth - 1, urls }
}

As you can see this simply performs a fetch on the page and sends the results to the channel, the object sent is a simple struct. The important point here is this function always sends something, even if there is an error. Its up to the controller to be smart about what to send, this is just a dumb fetch function.

func Crawl(url string, depth int, fetcher Fetcher) {

//setup channel for inputs to be processed
up := make(chan unprocessed, 0)

//kick off processing and count how many pages are left to process
go getPage(url, depth, up)
outstanding := 1

visited := make(map[string]bool)
for outstanding > 0 {

//pop a visit from the channel
next := <-up
outstanding--

//if were too deep, skip it
if next.depth <= 0 {
continue
}

//loop over all urls to visit from that page
for _, link := range next.url {

//check we havent visited them before
if visited[link] {
continue
}

//all good to visit them
outstanding++
visited[link] = true
go getPage(link, depth, up)
}
}
}

The second function starts to make sense with that dumb page fetch, it essentially sets up a loop with a counter that increments when a getPage is called, and decrements when it processes some results. This allows the channel to act a bit like a stack and moves all the clever work into the controlling loop.

After I managed to get a working solution I looked around for what others had done, interestingly my final solution looks a bit similar to this one. Though I used a struct to pass data around not just a raw string array which results in a slightly different processing of the depth check. Over time I guess I will see what is more idiomatic golang.

One thought on “Solving the Go Lang Web Crawler

Leave a Reply