Erlang

Web scraping with Elixir

by Oleg Tarasenko

Introduction

Businesses are investing in data. The big data analytics market is expected to grow to $103 billion (USD) within the next five years. It’s easy to see why, with everyone of us on average generating 1.7 megabytes of data per second.

As the amount of data we create grows, so too does our ability to inherent, interpret and understand it. Taking enormous datasets and generating very specific findings are leading to fantastic progress in all areas of human knowledge, including science, marketing and machine learning.

To do this correctly, we need to ask an important question, how do we prepare datasets that meet the needs of our specific research. When dealing with publicly available data on the web, the answer is web scraping.

What is web scraping?

Wikipedia defines web scraping as:

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.[1] While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. (See: https://en.wikipedia.org/wiki/Web_scraping).

Web scraping with Elixir

Web scraping can be done differently in different languages, for the purposes of this article we’re going to show you how to do it in Elixir, for obvious reasons ;). Specifically, we’re going to show you how to complete web scraping with an Elixir framework called Crawly.

Crawly is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. You can find out more about it on the documentation page.

Getting started

For the purposes of this demonstration, we will build a web scraper to extract the data from Erlang Solutions Blog. We will extract all the blog posts, so they can be analyzed later (let’s assume we’re planning to build some machine learning to identify the most useful articles and authors).

First of all, we will create a new Elixir project:

mix new esl_blog --sup

Now that the project is created, modify the deps` function of the mix.exs file, so it looks like this:

     #Run "mix help deps" to learn about dependencies.

    defp deps do
    [
      {:crawly, "~> 0.1"},
    ]
  end

Fetch the dependencies with: mix deps.get, and we’re ready to go! So let’s define our crawling rules with the help of… spiders.

The spider.

Screenshot-2019-08-07-at-08-01-13

Spiders are behaviours which you create an implementation, and which Crawly uses to extract information from a given website. The spider must implement the spider behaviour (it’s required to implement the parse_item/1, init/0`, `base_url/0 callbacks). This is the code for our first spider. Save it in to file called esl.ex under the lib/esl_blog/spiders directory of your project.

defmodule Esl do
  @behaviour Crawly.Spider

  @impl Crawly.Spider
  def base_url() do
    "https://www.erlang-solutions.com"
  end

  @impl Crawly.Spider
  def init() do
    [
      start_urls: ["https://www.erlang-solutions.com/blog.html"]
    ]
  end

  @impl Crawly.Spider
  def parse_item(_response) do
    %Crawly.ParsedItem{:items => [], :requests => []}
  end
end

Here’s a more detailed run down of the above code:

base_url(): function which returns base_urls for the given Spider, used in order to filter out all irrelevant requests. In our case, we don’t want our crawler to follow links going to social media sites and other partner sites (which are not related to the crawled target).

init(): must return a KW list which contains start_urls list which Crawler will begin to crawl from. Subsequent requests will be generated from these initial urls.

parse_item(): function which will be called to handle response downloaded by Crawly. It must return the Crawly.ParsedItem structure.

Crawling the website

Now it’s time to run our first spider: Start the interactive elixir console on the current project, and run the following command:

Crawly.Engine.start_spider(Esl)

You will get the following (or similar) result:

22:37:15.064 [info] Starting the manager for Elixir.Esl

22:37:15.073 [debug] Starting requests storage worker for Elixir.Esl…

22:37:15.454 [debug] Started 4 workers for Elixir.Esl :ok iex(2)> 22:38:15.455 [info] Current crawl speed is: 0 items/min

22:38:15.455 [info] Stopping Esl, itemcount timeout achieved

So what just happened under the hood?

Crawly scheduled the Requests object returned by the init/0 function of the Spider. Upon receiving a response, Crawly used a callback function (parse_item/1) in order to process the given response. In our case, we have not defined any data to be returned by the parse_item/1 callback, so the Crawly worker processes (responsible for downloading requests) had nothing to do, and as a result, the spider is being closed after the timeout is reached. Now it’s time to harvest the real data!

Extracting the data

Now it’s time to implement the real data extraction part. Our parse_item/1 callback is expected to return Requests and Items. Let’s start with the Requests extraction:

Open the Elixir console and execute the following:

`{:ok, response} = Crawly.fetch("https://www.erlang-solutions.com/blog.html")` you will see something like:

{:ok,
 %HTTPoison.Response{
   body: "<!DOCTYPE html>\n\n<!--[if IE 9 ]>" <> ...,
   headers: [
     {"Date", "Sat, 08 Jun 2019 20:42:05 GMT"},
     {"Content-Type", "text/html"},
     {"Content-Length", "288809"},
     {"Last-Modified", "Fri, 07 Jun 2019 09:24:07 GMT"},
     {"Accept-Ranges", "bytes"}
   ],
   request: %HTTPoison.Request{
     body: "",
     headers: [],
     method: :get,
     options: [],
     params: %{},
     url: "https://www.erlang-solutions.com/blog.html"
   },
   request_url: "https://www.erlang-solutions.com/blog.html",
   status_code: 200
 }}

This is our request, as it’s fetched from the web. Now, let’s use Floki in order to extract the data from the request.

First of all, we’re interested in “Read more” links, as they would allow us to navigate to the actual blog pages. In order to understand the ‘Now’, let’s use Floki in order to extract URLs from the given page. In our case, we will extract Read more links.

Screenshot-2019-08-07-at-08-01-13

Using the firebug or any other web developer extension, find the relevant CSS selectors. In my case I have ended up with the following entries: iex(6)> response.body |> Floki.find(“a.more”) |> Floki.attribute(“href”) [“/blog/a-guide-to-tracing-in-elixir.html”, “/blog/blockchain-no-brainer-ownership-in-the-digital-era.html”, “/blog/introducing-telemetry.html”, “/blog/remembering-joe-a-quarter-of-a-century-of-inspiration-and-friendship.html”,

Next, we need to convert the links into requests. Crawly requests is a way to provide some sort of flexibility to a spider. For example, sometimes you have to modify HTTP parameters of the request before sending them to the target website. Crawly is fully asynchronous. Once the requests are scheduled, they are picked up by separate workers and are executed in parallel. This also means that other requests can keep going even if some request fails or an error happens while handling it. In our case, we don’t want to tweak HTTP headers, but it is possible to use the shortcut function: Crawly.Utils.request_from_url/1 Crawly expects absolute URLs in requests. It’s a responsibility of a spider to prepare correct URLs. Now we know how to extract links to actual blog posts, let’s also extract the data from the blog pages. Let’s fetch one of the pages, using the same command as before (but using the URL of the blog post): {:ok, response} = Crawly.fetch(“https://www.erlang-solutions.com/blog/a-guide-to-tracing-in-elixir.html”)

Now, let’s use Floki in order to extract title, text, author, and URL from the blog post: Extract title with:

Floki.find(response.body, "h1:first-child") |> Floki.text

Extract the author with:
  author =
      response.body
      |> Floki.find("article.blog_post p.subheading")
      |> Floki.text(deep: false, sep: "")
      |> String.trim_leading()
      |> String.trim_trailing()

Extract the text with:

Floki.find(response.body, "article.blog_post") |> Floki.text

Finally, the url does not need to be extracted, as it’s already a part of response! Now it’s time to wire everything together. Let’s update our spider code with all snippets from above:

defmodule Esl do
  @behaviour Crawly.Spider

  @impl Crawly.Spider
  def base_url() do
    "https://www.erlang-solutions.com"
  end

  @impl Crawly.Spider
  def init() do
    [
      start_urls: ["https://www.erlang-solutions.com/blog.html"]
    ]
  end

  @impl Crawly.Spider
  def parse_item(response) do
    # Getting new urls to follow
    urls =
      response.body
      |> Floki.find("a.more")
      |> Floki.attribute("href")

    # Convert URLs into requests
    requests =
      Enum.map(urls, fn url ->
        url
        |> build_absolute_url(response.request_url)
        |> Crawly.Utils.request_from_url()
      end)

    # Extract item from a page, e.g.
    # https://www.erlang-solutions.com/blog/introducing-telemetry.html
    title =
      response.body
      |> Floki.find("article.blog_post h1:first-child")
      |> Floki.text()

    author =
      response.body
      |> Floki.find("article.blog_post p.subheading")
      |> Floki.text(deep: false, sep: "")
      |> String.trim_leading()
      |> String.trim_trailing()

    text = Floki.find(response.body, "article.blog_post") |> Floki.text()

    %Crawly.ParsedItem{
      :requests => requests,
      :items => [
        %{title: title, author: author, text: text, url: response.request_url}
      ]
    }
  end

  def build_absolute_url(url, request_url) do
    URI.merge(request_url, url) |> to_string()
  end
end

Running the spider again

If you run the spider, it will output the extracted data with the log:

22:47:06.744 [debug] Scraped "%{author: \"by Lukas Larsson\", text: \"Erlang 19.0 Garbage Collector2016-04-07" <> …

Which indicates that the data has successfully been extracted from all blog pages.

What else?

You’ve seen how to extract and store items from a website using Crawly, but this is just the basic example. Crawly provides a lot of powerful features for making scraping easy and efficient, such as:

  • Flexible request spoofing (for example user-agent rotation)
  • Item validation, using a pipeline approach
  • Filtering already seen requests and items
  • Filtering out all requests which are coming from other domains
  • Robots.txt enforcement
  • Concurrency control
  • HTTP API for controlling crawlers
  • Interactive console, which allows to create and debug spiders

Learn more

Want to know how the tools used in this blog can be used to create a machine learning program in Elixir? Don’t miss out on part two of this blog series, sign up to our newsletter and we’ll send the next blog straight to your inbox.

We thought you might also be interested in:

Which companies are using Elixir

CircleCI Continuous Integration for Elixir Projects

Elixir Language Development & Consultancy

Go back to the blog

×

Thank you for your message

We sent you a confirmation email to let you know we received it. One of our colleagues will get in touch shortly.
Have a nice day!