Tutorial

Scraping data with Elixir and Floki

This post was updated 27 Mar

floki scrape finch

In this tutorial I want to show you how to get started with screen scraping data from a website with Elixir and Floki. The page I want to scrape contains a list of cities in the US with the longitude and latitude. https://www.latlong.net/category/cities-236-15.html

As you can see, the data on the page is contained in a table which makes it pretty easy to deal with. However, there is more than one page so I need to handle the pagination and move to next page.

Step 1 - Setup the data model

I want to store the data in the database so for this purpose I generate a cities table in the database with a Phoenix Context.

mix phx.gen.context Cities City cities city state latitude:float longitude:float

I don’t want to deal with incomplete data at this point, so I require all fields to not allow NULL. Also, I want to have a unique index on city name and state.

# priv/repo/migrations/20211014084002_create_cities.exs
defmodule MyApp.Repo.Migrations.CreateCities do
use Ecto.Migration
def change do
create table(:cities) do
add :city, :string, null: false
add :state, :string, null: false
add :latitude, :float, null: false
add :longitude, :float, null: false
timestamps()
end
unique_index(:cities, [:city, :state])
end
end

With these changes in place, I can run the migrations with:

mix ecto.migrate

Step 2 - Introduction to scraping with Floki

A standard Elixir Phoenix application today already comes with the Floki package installed so that is what I will use. I will also use the relatively new HTTP client Finch. Make sure that Floki is available in all environments and add Finch:

# mix.exs
defp deps do
[
{:floki, ">= 0.30.0"},
{:finch, "~> 0.8"},
]
end

Install it with

mix deps.get

Add Finch to the children in the supervision tree and give it a name:

# lib/my_app/application.ex
children = [
{Finch, name: MyFinch}
]

With that in place, you need to restart the server if you had it running. The MyFinch name is arbitrary and could be whatever you want.

First test - Request with Finch

I want to have the scraper logic in a file called scraper.ex. And to get started, the only code I need is this. The HTTP verb and the url. Note that I use the name of the Finch child here, MyFinch.

# lib/my_app/scraper/scraper.ex
defmodule MyApp.Scraper do
def run do
{:ok, %Finch.Response{body: body}} =
Finch.build(:get, "https://www.latlong.net/category/cities-236-15.html")
|> Finch.request(MyFinch)
body
end
end

I can test this out in IEX with:

iex -S mix

And call:

MyApp.Scraper.run()

If everything works and the page is up, I should see the html body in text format.

However, the text body is not that easy to work with since it’s unstructured. Here is where I need Floki.

Second test - Parse html with Floki

I basically have the same code as above, but now I’m parsing the document with Floki. And also, I will use Floki to target the table rows in the document.

# lib/my_app/scraper/scraper.ex
def run do
{:ok, %Finch.Response{body: body}} =
Finch.build(:get, "https://www.latlong.net/category/cities-236-15.html")
|> Finch.request(MyFinch)
{:ok, document} = Floki.parse_document(body)
document
|> Floki.find("table tr")
end

If IEX is already started, I can reload the module with the r command and run the same function again.

# iex
r MyApp.Scraper
MyApp.Scraper.run()

This should give me a list of rows, in the form of a tuple with structured data. Moving forward, I can parse this with pattern matching. For example, I want to avoid the table header as you can see in the top.

With that proof of concept done, I can build the real scraper.

Step 3 - Building the screen scraper

Since I plan to loop through all the pages in the pagination list, I want the run-function to take an optional argument. It should either be the path provided, from the next-link in the pagination, or fallback to a default path.

Besides that, I want to split up the logic in three (and a half) functions.

  1. Perform the request and parse the result. This takes a path-string and returns a Floki document.
  2. From the Floki document, parse the table rows and return an Elixir map.
  3. Save each row in the database.
  4. Parse the pagination and see if there is a next-link. If it is, call the run-function again with the new path.
# lib/my_app/scraper/scraper.ex
defmodule MyApp.Scraper do
@base_url "https://www.latlong.net"
@initial_path "/category/cities-236-15.html"
def run(path \\ @initial_path) do
document = perform_request_and_parse_result(path)
find_and_parse_rows(document)
|> save_rows()
maybe_paginate(document)
end
end

This function is called first in the run-function. I make a GET-request with Finch and combine the base url with the path.

# lib/my_app/scraper/scraper.ex
defp perform_request_and_parse_result("" <> path) do
{:ok, %Finch.Response{body: body}} =
Finch.build(:get, "#{@base_url}/#{path}")
|> Finch.request(MyFinch)
{:ok, document} = Floki.parse_document(body)
document
end

This function finds all tr elements in the table. It returns a list of tuples where the last value in turn is a list of tuples containing the td elements. When I parse the rows, I map over them and call parse_row/1 and use pattern matching to extract city, state, latitude and longitude. It returns a map that will be used to create the records in the database.

Note that the fallback is to return an empty map. That is fine because it will fail validations when I try to insert it in the database.

# lib/my_app/scraper/scraper.ex
defp save_rows(rows) do
rows
|> Enum.each(&MyApp.Cities.create_city/1)
end

Below the table, in the bottom of the page, there is the pagination list. I could go about it and concurrently call all the pages at the same time, but I opt for scraping them one page at a time. I will find the a tag with the text “next”.

I usually start a function that has two types of outcomes with a maybe-prefix. Because I want to recursively call the run-function with the new path as long as there is one. Otherwise I want to exit with an :ok.

# lib/my_app/scraper/scraper.ex
defp maybe_paginate(document) do
document
|> Floki.find(".pagination li a")
|> Enum.find(fn row ->
case row do
{"a", [{"href", "/" <> _path}], ["next" <> _]} -> true
_ -> false
end
end)
|> case do
nil ->
:ok
{_, [{_, "" <> path}], _} ->
run(path)
end
end

Now when all the code is in place, I can test out the functionality in IEX.

iex -S mix

And call the run-function without an argument:

# iex
MyApp.Scraper.run()

Since it goes through and scrapes the pages one by one, it takes a few seconds. But I can see in the console that it saves the data in the database and it looks correct.

Note that it returns :ok when it has run out of pages to scrape.

Now, depending on the business case, I might want to do a daily import and setup an Oban worker. But I will leave that up to you.