Tutorial
Scraping data with Elixir and Floki
In this tutorial I want to show you how to get started with screen scrapinng data from a website with Elixir and Floki. The page I want to scrape contains a list of cities in the US with the longitude and latitude. https://www.latlong.net/category/cities-236-15.html
As you can see, the data on the page is contained in a table which makes it pretty easy to deal with. However, there is more than one page so I need to handle the pagination and move to next page.
Step 1 - Setup the data model
I want to store the data in the database so for this purpose I generate a cities table in the database with a Phoenix Context.
mix phx.gen.context Cities City cities city state latitude:float longitude:float
I dont want to deal with incomplete data at this point, so I require all fields to not allow NULL. Also, I want to have a unique index on city name and state.
# priv/repo/migrations/20211014084002_create_cities.exs
defmodule Tutorial.Repo.Migrations.CreateCities do
use Ecto.Migration def change do
create table(:cities) do
add :city, :string, null: false
add :state, :string, null: false
add :latitude, :float, null: false
add :longitude, :float, null: false timestamps()
end unique_index(:cities, [:city, :state])
end
end
With these changes in place, I can run the migrations with:
mix ecto.migrate
Step 2 - Introduction to scraping with Floki
A standard Elixir Phoenix application today already comes with the Floki package installed so that is what I will use. I will also use the relative new HTTP client Finch. Make sure that floki is available in all environments and add finch:
# mix.exs
defp deps do
[
{:floki, ">= 0.30.0"},
{:finch, "~> 0.8"},
]
end
Install it with
mix deps.get
Add Finch to the children in the supervision tree and give it a name:
# lib/tutorial/application.ex
children = [
{Finch, name: MyFinch}
]
With that in place, you need to restart the server if you had it running. The MyFinch name is arbitrary and could be whatever you want.
First test - Request with Finch
I want to have the scraper logic in a file called scraper.ex. And do get started, I the only code I need is this. The HTTP verb and the url. Note that I use the name of the Finch child here, MyFinch.
# lib/tutorial/scraper/scraper.ex
defmodule Tutorial.Scraper do
def run do
{:ok, %Finch.Response{body: body}} =
Finch.build(:get, "https://www.latlong.net/category/cities-236-15.html")
|> Finch.request(MyFinch) body
end
end
I can test this out in IEX with:
iex -S mix
And call:
Tutorial.Scraper.run()
If everthing works and the page is up, I should see the html body in text format.
However, the text body is a not that easy to work with since its unstructured. Here is where I need Floki.
Second test - Parse html with body
I basically have the same code as above, but not I pasing the document with Floki. And also, I will use Floki to target the table rows in the document.
# lib/tutorial/scraper/scraper.ex
def run do
{:ok, %Finch.Response{body: body}} =
Finch.build(:get, "https://www.latlong.net/category/cities-236-15.html")
|> Finch.request(MyFinch) {:ok, document} = Floki.parse_document(body) document
|> Floki.find("table tr")
end
If IEX is already started, I can reload the module with the r
command and run the same function again.
# iex
r Tutorial.Scraper
Tutorial.Scraper.run()
This should give me a list of rows, in the form of a tuple with structured data. moving forward, I can parse this with pattern matching. For exanple. I want to avoid the table header as you can see in the top.
With that proof of concept done, I can build the real scraper.
Step 3 - Building the screen scraper
Since I plan to loop throuch all the pages in the pagination list, I want the run-function to take an optional argument. It should either be the path provided, from the next-link in the pagination, or fallback to a deasult path.
Besides that, I want the to split up the logic in three (and a half) functions.
- Perform the request and parse the result. This takes a path-string and returns a Floki document.
- From the Floki document, parse the table rows and return an elixir map
- Save each row in the database
- Parse the pagination and see if there is a next-link. If it is, call the run-function again with the new path.
# lib/tutorial/scraper/scraper.ex
defmodule Tutorial.Scraper do
@base_url "https://www.latlong.net"
@initial_path "/category/cities-236-15.html" def run(path \\ @initial_path) do
document = perform<i>request</i>and<i>parse</i>result(path) find<i>and</i>parse_rows(document)
|> save_rows() maybe_paginate(document)
end
end
This function is called first in the run-function. I make a GET-request with finch and combine the base url with the path.
# lib/tutorial/scraper/scraper.ex
defp perform_request_and_parse_result("" <> path) do
{:ok, %Finch.Response{body: body}} =
Finch.build(:get, "#{@base_url}/#{path}")
|> Finch.request(MyFinch) {:ok, document} = Floki.parse_document(body) document
end
This function finds all tr
- elements in the table. It returns a list of tuples where the last value in turn is a list of tuples containing the td
-elements. When I parse the rows, I map pver them and call parse_row/1
and use pattern matching to extract city, state, latitude and longitude. it returns a map that will be used to create the records in the database.
Note that the fallback is to return an empty map. That is fine because it will fail validations when I try to insert it in the database.
# lib/tutorial/scraper/scraper.ex
defp save_rows(rows) do
rows
|> Enum.each(&Tutorial.Cities.create_city/1)
end
Below the table, in the bottom of the page, there is the paginatiom list. I could go about it and concurrently call all the pages at the same time, but I opt for scraping them one page at a time. I will find the a
- tag with the text "next"
I usually start a function that has two type of outcomes with a maybe-prefix. Because I want to recurevely call the run-function with the new path as long as there is one. Otherwise I want to exit with an :ok
.
# lib/tutorial/scraper/scraper.ex
defp maybe_paginate(document) do
document
|> Floki.find(".pagination li a")
|> Enum.find(fn row ->
case row do
{"a", [{"href", "/" <> _path}], ["next" <> _]} -> true
_ -> false
end
end)
|> case do
nil ->
:ok
{_, [{_, "" <> path}], _} ->
run(path)
end
end
Now when all the code are in place, I can test out the functionality in IEX.
iex -S mix
And call the run-function without an argument
# iex
Tutorial.Scraper.run()
Since it go throuch and scrape the pages one by one, it takes a few seconds. But I can see in the console that it saves the data in the database and it looks correct.
Note that it return :ok
when it has ran out of pages to scrape.
Now, depending business case, I might want to do a daily import and setup an Oban worker. But I will leave that up to you.