Tutorial
Scraping data with Elixir and Floki
In this tutorial I want to show you how to get started with screen scrapinng data from a website with Elixir and Floki. The page I want to scrape contains a list of cities in the US with the longitude and latitude. https://www.latlong.net/category/cities-236-15.html
As you can see, the data on the page is contained in a table which makes it pretty easy to deal with. However, there is more than one page so I need to handle the pagination and move to next page.
Step 1 - Setup the data model
I want to store the data in the database so for this purpose I generate a cities table in the database with a Phoenix Context.
mix phx.gen.context Cities City cities city state latitude:float longitude:float
I dont want to deal with incomplete data at this point, so I require all fields to not allow NULL. Also, I want to have a unique index on city name and state.
`
elixir
# priv/repo/migrations/20211014084002createcities.exs
defmodule Tutorial.Repo.Migrations.CreateCities do
use Ecto.Migration
def change do create table(:cities) do add :city, :string, null: false add :state, :string, null: false add :latitude, :float, null: false add :longitude, :float, null: false
timestamps() end
unique_index(:cities, [:city, :state])
end
end
`
With these changes in place, I can run the migrations with:
mix ecto.migrate
Step 2 - Introduction to scraping with Floki
A standard Elixir Phoenix application today already comes with the Floki package installed so that is what I will use. I will also use the relative new HTTP client Finch. Make sure that floki is available in all environments and add finch:
# mix.exs
defp deps do
[
{:floki, ">= 0.30.0"},
{:finch, "~> 0.8"},
]
end
Install it with
mix deps.get
Add Finch to the children in the supervision tree and give it a name:
# lib/tutorial/application.ex
children = [
{Finch, name: MyFinch}
]
With that in place, you need to restart the server if you had it running. The MyFinch name is arbitrary and could be whatever you want.
First test - Request with Finch
I want to have the scraper logic in a file called scraper.ex. And do get started, I the only code I need is this. The HTTP verb and the url. Note that I use the name of the Finch child here, MyFinch.
`
elixir
# lib/tutorial/scraper/scraper.ex
defmodule Tutorial.Scraper do
def run do
{:ok, %Finch.Response{body: body}} =
Finch.build(:get, "https://www.latlong.net/category/cities-236-15.html")
|> Finch.request(MyFinch)
body
end
end
`
I can test this out in IEX with:
iex -S mix
And call:
Tutorial.Scraper.run()
If everthing works and the page is up, I should see the html body in text format.
However, the text body is a not that easy to work with since its unstructured. Here is where I need Floki.
Second test - Parse html with body
I basically have the same code as above, but not I pasing the document with Floki. And also, I will use Floki to target the table rows in the document.
`
elixir
# lib/tutorial/scraper/scraper.ex
def run do
{:ok, %Finch.Response{body: body}} =
Finch.build(:get, "https://www.latlong.net/category/cities-236-15.html")
|> Finch.request(MyFinch)
{:ok, document} = Floki.parse_document(body)
document
|> Floki.find("table tr")
end
`
If IEX is already started, I can reload the module with the r
command and run the same function again.
# iex
r Tutorial.Scraper
Tutorial.Scraper.run()
This should give me a list of rows, in the form of a tuple with structured data. moving forward, I can parse this with pattern matching. For exanple. I want to avoid the table header as you can see in the top.
With that proof of concept done, I can build the real scraper.
Step 3 - Building the screen scraper
Since I plan to loop throuch all the pages in the pagination list, I want the run-function to take an optional argument. It should either be the path provided, from the next-link in the pagination, or fallback to a deasult path.
Besides that, I want the to split up the logic in three (and a half) functions.
- Perform the request and parse the result. This takes a path-string and returns a Floki document.
- From the Floki document, parse the table rows and return an elixir map
- Save each row in the database
- Parse the pagination and see if there is a next-link. If it is, call the run-function again with the new path.
`
elixir
# lib/tutorial/scraper/scraper.ex
defmodule Tutorial.Scraper do
@base_url "https://www.latlong.net"
@initial_path "/category/cities-236-15.html"
def run(path \\ @initial_path) do document = performrequestandparseresult(path)
findandparse_rows(document) |> save_rows()
maybe_paginate(document)
end
end
`
This function is called first in the run-function. I make a GET-request with finch and combine the base url with the path.
`
elixir
# lib/tutorial/scraper/scraper.ex
defp performrequestandparseresult("" <> path) do
{:ok, %Finch.Response{body: body}} =
Finch.build(:get, "#{@base_url}/#{path}")
|> Finch.request(MyFinch)
{:ok, document} = Floki.parse_document(body)
document
end
`
This function finds all tr
- elements in the table. It returns a list of tuples where the last value in turn is a list of tuples containing the td
-elements. When I parse the rows, I map pver them and call parse_row/1
and use pattern matching to extract city, state, latitude and longitude. it returns a map that will be used to create the records in the database.
`
elixir
# lib/tutorial/scraper/scraper.ex
defp findandparse_rows(document) do
document
|> Floki.find("table tr")
|> Enum.map(&parse_row/1)
end
defp parse_row( {"tr", _, [ {"td", , [{"a", , [citystatecountry]}]}, {"td", _, [latitude]}, {"td", _, [longitude]} ]} ) do
[city, state | _] = citystatecountry |> String.split(",") |> Enum.map(& String.trim(&1))
%{city: city, state: state, latitude: latitude, longitude: longitude} end
defp parserow(), do: %{}
`
Note that the fallback is to return an empty map. That is fine because it will fail validations when I try to insert it in the database.
# lib/tutorial/scraper/scraper.ex
defp save_rows(rows) do
rows
|> Enum.each(&Tutorial.Cities.create_city/1)
end
Below the table, in the bottom of the page, there is the paginatiom list. I could go about it and concurrently call all the pages at the same time, but I opt for scraping them one page at a time. I will find the a
- tag with the text "next"
I usually start a function that has two type of outcomes with a maybe-prefix. Because I want to recurevely call the run-function with the new path as long as there is one. Otherwise I want to exit with an :ok
.
# lib/tutorial/scraper/scraper.ex
defp maybe_paginate(document) do
document
|> Floki.find(".pagination li a")
|> Enum.find(fn row ->
case row do
{"a", [{"href", "/" <> _path}], ["next" <> _]} -> true
_ -> false
end
end)
|> case do
nil ->
:ok
{_, [{_, "" <> path}], _} ->
run(path)
end
end
Now when all the code are in place, I can test out the functionality in IEX.
iex -S mix
And call the run-function without an argument
# iex
Tutorial.Scraper.run()
Since it go throuch and scrape the pages one by one, it takes a few seconds. But I can see in the console that it saves the data in the database and it looks correct.
Note that it return :ok
when it has ran out of pages to scrape.
Now, depending business case, I might want to do a daily import and setup an Oban worker. But I will leave that up to you.