Basic Web Scraping With Curl

This is a short write-up on how you can use curl to download multiple web pages and use command line tools to extract relevant text from HTML files. If you are looking to scrape a large amount of data or have specific data elements to scrape there are better ways of doing so using dedicated web scrapers (for example using BeautifulSoup or ScraperWiki).
Part 1

Our first step is to find the links to all the webpages that contain the information we need. Here you will most likely encounter either sequential or random link structures.

Example of sequential links:
www.site.com/archive/2015/01.html
www.site.com/archive/2015/02.html
www.site.com/archive/2015/03.html

Example of random links:
www.site.com/archive/q101.html
www.site.com/archive/abc.html
www.site.com/archive/root1.html

In our example we will be working with random link structure but we will also go through how to handle sequential links.

Introduction to curl

Curl is a command line tool to transfer data using various protocols such as HTML or FTP. If curl is not installed on your system you can install it using apt-get install curl.

The basic syntax to tell curl to download a single page is:

curl <url> -o file.extension

For example: curl www.site.com/page.html -o page.html

The –o option specifies the output file where the curled file will be saved. Without –o specified curl will simply show the contents of the file on screen.

Using curl to download multiple sequential links

For sequential links you can use curl to download a range of documents you want straight from the command line by using different brackets.

You can download multiple URLs or parts of URLs by using braces:

www.{subdomain,subdomain,subdomain}.site.com

You can also download alphanumeric sequences by using square brackets:

www.site.com/archive/2015/[01-10].html

www.site.com/archive/2015/[a-h].html

You can also use multiple brackets in one command:

www.site.com/archive/20[10-15]/[01-10].html

When using brackets to curl multiple files we can dynamically allocate file names using ‘#’ in the –o option, which will replace the ‘#’ with the current string in the URL being downloaded.

curl www.site.com/archive/20[10-15]/[01-10].html -o ‘#1_#2.html

This will save the files as 10_01.html 10_02.html etc.

Using curl to download multiple random links

With random link structure it is difficult to construct a curl command with brackets to download multiple files. Instead we can tell curl to download files from a list of links contained within a text file.

In our case study we are looking to download all the US Presidential speeches that talk about cybersecurity between 2014 and 2015. Our first step is to gather the links we want curl to download.

We can use the search function at the American Presidency Project to search for cybersecurity between 2014 and 2015, which should return a result that looks like this in our browser:

 

 

 

 

 

 

 

 

 

 

To extract the links for the search results we can use the text based browser lynx. If you do not have lynx installed you can install it using apt-get install lynx.

First we have to save a local copy of the search results (just right click in your browser and select save as…) to work with (in our case we have saved the file as index.html). Using lynx we can then use the –dump function to strip all html tags from the source file and save it to a text file called links.txt using the ‘>’ operator.

lynx -dump index.html > links.txt

We now have removed all html tags from our index.html file and only the text present in the document as well as a list of links remains. In our case, the end of the text file should look like this:

We can see that the links relevant to us are number 39 to 53 (the actual search results) and since we are not interested in the other links or the preceding text we will remove everything but those links using grep.

Grep is a command line tool for searching plain-text data sets for lines matching a regular expression search string, for example grep ‘http’ index.html will return a list of all the links in our index.html file.

Grep output

We can see that all our relevant links share the same structure up until ‘?pid=’ so we can grep up to that point to only return the relevant links.

grep ‘http://www.presidency.ucsb.edu/ws/index.php?’ links.txt

However, for curl to be able to process the information as links we need to remove any extra characters, which in our case means that we need to remove the numbers and the spaces preceding the URL.

To accomplish this we can combine grep with the cut command and its –c option. With –c# cut will only display the character number specified (in our case seven, as we have a number, a full stop, and four spaces preceding the URL). However, with -c7 cut will only return the seventh character (‘h’) and as we want to keep the whole URL that comes after the ‘h’ we can add a trailing ‘‘ after our number to keep everything that comes after the seventh character.

grep ‘http://www.presidency.ucsb.edu/ws/index.php?’ links.txt | cut –c7-

The output should now look like this:

Grep with cut

We also want to save this to a file so we can use that file to curl all the relevant links. As we did with lynx we use the ‘>’ operator.

grep ‘http://www.presidency.ucsb.edu/ws/index.php?’ links.txt | cut –c7- > links-clean.txt

So now we have our file with clean links we need to instruct curl to read each line of the file and download each URL using a simple bash script.

We will start by opening the text editor nano and create a new script file called curl.sh.

nano curl.sh

In nano, we will type the following script:

#!/bin/bash
 file="links-clean.txt"
 while read line
 do
 outfile=$(echo $line | awk 'BEGIN { FS = "/" } ; {print $NF}')
 curl -o "$outfile.html" "$line"
 sleep 1m
 done < "$file"

To exit nano press ctrl+x and then press y to save your changes. This script basically tells curl to read each line in ‘links-clean.txt’, download the file in that URL, save it to a file with same name as the URL, and then move on to the next line until it has processed all lines.

Remember, computers can execute commands much faster than people and in order to not overload the website host it is a good idea to add a sleep function between each curl. In our example we have added a sleep time of one minute between each curl.

To be able to execute the script we need to first give it appropriate file permissions using chmod.

chmod u+x curl.sh

We can then run the script.

./curl.sh

If everything has gone well we should now have a number of html files in our directory.

Curled files

Part 2

We now have all presidential speeches that mention cybersecurity from the period 2014 to 2015 in HTML format. However, in our case we want to do some simple text analysis on the speeches themselves and would therefore like to have only the speech texts (not the HTML code or anything else) in text format. Therefore we need to find a way to convert the HTML to text and remove all content besides the speeches themselves.

To convert the HTML files to text we can use lynx’s dump command again but rather than manually doing it for each file we can actually reuse the script we made for curling multiple files. First of all we can see that our files actually have a .php ending rather than .html ending which is required for the –dump command to work.

For the sake of simplicity we will first use the rename command to remove everything but the numbers in our file names (note that these commands affect all files so make sure you have put the HTML files in a new separate folder).

rename ‘s/[^0-9]*//g’ index.php*

We can then add the .html extension to all files.

rename ‘s/(.*)/$1.html/’ *

The next step is to create the list of files to lynx to process. The easiest way is to just list the contents of this directory (since it does not contain anything but our HTML files) and save the output to a file.

ls > files.text

Our files.txt now looks like this:

Files.txt

Then we repeat the process to create a bash script.

nano dump.sh

#!/bin/bash
 file="files.txt"
 while read line
 do
 outfile=$(echo $line | awk 'BEGIN { FS = "/" } ; {print $NF}')
 lynx -dump "$outfile" "$line" > $outfile.txt
 done < "$file"

Like with our curl example this script reads each line in files.txt but instead of running curl, the script now runs lynx -dump on each file.

We then make the script executable and run it.

chmod u+x dump.sh

./dump.sh

We now have the speeches in text format with the HTML tags removed. However, we still have two problems. First, the beginning of each file contains a number of rows that are not relevant to our text (the other text content on the web page that isn’t our speech).

Beginning

What we would like to do is to remove all the lines of text before the speech text begins. Luckily, all the files share the same structure and have 79 lines of irrelevant text before the speech text. We can then use sed to remove line 1 to 79 in all text files in our catalogue. Sed (stream editor) is a powerful Linux tool that parses and transforms text. The following command tells sed to remove lines 1 to 79 (1,79d) for all text files (*.txt) and write the changes to the files (-i).

sed -i  ‘1,79d’ *.txt

The second problem is that lynx -dump puts all the links found in the HTML document as a list at the end of each text file.

End

We can see that there are two types of links; one that begins with http:// and one that begins with file://. Since we would like to remove all instances of both we can again use sed but with now with this syntax:

sed -i ‘/file:\|http/d’ *.txt

And that’s it! We now have all the presidential speeches that mention cybersecurity between 2014 to 2015 as text files with most unnecessary text removed and can proceed with some text or word analysis of the corpus.