Using HtmlUnit to Scrape Webpages

Background

HtmlUnit is a “GUI-Less browser for Java programs”. It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc… just like you do in your “normal” browser.
http://htmlunit.sourceforge.net/

Problem

We want to use a headless browser’s functions to scrape a webpage for all instances of <a> to verify each contains a title="" attribute. This will be an accessibility test.

Environment

I am using OSX, Eclipse for Java, and JUnit but everything I cover can be applied to whatever environment you develop in. My environment is the one setup in a previous post, http://timothycope.com/?p=274

Solution- Using HtmlUnit to Scrape Webpages

We’ll need to import HtmlUnit’s .jar file into the Eclipse project. After that’s done we can create an instance of HtmlUnit, called a WebClient.

import java.util.List;

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.DomElement;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class AccessibilityTest
{
	public static void CheckSelections()
	{
		// Run test
		CheckAnchorTitle();
	}

	private static void CheckAnchorTitle()
	{
		// Create a WebClient using HtmlUnit (a headless browser)
		WebClient webClient = new WebClient();

		// Create a new StringBuilder() for the log
		StringBuilder sb = new StringBuilder();
		sb.append("Check Anchor Title - Results:");

		try
		{
			// Get the HtmlPage
			HtmlPage page = webClient.getPage(AutoMater.url_SiteToTest);

			// Extract every <a> instance
			final List<DomElement> anchorList = page.getElementsByTagName("a");

			// For each <a> in anchorList
			for( int i = 0; i < anchorList.size(); i++ )
			{
				// Get the current <a>
				HtmlAnchor anchor = (HtmlAnchor) anchorList.get(i);

				// See if the title='' attribute is present with the <a>
				if ( anchor.getAttribute("title").isEmpty() )
				{
					// Write exception to string builder
					sb.append(System.getProperty("line.separator"));
					sb.append("Missing Title: " + anchor);
				}
			}
		}
		catch (Exception ex)
		{
			// Write exception to string builder
			sb.append(System.getProperty("line.separator"));
			sb.append("Exception: " + ex);
		}
		finally
		{
			// Close the web client
	 		webClient.closeAllWindows();

	 		// Add the test execution information to the end of the log
	 		sb.append(TestTimer.Result());

	 		// Write the results to the log
			System.out.print(sb.toString());
		}
	}
}