Learn How to Retrieve Data From Internet Web Sites in C#
I can't count the number of times that I've needed to retrieve data from an Internet site. There are several reasons for this. For starters, I developed an Internet filtering technology that analyzes images, and I needed thousands of images for testing and training. Pulling the images from the Internet in an automated way was the only practical way of obtaining the images I needed. I've also written crawlers (or spiders) that go from site to site. Sometimes I need an automated way of pulling web-based information, such as stock prices. Suffice to say that I've pulled data from Internet sites many times for many different reasons.
The .NET framework makes the task of requesting data from a URL simple. In days of old, I used raw sockets. The code had to connect with a remote server, create a well-formed HTTP request, send the request, retrieve the data and monitor for end-of-file conditions, and save to memory or a disk file—lots of code and lots of room for errors. This article, though, talks about the WebClient class. This is the simplest class that the .NET framework has to offer that makes it easy to download from Internet URLs. In later articles I'll talk about the WebRequest and WebResponse objects, and down the road we'll tackle Sockets.
Pulling Data With The WebClient Class
Pulling data from a URL with the WebClient class couldn't be easier. There are two different ways I use it: to save the data to a disk file, and to put the data into an in-memory buffer or string. Before you begin, though, you'll need to add a using statement for System.Net as follows.
using System.Net;
The next thing to note is that URLs must begin with "http://" if they are to be retrieved via the HTTP Internet protocol. I created a helper method that takes care of this detail. It is below.
// This helper method prepends "http://" to a URL if it isn't // already there. void PrependHTTP(ref string strURL) { if (strURL.Length < 7 || strURL.Substring(0, 7).ToUpper() != "HTTP://") { strURL = "http://" + strURL; } }
There are two methods that can be used to retrieve data. The first is the DownloadFile method which saves the retrieved data to a disk file. The second is the DownloadData method which places the retrieved data into a byte array. The following two examples show how to create a WebClient object and retrieve data.
Using the DownloadFile method:
WebClient wc = new WebClient(); wc.DownloadFile("http://www.rickleinecker.com/Default.htm", "DiskFile.htm");
Using the DownloadData method:
WebClient wc = new WebClient(); byte[] data = wc.DownloadData("http://www.rickleinecker.com/Default.htm");
If you want a string instead of a byte array, you can use the Encoding.ASCII.GetString method as follows. (Remember that you need a using statement for System.Text in order to use the Encoding.ASCII.GetString method.)
WebClient wc = new WebClient(); byte[] data = wc.DownloadData("http://www.rickleinecker.com/Default.htm"); string strData = Encoding.ASCII.GetString(data);
There's a demonstration program that lets you specify a URL to download. You can choose to save it as a disk file, show it as a string, or display it as an image. You can see the application in action in the figure below.
You can also download the demonstration program
Using Image Data
There are two ways to use image data when it is retrieved. The first retrieves that data using the WebClient.DownloadFile method, then uses a Bitmap object to load the image. The following code shows how to do this.
WebClient wc = new WebClient(); wc.DownloadFile("http://www.rickleinecker.com/MyImage.gif", "MyImage.gif"); Bitmap objBitmap = new Bitmap("MyImage.gif");
The drawback with this methodology is that you have to make sure that the current directory has write permissions and that the disk is not full. It also leaves the image on the disk unless you delete it at some point.
A better approach is to use the WebClient.DownloadData method to retrieve the data into a byte array, create a MemoryStream object around the downloaded byte array, and let the Bitmap object decode the image from the MemoryStream object. In this way everything is done in memory and you don't have to worry about an unwanted disk file. The following code shows how to do this. (Remember that you need a using statement for System.IO in order to use a MemoryStream object.)
WebClient wc = new WebClient(); byte[] data = wc.DownloadData("http://www.rickleinecker.com/MyImage.gif"); MemoryStream ms = new MemoryStream(data); Bitmap objBitmap = new Bitmap(ms);
The demonstration program allows you to retrieve and display images as seen in the following figure.
Parsing HTML Data
You can use the Microsoft html parser on downloaded data, too. The first thing you need to do is add a reference in your project to MSHTML. It usually shows a "Microsoft HTML Object Library" in the list of COM references. The following figure shows the MSHTML object in the list of available COM references.
Once the data is downloaded and converted to a string, the following code will set up a parser object.
WebClient wc = new WebClient(); byte[] data = wc.DownloadData(strURL); mshtml.HTMLDocumentClass ms = new mshtml.HTMLDocumentClass(); string strHTML = Encoding.ASCII.GetString(data); mshtml.IHTMLDocument2 objMyDoc = (mshtml.IHTMLDocument2)ms; objMyDoc.write(strHTML);
You can then use the IHTMLDocument2 object to get lists of images, anchors, and other HTML collections that you're interested in. The following code continues the previous fragment and gets all the anchors in the document.
mshtml.IHTMLElementCollection ec = (mshtml.IHTMLElementCollection)objMyDoc.links; for (int i = 0; i < ec.length; i++) { string strLink; mshtml.HTMLAnchorElementClass objAnchor; try { objAnchor = (mshtml.HTMLAnchorElementClass)ec.item(i, 0); strLink = objAnchor.href; } catch { continue; } }
I created a demonstration crawler. It's fairly simple, but it could be used as the basis for a real crawler that examines HTML pages for data. You can download the code
Conclusion
As you can see, pulling Internet data is easy. And you can use the WebClient class to create some pretty advanced applications such as crawlers.