To extract links from a webpage in C# you will need to use some string processing. The trick is to read the raw code of HTML or PHP pages for key parts.
This article will remain simple and do nothing more than extract links, whether they are absolute or relative. A more advanced link extractor could go ahead and assign realtive links to the original domain.
Our C# program will give two options, to extract links from a URL or from direct code. For URL, we will make use of the download data article.
The bottom line is to get the HTML or PHP code.
Using C# to extract links from pages is easy as long as you stay organized. Links in HTML are written as such:
<a href="[link here]">Some Text</a>
The .Net IndexOf function gives programmers an easy way to quickly scan a string for a specific sequence. So in this case you want to search for <a href=" since that will mark the start of each link.
The end sequence will not be </a> since we do not care about the anchor text. The end string sequence should be the closing quotation mark of the link. (Note: To write quotes in C# use the literal \").
You will encounter a problem though. How do you extract all links if you don't know how many there are? Use a while loop. Also make sure to toss out the parts of the raw code that the application has already scanned.
The downloadable source code at the bottom of this page has the written out C# code for how to extract links from a webpage...