Scraping, or Programatically Accessing, a Secure Webpage

Monday, August 25, 2008 11:07 PM Filed under:

Scraping, or Programatically Accessing, a Secure Webpage

There are many secure websites out there that provide useful information but do not have a public API to access it's data. A prime example of this is the LinkedIn website. You might love to gather some info from LinkedIn, but their promise to deliver a public API has yet to come to fruition. The problem is, the pages with all the good data are secure, requiring the user to log in before accessing these pages. Let's say we want to scrape this data from these pages programatically? We need to authenticate to access these pages. We can do that by reusing the authentication cookie from the site that we receive when we log in with a browser.

Note: I've mentioned LinkedIn as an example of a secure site to programatically access data from. It's actually a violation of LinkedIn's user agreement to scrape data from it's site. The techniques here apply to any form-based authenticated website, built on ASP.NET or anything else.

Before we move on with this, I wanted to state a few assumptions with the approach I'll be showing here:

You already have browser-based access to the secure pages (meaning you have a user account).
You're OK to use your own authentication cookie to access the secure pages without violating some site agreement. Doing this sort of thing on a site that prohibits it can get you banned from the site. Remember, you'll be using something that can link these requests back to your own user account.

When you visit a webpage that requires some sort of form-based authentication, usually there is an authentication token stored in a cookie. This certainly is the case with any ASP.NET site using Forms Authentication and is the case with LinkedIn as well as about any other similar site out there. Even if it isn't a persistent cookie, and is only active for the session, there still is a cookie. This cookie is passed back to the website in the request header each time a page is accessed. We can sniff out the data for that cookie using Firebug (the awesomely-awesome Firefox addon), or using Fiddler for IE.

Using Firebug, we can access the secure page and take a look at the cookie value in the header. For my example, I'll be using a sample ASP.NET site using forms authentication, but this all works the same for non-ASP.NET sites too.

If you're using Fiddler, just access the page using IE and then you'll see the cookie data by going to the Headers section under the Session Inspector.

For an ASP.NET site using forms authentication, the authentication token name is indicated in the "name" attribute of the forms key in the authentication section of the web.config. By default that name is ".ASPXAUTH", but you won't know what that name is, or the site might not even be an ASP.NET site. That is OK. You can usually pick out the authentication token in the cookie data, or just use the entire cookie.

Now, using that cookie, we can use the following code to access the secure webpage:

using System.Net;
using System.IO;
//...
 
 
//grab cookie authentication token from Firebug/Fiddler and add in here
string cookiedata = ".ASPXAUTH=FB8ADA49D4BFE4EF531A4539D0B74CCA6762F9CC6F62C8E...";

HttpWebRequest request = HttpWebRequest.Create("http://somesite.com/securepage.aspx") as HttpWebRequest;
//set the user agent so it looks like IE to not raise suspicion 
request.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)";
request.Method = "GET";
//set the cookie in the request header
request.Headers.Add("Cookie", cookiedata);

//get the response from the server
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
using (Stream stream = response.GetResponseStream())
{
    using (StreamReader reader = new StreamReader(stream))
    {
        string pagedata = reader.ReadToEnd();
        //now we can scrape the contents of the secure page as needed
        //since the page contents is now stored in our pagedata string
    }
}
response.Close();

One thing to point out here, since we're passing the entire contents of the cookie, I'm just adding that as a whole to the request header instead of adding each cookie element to the request.Cookie container.

If I wire that up in a form, we'll quickly see the secure page since the site will see the authentication token in the cookie sent in the page header and will not redirect us to the login page.

Not bad. We can now scrape any data we'd like from the secure page contents. Just to point out, we're not just limited to reading the data. We can also send form data back to the site as a POST on secure pages as well.

Comments

Carlo Mendoza 8/26/2008 4:59 AM

This was a nice read. I am assuming this would be limited to your profile and your connections, correct? I certainly hope so; I'll just have to watch who's in my network, if that's the case.
Miron 8/26/2008 9:44 AM

Good stuff here.
Thanks
DotNetKicks.com 8/26/2008 11:17 AM

You've been kicked (a good thing) - Trackback from DotNetKicks.com
dzone 8/26/2008 11:18 AM

dzone
Alvin Ashcraft's Morning Dew 8/26/2008 11:18 AM

Pingback from Dew Drop - August 26, 2008 | Alvin Ashcroft
Jason Haley - Interesting Finds 8/26/2008 11:20 AM

Pingback from Interesting Finds: August 26, 2008 - Jason Haley
Ryan Farley 8/26/2008 3:22 PM

@Carlo, in the case of using this technique with LinkedIn, you are correct. You'll be accessing the site with your own authentication details, e.g. as yourself, so you'll have access to anything that your user has access to. As far as accessing other details, you'd be amazed at how much public data is accessible via LinkedIn through public profiles as well.

However, this post is meant more to focus on how to access a secure page from any site by passing along your own cookie data in the request, not just for LinkedIn (but it does work for that as well, just be careful to not get banned for a user agreement violation).

-Ryan
Arjan`s World » LINKBLOG for August 26, 2008 8/26/2008 10:24 PM

Pingback from Arjan`s World » LINKBLOG for August 26, 2008
Paul 8/27/2008 4:31 AM

Nice, hard coded cookie in program! Please send me sites which generate one session key per user. I mean thats this methods is a bug.
Ryan Farley 8/27/2008 7:47 AM

Hi Paul, you'd be amazed at how many well known sites this method works with. It works with about any site I've tried it with. This method obviously isn't intended for commercial, public or distributable software since it contains your own authentication token.
Paul Boch 8/27/2008 8:23 AM

A service ive used before for getting secure data is mozenda. They even had an easy to use rest api to get the information from them when i was done.
Ryan Farley 8/27/2008 10:41 PM

@Paul, mozenda looks cool. I'll take a look.
Average Joe 8/30/2008 8:54 PM

Why don't you try to actually help the world by creating a post about how to stop jerks from ripping off your site content rather than politely creating a website about how to steal. I have a small local website and my cross-town rivals rip off my content every day and it sucks and is not fair. I went through about of year to cleanse data and license it and they just rip it off. Why do support theft??? This whole post is ridiculous.
Ryan Farley 8/30/2008 9:00 PM

@Average Joe, how do you take this post to be about theft? Or ripping of your site content? Scraping isn't just about "stealing" info from other sites? There are many legitimate uses for scraping as well. However, this site isn't specifically about scraping really. It is about taking your cookie contents and passing it back programatically in the request stream to the website. Your cross-town rivals aren't likely using a technique like this to steal your content anyway. I'm sorry you see an article about a legitimate programming technique as condoning stealing. Did I not mention to be aware of the website's user agreements before doing this?
Weekly Link Post 57 « Rhonda Tipton’s WebLog 9/2/2008 2:23 PM

Pingback from Weekly Link Post 57 « Rhonda Tipton’s WebLog
Dan 9/4/2008 9:08 AM

@Average Joe: Your anger is misguided. Stealing information from a website doesn't happen because of articles like this. This article is a tool for developers, like me, who want to use it for legitimate commercial uses. According to your logic you should also be angry at the person who invented the Print Screen button because that can also be used to rip-off your website. Or how about the Print or Save As button in web browsers. Should those features all disappear because you're site is getting ripped off? Of course not. So do yourself a favor and get your head on straight.
Thom 10/2/2008 9:27 PM

Ryan, love the site. I just stumbled across this article and have had many experiences with screen scraping. So much so that I've created a class that allows a username and password to be entered and then scrapes Hotmail, Yahoo or GMail through a front end interface. Needless to say, friends who work at establishments that don't allow email of this type love it as they can bounce through my site and receive and send email and attachments. Very popular. So popular in fact that I had to restrict users to only known associates as it was chewing up bandwidth at a ridiculous rate.

Basically it further automates your example above. Although I'm sure you have either done or envisioned the possibilities of expanding your example.

Again, love the blog. Between you and - http://weblogs.asp.net/pleloup/archive/2008/09/22/what-are-the-hidden-features-of-asp-net.aspx I've realized that I've got a lot of tips to catch up on with regards to .NET.

Keep up the good work!

P.S. That blog about setting browser specific asp.net server control properties. How did I not come across this earlier? Thanks for the great tip!

Thom
Ryan Farley 10/3/2008 9:43 AM

Thanks for the comment Thom. I thought the same thing when I came across those items. Always fun to find something that you never knew was there before.

-Ryan
Larry Connell 10/18/2008 7:17 AM

I can understand why Average Joe would be angry at something like this but anything that has a legitimate use can also be used for illegitimate purposes.
Rem 11/4/2008 3:59 AM

Nice post...
visit also asp.net example
Agro 12/22/2008 9:30 PM

good stuff .. do u have any links to indepth examples thanks
Rick Johnson 1/22/2009 5:18 PM

Is it possible to gather info from the above referenced site for marketing purposes? The site requires a password which I have to enter before accessing any info.
Rockballad 2/17/2009 12:54 AM

Hi! Your approach is simple. Simply is the best, I think! Thanks!
BTW, could you tell me if a webmaster can see if I use this way to access a page (on my account of course) ? I've tried, the old cookie is still able to reused to access a page. Does he recognize the access? Thanks again!
Ivo 3/13/2009 10:19 PM

Hi Ryan,
I like your example. Would be possible get the complete source code.
Thx
Ivo
Pat Kash 4/10/2009 8:51 AM

I like KISS (Keep It Short and Simple)approach and it is one of them. Great work.
erm 5/18/2009 9:11 AM

Fantastic, thank you. I owe you exactly one hug!!
Eric 8/6/2009 9:23 AM

There were a couple of comments earlier about Mozenda. I think Mozenda's fantastic - affordable, user-friendly, and scrapes all the information for you right into a database.
sexy corsets 10/29/2009 9:58 PM

View the source of the page and use the WebRequest class to do the posting. No need to drive IE. Just figure out what IE is sending to the server and replicate that. Using a tool like Fiddler will make it even easier.
Patent Registration India 2/19/2010 1:21 AM

That blog about setting browser specific asp.net server control properties. How did I not come across this earlier? Thanks for the great tip!
San Diego Real Estate 3/2/2010 8:59 PM

hard coded cookie in program! Please send me sites which generate one session key per user. I mean thats this methods is a bug.
Ajit Gupta 3/19/2010 7:35 AM

Hi Friends,

I am .net developer.I've created application where i want to scrape secure pages of a web site on https. I've used above code but i'm not able to scrape that page every time i am getting the html of login page.

Please advice what should i do to fulfill my requirement..

looking forward to hearing back from you soon.

Many Thanks
Ajit Kumar Gupta
Ajit Gupta 3/19/2010 7:38 AM

Hi Friends,

I am .net developer. I've created application where i want to scrape secure pages of a web site on https. I've used above code but I'm not able to scrape that page every time i am getting the html of login page.

Please advice what should i do to fulfill my requirement..

looking forward to hearing back from you soon.

Many Thanks
Ajit Kumar Gupta
bojanskr 5/27/2010 5:33 AM

This is cool....exactly what you need in certain situations...but is there a way to do this programatically.

Example I login to a website (POST) and then read the response...it does not return the .APSXAUTH in the CookieContainer that I use to track the session.

Anyone has a clue?
Chris Sousa 1/12/2011 7:57 AM

Hey Ryan,

Great site! I really like your blog area.
Are you using a opensouce blog? if so which are you using?
My site sucks, its an embarrassment I need to put a little time into it. Untouched since 2005. I believe I have code to contribute. I would also like to reference your site.

Thanks for taking the time to read and or reply.

Chris Sousa