Thursday, November 17, 2011

CAPTCHA Hax With TesserCap

This blog post was voted as 8th best in Top 10 Web Hacking Techniques of 2011 poll.

With the goal of creating a tool that can help security professionals and developers to test their CAPTCHA schemes, I conducted a research on over 200 high traffic websites and several CAPTCHA service providers listed on Quantcast’s Top 1 Million Ranking Websites.

During the same time frame, students at the Stanford University also conducted a similar research (PDF). Both research works concluded the obvious:

An alarming number of CAPTCHAs schemes are vulnerable to automated attacks.

I looked around, tested and zeroed in on Tesseract-OCR as my OCR engine. To remove color complexities, spatial irregularities, and other types of random noise from CAPTCHAs, I decided to write my own image preprocessing engine. After a few months of research, coding and testing in my spare time, TesserCap was born and is ready for release now.

TesserCap is a GUI based, point and shoot CAPTCHA analysis tool with the following features:
  1. A generic image preprocessing engine that can be configured as per the CAPTCHA type being analyzed.
  2. Tesseract-OCR as its OCR engine to retrieve text from preprocessed CAPTCHAs.
  3. Web proxy support
  4. Support for custom HTTP headers to retrieve CAPTCHAs from websites that require cookies or special HTTP headers in requests
  5. CAPTCHA statistical analysis support
  6. Character set selection for the OCR Engine
An example TesserCap image preprocessing and run on Wikipedia (Wikimedia’s Fancy CAPTCHA) is shown below:



Downloads

TesserCap and it's user manual can be downloaded from one of the following locations:

Results

The two tables below summarize the CAPTCHA analysis performed using TesserCap for few popular websites and some CAPTCHA service providers. All these tests were performed using TesserCap’s image preprocessing module and Tesseract-OCR’s default training data.







Website Accuracy* Quantcast Rank
wikipedia 20-30% 7
ebay 20-30% 11
reddit.com 20-30% 68
CNBC 50+% 121
foodnetwork.com 80-90% 160
dailymail.co.uk 30+% 245
megaupload.com 80+% 1000
pastebin.com 70-80% 32,534
cavenue.com 80+% 149,645




CAPTCHA Provider Accuracy*
captchas.net 40-50%
opencaptcha.com 20-30%
snaphost.com 60+%
captchacreator.com 10-20%
www.phpcaptcha.org 10-20%
webspamprotect.com 40+%
ReCaptcha 0%



*This accuracy maybe further increased by training the Tesseract-OCR engine for the CAPTCHAs under test.

Wikipedia






OpenCaptcha Preprocessing






OpenCaptcha Sample Run




Reddit




eBay

11 comments:

Vamsi Chandra said...

Hi,

Very interesting article!!! we are faced with a similar situation.. Could you help advising; what do you think are best Captcha solutions available out there? Thanks!

Regards,
Vamsi
vamsic@ivycomptech.com

Gursev Kalra said...

Hi Vamsi,

I suggest you guys look at deploying google's reCAPTCHA or microsoft's asirra on to your website. reCAPTCHA has been tested quite comprehensively and google often updates the CAPTCHA generation algorithms. ASIRRA (http://research.microsoft.com/en-us/um/redmond/projects/asirra/) is a new initiative by microsoft that basically uses animal (cats and dogs) images. Theses two are free and there are several paid CAPTCHA providers out there which you can use in you application.

Anonymous said...

i am trying to use it, but at the main tab i enter a url and after a few seconds it just says test completed but nothing changes/shows? did i do something wrong?
thanks

Gursev Kalra said...

@Anonymouse, please check the logs. They may have some additional information if there is an error condition.

James said...

6/7/2012 11:42:12 PM
System.UriFormatException: Invalid URI: The URI is empty.
at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind)
at TesserCap.Misc.ValidUrlFormat(String url)

6/7/2012 11:44:11 PM
System.Net.WebException: The remote server returned an error: (404) Not Found.
at System.Net.HttpWebRequest.GetResponse()
at TesserCap.Misc.Retrieve200OKContent(String url, String proxyAddress, String proxyPort, Boolean followRedirect, String headers)

6/7/2012 11:44:16 PM
System.Net.WebException: The remote server returned an error: (404) Not Found.
at System.Net.HttpWebRequest.GetResponse()
at TesserCap.Misc.Retrieve200OKContent(String url, String proxyAddress, String proxyPort, Boolean followRedirect, String headers)

6/7/2012 11:44:36 PM
System.Net.WebException: The remote server returned an error: (404) Not Found.
at System.Net.HttpWebRequest.GetResponse()
at TesserCap.Misc.Retrieve200OKContent(String url, String proxyAddress, String proxyPort, Boolean followRedirect, String headers)

6/7/2012 11:44:48 PM
System.Net.WebException: The remote server returned an error: (404) Not Found.
at System.Net.HttpWebRequest.GetResponse()
at TesserCap.Misc.Retrieve200OKContent(String url, String proxyAddress, String proxyPort, Boolean followRedirect, String headers)

6/7/2012 11:52:13 PM
System.UriFormatException: Invalid URI: The format of the URI could not be determined.
at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind)
at TesserCap.Misc.ValidUrlFormat(String url)

6/7/2012 11:53:14 PM
System.ArgumentException: Parameter is not valid.
at System.Drawing.Image.FromStream(Stream stream, Boolean useEmbeddedColorManagement, Boolean validateImageData)
at TesserCap.Misc.IsImage(Byte[] img)

6/7/2012 11:54:10 PM
System.ArgumentException: Parameter is not valid.
at System.Drawing.Image.FromStream(Stream stream, Boolean useEmbeddedColorManagement, Boolean validateImageData)
at TesserCap.Misc.IsImage(Byte[] img)

6/7/2012 11:59:56 PM
System.ArgumentException: Specified value has invalid HTTP Header characters.
Parameter name: name
at System.Net.WebHeaderCollection.CheckBadChars(String name, Boolean isHeaderValue)
at System.Net.WebHeaderCollection.Add(String name, String value)
at TesserCap.Misc.AddHeaders(HttpWebRequest q, String headers)

6/7/2012 11:59:56 PM
System.ArgumentException: Specified value has invalid HTTP Header characters.
Parameter name: name
at System.Net.WebHeaderCollection.CheckBadChars(String name, Boolean isHeaderValue)
at System.Net.WebHeaderCollection.Add(String name, String value)
at TesserCap.Misc.AddHeaders(HttpWebRequest q, String headers)

6/7/2012 11:59:56 PM
System.ArgumentException: Specified value has invalid HTTP Header characters.
Parameter name: name
at System.Net.WebHeaderCollection.CheckBadChars(String name, Boolean isHeaderValue)
at System.Net.WebHeaderCollection.Add(String name, String value)
at TesserCap.Misc.AddHeaders(HttpWebRequest q, String headers)

6/7/2012 11:59:56 PM
System.ArgumentException: Specified value has invalid HTTP Header characters.
Parameter name: name
at System.Net.WebHeaderCollection.CheckBadChars(String name, Boolean isHeaderValue)
at System.Net.WebHeaderCollection.Add(String name, String value)
at TesserCap.Misc.AddHeaders(HttpWebRequest q, String headers)


hmm?

Gursev Kalra said...

It appears that the .Net HTTP library finds the header values that you are supplying to TesserCap as invalid. Does your application custom HTTP headers?

Jmptrsn said...

i am confused.. is it possible if i can talk to you on windows live messenger or something for a few minutes for some help?

Gursev Kalra said...

Can you host the CAPTCHAs somewhere and share the URL?

Jmptrsn said...

http://freepicupload.com/images/735workimage_tn.jpg they are just like that?

can i just explain what i have been trying to do.
i am trying to make a bot/script for a game with a every 15 minute a 4 number captcha

Gursev Kalra said...

So check these sample settings. The results arent accurate for the sample you sent, but you will get the idea on removing the noise.

http://freepicupload.com/images/387sample_settings.png

Jmptrsn said...

Very nice, how did you get that picture working?
i also and not sure about everything else, i have been reading yours and other peoples blogs and i am just more confused on how to do this..