KNN for CAPTCHA Recognition

(Click the public account above to quickly follow)

Source: Qiu Kang singasong

https://segmentfault.com/a/1190000006070219

Introduction

I previously developed a campus dating APP, where one of the logics was to confirm that the user is a student by accessing their academic system. The basic idea was to use the user’s account and password to scrape the information. However, many academic systems have CAPTCHAs. At that time, we downloaded the CAPTCHA to a local server, distributed it to the client, and asked the user to fill in the CAPTCHA themselves, along with their account password, and submit it to the server, which would then simulate logging into the academic system to confirm if the user could log in. CAPTCHAs undoubtedly crushed our idea of quick user authentication, but there was no other way at that time. Recently, after reading some machine learning content, I thought that the very simple CAPTCHAs used by most schools could be cracked using the KNN method. So I organized my thoughts and rolled up my sleeves to get started!

Analysis

Our school’s CAPTCHA looks like this:KNN for CAPTCHA RecognitionIt’s actually just characters rotated with some slight noise added. To recognize it, we need to reverse the process. The specific idea is to first binarize to remove noise, then segment the individual characters, and finally rotate them to a standard direction. From these processed images, we select templates, and every time a new CAPTCHA comes in, we process it in the same way and compare it with these templates, selecting the one with the closest discriminative distance as the judgment result (i.e., the idea of KNN, where K=1 for this article). Next, I will explain the steps.

Obtaining the CAPTCHA

First, we need a large number of CAPTCHAs, which we can achieve through web scraping. The code is as follows:

#-*- coding:UTF-8 -*-

import urllib,urllib2,cookielib,string,Image

def getchk(number):

# Create cookie object

cookie = cookielib.LWPCookieJar()

cookieSupport= urllib2.HTTPCookieProcessor(cookie)

opener = urllib2.build_opener(cookieSupport,urllib2.HTTPHandler)

urllib2.install_opener(opener)

# First connect to the academic system to obtain the cookie

# Pretend to be a browser

headers = {

‘Accept’:‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8’,

‘Accept-Encoding’:‘gzip,deflate’,

‘Accept-Language’:‘zh-CN,zh;q=0.8’,

‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36’

}

req0 = urllib2.Request(

url =‘http://mis.teach.ustc.edu.cn’,

headers = headers # Request header

)

# Catch HTTP errors

try :

result0 = urllib2.urlopen(req0)

except urllib2.HTTPError,e:

printe.code

# Extract cookie

getcookie = [,]

foritem incookie:

getcookie.append(item.name)

getcookie.append(“=”)

getcookie.append(item.value)

getcookie = “”.join(getcookie)

# Modify headers

headers[“Origin”] = “http://mis.teach.ustc.edu.cn”

headers[“Referer”] = “http://mis.teach.ustc.edu.cn/userinit.do”

headers[“Content-Type”] = “application/x-www-form-urlencoded”

headers[“Cookie”] = getcookie

foriinrange(number):

req = urllib2.Request(

url =“http://mis.teach.ustc.edu.cn/randomImage.do?date=’1469451446894′”,

headers = headers # Request header

)

response = urllib2.urlopen(req)

status = response.getcode()

picData = response.read()

ifstatus == 200:

localPic = open(“./source/”+str(i)+“.jpg”,“wb”)

localPic.write(picData)

localPic.close()

else:

print“failed to get Check Code “

if__name__ == ‘__main__’:

getchk(500)

We downloaded 500 CAPTCHAs to the source directory. As shown in the image: KNN for CAPTCHA Recognition

Binarization

The rich image processing functions in MATLAB can save us a lot of time. We traverse the source folder, perform binarization on each CAPTCHA image, and store the processed images in the bw directory. The code is as follows:

mydir=‘./source/’;

bw = ‘./bw/’;

ifmydir(end)~=‘\’

mydir=[mydir,’\];

end

DIRS=dir([mydir,‘*.jpg’]); % File extension

n=length(DIRS);

fori=1:n

if ~DIRS(i).isdir

img = imread(strcat(mydir,DIRS(i).name));

img = rgb2gray(img);% Grayscale

img = im2bw(img);%01 Binarization

name = strcat(bw,DIRS(i).name)

imwrite(img,name);

end

end

The processing result is shown in the image: KNN for CAPTCHA Recognition

Segmentation

mydir=‘./bw/’;

letter = ‘./letter/’;

ifmydir(end)~=‘\’

mydir=[mydir,’\];

end

DIRS=dir([mydir,‘*.jpg’]); % File extension

n=length(DIRS);

fori=1:n

if ~DIRS(i).isdir

img = imread(strcat(mydir,DIRS(i).name));

img = im2bw(img);% Binarization

img = 1img;% Color inversion to make characters connected domains, convenient for removing noise

forii = 0:3

region = [ii*20+1,1,19,20];% Divide a CAPTCHA into four20*20 sized character images

subimg = imcrop(img,region);

imlabel = bwlabel(subimg);

% imshow(imlabel);

ifmax(max(imlabel))>1 % Indicating there are noise points, need to remove

% max(max(imlabel))

% imshow(subimg);

stats = regionprops(imlabel,‘Area’);

area = cat(1,stats.Area);

maxindex = find(area == max(area));

area(maxindex) = 0;

secondindex = find(area == max(area));

imindex = ismember(imlabel,secondindex);

subimg(imindex==1)=0;% Remove the second largest connected domain, noise points cannot be larger than characters, so the second largest is noise

end

name = strcat(letter,DIRS(i).name(1:length(DIRS(i).name)4),‘_’,num2str(ii),‘.jpg’)

imwrite(subimg,name);

end

end

end

The processing result is shown in the image: KNN for CAPTCHA Recognition

Rotation

Next, we proceed with the rotation. What standard should we find? Upon observation, these characters are rotated no more than 60 degrees, so we can rotate them uniformly to minimize character width within the range of -60 to +60 degrees. The code is as follows:

ifmydir(end)~=‘\’

mydir=[mydir,’\];

end

DIRS=dir([mydir,‘*.jpg’]); % File extension

n=length(DIRS);

fori=1:n

if ~DIRS(i).isdir

img = imread(strcat(mydir,DIRS(i).name));

img = im2bw(img);

minwidth = 20;

forangle = –60:60

imgr=imrotate(img,angle,‘bilinear’,‘crop’);%crop to avoid image size changes

imlabel = bwlabel(imgr);

stats = regionprops(imlabel,‘Area’);

area = cat(1,stats.Area);

maxindex = find(area == max(area));

imindex = ismember(imlabel,maxindex);% The largest connected domain is1

[y,x] = find(imindex==1);

width = max(x)min(x)+1;

ifwidth<minwidth

minwidth = width;

imgrr = imgr;

end

end

name = strcat(rotate,DIRS(i).name)

imwrite(imgrr,name);

end

end

The processing result is shown in the image, with a total of 2000 character images stored in the rotate folder KNN for CAPTCHA Recognition

Template Selection

Now we select a set of templates from the rotate folder, covering each character. One character can have multiple images selected, because even with the previous processing, we cannot guarantee that the final presentation of a character has only one form. Selecting a few more to ensure coverage is necessary. The selected template images are stored in the samples folder. This process is very time-consuming and labor-intensive. You can ask classmates for help~ as shown in the image KNN for CAPTCHA Recognition

Testing

The test code is as follows: first perform the above operations on the test CAPTCHA, and then compare it with the selected templates, using the template with the smallest differential value as the character selection for the test sample. The code is as follows:

% The image with the smallest differential value is the answer

mydir=‘./test/’;

samples = ‘./samples/’;

ifmydir(end)~=‘\’

mydir=[mydir,’\];

end

ifsamples(end)~=‘\’

samples=[samples,’\];

end

DIRS=dir([mydir,‘*.jpg’]); % File extension

DIRS1=dir([samples,‘*.jpg’]); % File extension

n=length(DIRS);% Total number of CAPTCHA images

singleerror = 0;% Single error

uniterror = 0;% Number of errors in one CAPTCHA

fori=1:n

if ~DIRS(i).isdir

realcodes = DIRS(i).name(1:4);

fprintf(‘Actual characters of CAPTCHA:%s
,realcodes);

img = imread(strcat(mydir,DIRS(i).name));

img = rgb2gray(img);

img = im2bw(img);

img = 1img;% Color inversion to make characters connected domains

subimgs = [];

forii = 0:3

region = [ii*20+1,1,19,20];% Oddly, why can it only be evenly divided like this?

subimg = imcrop(img,region);

imlabel = bwlabel(subimg);

ifmax(max(imlabel))>1 % Indicating there are noise points

stats = regionprops(imlabel,‘Area’);

area = cat(1,stats.Area);

maxindex = find(area == max(area));

area(maxindex) = 0;

secondindex = find(area == max(area));

imindex = ismember(imlabel,secondindex);

subimg(imindex==1)=0;% Remove the second largest connected domain

end

subimgs = [subimgs;subimg];

end

codes = [];

forii = 0:3

region = [ii*20+1,1,19,20];

subimg = imcrop(img,region);

minwidth = 20;

forangle = –60:60

imgr=imrotate(subimg,angle,‘bilinear’,‘crop’);%crop to avoid image size changes

imlabel = bwlabel(imgr);

stats = regionprops(imlabel,‘Area’);

area = cat(1,stats.Area);

maxindex = find(area == max(area));

imindex = ismember(imlabel,maxindex);% The largest connected domain is1

[y,x] = find(imindex==1);

width = max(x)min(x)+1;

ifwidth<minwidth

minwidth = width;

imgrr = imgr;

end

end

mindiffv = 1000000;

forjj = 1:length(DIRS1)

imgsample = imread(strcat(samples,DIRS1(jj).name));

imgsample = im2bw(imgsample);

diffv = abs(imgsampleimgrr);

alldiffv = sum(sum(diffv));

ifalldiffv<mindiffv

mindiffv = alldiffv;

code = DIRS1(jj).name;

code = code(1);

end

end

codes = [codes,code];

end

fprintf(‘Test CAPTCHA characters:%s
,codes);

num = codesrealcodes;

num = length(find(num~=0));

singleerror = singleerror + num;

ifnum>0

uniterror = uniterror +1;

end

fprintf(‘Number of errors:%d
,num);

end

end

fprintf(‘\n—–Results are as follows—–\n\n’);

fprintf(‘Total number of characters in test CAPTCHA:%d
,n*4);

fprintf(‘Number of character errors in test CAPTCHA:%d
,singleerror);

fprintf(‘Single character recognition accuracy:%.2f%%
,(1singleerror/(n*4))*100);

fprintf(‘Number of test CAPTCHA images:%d
,n);

fprintf(‘Number of errors in test CAPTCHA images:%d
,uniterror);

fprintf(‘Probability of filling in the correct CAPTCHA:%.2f%%
,(1uniterror/n)*100);

Results:

Actual characters of CAPTCHA:2B4E

Test characters of CAPTCHA:2B4F

Number of errors:1

Actual characters of CAPTCHA:4572

Test characters of CAPTCHA:4572

Number of errors:0

Actual characters of CAPTCHA:52CY

Test characters of CAPTCHA:52LY

Number of errors:1

Actual characters of CAPTCHA:83QG

Test characters of CAPTCHA:85QG

Number of errors:1

Actual characters of CAPTCHA:9992

Test characters of CAPTCHA:9992

Number of errors:0

Actual characters of CAPTCHA:A7Y7

Test characters of CAPTCHA:A7Y7

Number of errors:0

Actual characters of CAPTCHA:D993

Test characters of CAPTCHA:D995

Number of errors:1

Actual characters of CAPTCHA:F549

Test characters of CAPTCHA:F5A9

Number of errors:1

Actual characters of CAPTCHA:FMC6

Test characters of CAPTCHA:FMLF

Number of errors:2

Actual characters of CAPTCHA:R4N4

Test characters of CAPTCHA:R4N4

Number of errors:0

—–Results Statistics—–

Total number of characters in test CAPTCHA:40

Number of character errors in test CAPTCHA:7

Single character recognition accuracy:82.50%

Total number of test CAPTCHA images:10

Number of errors in test CAPTCHA images:6

Probability of filling in the correct CAPTCHA:40.00%

It can be seen that the accuracy of individual characters is quite high, but the overall accuracy is still lacking. Observing the results, the erroneous characters are those easily confused characters, such as E and F, C and L, 5 and 3, 4 and A, etc. Therefore, what we can do is to increase the number of samples in the templates to minimize confusion.

After increasing dozens of samples, we tested again, and the results are as follows:

Actual characters of CAPTCHA:2B4E

Test characters of CAPTCHA:2B4F

Number of errors:1

Actual characters of CAPTCHA:4572

Test characters of CAPTCHA:4572

Number of errors:0

Actual characters of CAPTCHA:52CY

Test characters of CAPTCHA:52LY

Number of errors:1

Actual characters of CAPTCHA:83QG

Test characters of CAPTCHA:83QG

Number of errors:0

Actual characters of CAPTCHA:9992

Test characters of CAPTCHA:9992

Number of errors:0

Actual characters of CAPTCHA:A7Y7

Test characters of CAPTCHA:A7Y7

Number of errors:0

Actual characters of CAPTCHA:D993

Test characters of CAPTCHA:D993

Number of errors:0

Actual characters of CAPTCHA:F549

Test characters of CAPTCHA:F5A9

Number of errors:1

Actual characters of CAPTCHA:FMC6

Test characters of CAPTCHA:FMLF

Number of errors:2

Actual characters of CAPTCHA:R4N4

Test characters of CAPTCHA:R4N4

Number of errors:0

—–Results Statistics—–

Total number of characters in test CAPTCHA:40

Number of character errors in test CAPTCHA:5

Single character recognition accuracy:87.50%

Total number of test CAPTCHA images:10

Number of errors in test CAPTCHA images:4

Probability of filling in the correct CAPTCHA:60.00%

It can be seen that both the single character recognition accuracy and the overall probability of correctly filling in the CAPTCHA have improved. It can be foreseen that as the number of templates increases, the accuracy will continue to improve.

Conclusion

This method has very weak scalability and is only suitable for simple CAPTCHAs. For complex ones like 12306, it is not applicable at all.

In summary, the road of learning is still very long, and I will gradually improve this method.

Did you gain something after reading this article? Please share it with more people

Follow “Python Developers” to enhance your Python skills

KNN for CAPTCHA Recognition

Leave a Comment