(Click the public account above to quickly follow)
Source: Qiu Kang singasong
https://segmentfault.com/a/1190000006070219
Introduction
I previously developed a campus dating APP, where one of the logics was to confirm that the user is a student by accessing their academic system. The basic idea was to use the user’s account and password to scrape the information. However, many academic systems have CAPTCHAs. At that time, we downloaded the CAPTCHA to a local server, distributed it to the client, and asked the user to fill in the CAPTCHA themselves, along with their account password, and submit it to the server, which would then simulate logging into the academic system to confirm if the user could log in. CAPTCHAs undoubtedly crushed our idea of quick user authentication, but there was no other way at that time. Recently, after reading some machine learning content, I thought that the very simple CAPTCHAs used by most schools could be cracked using the KNN method. So I organized my thoughts and rolled up my sleeves to get started!
Analysis
Our school’s CAPTCHA looks like this:It’s actually just characters rotated with some slight noise added. To recognize it, we need to reverse the process. The specific idea is to first binarize to remove noise, then segment the individual characters, and finally rotate them to a standard direction. From these processed images, we select templates, and every time a new CAPTCHA comes in, we process it in the same way and compare it with these templates, selecting the one with the closest discriminative distance as the judgment result (i.e., the idea of KNN, where K=1 for this article). Next, I will explain the steps.
Obtaining the CAPTCHA
First, we need a large number of CAPTCHAs, which we can achieve through web scraping. The code is as follows:
#-*- coding:UTF-8 -*-
import urllib,urllib2,cookielib,string,Image
def getchk(number):
# Create cookie object
cookie = cookielib.LWPCookieJar()
cookieSupport= urllib2.HTTPCookieProcessor(cookie)
opener = urllib2.build_opener(cookieSupport,urllib2.HTTPHandler)
urllib2.install_opener(opener)
# First connect to the academic system to obtain the cookie
# Pretend to be a browser
headers = {
‘Accept’:‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8’,
‘Accept-Encoding’:‘gzip,deflate’,
‘Accept-Language’:‘zh-CN,zh;q=0.8’,
‘User-Agent’:‘Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36’
}
req0 = urllib2.Request(
url =‘http://mis.teach.ustc.edu.cn’,
headers = headers # Request header
)
# Catch HTTP errors
try :
result0 = urllib2.urlopen(req0)
except urllib2.HTTPError,e:
printe.code
# Extract cookie
getcookie = [”,]
foritem incookie:
getcookie.append(item.name)
getcookie.append(“=”)
getcookie.append(item.value)
getcookie = “”.join(getcookie)
# Modify headers
headers[“Origin”] = “http://mis.teach.ustc.edu.cn”
headers[“Referer”] = “http://mis.teach.ustc.edu.cn/userinit.do”
headers[“Content-Type”] = “application/x-www-form-urlencoded”
headers[“Cookie”] = getcookie
foriinrange(number):
req = urllib2.Request(
url =“http://mis.teach.ustc.edu.cn/randomImage.do?date=’1469451446894′”,
headers = headers # Request header
)
response = urllib2.urlopen(req)
status = response.getcode()
picData = response.read()
ifstatus == 200:
localPic = open(“./source/”+str(i)+“.jpg”,“wb”)
localPic.write(picData)
localPic.close()
else:
print“failed to get Check Code “
if__name__ == ‘__main__’:
getchk(500)
We downloaded 500 CAPTCHAs to the source directory. As shown in the image:
Binarization
The rich image processing functions in MATLAB can save us a lot of time. We traverse the source folder, perform binarization on each CAPTCHA image, and store the processed images in the bw directory. The code is as follows:
mydir=‘./source/’;
bw = ‘./bw/’;
ifmydir(end)~=‘\’
mydir=[mydir,’\‘];
end
DIRS=dir([mydir,‘*.jpg’]); % File extension
n=length(DIRS);
fori=1:n
if ~DIRS(i).isdir
img = imread(strcat(mydir,DIRS(i).name));
img = rgb2gray(img);% Grayscale
img = im2bw(img);%0–1 Binarization
name = strcat(bw,DIRS(i).name)
imwrite(img,name);
end
end
The processing result is shown in the image:
Segmentation
mydir=‘./bw/’;
letter = ‘./letter/’;
ifmydir(end)~=‘\’
mydir=[mydir,’\‘];
end
DIRS=dir([mydir,‘*.jpg’]); % File extension
n=length(DIRS);
fori=1:n
if ~DIRS(i).isdir
img = imread(strcat(mydir,DIRS(i).name));
img = im2bw(img);% Binarization
img = 1–img;% Color inversion to make characters connected domains, convenient for removing noise
forii = 0:3
region = [ii*20+1,1,19,20];% Divide a CAPTCHA into four20*20 sized character images
subimg = imcrop(img,region);
imlabel = bwlabel(subimg);
% imshow(imlabel);
ifmax(max(imlabel))>1 % Indicating there are noise points, need to remove
% max(max(imlabel))
% imshow(subimg);
stats = regionprops(imlabel,‘Area’);
area = cat(1,stats.Area);
maxindex = find(area == max(area));
area(maxindex) = 0;
secondindex = find(area == max(area));
imindex = ismember(imlabel,secondindex);
subimg(imindex==1)=0;% Remove the second largest connected domain, noise points cannot be larger than characters, so the second largest is noise
end
name = strcat(letter,DIRS(i).name(1:length(DIRS(i).name)–4),‘_’,num2str(ii),‘.jpg’)
imwrite(subimg,name);
end
end
end
The processing result is shown in the image:
Rotation
Next, we proceed with the rotation. What standard should we find? Upon observation, these characters are rotated no more than 60 degrees, so we can rotate them uniformly to minimize character width within the range of -60 to +60 degrees. The code is as follows:
ifmydir(end)~=‘\’
mydir=[mydir,’\‘];
end
DIRS=dir([mydir,‘*.jpg’]); % File extension
n=length(DIRS);
fori=1:n
if ~DIRS(i).isdir
img = imread(strcat(mydir,DIRS(i).name));
img = im2bw(img);
minwidth = 20;
forangle = –60:60
imgr=imrotate(img,angle,‘bilinear’,‘crop’);%crop to avoid image size changes
imlabel = bwlabel(imgr);
stats = regionprops(imlabel,‘Area’);
area = cat(1,stats.Area);
maxindex = find(area == max(area));
imindex = ismember(imlabel,maxindex);% The largest connected domain is1
[y,x] = find(imindex==1);
width = max(x)–min(x)+1;
ifwidth<minwidth
minwidth = width;
imgrr = imgr;
end
end
name = strcat(rotate,DIRS(i).name)
imwrite(imgrr,name);
end
end
The processing result is shown in the image, with a total of 2000 character images stored in the rotate folder
Template Selection
Now we select a set of templates from the rotate folder, covering each character. One character can have multiple images selected, because even with the previous processing, we cannot guarantee that the final presentation of a character has only one form. Selecting a few more to ensure coverage is necessary. The selected template images are stored in the samples folder. This process is very time-consuming and labor-intensive. You can ask classmates for help~ as shown in the image
Testing
The test code is as follows: first perform the above operations on the test CAPTCHA, and then compare it with the selected templates, using the template with the smallest differential value as the character selection for the test sample. The code is as follows:
% The image with the smallest differential value is the answer
mydir=‘./test/’;
samples = ‘./samples/’;
ifmydir(end)~=‘\’
mydir=[mydir,’\‘];
end
ifsamples(end)~=‘\’
samples=[samples,’\‘];
end
DIRS=dir([mydir,‘*.jpg’]); % File extension
DIRS1=dir([samples,‘*.jpg’]); % File extension
n=length(DIRS);% Total number of CAPTCHA images
singleerror = 0;% Single error
uniterror = 0;% Number of errors in one CAPTCHA
fori=1:n
if ~DIRS(i).isdir
realcodes = DIRS(i).name(1:4);
fprintf(‘Actual characters of CAPTCHA:%s
‘,realcodes);img = imread(strcat(mydir,DIRS(i).name));
img = rgb2gray(img);
img = im2bw(img);
img = 1–img;% Color inversion to make characters connected domains
subimgs = [];
forii = 0:3
region = [ii*20+1,1,19,20];% Oddly, why can it only be evenly divided like this?
subimg = imcrop(img,region);
imlabel = bwlabel(subimg);
ifmax(max(imlabel))>1 % Indicating there are noise points
stats = regionprops(imlabel,‘Area’);
area = cat(1,stats.Area);
maxindex = find(area == max(area));
area(maxindex) = 0;
secondindex = find(area == max(area));
imindex = ismember(imlabel,secondindex);
subimg(imindex==1)=0;% Remove the second largest connected domain
end
subimgs = [subimgs;subimg];
end
codes = [];
forii = 0:3
region = [ii*20+1,1,19,20];
subimg = imcrop(img,region);
minwidth = 20;
forangle = –60:60
imgr=imrotate(subimg,angle,‘bilinear’,‘crop’);%crop to avoid image size changes
imlabel = bwlabel(imgr);
stats = regionprops(imlabel,‘Area’);
area = cat(1,stats.Area);
maxindex = find(area == max(area));
imindex = ismember(imlabel,maxindex);% The largest connected domain is1
[y,x] = find(imindex==1);
width = max(x)–min(x)+1;
ifwidth<minwidth
minwidth = width;
imgrr = imgr;
end
end
mindiffv = 1000000;
forjj = 1:length(DIRS1)
imgsample = imread(strcat(samples,DIRS1(jj).name));
imgsample = im2bw(imgsample);
diffv = abs(imgsample–imgrr);
alldiffv = sum(sum(diffv));
ifalldiffv<mindiffv
mindiffv = alldiffv;
code = DIRS1(jj).name;
code = code(1);
end
end
codes = [codes,code];
end
fprintf(‘Test CAPTCHA characters:%s
‘,codes);num = codes–realcodes;
num = length(find(num~=0));
singleerror = singleerror + num;
ifnum>0
uniterror = uniterror +1;
end
fprintf(‘Number of errors:%d
‘,num);end
end
fprintf(‘\n—–Results are as follows—–\n\n’);
fprintf(‘Total number of characters in test CAPTCHA:%d
‘,n*4);fprintf(‘Number of character errors in test CAPTCHA:%d
‘,singleerror);fprintf(‘Single character recognition accuracy:%.2f%%
‘,(1–singleerror/(n*4))*100);fprintf(‘Number of test CAPTCHA images:%d
‘,n);fprintf(‘Number of errors in test CAPTCHA images:%d
‘,uniterror);fprintf(‘Probability of filling in the correct CAPTCHA:%.2f%%
‘,(1–uniterror/n)*100);
Results:
Actual characters of CAPTCHA:2B4E
Test characters of CAPTCHA:2B4F
Number of errors:1
Actual characters of CAPTCHA:4572
Test characters of CAPTCHA:4572
Number of errors:0
Actual characters of CAPTCHA:52CY
Test characters of CAPTCHA:52LY
Number of errors:1
Actual characters of CAPTCHA:83QG
Test characters of CAPTCHA:85QG
Number of errors:1
Actual characters of CAPTCHA:9992
Test characters of CAPTCHA:9992
Number of errors:0
Actual characters of CAPTCHA:A7Y7
Test characters of CAPTCHA:A7Y7
Number of errors:0
Actual characters of CAPTCHA:D993
Test characters of CAPTCHA:D995
Number of errors:1
Actual characters of CAPTCHA:F549
Test characters of CAPTCHA:F5A9
Number of errors:1
Actual characters of CAPTCHA:FMC6
Test characters of CAPTCHA:FMLF
Number of errors:2
Actual characters of CAPTCHA:R4N4
Test characters of CAPTCHA:R4N4
Number of errors:0
—–Results Statistics—–
Total number of characters in test CAPTCHA:40
Number of character errors in test CAPTCHA:7
Single character recognition accuracy:82.50%
Total number of test CAPTCHA images:10
Number of errors in test CAPTCHA images:6
Probability of filling in the correct CAPTCHA:40.00%
It can be seen that the accuracy of individual characters is quite high, but the overall accuracy is still lacking. Observing the results, the erroneous characters are those easily confused characters, such as E and F, C and L, 5 and 3, 4 and A, etc. Therefore, what we can do is to increase the number of samples in the templates to minimize confusion.
After increasing dozens of samples, we tested again, and the results are as follows:
Actual characters of CAPTCHA:2B4E
Test characters of CAPTCHA:2B4F
Number of errors:1
Actual characters of CAPTCHA:4572
Test characters of CAPTCHA:4572
Number of errors:0
Actual characters of CAPTCHA:52CY
Test characters of CAPTCHA:52LY
Number of errors:1
Actual characters of CAPTCHA:83QG
Test characters of CAPTCHA:83QG
Number of errors:0
Actual characters of CAPTCHA:9992
Test characters of CAPTCHA:9992
Number of errors:0
Actual characters of CAPTCHA:A7Y7
Test characters of CAPTCHA:A7Y7
Number of errors:0
Actual characters of CAPTCHA:D993
Test characters of CAPTCHA:D993
Number of errors:0
Actual characters of CAPTCHA:F549
Test characters of CAPTCHA:F5A9
Number of errors:1
Actual characters of CAPTCHA:FMC6
Test characters of CAPTCHA:FMLF
Number of errors:2
Actual characters of CAPTCHA:R4N4
Test characters of CAPTCHA:R4N4
Number of errors:0
—–Results Statistics—–
Total number of characters in test CAPTCHA:40
Number of character errors in test CAPTCHA:5
Single character recognition accuracy:87.50%
Total number of test CAPTCHA images:10
Number of errors in test CAPTCHA images:4
Probability of filling in the correct CAPTCHA:60.00%
It can be seen that both the single character recognition accuracy and the overall probability of correctly filling in the CAPTCHA have improved. It can be foreseen that as the number of templates increases, the accuracy will continue to improve.
Conclusion
This method has very weak scalability and is only suitable for simple CAPTCHAs. For complex ones like 12306, it is not applicable at all.
In summary, the road of learning is still very long, and I will gradually improve this method.
Did you gain something after reading this article? Please share it with more people
Follow “Python Developers” to enhance your Python skills