Improving kNN Matching Effect for Dating Websites (With Source Code)

Improving kNN Matching Effect for Dating Websites

This article follows up on the previous one, in which we briefly introduced the basic principles of the kNN algorithm. Using the context of dating website matching, we implemented data parsing from a text file using Python and created scatter plots with matplotlib for data analysis. In this article, we will build a complete and usable matching system based on this foundation (the data used in this example and the implementation code in Python are attached at the end).

Improving kNN Matching Effect for Dating Websites (With Source Code)

3. Data Preparation: Normalization

Improving kNN Matching Effect for Dating Websites (With Source Code)

In this example, the flight mileage attribute obtained each year has a much greater impact on the calculation results than the other two attributes (the proportion of time spent playing games and the kilograms of ice cream consumed per week). This is solely due to the difference in the magnitude of the attributes. When considering the importance of the three attributes, the flight mileage should not have such a significant impact on the calculation results.

When dealing with attributes that have different value ranges, normalization is the most commonly used method, such as compressing/extending the value range to [0, 1] or [-1, 1]. The following formula can convert any range of attribute values to [0, 1]:

newValue = (oldValue-Min)/(Max – Min)

Where Min and Max are the minimum and maximum attribute values in the data set, respectively. Although changing the range of values increases the algorithm’s time and space complexity to some extent, it is necessary to better utilize the information between the data. Therefore, we create a function called autoNorm, which compresses the attribute values to [0, 1]. The code is as follows:

def autoNorm(dataSet):

minVals = dataSet.min(0)

maxVals = dataSet.max(0)

ranges = maxVals – minVals

normDataSet = zeros(shape(dataSet))

m = dataSet.shape[0]

normDataSet = dataSet – tile(minVals, (m,1))

normDataSet = normDataSet/tile(ranges, (m,1))

return normDataSet, ranges, minVals

In the Python command line, load the kNN.py module and execute the autoNorm function, as shown below.

Improving kNN Matching Effect for Dating Websites (With Source Code)
Improving kNN Matching Effect for Dating Websites (With Source Code)

Here we can only return normMat as the function’s return value.

Improving kNN Matching Effect for Dating Websites (With Source Code)

4. Testing the Algorithm

Improving kNN Matching Effect for Dating Websites (With Source Code)

In this article, we use 10% of the data set as the testing data set and 90% as the training data set (Note: The training and testing data sets should be randomly selected).

We create a function called classify0 as the kNN classifier, with the following code:

def classify0(inX, dataSet, labels, k):

dataSetSize = dataSet.shape[0]

diffMat = tile(inX, (dataSetSize,1)) – dataSet

sqDiffMat = diffMat**2

sqDistances = sqDiffMat.sum(axis=1)

distances = sqDistances**0.5

sortedDistIndicies = distances.argsort()

classCount={}

for i in range(k):

voteIlabel = labels[sortedDistIndicies[i]]

classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1

sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)

return sortedClassCount[0][0]

To test the effectiveness of the classifier, we need to create a function called datingClassTest, with the following code:

def datingClassTest():

hoRatio = 0.50 #hold out 10%

datingDataMat,datingLabels = file2matrix(‘datingTestSet2.txt’)

normMat, ranges, minVals = autoNorm(datingDataMat)

m = normMat.shape[0]

numTestVecs = int(m*hoRatio)

errorCount = 0.0

for i in range(numTestVecs):

classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)

print “the classifier came back with: %d, the real answer is: %d” % (classifierResult, datingLabels[i])

if (classifierResult != datingLabels[i]): errorCount += 1.0

print “the total error rate is: %f” % (errorCount/float(numTestVecs))

print errorCount

This function uses previously defined functions. The file2matrix function will parse data from the text file, and autoNorm will normalize the data read into memory. Enter the following code in the Python command line, as shown below.

Improving kNN Matching Effect for Dating Websites (With Source Code)
Improving kNN Matching Effect for Dating Websites (With Source Code)

As shown in the figure, the error rate for processing the dating data set is 6.4%, which is a quite good result. We can debug the variables hoRatio and k in datingClassTest to reduce the error rate. Therefore, one can easily input the attribute information of an unknown object and use the classification software to help determine which category the object belongs to (dislike, generally like, and very like).

Improving kNN Matching Effect for Dating Websites (With Source Code)

5. Using the Algorithm: Building a Complete Usable System

Improving kNN Matching Effect for Dating Websites (With Source Code)

The implemented code is as follows.

def classifyPerson():

resultList = [‘not at all’, ‘in small doses’, ‘in large doses’]

percentTats = float(raw_input(“percentage of time spent playing video games?”))

ffMiles = float(raw_input(“frequent filter miles earned per year?”))

iceCream = float(raw_input(“liters of ice cream consumed per year?”))

(datingDataMat, datingLabels) = file2matrix(‘datingTestSet2.txt’)

(normMat, ranges, minVals) = autoNorm(datingDataMat)

inArr = array([ffMiles, percentTats, iceCream])

classifierResult = classify0((inArr-minVals)/ranges, normMat, datingLabels, 3)

print “You will probably like this person:”, resultList[classifierResult-1]

Enter the following in the Python command line as shown in the figure.

Improving kNN Matching Effect for Dating Websites (With Source Code)

Thus, we have completed the application of kNN in improving the matching rate for dating websites.

The data set used in this article and the complete Python implementation code can be accessed via the Baidu Cloud link:http://pan.baidu.com/s/1boPnSBH

Leave a Comment