Address Processing in Risk Control: KNN Radius Neighbors Graph Clustering

I previously wrote two articles on address processing, detailed below:

Address Processing in Risk Control: Simple and Effective Regex Replacement

Address Processing in Risk Control: Simple, Accurate, and Effective Multi-label Clustering

Today we are writing the third article, using KNN to construct similar addresses. Although this article is about addresses, many scenarios can draw on this, so everyone should learn to think outside the box.

Additionally, many students do not master algorithms deeply enough; they might find a demo and run a case, thinking they are very skilled. However, many of our algorithms are quite profound. If you don’t understand today’s article, you can review my previous articles:KNN Algorithm Simple? I Spent 30,000 Words Not Making It Clear…

If you want to learn more deeply and systematically, you can also buy my course, which explains a similar case in great detail.

Let’s start with the main content. First, let’s look at the format of the address data. The previous articles also used this dataset. To practice with the data, please follow Little Wu Talks Risk Control and reply with 【Address Data】 to get the download link.

Today we won’t do regex processing, let’s get straight to it.

# Load the necessary packages
import pandas as pd
import numpy as np
import os

# Read data - For data, please follow "Little Wu Talks Risk Control" and reply with 【Address Data】
os.chdir('/Users/wuzhengxiang/Documents/DataSets/Address Processing')
data = pd.read_csv('Company Address Data.csv')
print(data.head(20))

# Jieba word segmentation - test
import jieba
jieba.lcut('Little Wu is the most handsome in the universe')  # ['Little Wu', 'is', 'the', 'most', 'handsome', 'in', 'the', 'universe']

# Perform word segmentation with Jieba

df = data
df = df['Company Address'].apply(lambda x: ' '.join(jieba.lcut(x)))
df.head()

# Vector transformation
tf-idf_vectorizer_word = TfidfVectorizer(max_features=20000,
                             token_pattern=r"(?u)\b\w+\b",
                             min_df=1,
                             analyzer='word',
                             ngram_range=(1, 2)
                             )
vectorizer_word = tf-idf_vectorizer_word.fit(df)
tfidf_matrix = vectorizer_word.transform(df)
# Check the size of the vocabulary
print(len(vectorizer_word.vocabulary_))  # 109
# Check some words in the vocabulary
vectorizer_word.vocabulary_{'Yunnan Province': 51, 'Kunming City': 72, 'Panlong District': 85, 'Ciba': 93, 'Street': 95, 'Baiyang': 81, 'Road': 98, '233': 13}

With the above code, we obtained the vector representation of the text. Next is the KNN algorithm.

There are two graph algorithms in KNN:kneighbors_graph and radius_neighbors_graph

The radius_neighbors_graph constructs a restricted radius neighbor graph, meaning neighbors within a certain distance are included, while those beyond that distance are excluded. We can also use kneighbors_graph to find the same number of neighbors for each node.

from sklearn.neighbors import radius_neighbors_graph
from sklearn.neighbors import kneighbors_graph

# Construct the restricted radius neighbor graph. We can also use kneighbors_graph to find the same number of neighbors for each node.
A = radius_neighbors_graph(tfidf_matrix,
                      radius=0.99,
                      mode='connectivity',
                      include_self=False
                           )
A.toarray()
array([[0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 1., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 1., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 1., 1., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0.]])

Next, we will use the maximal connected subgraph to mine:

import networkx as nx
import matplotlib.pyplot as plt

# Convert data into graph format
G = nx.from_numpy_matrix(A.toarray())

# Find all connected subgraphs
com = list(nx.connected_components(G))

# Organize the list-dictionary into a tabular format
df_com = pd.DataFrame()
for i in range(0, len(com)):
    d = pd.DataFrame({'group_id': [i] * len(com[i]), 'user_id': list(com[i])})
    df_com = pd.concat([df_com, d])

# View the results
df_com   group_id  user_id
0         0        0
1         0        1
2         0        2
3         0        3
4         0        4
5         0        5
6         1        0
7         1        1
8         1        2
9         1        3
10        1        4
100        2       11
111        2       12
122        2       13

We have obtained the group divisions of each sample. To make it more intuitive, we will visualize the next step, which is relatively complex, so please study it carefully.

# Restore the group clustering data back to the original data
data['group_id'] = pd.Series(group_id)
pd.Series(group_id).value_counts().head(20)
# View a specific group
data[data['group_id'] == 0].sort_values(by='Company Name')
        Company Name                        Company Address  group_id
1  Kunming Kejiaya Department Store  Yunnan Province Kunming City Panlong District Ciba Street Baiyang Road 2011 No. 1-PL         0
2  Kunming Wanyuelan Department Store   Yunnan Province Kunming City Panlong District Ciba Street Baiyang Road 114 No. 1-PL         0
3  Kunming Jizixin Department Store   Yunnan Province Kunming City Panlong District Ciba Street Baiyang Road 233 No. 1-PL         0
4  Kunming Xuntufa Department Store   Yunnan Province Kunming City Panlong District Ciba Street Baiyang Road 310 No. 1-PL         0
5  Kunming Xinminfan Department Store   Yunnan Province Kunming City Panlong District Ciba Street Baiyang Road 395 No. 1-PL         0
6  Kunming Luhongjin Department Store   Yunnan Province Kunming City Panlong District Ciba Street Baiyang Road 355 No. 1-PL         0
# Extract the text of the visualized group
comment_group = data['Company Address'][data[data['group_id'] == 0].index].reset_index(drop=True)
labels = dict(comment_group)
labels
# Extract the tf-idf of the visualized group
tfidf_group = tfidf_matrix.toarray()[data[data['group_id'] == 0].index]
# Recalculate the connectivity matrix
A = radius_neighbors_graph(
                     tfidf_group,
                     radius=0.99,
                     mode='connectivity',
                      include_self=False
                           )
# Visualize the address data
import networkx as nx
import matplotlib.pyplot as plt

# Convert matrix format to graph format
G = nx.from_numpy_matrix(A.toarray())

## Solve the issue of displaying Chinese characters in the graph
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['axes.unicode_minus'] = False
# Color settings
colors = ['r', 'c', 'y', 'c'] * 1000
#colors = ['#008B8B', 'r', 'b', 'orange', 'y', 'c', 'DeepPink', '#838B8B', 'purple', 'olive', '#A0CBE2', '#4EEE94'] * 500
colors = colors[0:len(G.nodes())]
#kamada_kawai_layout spring_layout
plt.figure(figsize=(4, 4), dpi=400)
nx.draw_networkx(G,
                 pos=nx.spring_layout(G),
                 node_color=colors,
                 labels=labels,
                 node_size=500,
                 font_size=5,
                 width=0.2,
                 alpha=1)
plt.axis('off')  
plt.show()

Visualization Result: data[‘group_id’] == 0

We can look at the visualization of other groups.

Visualization Result: data[‘group_id’] == 1

Visualization Result: data[‘group_id’] == 2

The article is basically over. We can see that we can find a neighbor address for one address, which is quite suitable for mining attempts. However, this algorithm also has certain limitations; the amount of mining cannot be too large. The data we used is relatively small; if the data is large, the mined graph will be even richer. For example, this is based on comments in the course for group mining.

Address Processing in Risk Control: KNN Radius Neighbors Graph Clustering

Previous Highlights:

Complex Networks in Risk Control – Learning Path Map

[Course] Everything is a Network – Network Mining Methods in Risk Control

Must-read risk control articles summary, text mutation recognition, complex networks, anomaly detection, automated risk control strategy mining

Credit Card Fraud Anomaly Detection Based on Deep Learning Autoencoder, the effect is very impressive

Credit Card Fraud Isolation Forest Practical Case Analysis, best parameter selection, visualization, etc.

Similarity Calculation of Listed Company Addresses & Building Relationship Graphs

Automated Generation of Risk Control Strategies – Generating Thousands of Strategies in Minutes Using Decision Trees

SynchroTrap – Fraud Account Detection Algorithm Based on Loose Behavior Similarity

Long press to follow this account Long press to add me to the group

Leave a Comment Cancel reply