Algorithm(s) to match names – Thoughts

Problem Statement:

As a part of an lyrical analysis problem, we downloaded artist name from two different source. Source A has 90,000+ data and Source B has 11,000+ records. Having a goal to match names from Source A with Source B, here are some of the different approach we took:

Pandas (pd.merge): Relate both sources matching name fields.

import pandas as pd

source_a = pd.read_csv('data/source_data_a.csv')
source_a = pd.read_csv('data/source_data_b.csv')

# When both dataframe has same column names
data = pd.merge(source_a, source_b, how='left', on=['name'])

# When both dataframe has differnt column names
data = pd.merge(source_a, source_b, how='left', left_on=['name_a'], right_on=['name_b'])

# Since having same column names in both dataframe auto handles column names to unique
# You may choose to use pd.drop or pd.rename depending on your needs.

As this is a exact match resulted around 300 matches only with 100% accuracy. At least it helped to minimize the number of records to process next.

Python List Intersection: Another approach to convert all words in the name into a set and compare them to the another.

def intersection(list_to_compare=[], list_set_to_compare_with=[]):
# Returns a list (same lenght as list_set_to_compare_with) having multiple matched elements
result_list = [list(filter(lambda x: x in list_to_compare, sublist)) for sublist in list_set_to_compare_with]

# Get the max length of sub-list from the return list
result_list_max = max(result_list, key=len)

# Get all the index list based on result_list_max
index_list = []
if len(result_list_max) != 0:
index_list = [index for index, row in enumerate(result_list) if row == result_list_max]

return index_list, len(result_list_max)

The above process took lot of time to run and the result were not very exciting for the problem above problem.

Fuzzy Match: Came across with a library FuzzyWuzzy, which has a few good methods to match a string against a string / a list. process.extractOne() is good, which gives a best match string in a list with a match-score – seems to be very useful. Other methods that can be used for name matchings: RegularExpression, Phonetics, NLTK, etc.

Since artist name matching required to be accurate to solve the problem that we have, we didn’t proceed with further with name matching algorithms.

Algorithm(s) to match names – Thoughts