As a part of an lyrical analysis problem, we downloaded artist name from two different source. Source A has 90,000+ data and Source B has 11,000+ records. Having a goal to match names from Source A with Source B, here are some of the different approach we took:
Pandas (pd.merge): Relate both sources matching name fields.
source_a = pd.read_csv('data/source_data_a.csv')
source_a = pd.read_csv('data/source_data_b.csv')
# When both dataframe has same column names
data = pd.merge(source_a, source_b, how='left', on=['name'])
# When both dataframe has differnt column names
data = pd.merge(source_a, source_b, how='left', left_on=['name_a'], right_on=['name_b'])
# Since having same column names in both dataframe auto handles column names to unique
# You may choose to use pd.drop or pd.rename depending on your needs.
As this is a exact match resulted around 300 matches only with 100% accuracy. At least it helped to minimize the number of records to process next.
Python List Intersection: Another approach to convert all words in the name into a set and compare them to the another.
# Returns a list (same lenght as list_set_to_compare_with) having multiple matched elements
result_list = [list(filter(lambda x: x in list_to_compare, sublist)) for sublist in list_set_to_compare_with]
# Get the max length of sub-list from the return list
result_list_max = max(result_list, key=len)
# Get all the index list based on result_list_max
index_list = 
if len(result_list_max) != 0:
index_list = [index for index, row in enumerate(result_list) if row == result_list_max]
return index_list, len(result_list_max)
The above process took lot of time to run and the result were not very exciting for the problem above problem.
Fuzzy Match: Came across with a library FuzzyWuzzy, which has a few good methods to match a string against a string / a list. process.extractOne() is good, which gives a best match string in a list with a match-score – seems to be very useful. Other methods that can be used for name matchings: RegularExpression, Phonetics, NLTK, etc.
Since artist name matching required to be accurate to solve the problem that we have, we didn’t proceed with further with name matching algorithms.