As a part of lyrical analysis, our goal was to collect song lyrics for 8000+ pre-collected song title along with artist’s information. Here are different attempts to reach to a fair solution:
1st Attempt: While looking sites to use for scraping lyrics, we came across with one site (referred as AZ-1 here onward) that gave us good result for some of the song when manually tested. Based on this outcome, we wrote a logic to scrape lyrics from AZ-1. Just to test the logic, we ran first 100 songs and found some of the lyrics were missing at AZ-1. As the program is expected to collect lyrics on-going basis, 100% accuracy is must. We decided to further investigate better processes.
2nd Attempt: Since it was identified that a source, i.e AZ-1, can’t be used to bring 100% accuracy, we started looking for other source by manually searching the songs for which we didn’t find the lyrics in our first attempt. In the process, we identified 6 more sites in the top 10 results of the search engine we used. Then we wrote parsing logic for all 6 of them to scrape lyrics and ran first 500 songs, took lot of time to execute, as the logic build to scrape lyrics from AZ-1, if not found than AZ-2, if not found than AZ-3 and so on…
3rd Attempt: Better performance is always a key factor to considered, which pushed us to think differently. We then decided to automate the manual process to use search engine to help us finding the right link, then use same parsing logic, written during 1st and 2nd attempt for multiple sites. We run it for 8000+ song and accuracy was 99.5%. We added few more sites into out list to handle remaining 0.5%. Cool!
The code is in production now, running on weekly basis, getting lyrics for all new songs that are been added every week. Knowing that we may not always find lyrics in those 10 sites, added mechanism to notify stake-holders for any failed attempt to scrape lyrics and made it easy to add more sites.
There is always a better solution, if we don’t consider last solution is a perfect attempt! Can we consider Apache Nutch here? Probably a candidate for next blog, if considered.