After over a decade of writing software professionally… sometimes things get boring. I’m thankful that my job at SBLive is actually pretty interesting, but at some level my day to day work follows a pretty consistent rhythm. Not long ago though, a coworker hit me up about a tiny project that sounded pretty fun.
I have a very random, non work related, question for you lol. I downloaded my Spotify listening history and I’m searching for a Chinese song that I found while I was living there. I have no idea what the title of the song is, but I know it’s in Mandarin. How could I take my data set and search for either Mandarin or non-English language titles? 😂
My initial response was that I bet ChatGPT could answer the question for him. I tried to just add the file into the assistants playground and ask it do the heavy lifting… but the results weren’t great.
I poked a little bit more, but at the step where I was trying to figure out how to vectorize and search across it more easily, I realized that I was drastically overcomplicating things. A quick bit of Googling later, I ended up slapping together this quick little script:
import jsonfrom langdetect import detect
def load_and_map_music_history(file_path): with open(file_path, 'r') as file: music_data = json.load(file)
unique_tracks = {}
for entry in music_data: track_name = entry["master_metadata_track_name"] artist_name = entry["master_metadata_album_artist_name"] album_name = entry["master_metadata_album_album_name"] connection_country = entry["conn_country"]
language = "unknown" try: track_language = detect(track_name) artist_language = detect(artist_name) album_language = detect(album_name)
except: language = "unknown"
if track_name not in unique_tracks: if ('zh' in track_language or 'zh' in artist_language or 'zh' in album_language): unique_tracks[track_name] = { "artist_name": artist_name, "album_name": album_name }
return unique_tracks
track_info_dict = load_and_map_music_history('music_history.json')
for track, info in track_info_dict.items(): print(f"{track} by {info['artist_name']} from the album {info['album_name']}")
I didn’t try to optimize it at all, it’s the quickest and dirtiest solution that I could make happen. And it even works! Here’s the output:
Why Nobody Fights by Hua Chen Yu from the album 卡西莫多的禮物二愣子 by Jordan Chan from the album 抱一抱經典 (feat. 懂伯 & DJ Premier) by 大支 from the album 不聽還是覺得你最好 b
It was a nice little reminder that sometimes it’s okay to just hack something together and call it a day.
The Robots Probably Still Win
OpenAI has made some upgrades since I initially played around with this, and it’s worth pointing out that it definitely can solve the problem without needing to lift a finger. I’m not sure whether my script or OpenAI is more correct.