Working on the algorithm for Jamendo’s “similar albums”
I'm currently working on a big rewrite of the algorithms behind Jamendo's "Similar albums". The new algorithm will focus only on tags because there will be both "Similar albums" and "Related albums" (where I'll put recommended stuff). I was really slow this morning while thinking about it so I decided to blog the whole process in real time ;-)
Let's begin! The problem is, we have a list of weighted tags for each album and we want to compute for each album the list of album that have the most similar tags list.
Here is a sample tag/weight list for the Manu Cornet album:
jazz 19.8494 latino 18.4662 bossa 16.4924 contrebasse 16.4924 sensible 16.4924 world 16 basse 15.6844 hamonica 15.6844 batterie 15.6844 tamtam 15.6844 improvisation 15.6844 piano 8.48528 easylistening 7.34847 ambient 6.63325
Let's call this base album "A1" and the similar albums we're looking for "A2". We can consider that "20" is the maximum tag weight.
A quick list of requirements for the algorithm :
- Shouldn't be symmetric
- Should penalize A2s if they lack one or more of the most important tags of A1
- Should be quite fast, even if we're not going to run it on the fly (everything will be computed once a day)
Martin told me yesterday about the nice app "Grapher" which is bundled with OS X. I'm going to use it to write the formula.
My first idea was, for each tag of A1, to compute a number Z between -1 and 1 and add them all. Let (x,y) be the weight of the tag in A1 and A2 respectively. The most obvious algorithm would be :
The axis on the left is Y, on the right it's X and the elevation is of course Z. The window is (0,20),(0,20),(-20,20)
Easy improvement : multiply by x/20
Quite good but there is a problem over the X axis. If y=0, z should always be < 0. Another easy solution : add min(0,y-x)
New problem : x=20, y=10. In these conditions z shouldn't be nul. Let's multiply min(0,y-x) by (1-y/20).
Better, but not good enough yet. (1-y/20)^4 ?
Well, it looks quite good to me! Maybe we could tweak some curves a bit but let's go for this one for now.
Here are the results for the Manu Cornet album :
id weight_sum ----------------------- ------------------ 2336 89.7347740595245 608 84.9463957742962 2013 79.2737167470594 2239 78.7533019317343 1968 78.3494367998522 3552 78.2419321761695 2626 77.4643574361218 868 76.3107863971299 (query took 1.2 seconds)
I am not so happy with the results. Maybe there need to be another negative zone for tags that are in A2 but not in A1 after all.
Let's reduce the window size to (0,1),(0,1),(-1,1) and take those "20" factors out. Now much easier, here is what I've come up :
Looks good too but let's compare the results :
id weight_sum ------ ------------------ 608 4.49465968951514 2336 4.32499417058527 2013 3.74828022088347 3552 3.49127205825584 2626 3.44471851383391 2407 3.35824086666307 2239 3.3511421181151 1968 3.27891762132192 868 3.18629627915296 1013 3.18123969513814 1741 3.11171873935927 (query took 1.2 seconds)
Quite funny that both queries take the same time to process: the formula doesn't seem to be the bottleneck. Results are a bit better but I think there will still be some tweaking needed in the future. Maybe reduce the negative zones.
Anyway, now I'm going to implement this and the new data should be online in a few minutes. Don't forget to check out the "Similar albums" slider on your favourite albums pages!