| View previous topic :: View next topic |
| Author |
Message |
Bircoph Apprentice


Joined: 27 Jun 2008 Posts: 241
|
Posted: Tue Jun 12, 2012 9:36 pm Post subject: [SOLVED] Fuzzy audio search |
|
|
Hello,
I needed to find similar audio files in a large set of files (about ~60 000 for now). Under similar I mean files not exactly the same in terms of per byte comparison of audio payload, but in terms of human perception, so that tracks may have a different bitrate, slightly varying length, somewhat different noise, or they may be just encoded with different codecs or codec options. Title and other tags are inadequate for my task, because for the same track they may be in different languages, locales, they may be utterly broken due to incorrect multiple encodings or absent at all.
I failed to find any appropriate applications for Linux: dupeGuru comparison is based on tags, for payload comparison it requires an exact match; other tools are about the same with variations. And all tools are required to have CLI, because computing part of this task is in fact rather heavy and workload is distributed among some local cluster.
ATM I decided to create a toolset myself using available tools. Audio data is presorted by groups in directories and probability for cross-match between files from different groups is considered negligible. The most hard part is to formalize what data are similar enough and what are not. My approach is based on cascading filters. Find process is still partially manual (I doubt it is possible to make it fully automated without AI technologies) and contains the following steps:
1) app-misc/fdupes is used to find exact byte-by-byte file duplicates, this also hints for directories that duplicate each other, though requires manual review for a small number of matches;
2) mplayer-based gawk/bash script to create a list of all audio files with track length and bitrate for each file. MPlayer is used because it supports dozens of audio formats, so the data have a lot of them present.
3) gawk/bash script to create filtered lists for each data group, where only files different in length by few seconds from another files are present in corresponding close-length subgroups for each group.
4) For each duplicate group I build spectrum for each file using sox and compare those spectra using ImageMagick. To speedup the process a reasonable slice is used instead of a full file. But here problems arose:
4a) Audio data needs to be resamled to the lowest sample rate in the set, this takes time even for slices. Otherwise spectrograms are significantly different for low and high bitrates.
4b) It is rather hard to find an appropriate and reasonably fast comparison scheme. For my task MSE (mean squared error) between two images works the best from what metrics ImageMagick supports. And sox images need some conversion before proceeding. ImageMagick made too many unnecessary operations if used as CLI tools (e.g. I don't need a diff image), so I'm writing now a small C application using ImageMagick API, it looks like at least 10x times faster.
5) As the last step found and filtered candidates will still need some manual review.
If someone can point me to a Linux app doing such fuzzy comparison, I'd be grateful. I don't want to reinvent a bicycle, but I can't find one right here and now. If there are any suggestions of how to speedup toolchain above, I'm listening too. _________________ Per aspera ad astra!
Last edited by Bircoph on Wed Jun 13, 2012 9:41 am; edited 1 time in total |
|
| Back to top |
|
 |
djdunn l33t


Joined: 26 Dec 2004 Posts: 617 Location: Under the moon and all the stars in the sky.
|
Posted: Tue Jun 12, 2012 11:47 pm Post subject: |
|
|
something like musicbrainz? they do acoustic fingerprinting _________________ Now, with penguins, (cuddly such), "contented" means it has either just gotten laid, or it's stuffed on herring. Take it from me, I'm an expert on penguins, those are really the only two options.
--Linus Torvalds |
|
| Back to top |
|
 |
Bircoph Apprentice


Joined: 27 Jun 2008 Posts: 241
|
Posted: Wed Jun 13, 2012 9:40 am Post subject: |
|
|
Thanks,
while I don't want to submit by DB online, I found that media-libs/chromaprint does exactly what I need. _________________ Per aspera ad astra! |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|