Using Gawk and Wget to Resolve URL Shorteners
Jean’s post today points to a key problem in examining user activities on Twitter and elsewhere – people are increasingly using bit.ly and other URL shorteners, which means that a) the same target URL might appear in any number of different shortened versions, and b) it’s no longer possible from a quick look at a list of URLs to select only those which are from a specific site (for example, YouTube videos).
For our purposes, that’s a significant problem – we might want to find out, for example, which were the most popular videos shared during the election campaign, the most popular articles on abc.net.au, and so on. So, we need to resolve those shortened URLs back to their original state. This could be done through the APIs of the various shortening services, of course, but with literally hundreds of different shorteners now available, that would probably require specific unshortening scripts for each service – far too much work. So, what can we do?