Monday, February 22, 2010

Simple Filter to Extract Links from a Pidgin Log

I often trade political links via the Pidgin IM client with my friend Jeremiah. Last week, he had the idea that we should coauthor a blog about these links. Towards this end I decided I harvest all of the links from my Pidgin log. This script will do that:

grep http ~/.purple/logs/aim/yourimid/friendsimid/* | grep -v -E "content-type|funpic\.hu|funnyjunk" | sed -e "s/^.*href=\"//" -e "s/\">.*//" | grep -v "font color"

The first grep finds anything that looks like a link, the second filters out any sites you don't care about. You can add more to that list by adding more "|sitename" clauses to the regex. The sed command scours off the html that Pidgin puts around anything and the last grep kills off some oddball lines that made it through the filters.

I'm sure this could all be made more efficient, but it did the job and unless you had an enormous quantity of logs to search, it's efficient enough.

No comments:

Post a Comment