Cleaning Up Blog Network Data with Gawk

Having done a fair amount of work with Twitter data over the past couple of months, I’m keen to get back now to the other substantive part of our ARC Discovery project on mapping online public communication in Australia during this first year of the project: examining patterns of interaction within and across the Australian blogosphere.

This post will start off that process by exploring some of the methodological issues, and asking for some help on refining our methods from Gawk nerds along the way. What we’re building on with the blog mapping is our previous work with our fantastic colleagues from Sociomantic Labs in Berlin, who are also doing the data gathering for this new slice of research. We’ve outlined the basic approach of our blog mapping in some detail elsewhere already (also see the Publications section of this blog), but here’s a very quick summary:

Identify a large list of Australian(-based) blogs.
Follow their RSS feeds, and capture in full any new posts which appear.
Extract only the post text and embedded links from these posts (discard headers, footers, sidebars, blogrolls and other extraneous content).
Process text and/or links to generate keyword co-occurrence and/or hyperlink networks.

We’re still working through some of this process, so here and in my next post I’ll focus only on some aspects of this work.

First, some limitations: currently, we’re tracking some 8500 Australian blogs (also including some mainstream sites which happen to have RSS feeds, by the way – these will be separated from ‘genuine’ blogs at a later stage). So that list is far from comprehensive yet, and we’ll be snowballing out from here to see what other blogs we might discover by following the links from our currently known network. We’re also still working on improving our post text extraction processes – a notoriously difficult problem if you want to do it both reliably, automatically, and without slowing down the overall process to a crawl (as we move towards tracking 10,000 blogs, for example, there’s no way we can still manually create page templates for each blog that tell the text extractor which part of the page is the blog post itself; similarly, we can’t build in too much clever page parsing algorithms unless we’re prepared to take a significant performance hit on the server that captures the blog posts).

For the moment, though, what I’m already keen to do is to develop our further methodology for dealing with the data which this process generates – in the first place, the network data. Part of what Sociomantic delivers to us is a comma-separated (or tab-separated) list of the blog posts which were captured, including various other details – such as blog name and URL, post URL, title, and timestamp, and (most importantly) a list of the links which were embedded in the post. (For now, though, while we’re still improving our post extraction techniques, this is simply a list of all links on the blog page, rather than of the links in the post text only.) So, I’m interested in processing this list into a link network file that I can visualise using Gephi.

But to do so requires some extra work with Gawk. Many of the links we’re finding will necessarily point to specific URLs (of the form http://domain.com/hierarchy/page.html), but visualising the network at this level of specificity is likely to be an exercise in futility, as it treats each blog post (with its own URL) as a separate entity, rather than operating at the level of the blog itself. While the most linked-to individual posts in the network might also be of interest at a later stage, for now I’m more interested in the networks of interlinkage at the level of blogs: three links from blog A to different pages on the same blog B should be counted as three links from A to B, rather than as three links from A to different entities altogether.

In other words, we need to condense the destination URL to its most meaningful part – dropping out any reference to the specific page that the link points to. As a first approximation this could be done by truncating the destination URL simply to the domain name itself: http://domain.com/hierarchy/page.html becomes http://domain.com/, for example. But this runs into trouble where multiple different blogs reside on the same server – in our current population of blogs, this is true for http://blogs.crikey.com.au/pollytics/ and http://blogs.crikey.com.au/contentmakers/, to pick just one example. Conflating both of them under the http://blogs.crikey.com.au/ banner is problematic, since these two blogs could have quite a different readership and might be part of different networks of interlinkage within the blogosphere.

To address this issue, what I’ve come up with is a Gawk script that will take each destination link and match it against the table of known blogs we’re following – starting with the most specific URLs first (e.g. http://blogs.crikey.com.au/pollytics/) and proceeding towards less specific ones (e.g. http://blogs.crikey.com.au/) if no earlier match was found. Only if the least specific lookup fails as well, we’ll proceed to a further stage of brute-force truncation that reduces the destination link to the domain name only.

I’m including the script below, in the hope that while our dataset and data structure are necessarily unique to this research project, they might still be useful for researchers facing similar issues in their own data. I’m also hoping for some feedback from researchers and developers who are working on similar projects: I’m not much of a programmer, and currently, the lookup process that this script performs can be quite slow for large quantities of data – in our case, each destination URL has to be matched against some 8,500 known blogs, which (for the over nine million links that my current dataset contains) takes quite some time! So, I’d be grateful for any advice which could speed up the Gawk script.

Here’s the script, then. It expects the data to be in the format that we receive from Sociomantic, of course, but it should be easy to change this to match similar formats; also, I’m currently using tab-separated values files (and the destination URLs are separated by semicolons), but those details are similarly easy to change. There’s also a command-line parameter called filter which is used to filter out blogs that match specific criteria which are listed in a separate column in our source file, but again, this could be removed easily if you’re adapting this script to your own purposes.

Most importantly, in addition to the CSV/TSV file which it processes, the script expects a second TSV file called bloglist.tsv to exist in the current directory; this file contains the list of known blogs and is loaded into the alphabetically sorted knownblog[] array during script initialisation by the BEGIN clause. Here, as well as when dealing with source and destination URLs in the link data, URLs are standardised to facilitate more reliable URL matching (and unclutter the visualisation later on): any occurrences of ‘www.’ are removed, all trailing slashes at the end of link URLs are removed, and all URLs are uniformly made to start with ‘http://’.

Importantly, then, during the matching process the knownblog[] array is processed in descending order (from http://z… to http://a…), in order to make sure that more specific URLs are found first: alphabetic sorting of an array including http://domain.com/ and http://domain.com/blog/ will always place the more generic http://domain.com/ first, so matching from back to front reverses that order. If a match is found, the matching process is terminated without working through the rest of the knownblog[] array, so if http://domain.com/blog/ is a match, we never get to http://domain.com/ even though it would be a match as well.

UPDATE: I forgot to escape the dot in ‘www.’, which meant that ‘www.’ also matched things like ‘www1’ (in regular expressions, ‘.’ matches any one character…). Turns out Gawk treats the usual escape sequence for a literal dot, ‘\.’, as ‘.’ – so the correct way to match a literal dot is ‘[.]’! Fixed now!

# linkprocess.awk - Create CSV with link network information
#
# this script takes a TSV archive of blog posts as produced by Sociomantic Labs
# the script also requires bloglist.tsv - a list of all currently tracked blogs
#
# expected data format:
# blog_categories, blog_feed_url, blog_gender, blog_id, blog_location, blog_notes, blog_num_posts, blog_tags, blog_title, blog_type, blog_url, post_date, post_file, post_folder, post_id, post_links, post_retrieval_date, post_tags, post_title, post_url
#
# script expects command-line argument filter={searchcriteria} _before_ the input CSV filename
# enclose the search term in quotation marks if it contains any special characters
# this is matched against the blog_tags column (usually to skip blogs marked 'remove')
#
# e.g.: gawk -F \t -f filter.awk filter=remove posts.tsv >network.csv
#
# output format:
# source - source URL, reduced to blog's base URL (e.g. http://snurb.info/),
# destination - original target URL, in full (e.g. http://snurb.info/node/1374),
# shortened - original target URL, reduced to blog's base URL (if known) or site's domain name (if not known)
# knownblog - 0 (unknown site) | 1 (known blog)
# date
#
# all URLs are standardised: 'http://' and/or 'www.' are removed to reduce clutter
#
# Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au

BEGIN {
	# load look-up list of known sites from bloglist.tsv into knownblog[] array
	# standardise blog URLs: remove any 'www.', add 'http://'

	i = 1
	getline < "bloglist.tsv"
	while(getline < "bloglist.tsv") {
		sub("http://", "", $3)
		sub("www[.]", "", $3)
		sub("/$", "", $3)
		knownblog[i] = "http://" $3
		i++
	}

	# list of known blogs is sorted and later processed in reverse order so that more specific URLs are found first

	blognum = asort(knownblog)

	print "source,destination,shortened,knownblog,date"
	getline
}

tolower($8) !~ filter {

	sub("http://", "", $11)
	sub("www[.]", "", $11)
	sub("/$", "", $11)

	if(split($16, destinationurl, ";")) {

		for(d in destinationurl) {
			sub("www[.]", "", destinationurl[d])
			sub("/$", "", destinationurl[d])
			orig[d] = gensub("http://", "", "g", destinationurl[d])
			found = 0
			for(j = blognum; j >= 1; j = j-1) {

			if(index(destinationurl[d], knownblog[j])) {
					destinationurl[d] = gensub("http://", "", "g", knownblog[j])
					found = 1
				}
				if(found) {
					break
				}
			}

			if(!found) {
				split(destinationurl[d], domain, "/")
				if(domain[3]) {
					destinationurl[d] = domain[3]
				} else {
					destinationurl[d] = ""
				}
			} 

			print $11 "," orig[d] "," destinationurl[d] "," found "," $12

		}
	}

}

Finally, then, the script produces a new CSV file which contains the following columns:

source: URL of the linking blog
destination: original destination URL of the link
shortened: destination URL of the link, reduced to the matched blog URL if a known blog was identified, or to the basic domain of the destination link if not
knownblog: a boolean flag that is set to 0 (destination site unknown) or 1 (destination site is a known blog from bloglist.tsv)

From here, it’s just a few short steps to producing a simple link network file that Gephi will be able to process. The simplest approach would be to use Gawk to extract source and shortened URLs into a new CSV file, like this:

Gawk -F , "{ print $1 \",\" $3 }" links.csv >linknetwork.csv

This simply takes columns 1 (source) and 3 (shortened) from the script output and dumps them into a second CSV file which can then be loaded into Gephi.

If our interest is in visualising only the networks of links between the blogs we already know, ignoring any unknown sites, this could be modified as follows:

Gawk -F , "$4 == 1 { print $1 \",\" $3 }" links.csv >linknetwork.csv

This filters out any links for which the ‘knownblog’ flag is set to 0 – that is, links whose destination did not match any of the known blogs listed in bloglist.tsv. Alternatively, using $4 == 0 would select only those destination URLs which are unknown at this point – not sure if that would make much sense for visualisation purposes, but what it does create is a list of potential additions to the list of known blogs, to further grow the sample population…

OK – so much for a first look under the hood, then. Again, I’d very much appreciate any feedback, especially on how to further improve the performance of the main script. In a later post today, I’ll show some of the early visualisation work-in-progress using these data…

Feature image from GNU Operating Systems.

Cleaning Up Blog Network Data with Gawk

Published by Snurb

2 replies on “Cleaning Up Blog Network Data with Gawk”