Easy Webpage Scraping with Python

To produce the tube station usage mashup I obtained the data from the TfL website. Unfortunately the data is not in an immediately usable format – rather than there being a CSV file to download, or a large HTML table, the data is presented as a separate webpage for each station and each year.

Luckily, Python makes it easy to get the data as a CSV file – although you do need to know a little Regex too, to extract the data you want. To construct the regular expressions needed, I used an excellent online tool, RegExr.

Once you have your regular expressions ready, you just use Python’s Urllib, RE and CSV libraries, and some loops, to download the webpages, get the data, and write it into a CSV file.

Here’s the script I used – note I’m using the back-slash character at the end of some lines below to indicate line continuation:

import urllib2, re, csv

stationnums = {2003:4, 2004:4, 2005:4, 2006:4,
2007:4, 2008:6}

addressPre = "http://www.tfl.gov.uk/tfl/corporate/
modesoftransport/tube/performance/entriesandexits.asp"

indRE = '.*?salign=right>([0-9]{1,9}?)</td>.*?'
totalRE = '.*?smillions)s=s([0-9.]{1,9}?)</strong>.*?'
nameRE = '.*?selected>(.*?)</option>'

resFile = open('results.csv', 'w')
resWriter = csv.writer(resFile, quoting=csv.QUOTE_MINIMAL)

for i in range(2003, 2009):
	for j in range(1, stationnums[i]+1):
		address = addressPre + "?id=" + str(j)
		+ "&agekey=" + str(i)
		html = urllib2.urlopen(address).read()
		indRes = re.findall(indRE, html)
		totalRes = re.findall(totalRE, html)
		nameRes = re.findall(nameRE, html)
		if len(nameRes) > 0:
			resWriter.writerow([i, j,
			nameRes[0], totalRes[0]]
			+ [e for e in indRes])
resFile.close()

Change the stationnums values for each year to 304 (except 2008, to 306) to get all the data.

Tube Stations in London – Visualisation

I was inspired by seeing this map and associated article on the New York Times website, linked from Going Underground, to create a similar mashup/visualisation of entry/exit volumes from the 300-odd tube stations in London. On their website, Transport for London provide the metrics for entries/exits from the stations, between 2003 and 2008, broken up into rush-hour, regular and weekend travel.

Each circle’s area is directly proportional to the flow numbers for that station (click on the circle to see the numbers.) The circles are rescaled between the first metric (total flows) and the rest, so direct comparison of metrics is possible except between the first and others, Blue circles represent an increase in flow and red a decrease.

If the mass of circles are obscuring each other, zoom in!

You can try it out here.

Some technical notes:

The background map is a custom render of OpenStreetMap data, with the tube lines highlighted in their traditional colour – it doesn’t always look quite “right” when you zoom in, due to the way the lines are tagged in my own copy of the OpenStreetMap database. The stations are even harder to disambiguate, so I’m using a free source from Wikimedia Commons, this means they don’t always line up.

Because your browser gets a copy of all the flow data when you load the page (yes I’ve heard of AJAX) it does run a little slowly in Internet Explorer, particularly the slider bars – these allow you to “drag” through the range of metrics or years.

Manchester Map Mashup

I’ve created a mashup of lots of maps of Manchester as a proof-of-concept of how easy it is to mashup using OpenLayers. It’s not particularly pretty but does involve lots of maps.

See it here.

The layers are:

  • OpenStreetMap
  • Ordnance Survey Street View
  • Ordnance Survey 1:25000 First Series (1959)
  • Ordnance Survey New Popular Edition (1948)
  • Marr Map of Housing Conditions (1904)
  • Swire Map of Manchester (1824)

The first four maps are all hosted on OpenStreetMap servers.
The Swire map also contains an inset, dated 1650!

GISRUK Navigation Challenge

This is the map GISRUK 2010 attendees are using to get from UCL, where the conference is, to the River Thames, where the boat await for the evening cruise. On the way, some of them are doing the challenge, which is to take the optimum route to visit any 6 of the 12 control points – a blue plaque at each one to prove their visit. The map was made using the OpenOrienteeringMap map builder.

[Download PDF]

(Note: The start point was actually from just east of “B” rather than the triangle.)

I haven’t yet computed the best route, I think it’s probably BAJFED or maybe BMCKED. There is no “trick” best route, as the points were fairly fixed by the locations of the blue plaques. But the solution is apparently not immediately obvious to the human eye.

Accuracy vs Completeness: OSM vs Meridian 2

[Updated x2] Yesterday’s Ordnance Survey OpenData launch has provided the OpenStreetMap community with a potentially rich set of data to use to complete the map of Great Britain. OpenStreetMap’s accuracy and detail is generally excellent, however a problem which is (very arguably) more important than either accuracy or detail, in a map is that some parts of the country are substantially incomplete.

It’s not that the data quality is poor, it’s that someone with a GPS (or a satellite photo) has never been to that part of the country to gather the data in the first place. There are still significant parts of Scotland and Northern England which have many missing roads. The NPE (out-of-copyright) maps have been useful in starting to fill out these sections, but there’s always going to be a roads missing from a 60-year-old (or older) map.

So, the OS datasets could be very useful. Perhaps the most interesting of the datasets is Meridian 2, it is a vector dataset covering the whole country. One thing that needs to be watched out for though is that Meridian (which is a “complete” dataset of the country) is relatively inaccurate Pixellation or resolution isn’t a problem, it being vector based – but data is quite simplified.

I’ve built a mashup which allows direct comparision of the Meridian and OSM data for Great Britain. I’ve added in most of the available layer files that come with the Meridian package that has been released as part of the OS OpenData initative. The only two areal ones I’ve added are for woodland areas and lakes – everything is linear. I’ve added in labels for the roads and rivers, but no boundaries or point features, at this stage.

You can access the mashup here. (N.B. Not tested in IE so will probably break horribly in it.) Zooming in reveals the relative coarseness of the Meridian data – although crucially it is “substantially” complete for all but the smallest of roads, for the whole of the UK – not just for the major cities where the OSM contributors mostly live!

Completeness
In the pictures below, the “solid”, thinner roads are Meridian and the fatter roads with “borders” are OSM.

Spot the missing roads in Meridian around Leytonstone in East London [Update 1 - Some sections of motorway are missing from my rendering but are present in the data - it is possible this problem extends to smaller roads too so take these screenshots with a pinch of salt]:

…but go further out of London, and it doesn’t look so good for OSM:

Interestingly, the Park Estate in central Nottingham is missing entirely from Meridian:

The Park Estate is a private estate and the roads are not maintained by the council – this might have something to do with it. I’ll be running around the Park Estate next weekend.

Accuracy
[Update 2 - Meridian is not intended to be used at scales larger than 1:50000, as per its documentation, so I shouldn't really be comparing it with OSM which generally is based on data recorded at larger scales. So, bear in mind these screenshots are all larger than 1:50000 scale.] It’s difficult to authoritatively judge the relative accuracies of the two datasets without getting out on the streets or looking at aerial imagery – but you can infer a basic measure of accuracy by looking at how roads “wiggle” – or, in the case of the Mayfair squares below, how Meridian converges the square to a point:

Detail
A little unfair to compare the two here, as Meridian 2 was always meant to be a medium-scale dataset, whereas OSM can be all things to all people!

The tiles that make up the imagery are generated on demand (and cached for subsequent use) so may run slowly. You’ll need to zoom in quite a long way before all the features get added to the map. Use the slider on the top left to fade between the OSM and Meridian layers.

The images are derived from Ordnance Survey data © Crown copyright and database right 2010 and OpenStreetMap data which is CC-By-SA OSM and contributors.

Forests of Great Britain

From the OS OpenData’s Meridian 2 dataset, which was released under a free licence today, here is the extent of forest cover across Great Britain – there is a dedicated polygon shapefile within the distribution showing just this:

As a general rule for orienteering, areas with good forest cover have the best orienteering maps. Scotland and Wales beat England hands down for cover, although Surrey’s doing not too badly at all. The rule doesn’t always hold though – there’s a big patch near Cambridge – more so, it appears than the Lake District, but the latter is considerably finer for orienteering in.

The map contains Ordnance Survey data © Crown copyright and database right 2010.

OS OpenData is here

It’s the first of April – but it’s not an April Fool – lots of Ordnance Survey medium-scale data has been released today, under a licence compatible with Creative Commons’ Attribution, i.e. you can do what you like with it as long as you attribute and don’t misrepresent the data source.

The best mirror for the data I’ve found is at MySociety – the OS’s own servers have been apparently overloaded since the release went live.

The first use I’ve made of the data is taking the “CodePoint Open” set of postcodes and locations, I’m now using this data as the postcode lookup for OpenOrienteeringMap. If you type in a UK postcode there, it should now take you to exactly the right place. Before, the lookup was using NPEmap data, which was pretty good in general, but someone did spot some glaring errors when they were using it, coincidently, yesterday.

The attribution statement, by the way, can be seen by mousing over the “Jump to Postcode” text.