In a comment on the last post, Tim Sherratt observed that there seemed to be fewer links between Series than there should be. I did some digging in the data and discovered that links in the Archives' data are uni-directional. In other words, when Series A lists Series B as a related Series, Series B does not automatically reciprocate. The same is true for succession and control relationships: Series data lists subsequent Series links, but not preceding Series (which are subsequent Series relationships in reverse). Controlling links are listed, but not controlled by relationships.

In order to represent these links I first had to rewrite the parsing code so that when it finds a link, it simply records the link in two Series - at both ends of the link - rather than one. Thinking about directionality I decided that succession links all could be represented in the same way, regardless of direction: since the grid layout shows chronological ordering, that relationship is already clear (succession relationships are blue, above). Related Series could also be represented symmetrically - if Series A is related to Series B, surely B is also related to A (related links are yellow, above). Control relationships however are highly directional, so I introduced a new link type to represent the controlled by relationship. In the image above the controlled by links are purple, and lead from a large series to a number of smaller ones.
This tweak has a number of important results. Not surprisingly, the number of links increases - it doubles, in fact - providing more impetus to expore the context around a focused series. Also, the addition of the controlled by relationship makes small controlling Series far more findable because they are often linked from large Series, as in the image above.
Update 20th August - updated these sketches to fix a memory allocation problem
After a hiatus over summer and the start of the academic year, I finally have some more progress to report. Using the packed square visualisations as a base, I've been adding more data elements from the Series dataset, and working towards visualising relationships between Series and the Agencies that generate their content. This has taken longer than planned due to more data-plumbing issues, which I'll come to later.
The Archives' Series data records two kinds of links: Series-Agency and Series-Series. The latest sketches make a start in visualising both of these. Here colour (or more accurately hue) is derived from the first listed Recording or Controlling Agency. As the CRS Manual explains, the Recording Agency generates the records; while the Controlling Agency is the "agency currently responsible for some or all of the functions or legislation documented in records." In either case, given that there are some 9000 Agencies involved here, how do we visualise this link? For the moment I'm doing it in the simplest possible way: low Agency numbers have low hue values (red), while high Agency numbers have high hue values (blue to purple). There are a number of problems with this - notably that it's impossible to tell the difference between, for example, CA 11 (Treasury 1901-1976) and CA 12 (the PM's Department 1911-1971) - which is a very significant difference. These two images show the difference between visualising Recording Agency (top) and Controlling Agency (bottom).
Series data also records links to other Series, which come in three flavours: Succession (between previous and subsequent series), Controlling (where one Series acts as an index or register for another) and Related (for other relationships). In this dataset (57.5k Series) there are some 7.5k succession links, 6.2k controlling series links, and 25k related series links. My initial attempt to render all of these (by just drawing a line between linked Series) resulted in a giant, unreadable cloud. A simpler and more legible approach is to only draw links for one Series at a time.
In the latest interactive sketch, a single Series' links are drawn as coloured lines: controlling links are red, succession links are blue, and related Series links are yellow. Clicking a Series selects it and draws its links, rendering linked series in colour while dimming the rest to grey (clicking the Series again unselects it and returns to Technicolor mode). This begins to show the potential for a visual interface to the collection, I think. Here's the applet - note that it's fairly screen and memory-hungry. Feedback welcome, as always.

There are a few changes behind the scenes here as well. As outlined earlier, XML has been a mixed blessing: easy to use and human-readable, but the file sizes are large, and the DOM parsing method used in Processing is memory-hungry and slow. For these sketches I've switched to JSON, a simple, lightweight data format with its own Java library. So far, JSON is working nicely; its file sizes are around half those of the equivalent XML files, parsing is much faster, and the parsing code is simpler and neater. This thread has lots of useful info on implementing JSON in Processing.
HashMaps are the other new toy here. I'd never quite found a use for them until now, but because they easily connect an object (in this case a Series) with an index string (in this case a Series ID), they are essential here for building Series-Series links. I simply store each Series' links as a list of ID strings, then to draw the link, feed each ID into a HashMap to access the whole Series object. Thanks to @blprnt and @toxi for reminding me why I needed HashMaps!
Next: digging deeper into the complexities of Agency-Series relations.
Labels: interactive, links, packing, series, sketch
Up to this point the grid visualisations have taken a very simple approach to space: dividing it up equally among the data points, and then using hue and brightness to show attributes such as shelf metres and items. This has the advantage of simplicity, but it has a major disadvantage too: it's attempting to represent size (shelf metres or number of items) using other means. Why not just use size for size? Read on for the blow-by-blow account, or skip straight to the end result: the latest interactive sketch.
Before Christmas I had a first stab at this problem. The approach was basic, as usual. Maintaining the chronological ordering of the series, I drew each series as a square with area proportional to number of items. The packing procedure was simply: starting where the previous series is, step through the grid until we find a big enough space to draw the current series. The result looked like this:
After weeks of regular grids, this was a sight to see. The distribution of the sizes of series (overall and through time) is instantly apparent. This ultra-simple packing method is far from perfect, though, as you can see from all the black gaps. Because it tiles one series at a time, in strict sequence, and only searches forwards through the grid, gaps appear whenever a large square comes up as the search scrolls along to find a free space.
The main restriction here is the chronological ordering of the series. I need to maintain that ordering, but at the same time I need to be able to pack the squares more efficiently, which means changing the order. Luckily there's a loophole: as the first histogram showed, many series share the same start date. So we can change the sequence of those same-year series, without disrupting the overall order. We can pack them starting with the biggest squares and pack in the smaller ones around them. The latest sketches use this method, which can be described in pseudocode:
- Make a list of series with a given start year
- Working from biggest to smallest, pack each series into the grid, from a given start point: restart the search from the start point each time.
- Keep track of the latest point in the grid that this group occupies. For the following year, start from this point.
In this image square area is mapped to shelf metres; as in the earlier sketch hue is derived from the series prefix (roughly A = red, Z = blue). One artefact is apparent here - those lines of squares graded by size occur when nothing gets in the way of the packing process. As a byproduct of this, the biggest squares in those sequences often mark the start of a new year in the grid.The latest sketches integrate both shelf metres and described items, and finally add interaction to this visualisation. To combine metres and items the squares are drawn as above, with area proportional to shelf metres; then overlaid with a second grey square, whose size is inversely proportional to the number of items in the series. The result is that series with many items are full of colour, and series with few items have large "hollows" and narrow coloured borders.
Again, there are relations between series here that are instantly apparent. It's easy to see those series that have lots of shelf metres but relatively few items, as well as even medium-sized series with many items. I couldn't find A1 in the earlier grids (though Tim Sherratt from the Archives could); it is much more prominent here. Tim also pointed out that B2455, one of the big series of WWI service records, didn't jump out of the grids: it's very prominent here. As well that cluster of post-War migration series spotted in the items grid reappears here. Promising signs for the usefulness of this visualisation.All this is best demonstrated in the interactive version, which like the previous grids adds a caption overlay and some year labels on the vertical axis. Browse around and see what you can find - feedback very welcome.
Labels: grid, interactive, packing, series, sketch

