Category Archives: Epistemologies

Event: DP @ British Crime Historians Symposium, Liverpool, September 2014

We’re delighted that our quest to take over the entire known universe of the history of crime continues with a panel session at this year’s British Crime Historians Symposium:

The Digital Panopticon: New perspectives on criminal justice records and the practice of transportation

Robert Shoemaker, ‘Identifying the criminal: The state and record keeping in the eighteenth and nineteenth centuries’
Richard Ward, ‘Seeing things differently: Visualising data on crime and punishment’
Lucy Williams, ‘Bound for Botany Bay? Assessing the differences between Old Bailey penal sentences and their implementation’

Leave a Reply

Transportation Under the Macroscope

Computers are brilliant microscopes. They make it easy to find needles in haystacks. Want to find references to the famous lawyer William Garrow amongst the millions of words in the printed reports of trials held at the Old Bailey, for instance? A keyword search produces the results in less than a second. Without computers it would take months. Likewise, as I explained in a recent post, through the techniques of data visualisation computers can be used to spot (what would otherwise be largely imperceptible) errors within the massive datasets that we are drawing upon in the Digital Panopticon project.

But computers are also fantastic macroscopes — today’s powerful digital technologies allow us to stand back from our sources and view them in their entirety. We can see the big picture, presenting complex and large-scale patterns in simple but effective ways. Microscopes allow us to see the infinitely small. Telescopes reveal the infinitely great. Macroscopes, meanwhile, peer in to the infinitely complex, allowing us to explore combinations, relationships and interactions between multiple elements.

By visualising the information recorded in the British Convict Transportation Registers, I’ve recently put penal transportation to Australia in the eighteenth and nineteenth centuries under the macroscope. This has produced some interesting insights into the relationship between Australian penal colonies, terms of transportation and how these changed and interacted over time.

The British Convict Transportation Registers database provides information on more than 123,000 offenders who were transported to Australia between 1787 and 1867. It’s a fantastic resource, and it will be at the heart of the Digital Panopticon project’s efforts to chart the criminal lives of London convicts sent to Australia. In charting these lives, we need to address some overarching starting questions. How many London convicts were actually transported to Australia for their crimes? Which parts of Australia were they sent to? How many years abroad did they face according to their sentences? Did this change over time, and what was the relationship between these different elements? Visualisations can help us to explore these questions across the long term and a large scale.

The total number of London convicts transported to Australia fluctuated greatly over the late eighteenth and nineteenth centuries, as Graph 1 below demonstrates. Relatively few convicts were transported in the years 1793–1804 when the Revolutionary War monopolised Britain’s shipping resources. With the end of the Napoleonic War in 1815 there were however large and rapid increases in the numbers of London convicts sent to Australia, reaching a massive peak in the 1830s. Thereafter, numbers gradually fell until the eventual abandonment of penal transportation in the 1860s. Interestingly, this pattern reflects a wider inverse relationship between the numbers of convicts transported and the years in which Britain was engaged in war throughout the eighteenth and nineteenth centuries.

What Graph 1 doesn’t reveal is that the places in Australia where convicts were sent to changed over time. The individual penal colonies to which London convicts were sent operated at different times. As Graph 2 below shows, New South Wales was the first penal colony in Australia, and was later used alongside the penal colony of Van Diemen’s Land between the late 1820s and 1840, when transportation to Australia was at its peak. Following this, Van Diemen’s Land was used almost exclusively, until the 1850s, when Western Australia was the sole transportation location in Australia.

If the locations of penal transportation to Australia changed over time then so too did the lengths of time which offenders were sentenced to abroad. Between 1787 and the virtual abandonment of New South Wales as a penal colony in the late 1830s, as Graph 3 highlights, offenders were sentenced almost without exception to a term of 7 years, 14 years or life. Between 1840 and 1850, when Van Diemen’s Land was used exclusively, terms became more varied, with greater use of 10 and 15 year sentences. And especially after 1853, when Western Australia became the sole destination for transportees, an even greater variety of terms were put to use. This more nuanced tariff in transportation sentences was likely introduced to make transportation more favourable to penal reformers who increasingly viewed the practice with concern.

These changes in penal colony and terms of transportation were intimately linked, and the interaction between the two is clearly captured in Graph 4. The colonies operated at different times, and the law which underpinned them and the terms of transportation which could be imposed also changed in accordance. In short, the convicts who found themselves on the shores of New South Wales were primarily one of two kinds: either those sentenced to 7 years transportation; or those sentenced to a whole life abroad. By contrast, London convicts landing some 2,000 miles away on the shores of Western Australia and on the eve of transportation’s demise in the 1850s would each have had subtly different terms to serve out.

Through the macroscope of computer-generated visualisations, we can see these complex patterns and interactions in their entirety, spanning the breadth of Australia and the length of a century, taking in the lives of tens of thousands of individuals along the way.

2 replies

Men as Wives: Visualising Errors in the Old Bailey Proceedings Data

In a recent post I talked about some of the ways in which data visualisations have helped me to see patterns in the information recorded in the Old Bailey Proceedings on things such as crimes, verdicts, punishments and the ages of defendants, patterns that might otherwise have been missed if using traditional methods of representing data such as tables. Here I just want to give a brief update on my analysis of the Proceedings, particularly the recording of defendant occupations and social status in the Proceedings in the eighteenth and nineteenth centuries. Again, visualisations have been extremely useful, especially in identifying errors in the data.

As with the recording of defendant ages, it might well be the case that information on the occupation/social status of those tried at the Old Bailey in the eighteenth and nineteenth centuries could be useful to us on the Digital Panopticon project in tracing offenders across different sets of records. Just as an age or a birth date might allow us to establish whether the “John Smith” tried at the Old Bailey and the “John Smith” transported to Australia was indeed the same person, likewise information on occupation or social status can help us to prove/disprove such name matches across records. But as with ages it depends on how extensively, and in what manner, such information on occupation/social status is recorded in our sources. And to this end, as with information on defendant age, the techniques of data visualisation can be useful.

Searches of the Proceedings for defendant occupation/social status can be carried out using the “custom search” page of the Old Bailey Proceedings Online.

However, whereas with defendant ages I was able to use the “statistics search” function of the Old Bailey Proceedings Online to generate numbers for analysis, this wasn’t possible in the case of defendant occupation/social status. In the process of digitising the original trial reports, defendant occupation was indeed tagged as a distinct category of information, and thus it can be searched for systematically in the “custom search” page of the Old Bailey Proceedings Online. But this can’t be used to quantitatively analyse the recording of defendant occupations in the Proceedings. In order to do this I needed to look at the website’s underlying data file of defendant information.

This is a large file which includes numerous fields of tagged information relating to all the defendants tried at the Old Bailey and reported in the Proceedings. Since much of this information is in the form of text rather than numbers, software such as Excel isn’t very useful in analysing the data. Instead I turned to Tableau Public, a free, web-based tool that is powerful but still easy to use. There are numerous other data visualisation tools available which are ideal for novices. All need to be used with caution, but used carefully they can be invaluable. (I’m going to talk in more detail about the actual process of using tools such as Tableau to undertake crime history in my next post, so watch this space.)

By running our file on Old Bailey defendant information through Tableau I’ve been able to create some fairly simple but nonetheless useful visualisations. For the data on defendant occupation and social status this has revealed two things in particular.

Pie chart demonstrating frequency of recording defendant occupation

First of all, it has highlighted how little information we actually have on the occupational and social status of Old Bailey defendants from the seventeenth to the twentieth centuries. Across the entire publication history of the Proceedings between 1674 and 1913, occupation or social status is recorded for only 11% of all the defendants put on trial. In the years 1755 to 1834, occupation/social status is recorded for 15% of defendants, but between 1834 and 1906 virtually no defendants’ occupations were recorded. On the whole, therefore, we have occupation information for only a small proportion of defendants, and none at all for our specific period c. 1787-1875.

The sheer variety of occupations that are recorded in the Proceedings were also made clear by visualising the data. The bubble chart below for example give an indication of this, and the relative frequency with which different categories are recorded. One of the problems is that the same occupations were recorded in the Proceedings in slightly different ways (“servant” and “servants”, for example) or with variant spelling (such as “taylor” and “tailor”). If we wanted to utilise occupation or social status labels to verify name matches across sets of records this suggests that we would need to use sophisticated forms of keyword searching.

Bubble chart showing categories of defendant occupation

Bubble chart of defendant occupations by gender

But visualisations have been especially useful in highlighting some of the errors in the recording of occupations within the Old Bailey Proceedings data. One of the things that I wanted to find out was how occupation labels varied according to the gender of the defendants tried at the Old Bailey. In order to do this I used Tableau to create the following bubble chart of the most common forms of recorded occupations/social status for male and female defendants in the years when we have significant amounts of information on this. One of the things that really struck me in this bubble chart was the amount of men whose occupation label is recorded in our Proceedings dataset as “wife”. This clearly seemed to be an error in the data, but I wanted to know what the source of the problem was so I went back to the original data file and filtered it for male defendants with the occupation/social status label of “wife”. And I then looked at the trial reports in the Old Bailey Proceedings for these cases.

Trial report in the Old Bailey Proceedings in which the husband of a female defendant has been tagged with the social status of “wife”

It turns out that many of these cases were due to errors in the digitisation process which resulted from the unusual nature of the trial reports themselves. The cases were actually ones (such as this example below) in which a female defendant had been named in the trial report as the wife of her husband, and thus the automated tagging process used to digitise the Proceedings had recorded both the husband and the wife as defendants and assigned them both the role of “wife”. This practice in the Proceedings of naming the female defendant as the wife of her husband largely disappeared in the nineteenth century, and therefore most of these errors in the data file tend to come from the eighteenth century. By identifying these kinds of anomalies, visualisations therefore allow us to find errors in the data. Such errors can then be rectified. This leaves us with a much “cleaner” dataset, and thereby increasing the chances of successful record linkage.

Historians of crime (particularly the history of crime in Britain) have been quick to exploit the plethora of digitised criminal justice (and associated) records that are now available online. We all make us of resources such as the Old Bailey Proceedings Online, Eighteenth-Century Collections Online and digitised newspapers. But whilst we have been quick to take advantage of the benefits offered by these digitised records – such as keyword searching to find needles in haystacks – we have been less ready to understand the full effects of the digitisation process for how we study our sources and the information that we extract from them. By using data visualisations we can better understand the implications of digitisation, including the ways in which the actual process of turning a paper record into a digital format might result in errors (relatively rare, it should be said, in the case of the Old Bailey Proceedings Online) in the information we compile.

Leave a Reply

Adventures with Data Linkage

The British Convict Transportation Registers is a database detailing the journeys of over 123,000 people transported to Australia in the 18th and 19th centuries. Compiled from British Home Office records, it contains information such as the name of each person being transported, the date they departed, and their final destination.

The early stages of the Digital Panopticon have allowed us to perform some preliminary data linkage between these registers and people sentenced to transportation in the Old Bailey Proceedings. We’ve made the links primarily by name, with a degree of tolerance for spelling. We found that many names actually matched exactly, suggesting that perhaps names were in some cases directly copied from one record to another. A further 7% of names could be matched via an algorithm known as Soundex, which attempts to identify names which sound similar when spoken, but might be (accidentally) spelt differently. A remaining handful were matched by virtue of having a small Levenshtein Distance. Levenshtein is a simple metric by which the variance between two text strings is quantified. Including matches with a very small Levenshtein Distance, where perhaps only a single letter is different or omitted, helps take account of minor clerical errors.

Percentages of names matched between the British Transportation Records and Old Bailey Proceedings, under various conditions.

Results of attempted name matching between the British Transportation Records and Old Bailey Proceedings.

In total, about 70% of the people sentenced to transportation in the Proceedings appear in the transportation records. We can be quite confident of about half of these, because in some cases the date of conviction is actually given in the transportation record. If the date and name match, it becomes very likely that we’re dealing with the same individual. For transportation records where a conviction date is not given, we have to examine five or six years worth of Old Bailey records to make sure we don’t miss a possible match. This greatly increases the possibility of a false positive, so we can be less sure about these links.

One interesting trend is that the number of exact links decreases significantly in cases where the conviction date is not given. A greater proportion of these links had to be made with Soundex or Levenshtein Distance. This suggests that the links made without a conviction date are less reliable, as we might expect. Therefore, for the time being we will discard these.

With our most reliable links in hand, we can begin looking for patterns between the details of conviction and transportation. One of the most interesting pieces of information contained in the transportation records is the destination of convict ships. An obvious question is whether convicts were directed to particular destinations based upon their offence, gender or age. One might imagine colonies having a need for people with particular skills or attributes at particular times, and the system might have attempted to address these needs. Luckily, occupation is indeed sporadically recorded in the Old Bailey Proceedings.

In fact, the data shows that the overwhelming factor in deciding where a convict was sent was the particular year when they left England. Transportation was almost exclusively to New South Wales before 1831, and overwhelmingly to Van Diemens Land after 1838. There is a brief period from 1832 to 1835 where roughly equal numbers of convicts are sent to both destinations. However, even during that period, there doesn’t appear to be any correlation between the characteristics of a convict and their destination. Neither gender or age, crime or occupation seem to have made any difference. Once a person was in the transportation system, their final destination was entirely arbitrary. There was no easily identifiable tendency to send people with particular attributes to particular destinations.

Sankey diagram, showing proportions of different age groups transported to different destinations, including where the destination is unknown because a link between records could not be made.

Sankey diagram, showing proportions of different age groups transported to different destinations between 1832 and 1835, including where the destination is unknown because a link between records could not be made.

If we cannot find a pattern in where people were sent, perhaps we can find a pattern in how long it took them to be sent there. For every convict there is a period of time between when they were convicted and when they actually set sail aboard a ship. The interval between conviction and transportation is hugely variable. A few people were transported in little over a month. Some people, as we have noted, spent six years waiting to be transported.

Line graph showing the minimum, maximum and average intervals between conviction and transportation over time, 1787 - 1852.

Line graph showing the minimum, maximum and average intervals between conviction and transportation between 1787 and 1852.

The data shows that again, time was a very important factor. Transportation almost halted between 1835 and 1844, as did sentences of transportation. In contrast, the system seems to have been at peak efficiency between about 1814 and 1834, but even then there are a few outliers (represented by the green line) who still had to wait a very long time to be transported.

Detail of a scatterplot variation showing every interval between Proceedings conviction and BTR transporation, represented by horizontal bars running from conviction date to transportation date. Females are blue, males are orange.

If we look at the data in more detail, we can see that a great many of those sentenced to transportation, at least early in the period, are simply waiting for the next boat to depart. Convicts sentenced at multiple sessions are stored up until, presumably, there are enough to justify a voyage. Nevertheless, there are people who seem to miss multiple voyages; people convicted at the same session as those who depart on the next boat who are, for whatever reason, left behind. Can we detect any common characteristics among these people?

It is not at all easy to find a pattern, but there may be one: Male prisoners below the age of 15 appear to be kept for longer, on average, than those who are older. It’s worth noting that the minimum and maximum intervals show no such trend; there are still people under fifteen who are transported very quickly, and people over fifteen who are held for a very long time. But in terms of the average, there is a definite increase which starts abruptly at the age of fifteen and then accelerates as prisoners get younger. In fact, on average, male prisoners under fifteen are kept for twice as long as those over fifteen.

Age plotted against minimum, maximum and average days between conviction and transportation, for males sentenced at the Old Bailey 1787-1852.

This is a finding which we can begin to investigate and verify. Certainly, the pattern is not repeated for female prisoners, whose average transportation time remains remarkably consistent regardless of age. As the project gathers more data and continues its initial investigations, we hope to be able to explore this possible trend in more detail.

This is the very first linking exercise we have done, and there is undoubtedly scope to refine the process. Every dataset we add will help us to evaluate our findings more thoroughly and ask more detailed questions. The next step may be to try and link the Old Bailey and Transportation Registers to the Convict Database, which contains information such as height, and prisoner health. These may well be important factors in determining the treatment of prisoners and providing further clues as to the nature of a journey through the eighteenth century criminal justice system.

Leave a Reply

Visualising Life-Grids and Narrating the Lives of Convicts

One of the great opportunities presented by the Digital Panopticon project (and one of the most exciting in my opinion) is in uncovering more about the processes of crime and punishment by placing thousands of offenders, and their offences, back within the context of their own lives.

Tracing offenders through the records has been a preoccupation of several groups of historians and criminologists (for example Barry Godfrey, Heather Shore, Pam Cox, David Cox, Helen Johnston, Zoe Alker, Joanne Turner, and Stephen Farrall) in the last decade. On account of the laborious nature of record linkage those studies which have focussed on tracing groups offenders through civil as well as criminal datasets have been able to examine a few hundred offenders at a time. Those pioneering this methodology have taken the collected information and sorted it into ‘lifegrids’ which chart life events and changes for each individual. Lifegrids might typically include details of birth marriage and death, family evolution, employment and residential addresses, and offending and punishment history. Of course, the depth and breadth of documents and information available on different groups of, or individual offenders, dictates how much material can be recorded in each life grid.

Other than life-grid format, there are a number of ways that this information can be presented and communicated. Even the simplest visualisations are able to show the role that offending had in any one person’s life. This might be through indicating what proportion of an individual’s life was spent in custody, or how many offences were recorded against them at what stage of their life. It is possible to chart how someone’s offending accelerated and decelerated. From an institutional perspective it is possible to indicate how an individual’s weight and health changed over time, or how their behaviour and privileges impacted upon their experience of punishment. The myriad of ways in which this fascinating and complex data can be presented has some exciting potential for how others see, interrogate, and engage with this fantastically rich data.

To begin to explore these possibilities, we have been working with an example offender: Patrick Madden (one of a number of offenders included in Johnston, Godfrey and Cox’s ESRC funded research on ‘The costs of imprisonment’).

Born and raised in Sheffield, Patrick began offending around the age of sixteen. Although often motivated by property, Patrick’s offences were primarily violent in nature. Madden had 15 offences recorded against him over an almost thirty year period. Each of these was committed either in Sheffield or other close-by northern towns such as Wakefield and Doncaster. It was in these locations that he was incarcerated, accept for one occasion of penal servitude when he served seven years of penal servitude in London, and the south of England. It does not appear as if Patrick ever married or had children, nor that he managed to establish a life for himself that did not involve repeat offending for long before dying at the age of 52.

Patrick Maddens lifegrid, of course, contains much more information than this brief overview might suggest. Patrick’s civil and penal records allow us to know about many elements of Patrick’s life right down to his familial relationships and sexual preferences. However, even if we take the most ‘bare bones’ approach to Patrick’s life narrative, it is possible to start creating some interesting visualisations based on his experiences and offending history.

Yet the size and scale of the research being undertaken by the Digital Panopticon means that we are faced not just with presenting Patrick Madden’s life, but instead the lives of all of the ‘Patricks’ that went through the old bailey between the late 18^th and early 20^th centuries. This poses two distinct challenges which we will face in presenting the mass of information traditionally held in lifegrids. First is that the range of records being linked together for each offender is unprecedented. Some records are well known to our researchers and relatively straightforward to visualise, such as criminal registers that allow us to examine date, place and type of offence. Others such as the changing picture of family life that might evolve from three successive census entries, or the seemingly random personal or professional information that can be carried in a newspaper report, are far more difficult to quantify and visualise. This first problem will become clearer and hopefully less significant as more records are collected and linked. It should be fairly straightforward to identify the information which can be presented easily, and to adapt that which cannot. The second challenges we must meet is that of potentially presenting to other researchers and the public tens of thousands of individual life and offending histories. What we need to work on is finding a way of presenting a range of different information about our offenders both individually and in aggregate so that it is possible for users to access information about an individual they are interested in, but also to see how such an individual compares and contrasts with others in the study – something which enables researchers to identify how typical an individual’s experience was.

BG offered some initial ideas of how we might best achieve this when we met in Oxford. By creating ‘strand’ visualisations which present a mass of offenders by a few ‘key values’ – for example the year of their first recorded offence, nature of offence, or length of offending career – and then allowing users to further restrict what strands are shown to them by other values – for example sex and location- it would be possible to access information about a single individual, whilst getting a sense of how they match up to their contemporaries.

We hope that this will prove an excellent starting point as we work to develop future visualisations and methods of presentation which will allow the Digital Panopticon team, fellow researchers, and members of the public to explore, understand, and get the most from the fantastic wealth of data at our fingertips.

Leave a Reply

Visualising Data Workshop Report: part 2

The second half of the workshop was devoted to work in progress and plans for the Digital Panopticon – I’ll say less about these than those in part 1 because longer versions should be appearing (or have already appeared) here on the blog!

Barry Godfrey briefly introduced the project and the challenges of visualisation of our data.

we’re looking at systematic changes in punishment over a long period of time (late 18th to early 20th century); but we’re also looking at individuals over their lifetimes and at many thousands of individuals.
It’s not just about temporality: we’re also deeply concerned with spatiality – not simply the long distance movement of transportation but movement within Britain.
another theme of the project is ethical – the responsibilities of revealing so much information about people: how much does this extend to visualisation too?
finally, there are many potential audiences for DP data visualisation – in addition to researchers and academics, students, teachers, genealogists and other non-traditional users of criminal data. How to cater for so many different people and their needs?

Jamie McLaughlin demonstrated some of our early explorations in record linkage and data visualisation, including a number of Sankey diagrams to show connections between two datasets (Old Bailey Proceedings and British Convict Transportation Register). In particular, he’s been comparing the outcomes for defendants sentenced to transportation and those who were sentenced to death which was subsequently commuted to transportation. Another topic of interest is the people sentenced to be transported who don’t subsequently turn up in the transportation records: what happened to them? Can we find them again elsewhere?

Richard Ward focused on visualising (again, extensively using Tableau Public) a single dataset, the Proceedings, and covering much of the ground on questions of age in his recent blog post here. (I learned along the way that the proper demographic term for the tendency to round ages is age heaping.) He also introduced the topic of occupations/status labels – which are problematic in the Proceedings for a number of reasons – and hopefully this will be covered in his next blog post. [slides]

Barry and Lucy Williams rounded off the session by looking at the challenges involved in visualising life grids. Barry’s previous research on 600 prisoners used a wealth of different sources including licenses, medical sources, and other prison records, as well as civil data, and tried to build up as complete a picture as possible of each prisoner’s whole life: this was summarised in life grids. We looked at interesting options for visualising the life of a single prisoner – but how to multiply up to thousands of them? [blog post]

The following discussion introduced a number of suggestions and possible ideas and resources to follow up. Certain themes however, resurfaced throughout the day as key issues:

The importance of seeing data visualisations as part of a process with changing needs and purposes over the course of the project, and for different people. Part of the challenge is that we want to cater not just for the specific research agendas of the project team members but also for a range of other researchers.

The twin challenges of scaling up and the very long period of time we’re covering; but also the sheer variety of different types of source and data that we’re dealing with. The Proceedings are a very different kind of record from the (mostly) highly structured tabular data of Founders and Survivors, and from the English imprisonment records we’ll be working with.

It was all in all a great day! We were bowled over by the wealth of ideas from our three external speakers and the additional input of everyone who attended for the day, not least Andrew Prescott: thanks to everyone who came for making it such an enjoyable and stimulating event. And I’d add a final thank you to Deb Oxley for organising the event and being a splendid host.

Leave a Reply

Visualising Data Workshop Report: part 1

The first half of the workshop consisted of speakers we invited to introduce the ways in which they have used visualisation in research, and look at how these could be useful to the Digital Panopticon and researchers attending the event. I’ve included as many links to relevant resources as I could find. (See also the Storify of the event.)

Professor Min Chen of the Oxford e-Research Centre got the day off to a great start. He treated us to a dizzying array of examples of different kinds of visualisations, emphasising the importance of who visualisations are being created for. He surveyed the long history of data visualisation and outlined four levels of visualisation:

disseminative (‘this is’) – presentational aids for dissemination
operational (‘what?’) – enable intuitive and speedy observation of captured data
analytical (‘why?’) – investigative, can be used to examine complex relationships
inventive (‘how?’) – aid improving existing models, methods etc

He also got us to think about ‘modes’ of visualisation, the different perspectives/needs of analysts, presenters and viewers. Question asked: ‘what would be a visual language for the Digital Panopticon?’ – taking into account the different kinds of data we’re working with.

These were just some of the examples!

Poem Viewer from the Imagery Lenses for Visualizing Text Corpora project (Oxford and Utah collaboration) – designed to support close reading by visualising the sounds of poetry.
Temporal Visualization of Boundary-based Geo-information Using Radial Projection – visualising movement of 200 glaciers over 10 years (recorded in satellite images). This was highly challenging: line graphs were too messy, maps not very helpful; a solution was found in radial visualizations.
Visualizing facial dynamics – humans are very good at expression recognition, but computers are terrible; project investigating methods to do this
Use of glyphs (simple stylised icons) rather than text labels in complex workflow diagrams, and to enable display of multiple measurements simultaneously.
Idea of parallel coordinates for visualising multi-dimensional data. (Lots of interest in this!)
How to visualise time without animation? – summarising into a single picture can help to see patterns.

Next, William Allen of the Oxford Migration Observatory talked about ‘Doing the Best with Data: critical realism and visualisation’. The Observatory’s goals are to communicate social science research beyond academia; migration is complex and doing this accessibly is challenging, so they make extensive use of visual techniques.

Visualisations are appealing, as they appear to offer comprehensive and independent windows, but actually achieving this needs to approach visualisation as an iterative and critical process. Use of critical realism approach as a lens for evaluation, critical testing of given categories. Rather than ‘what works?’ it’s better to ask ‘what about this visualisation works, for whom in which contexts, for what purposes?’

The media monitoring project was set up to monitor and analyse systematically what the press actually say about migration, over a period of time. Analysis of how press portrays migrant groups uses corpus linguistic methods (43 million words for 2010-12!). Allen showed us a number of visualisations using the tool Tableau Public (which some members of DP team have also been using).

Allen spoke of the ‘frontiers of visualisation’

political: how data/research are used by range of actors, decisions made through research
technical: the software and built-in assumptions/settings
virtual: interactivity, challenges of opening analysis up to public stakeholders

Questions and problems arising from the Observatory’s work: how do we visualise large datasets and patterns in them? Every decision comes with assumptions about what works. Also emphaised the danger that visualisation software can be a black box – eg, misleading on scale.

Additional resource: The Observatory website has a terrific page of data and resources with ‘ready-made charts and maps on migration in the UK as well as a description of key data sources and their limitations’, and a create your own chart facility. Go and play!

Our third speaker, Arthur Downing (Oxford), gave a presentation on Network Analysis and Visualisation for historians.

A network is a particular set of connections between agents: network analysis is analysis of the patterns of these connections (‘nodes’ and ‘links’). It differs from standard social science methodology (which tend to chop up objects by categories like race and gender and then looks at averages), in that network analysis starts with connections between objects/actors and then looks at their attributes. This is important because there can be different patterns of connections within superficially similar scenarios.

Some fascinating case studies he introduced:

RV Gould on networks and mobilization in the Paris Commune 1871
Hillman on Elites before the English civil war – showed the importance of merchants even though their numbers were small, as they linked many other groups together.
Adamic and Glance on the US political blogosphere in 2004

Downing’s own work on 19th-century Friendly Societies – a network analysis of proposers and seconders showed that top 20% of recruiters were responsible for 80% of members. But using ‘eigenvector centrality’ (which takes into account degree of node and degree of nodes connected to each node), also showed that some people were important even though they weren’t large recruiters.

Network analysis for maps can show more complex patterns than standard maps:

Spread of Freemasons in the US – on a conventional map this just looks like a ‘frontier’ movement, but when mapped as a network, a different picture emerges with more complex directions of flows
Social networks between Australian lodges – most migration is short distance and internal, though migration from England and Wales is very important

Pitfalls and problems:

identifying the boundaries of networks can be difficult
sampling is hard to justify as any missing ties can skew interpretation
longitudinal analysis is difficult – network analysis by definition is a snapshot in time; but may want to know how long does a tie persist. One answer is to breaks down into phases and look at different periods

Conceptually this is very different to standard statistics: ‘analysis of an endogenous system where endogeneity is what is interesting’, but potentially a great method for social history since it’s all about exploring complexity.

In subsequent discussion, concerns about ideological assumptions going into visualisations and how to communicate them to the user – but a reminder that this is a problem with traditional charts and tables too, with no simple answer.

We were deeply grateful to all three speakers for providing us with so much food for thought, and so many ideas to follow up!

[Part 2 of the report to follow shortly…]

Leave a Reply

Seeing things differently: Visualizing patterns of data from the Old Bailey Proceedings

An edition of the Old Bailey Proceedings

The Old Bailey Proceedings are a rich historical resource, almost unimaginably so. They constitute the largest body of texts detailing the lives of non-elite people ever published. Words alone can’t quite do justice to the magnitude of the Proceedings – 197,745 accounts of trials covering 239 years (1674-1913); some 127 million words of text (at an average reading rate of 250 words per minute, this would take eight hours’ solid reading every single day for nearly three years to get through!); details of some 253,382 defendants, including name, gender, age and occupation, as well as details of 223,246 verdicts passed by the juries and 169,243 punishments sentenced by the judges.

The Proceedings clearly contain a huge amount of information, but they don’t record everything – like any historical source, they are selective in what they document. The amount of information that was recorded in the Proceedings on crimes, verdicts, punishments, defendants and so on also varied over time. And whilst the digitization of the Proceedings by The Old Bailey Online has revolutionised the way in which we search and use this rich historical resource, this also has its limits. The marking-up of the text of the Proceedings (assigning tags to particular pieces of information in the text – such as name or crime – so that this information can be systematically searched) makes it possible to undertake sophisticated statistical analysis. Crimes, verdicts, punishments, defendant age and defendant gender can all be counted at the click of a mouse. Nevertheless, marking-up inevitably involves choices (about what information to tag and the level of detail that is tagged), and those choices limit the ways in which the Proceedings can be studied using computers.

Statistical searches of the Proceedings can be carried out through The Old Bailey Online

The question that we might ask, then, is what are the limitations of the Proceedings as a source of data on such things as punishments, defendant age and gender? Taking the Proceedings in their entirety, what are the limits in terms of the information that was recorded in the original trial reports? How frequently, for example, was the age of the defendant recorded? And what are the limits in terms of what we can actually search for systematically using digital technologies? Can we, for instance, systematically determine the lengths of imprisonment which offenders were sentenced to?

These are crucial questions for us because the Digital Panopticon will rely so heavily on the Proceedings as a source: in our effort to trace the life histories of offenders who were sentenced to transportation or imprisonment at the Old Bailey between 1787 and 1875, the Proceedings will obviously be a vital source of information. After identifying those who were sentenced to transportation or imprisonment recorded in the Proceedings we will then try to trace such individuals both before and after their conviction by linking the Proceedings with other sets of records.

In trying to better understand the limitations of the Proceedings as a source of data for the Digital Panopticon project, I have recently been making use of data visualization (‘dataviz’) – using computers to create visual representations of numbers. This includes the traditional graphs and pie charts that we are all familiar with, and which I will be talking about here. But it also includes more complex forms of visualization which I will be looking at in future posts (watch this space!).

Since the Proceedings contain such a vast amount of information, manual counting and tables are therefore inadequate in making sense of the data. Turning the raw numbers into a visual form makes it much easier to see overall patterns in the data. Here I give just a brief example of how dataviz has helped me to see the Proceedings differently, to appreciate the limits of this immense historical resource, and to think about how information from the Proceedings can be used most effectively in the Digital Panopticon project.

A data visualisation of the length of trial reports in the Proceedings over time, created by The Datamining with Criminal Intent project

A data visualization of the length of trial reports in the Proceedings over time, created by William J. Turkel as part of the Datamining with Criminal Intent project (created using Mathematica 8)

One of the key things we want to know on the Digital Panopticon is how useful age data might be in helping us to link offenders recorded in the Proceedings with individuals documented in other sets of records (such as the convict transportation registers or census records). In the first instance, links will be made through name searches of the different types of records. But how can we be sure that the John Smith recorded in the Proceedings is the same individual as the John Smith recorded in the prison parole registers, for example? Age data might help us here. If John Smith is recorded as being 24 years’ old in the Proceedings at the time of his sentence to two years’ imprisonment at the Old Bailey, and the John Smith recorded in the parole registers is stated to be 26 years’ old, then we can be confident that this is indeed the same person. By the same token, if the John Smith recorded in the parole registers is said to be 60 years’ old, this would suggest not.

Ages could then be extremely useful, but it depends on how extensively, and how accurately, age data is recorded in the Proceedings (and our other sets of records). By visualizing the results of quantitative searches of the Proceedings we can get a clear sense of this, far more so than through the use of text-heavy tables which can be hard to “read” for patterns. A statistical search using The Old Bailey Online reveals that 171,168 defendants are recorded in the Proceedings in the years 1755-1870. Of these, age is recorded for 101,364 (59.3%) of them. So for the entire period of our study, we have age data for just over half of all the defendants at the Old Bailey.

Further digging into the data and visualisation of the findings reveals some of the deeper patterns in the age data. In the first instance, the recording of ages only began in the year 1790 for defendants found guilty, and from the 1860s for those found not guilty, as shown in the graph below. In the 1790s, we have age data for 65% of guilty defendants, increasing to 90% and above thereafter. By contrast, age data for the not guilty is missing until at least the 1850s, and in earnest until the 1860s.

Visualisation demonstrating the extent of age recording over time and by verdict

Visualization demonstrating the extent of age recording over time and by verdict

This gives a sense of how extensively ages are recorded in the Proceedings over time, and according to which categories of offenders. By visualizing the patterns of recorded ages we can also get a feel for how ages were actually recorded. The graph below, for instance, suggests that there was a tendency to revise the defendant’s recorded age up or down slightly to match a round figure. The numbers of defendants whose ages are recorded as 30, 40, 50 and (to a lesser extent) 60 are all significantly above the number we might expect according to the moving average (in other words, when the yellow bar goes above the green line in the graph). By contrast, ages just either side of these figures (such as 29, 31, 39, 41 and 51) are systematically below the average (when the yellow bar is below the green line). It may well also have been the tendency for those in their early twenties to have their recorded ages revised down to 18 or 19, since these two ages are also well above the expect number. In short, many more defendants were recorded as being 30 rather than 31, or 40 rather than 41, and the scale of the difference suggests that this resulted from a deliberate policy of revising the defendant’s age up or down to match the nearest round figure.

Visualisation demonstrating the “bunching” of recorded ages at 30, 40, 50 and 60

Visualization demonstrating the “bunching” of recorded ages at 30, 40, 50 and 60

Together this suggests that age data in the Proceedings will be of much use to us in the Digital Panopticon, particularly for the defendants found guilty and subsequently sentenced to transportation or imprisonment. In this instance we have extensive amounts of age data from 1790 onwards. In the case of our not guilty control group, however, we have no age data available in the Proceedings to work with before the 1860s. In this instance we will be reliant on other categories of information to link the not guilty defendants across datasets. And in light of the seeming tendency for recorded ages to be rounded up or down, this suggests that when we use age data to link individuals across datasets it would be more effective to work within age ranges rather than trying to compare specific numbers.

From these early explorations it seems clear that visualization will be invaluable in helping us to identify the overall patterns in the data of the Proceedings. The first step in this is identifying some of the limitations in terms of the information recorded in the Proceedings. Traditional forms of visualization are useful to this end. But there are also potential benefits in going beyond this, by using more complex forms of visualization to uncover deeper patterns in the data – patterns that would be difficult to detect through simple graphs or charts. This is what I will be turning to next.

Leave a Reply

Event: Visualising Data Workshop, Oxford, April 2014

We are delighted to be able to announce our first project workshop on Visualising Data, part of our Epistemologies research theme. We anticipate that the workshop will be of interest to many people (not just from large projects!) interested in the potential benefits and pitfalls of visualising large historical datasets.

Along the way, we’ll be reflecting on one of our key research questions:

What can visualisation techniques tell us about the overall shape/distinctive patterns in the data, and what does this reveal about the various processes by which the data were created, and their constraints/limitations?

We’re in the process of exploring data visualisation techniques that will enable us to analyse the datasets both individually and collectively, and members of the project team will talk and invite discussion about both the academic and technical challenges this presents. But we also have three excellent external speakers to provide perspectives from a range of fields and projects: Rob Procter (Warwick), Min Chen (Oxford) and William Allen (Oxford).

It’s an afternoon workshop which we hope will enable as many UK-based people as possible to make a one-day trip of it.

Download the Visualising Data Flyer for full programme details.

Workshop Information

When: 2pm-6pm, Monday 14 April 2014
Where: Wharton Room, All Souls College, High St, Oxford, UK.
Twitter: #dpdataviz

How to attend: Email Sharon Howard (sharon.howard@sheffield.ac.uk) to register. Places are very limited, so contact asap!

Leave a Reply

Thinking about Dates and Data

Our headline dates (1780-1925) are far from being the whole story when it comes to thinking about data collection and record linkage. One of our stated objectives in our original application elaborates:

to chart the fortunes of all Londoners convicted at the Old Bailey between the departure of the First Fleet to Australia (1787) through to the death of the last transported Londoner in Australia in the early 1920s

But in order to do this, we need to look at data from significantly earlier than 1787, or even 1780. Our interest in convicts doesn’t start at the moment of the Old Bailey trial that sent them on their journeys to Australia. For 18th-century offenders, we don’t have census or civil registration records that we can use, so our focus will be on attempting to trace earliest contacts with the criminal justice system. But if we go too far back, we’ll spend a lot of time and computing resources processing data we don’t need, which will increase problems with noise and false positives (especially when we’re looking for needles in haystacks of unstructured data like newspaper or sessions papers).

Still, it seemed worth checking a more simple question initially. We knew some of the convicts transported in 1787 would have been held in the hulks for several years, as authorities sought a replacement for the American colonies (those pesky Revolutionaries). How long exactly? We wanted to pin down a more precise date than 1780.

Attribution: State Library of New South Wales

The First Fleet entering Port Jackson, January 26, 1788 (State Library of New South Wales)

The Old Bailey Online isn’t a very useful source for this question, however convenient it might be (a few moments with the stats search tells me, for example, that 1258 people were sentenced to transportation between 1781 and 1786), because sentences given after trials don’t necessarily reflect actual outcomes: not everyone who was sentenced to transportation was actually transported; and not everyone who was transported had been given that sentence in court (a significant proportion of of death sentences was subsequently commuted to transportation). In addition,between the collapse of transportation to the American colonies and the establishment of Australia as the primary recipient of transported convicts, there were experiments with transportation to other colonies.

I needed different sources, based on the actual transportation records, so it was a chance for me to start learning about the transportation and Australian datasets I’m not familiar with. In fact, there is plenty of source material: many of the transportation records routinely included information about the convicts’ trials – offence, court, and date convicted. Moreover, a number of projects have already produced readily usable and accessible datasets based on these sources.

I started with the State Library of Queensland British Convict Transportation Registers database (BCTR), created from Home Office registers (TNA HO11, for those who’re interested). We’ve already indexed this data in Connected Histories. The CH version wasn’t designed for this kind of data analysis, however, and to run individual searches would have been a long slow job, so I downloaded the full dataset and played with it (using OpenRefine) until I got the information I wanted. The earliest trial in there, it seemed, was that of John Martin, in July 1782.

The second relevant and easily accessible dataset was the First Fleet database (FF-DB), which is also available to download. This is a smaller dataset, containing the 780 or so convicts transported on the First Fleet, of whom 327 had been sentenced at the Old Bailey. Unlike the BCTR, it’s been compiled from a number of different primary and secondary sources. In FF-DB, the earliest Old Bailey trials were from 1781. The earliest trial of all was that of Samuel Woodham and John Ruglass, at the sessions of 30 May 1781.

Why hadn’t I found these in BCTR? Because, it transpired on reading the entries, in each case their journey to Australia was actually their second convict voyage. They’d escaped from their first convict destination and had been convicted of returning from transportation around 1784-5. BCTR only gave the date of the second conviction that actually put them on the ships to Australia, whereas FF-DB records both. Most of the 14 FF-DB convicts from 1782 trials had also returned from transportation (several had been involved in the Mercury mutiny) and been re-sentenced at a later date.

Don’t ya just love the way a ‘simple’ historical question is never so simple after all?

A different question I decided to ask the data: setting aside 1781-2 outliers, what was the more normal interval between conviction and departure for Australia for the Old Bailey First Fleeters? The following table is taken from the FF data (without taking the “re”-transported into account): 213 (65%) were originally tried in 1784 or earlier. Those who’d spent less than 3 years in the hulks could presumably consider themselves the lucky ones.

Year of conviction	Number of convictions
1781	4
1782	14
1783	48
1784	147
1785	37
1786	49
1787	28

Now I needed to investigate the age range of the First Fleet convicts, which would help me to work out the likely earliest dates of contact with the justice system. Both the transportation and Old Bailey Online data contain at least some information about ages, although 18th-century information on this is often imprecise and not always accurate. I wasn’t too worried about this, since they didn’t need to be exact for this purpose.

First-Fleet-OB-ages2

What are the recorded ages of the First Fleet convicts in FF-DB? There is age information for 309 out of the OB sample of 327 (bearing in mind these are recorded as ages at the time of departure, so they’d have generally been a few years younger at the time of trial). I think it will hardly come as a major surprise to 18th-century crime historians that the majority (64%) were between 20 and 30 years old, and the vast majority (95%) were over 15 and under 40.

That age data could be skewed in various ways, though: it’s conceivable that those selecting prisoners for the First Fleet tended to choose younger people who’d be more likely to survive the passage, and be stronger workers at the other end; on the other hand, though, we might reasonably speculate that very young offenders would be less likely to be transported.

Age data is available for only about 3% of Old Bailey Online defendants between 1740 and 1780 (contrasting sharply with the later 19th-century Proceedings – which in itself tells us a lot about changes in record-keeping generally and surveillance of the criminal elements in society in particular). We have no idea how representative that 3% was so I’m wary of taking any hard numbers from it. (And again, I can imagine that very young offenders might be slightly less likely to appear at the Old Bailey than at lower courts.) But it does show a reasonably similar profile to FF-DB, with very, very few defendants under 15, though rather more between 40 and 50 – which might (if we could really trust it) back up my notion that the First Fleet convicts tended to be selected from younger prisoners.

Using the age of 45 (in 1787) as an upper limit would give a birth year c. 1742 – let’s round that down to 1740 for convenience. So, if they were unlikely to appear in criminal justice records much before the age of 15, that takes us to 1755. That too will not be quite the final word: we’ll probably do manual searches in earlier records for the handful of First Fleeters aged over 45, and for individuals who appear to have exceptionally rich stories. But in terms of data collection for automated searching/processing, that is likely to be close to our “real” starting date.

Leave a Reply