Tag Archives: visualisation

The Challenge of Visualising 100,000 Convict Lives

The Digital Panopticon project is linking together a wide variety of criminal justice, genealogical, and biometric records to trace thousands of convict lives from birth to death.  Each story will start with a birth date anywhere from the mid eighteenth century to the mid nineteenth century, and will include a variety of events including convictions for minor offences, one or more Old Bailey trials and punishments, possible subsequent convictions, marriage, children, census records, and death.  We are calling these life archives, though many will only present fragments of lives, depending on the amount of evidence available.  One such fragment we have already assembled is that of John Davis, born in about 1817, convicted of stealing some clothes and other items from a dwelling house in 1836, incarcerated for a month on the hulks, and transported on the ship Moffatt to New South Wales, where he arrived several months later.

John Davis

Life Archive for John Davis

How do we summarise 100,000 stories like this?  How can we find common patterns among all the individual narratives?  The project is exploring a variety of visualisation techniques in order to summarise this evidence without, as much as possible, obscuring the complexity of the individual stories.  We have already used visualisations to assess levels of missing evidence and detect errors in the Old Bailey Proceedings (Men as Wives: Visualising Errors in the Old Bailey Proceedings Data and Seeing Things Differently: Visualising Patterns of Data from the Old Bailey Proceedings), and to identify patterns in individual datasets (Transportation Under the Macroscope); and Open Data and the Digital Panopticon). But how do we use visualisations to document relations between datasets?

There is a bewildering array of visualisation formats available, as this Google Images screenshot indicates. Which one should we choose?

Which visualisation?!

Which visualisation?!

The choice obviously depends on the nature of the information to be displayed. Our most successful record linkage so far is between the records of sentences (from the Old Bailey Proceedings) and the records of punishments experienced (primarily execution, transportation, and imprisonment).  You may be surprised to read that there was a considerable discrepancy between the punishments judges dictated to convicts in the Old Bailey courtroom and the actual punishments they received.  Following their sentences, many convicts received reduced punishments as a result of pardons, other decisions taken by penal officials, and ill health or death.

Most useful to us for representing these patterns are Sankey diagrams, which depict flows in many to many relationships. Individual lines trace individual journeys, but where the same paths are followed by many people they are brought together as thicker lines, the thickness of the line denoting the volume of the flow.

Old Bailey sentences vs actual penal outcomes, 1790-99

Old Bailey sentences vs actual penal outcomes, 1790-99

For example, this diagram traces the convicts’ experiences in the 1790s, focusing on the two main sentences of that decade, death and transportation.  We can see from this that only a proportion (28%) of those sentenced to death were actually executed, with many others being transported (following a conditional pardon), and a few experiencing other outcomes such as going into service in the army or navy (during the French wars) or death.  Only around two-thirds of those sentenced to transportation, similarly, were actually transported, with the remained ending up in the hulks (and then presumably discharged after a period), or having a small number of other outcomes.

The advantage of presenting the information in this way—as opposed to, for example, a table—is that it is readily understandable without obscuring the variety of the possible outcomes.  Moreover, the patterns which stand out pose questions for further research, such as how and why did so many potential transportees manage to evade this punishment–and what determined which punishments they actually received?  These are issues we are currently investigating.

But what happens when the variables become more complex, and the number of stages prisoners might go through multiplies?  This is the problem we are working on now.  As noted, our multiple datasets include information about a variety of different types of events in convict lives.  Sankey diagrams should be able to help, as they can show multiple paths through several stages, which is what we want to do with convict lives.  Each life history can be a line in a Sankey diagram, which, when 1000s of lives are included, would reveal general patterns.  But how do we manage the large number of events, taking place at different times?  A problem here is that we want to introduce a time element to the variables (the actual dates of events), which makes it too complicated for a normal Sankey diagram.

There is no off-the-peg solution to this problem.  But here is a crude mock up using Excel of what we hope to achieve.  Eventually we will develop visualisations like this using D3, a JavaScript library for producing data visualisations.

Twenty-four convict lives from birth to punishment

Twenty-four convict lives from birth to punishment

This is based on twenty-four convict lives where we currently have eight or more records, including their birth, previous conviction (if any), Old Bailey conviction, and punishment (periods of incarceration in the hulks or a prison and subsequent release, or transportation, or execution).

It is hard to draw conclusions from the rather inelegant presentation, but you can start to see some interesting patterns.  A flat line means little time elapsed, while a steep line connotes a longer period.  We can see how many convicts had previous convictions, and how these often occurred years before the Old Bailey conviction which led to the punishment displayed.  In terms of punishment, we can see significant changes over time in the nineteenth century: crudely a shift from incarceration in the hulks followed by transportation; to prisons followed by transportation; to prisons leading to a prison licence.  What will happen when we replicate this format with tens of thousands of cases?  Will patterns become clearer, or will it just be a mess?

Convict lives by age at which events occurred

Convict lives by age at which events occurred

In fact, this visualization is in some respects already too complicated to interpret easily.  If we remove the date variable and just use the age at which events occurred, it simplifies things.  Here different patterns emerge: the wide age range of previous convictions (many first convictions took place at a young age), the wide age range of those convicted at the Old Bailey; the relatively short time gaps between conviction and commitment on the hulks, and between incarceration on the hulks and transportation (usually); the longer times spent in prison before transportation or licence; and the older ages of those sentenced to prison.

Obviously, this is work in progress, and we have a lot more work to do to create accessible and fine-tuned visualisations providing these types of information, while including thousands more cases. We hope that what we come up with will be of use not only to this project, but also to researchers in other fields who want to create visual representations of vast amounts of complex data in accessible formats.

Open Data and the Digital Panopticon

Of all historical periods and subjects, crime and justice in eighteenth- and nineteenth-century London is the most extensively digitised. Through the digitisation of countless numbers of court records, transportation registers, prison archives, trial reports, criminal biographies, last dying speeches and newspapers (amongst many other things), we can access a wealth of information about crime, policing and punishment in the metropolis, and about the fates of the offenders tried there, all at the click of a mouse.

To our great benefit, much of this data is openly available, a product of the dogged efforts of public bodies, academics, data developers, volunteers and enthusiasts; often (but certainly not always) supported by public funding. In the process it has opened up seemingly boundless possibilities for research.

Indeed, without several of these open datasets the Digital Panopticon could not be realised. In our efforts to trace the life courses and subsequent offending histories of London convicts transported to Australia or imprisoned in Britain in the late eighteenth and nineteenth centuries, we will be reliant on a number of open datasets such as the British Convict Transportation Registers and Female Prison Licences.

It seems timely, therefore, on Open Data Day, to celebrate these fantastic, freely-accessible resources, and to highlight just a couple of ways in which they will be useful to us on the Digital Panopticon. Taking place on 21 February 2015, Open Data Day will involve a series of events and gatherings which seek to develop support for, and to encourage, the adoption of open data policies by the world’s local, regional and national governments.

I have talked in a previous post about the ways in which visualisations of the openly-available British Convict Transportation Registers database can be used to put transportation under the ‘macroscope’ – to chart the complex patterns and interactions of penal transportation in their entirety, spanning the breadth of Australia and the length of a century, taking in the lives of tens of thousands of individuals along the way.

In this post I briefly want to highlight another open dataset which will be at the heart of the project – the prison licence records of females incarcerated in British jails in the nineteenth century, held by the National Archives (under the catalogue reference PCOM 4), the metadata for which is openly available on the Archive’s online catalogue.

The licences almost without exception record the age of the offender on conviction, a potentially useful piece of information for us on the Digital Panopticon in terms of record linkage. But, as with our other datasets, we want to know how accurately ages were recorded, and again in the case of the female licences by visualising the data it suggests some interesting things for us to think about.

Not least, it again reveals the tendency towards age heaping in the recording of ages at round numbers such as 20, 30 and 40, suggesting that recorded ages were regularly rounded up or down rather than representing the true age of the offender. If ages were recorded accurately, we would expect to see a smooth distribution of recorded ages. As seen in the graph below, however, this was far from the case in the recording of female prisoner ages in the nineteenth century, with spikes at the ages of 20, 30, 40 and 50, and dips at the ages 29, 31, 39, 41.

Age on Conviction as Recorded in the PCOM4 Female Prison Licences

Age on Conviction as Recorded in the PCOM4 Female Prison Licences

Does this mean, therefore, that we should disregard recorded ages as entirely inaccurate? Not necessarily – as the graph below demonstrates, when we compare the distribution of ages across different sets of records, it suggests that recorded ages were perhaps broadly reflective of age patterns. The distribution of offender ages is typically younger in the Old Bailey Proceedings (OBP) and in the Convict Indents (CIN – the records of those transported to Australia) compared to that of females imprisoned in Britain (PCOM4) – certainly what we would expect, given the nature of criminal justice policy at the time.

Ages of Female Offenders as Recorded across each Dataset

These are just a couple of ways in which the Digital Panopticon will be drawing upon the wealth of open data available to criminal justice historians. We are indebted to the hard work of all those who have contributed to the creation and dissemination of this embarrassment of riches which, in combination with the powerful digital technologies now at our fingertips, is opening up a whole new realm of research opportunities.

 

BCHS4 presentation: Visualising Digital Panopticon Data

Abstract:

The Digital Panopticon will assemble a larger collection of datasets than any other crime history project to date (including, amongst many others, the Old Bailey Proceedings, convict transportation registers and prison records), covering hundreds of thousands of individuals. To effectively bring together this information to reconstruct the lives of offenders, we need to develop a detailed understanding of our datasets – of what information is and isn’t recorded on offenders, and how this varied both over time and across different sets of records. Traditional methods of data analysis and representation such as manual counting and tables are inadequate to this end. This paper instead highlights the power of digital technologies in identifying previously unrecognised (and otherwise unrecognisable) patterns. The techniques of data visualisation in particular have been invaluable in uncovering how extensively, and in what manner, information on offender age, occupation and crime location was recorded within our sources. By using digital technologies to step back from our datasets, and see them in their entirety, we can develop a much fuller and more systematic understanding of the sources we are working with.

Slides:

Seeing things Differently: Visualising Data on Crime and Punishment

Transportation Under the Macroscope

Computers are brilliant microscopes. They make it easy to find needles in haystacks. Want to find references to the famous lawyer William Garrow amongst the millions of words in the printed reports of trials held at the Old Bailey, for instance? A keyword search produces the results in less than a second. Without computers it would take months. Likewise, as I explained in a recent post, through the techniques of data visualisation computers can be used to spot (what would otherwise be largely imperceptible) errors within the massive datasets that we are drawing upon in the Digital Panopticon project.

But computers are also fantastic macroscopes — today’s powerful digital technologies allow us to stand back from our sources and view them in their entirety. We can see the big picture, presenting complex and large-scale patterns in simple but effective ways. Microscopes allow us to see the infinitely small. Telescopes reveal the infinitely great. Macroscopes, meanwhile, peer in to the infinitely complex, allowing us to explore combinations, relationships and interactions between multiple elements.

By visualising the information recorded in the British Convict Transportation Registers, I’ve recently put penal transportation to Australia in the eighteenth and nineteenth centuries under the macroscope. This has produced some interesting insights into the relationship between Australian penal colonies, terms of transportation and how these changed and interacted over time.BCTR

The British Convict Transportation Registers database provides information on more than 123,000 offenders who were transported to Australia between 1787 and 1867. It’s a fantastic resource, and it will be at the heart of the Digital Panopticon project’s efforts to chart the criminal lives of London convicts sent to Australia. In charting these lives, we need to address some overarching starting questions. How many London convicts were actually transported to Australia for their crimes? Which parts of Australia were they sent to? How many years abroad did they face according to their sentences? Did this change over time, and what was the relationship between these different elements? Visualisations can help us to explore these questions across the long term and a large scale.

The total number of London convicts transported to Australia fluctuated greatly over the late eighteenth and nineteenth centuries, as Graph 1 below demonstrates. Relatively few convicts were transported in the years 1793–1804 when the Revolutionary War monopolised Britain’s shipping resources. With the end of the Napoleonic War in 1815 there were however large and rapid increases in the numbers of London convicts sent to Australia, reaching a massive peak in the 1830s. Thereafter, numbers gradually fell until the eventual abandonment of penal transportation in the 1860s. Interestingly, this pattern reflects a wider inverse relationship between the numbers of convicts transported and the years in which Britain was engaged in war throughout the eighteenth and nineteenth centuries.Graph 1

What Graph 1 doesn’t reveal is that the places in Australia where convicts were sent to changed over time. The individual penal colonies to which London convicts were sent operated at different times. As Graph 2 below shows, New South Wales was the first penal colony in Australia, and was later used alongside the penal colony of Van Diemen’s Land between the late 1820s and 1840, when transportation to Australia was at its peak. Following this, Van Diemen’s Land was used almost exclusively, until the 1850s, when Western Australia was the sole transportation location in Australia.

Graph 2

If the locations of penal transportation to Australia changed over time then so too did the lengths of time which offenders were sentenced to abroad. Between 1787 and the virtual abandonment of New South Wales as a penal colony in the late 1830s, as Graph 3 highlights, offenders were sentenced almost without exception to a term of 7 years, 14 years or life. Between 1840 and 1850, when Van Diemen’s Land was used exclusively, terms became more varied, with greater use of 10 and 15 year sentences. And especially after 1853, when Western Australia became the sole destination for transportees, an even greater variety of terms were put to use. This more nuanced tariff in transportation sentences was likely introduced to make transportation more favourable to penal reformers who increasingly viewed the practice with concern.

Graph 3

These changes in penal colony and terms of transportation were intimately linked, and the interaction between the two is clearly captured in Graph 4. The colonies operated at different times, and the law which underpinned them and the terms of transportation which could be imposed also changed in accordance. In short, the convicts who found themselves on the shores of New South Wales were primarily one of two kinds: either those sentenced to 7 years transportation; or those sentenced to a whole life abroad. By contrast, London convicts landing some 2,000 miles away on the shores of Western Australia and on the eve of transportation’s demise in the 1850s would each have had subtly different terms to serve out.

Graph 4

Through the macroscope of computer-generated visualisations, we can see these complex patterns and interactions in their entirety, spanning the breadth of Australia and the length of a century, taking in the lives of tens of thousands of individuals along the way.

Men as Wives: Visualising Errors in the Old Bailey Proceedings Data

In a recent post I talked about some of the ways in which data visualisations have helped me to see patterns in the information recorded in the Old Bailey Proceedings on things such as crimes, verdicts, punishments and the ages of defendants, patterns that might otherwise have been missed if using traditional methods of representing data such as tables. Here I just want to give a brief update on my analysis of the Proceedings, particularly the recording of defendant occupations and social status in the Proceedings in the eighteenth and nineteenth centuries. Again, visualisations have been extremely useful, especially in identifying errors in the data.

As with the recording of defendant ages, it might well be the case that information on the occupation/social status of those tried at the Old Bailey in the eighteenth and nineteenth centuries could be useful to us on the Digital Panopticon project in tracing offenders across different sets of records. Just as an age or a birth date might allow us to establish whether the “John Smith” tried at the Old Bailey and the “John Smith” transported to Australia was indeed the same person, likewise information on occupation or social status can help us to prove/disprove such name matches across records. But as with ages it depends on how extensively, and in what manner, such information on occupation/social status is recorded in our sources. And to this end, as with information on defendant age, the techniques of data visualisation can be useful.

Searches of the Proceedings for defendant occupation/social status can be carried out using the “custom search” page of the Old Bailey Proceedings Online.

Searches of the Proceedings for defendant occupation/social status can be carried out using the “custom search” page of the Old Bailey Proceedings Online.

However, whereas with defendant ages I was able to use the “statistics search” function of the Old Bailey Proceedings Online to generate numbers for analysis, this wasn’t possible in the case of defendant occupation/social status. In the process of digitising the original trial reports, defendant occupation was indeed tagged as a distinct category of information, and thus it can be searched for systematically in the “custom search” page of the Old Bailey Proceedings Online. But this can’t be used to quantitatively analyse the recording of defendant occupations in the Proceedings. In order to do this I needed to look at the website’s underlying data file of defendant information.

This is a large file which includes numerous fields of tagged information relating to all the defendants tried at the Old Bailey and reported in the Proceedings. Since much of this information is in the form of text rather than numbers, software such as Excel isn’t very useful in analysing the data. Instead I turned to Tableau Public, a free, web-based tool that is powerful but still easy to use. There are numerous other data visualisation tools available which are ideal for novices. All need to be used with caution, but used carefully they can be invaluable. (I’m going to talk in more detail about the actual process of using tools such as Tableau to undertake crime history in my next post, so watch this space.)

By running our file on Old Bailey defendant information through Tableau I’ve been able to create some fairly simple but nonetheless useful visualisations. For the data on defendant occupation and social status this has revealed two things in particular.

Pie chart demonstrating frequency of recording defendant occupation

Pie chart demonstrating frequency of recording defendant occupation

First of all, it has highlighted how little information we actually have on the occupational and social status of Old Bailey defendants from the seventeenth to the twentieth centuries. Across the entire publication history of the Proceedings between 1674 and 1913, occupation or social status is recorded for only 11% of all the defendants put on trial. In the years 1755 to 1834, occupation/social status is recorded for 15% of defendants, but between 1834 and 1906 virtually no defendants’ occupations were recorded. On the whole, therefore, we have occupation information for only a small proportion of defendants, and none at all for our specific period c. 1787-1875.

The sheer variety of occupations that are recorded in the Proceedings were also made clear by visualising the data. The bubble chart below for example give an indication of this, and the relative frequency with which different categories are recorded. One of the problems is that the same occupations were recorded in the Proceedings in slightly different ways (“servant” and “servants”, for example) or with variant spelling (such as “taylor” and “tailor”). If we wanted to utilise occupation or social status labels to verify name matches across sets of records this suggests that we would need to use sophisticated forms of keyword searching.

Bubble chart showing categories of defendant occupation

Bubble chart showing categories of defendant occupation

Bubble chart of defendant occupations by gender

Bubble chart of defendant occupations by gender

But visualisations have been especially useful in highlighting some of the errors in the recording of occupations within the Old Bailey Proceedings data. One of the things that I wanted to find out was how occupation labels varied according to the gender of the defendants tried at the Old Bailey. In order to do this I used Tableau to create the following bubble chart of the most common forms of recorded occupations/social status for male and female defendants in the years when we have significant amounts of information on this. One of the things that really struck me in this bubble chart was the amount of men whose occupation label is recorded in our Proceedings dataset as “wife”. This clearly seemed to be an error in the data, but I wanted to know what the source of the problem was so I went back to the original data file and filtered it for male defendants with the occupation/social status label of “wife”. And I then looked at the trial reports in the Old Bailey Proceedings for these cases.

Trial report in the Old Bailey Proceedings in which the husband of a female defendant has been tagged with the social status of “wife”

Trial report in the Old Bailey Proceedings in which the husband of a female defendant has been tagged with the social status of “wife”

It turns out that many of these cases were due to errors in the digitisation process which resulted from the unusual nature of the trial reports themselves. The cases were actually ones (such as this example below) in which a female defendant had been named in the trial report as the wife of her husband, and thus the automated tagging process used to digitise the Proceedings had recorded both the husband and the wife as defendants and assigned them both the role of “wife”. This practice in the Proceedings of naming the female defendant as the wife of her husband largely disappeared in the nineteenth century, and therefore most of these errors in the data file tend to come from the eighteenth century. By identifying these kinds of anomalies, visualisations therefore allow us to find errors in the data. Such errors can then be rectified. This leaves us with a much “cleaner” dataset, and thereby increasing the chances of successful record linkage.

Historians of crime (particularly the history of crime in Britain) have been quick to exploit the plethora of digitised criminal justice (and associated) records that are now available online. We all make us of resources such as the Old Bailey Proceedings Online, Eighteenth-Century Collections Online and digitised newspapers. But whilst we have been quick to take advantage of the benefits offered by these digitised records – such as keyword searching to find needles in haystacks – we have been less ready to understand the full effects of the digitisation process for how we study our sources and the information that we extract from them. By using data visualisations we can better understand the implications of digitisation, including the ways in which the actual process of turning a paper record into a digital format might result in errors (relatively rare, it should be said, in the case of the Old Bailey Proceedings Online) in the information we compile.

Adventures with Data Linkage

The British Convict Transportation Registers is a database detailing the journeys of over 123,000 people transported to Australia in the 18th and 19th centuries. Compiled from British Home Office records, it contains information such as the name of each person being transported, the date they departed, and their final destination.

The early stages of the Digital Panopticon have allowed us to perform some preliminary data linkage between these registers and people sentenced to transportation in the Old Bailey Proceedings. We’ve made the links primarily by name, with a degree of tolerance for spelling. We found that many names actually matched exactly, suggesting that perhaps names were in some cases directly copied from one record to another. A further 7% of names could be matched via an algorithm known as Soundex, which attempts to identify names which sound similar when spoken, but might be (accidentally) spelt differently. A remaining handful were matched by virtue of having a small Levenshtein Distance. Levenshtein is a simple metric by which the variance between two text strings is quantified. Including matches with a very small Levenshtein Distance, where perhaps only a single letter is different or omitted, helps take account of minor clerical errors.

Percentages of names matched between the British Transportation Records and Old Bailey Proceedings, under various conditions.

Results of attempted name matching between the British Transportation Records and Old Bailey Proceedings.

In total, about 70% of the people sentenced to transportation in the Proceedings appear in the transportation records. We can be quite confident of about half of these, because in some cases the date of conviction is actually given in the transportation record. If the date and name match, it becomes very likely that we’re dealing with the same individual. For transportation records where a conviction date is not given, we have to examine five or six years worth of Old Bailey records to make sure we don’t miss a possible match. This greatly increases the possibility of a false positive, so we can be less sure about these links.

One interesting trend is that the number of exact links decreases significantly in cases where the conviction date is not given. A greater proportion of these links had to be made with Soundex or Levenshtein Distance. This suggests that the links made without a conviction date are less reliable, as we might expect. Therefore, for the time being we will discard these.

With our most reliable links in hand, we can begin looking for patterns between the details of conviction and transportation. One of the most interesting pieces of information contained in the transportation records is the destination of convict ships. An obvious question is whether convicts were directed to particular destinations based upon their offence, gender or age. One might imagine colonies having a need for people with particular skills or attributes at particular times, and the system might have attempted to address these needs. Luckily, occupation is indeed sporadically recorded in the Old Bailey Proceedings.

In fact, the data shows that the overwhelming factor in deciding where a convict was sent was the particular year when they left England. Transportation was almost exclusively to New South Wales before 1831, and overwhelmingly to Van Diemens Land after 1838. There is a brief period from 1832 to 1835 where roughly equal numbers of convicts are sent to both destinations. However, even during that period, there doesn’t appear to be any correlation between the characteristics of a convict and their destination. Neither gender or age, crime or occupation seem to have made any difference. Once a person was in the transportation system, their final destination was entirely arbitrary. There was no easily identifiable tendency to send people with particular attributes to particular destinations.

Sankey diagram, showing proportions of different age groups transported to different destinations, including where the destination is unknown because a link between records could not be made.

Sankey diagram, showing proportions of different age groups transported to different destinations between 1832 and 1835, including where the destination is unknown because a link between records could not be made.

If we cannot find a pattern in where people were sent, perhaps we can find a pattern in how long it took them to be sent there. For every convict there is a period of time between when they were convicted and when they actually set sail aboard a ship. The interval between conviction and transportation is hugely variable. A few people were transported in little over a month. Some people, as we have noted, spent six years waiting to be transported.

Line graph showing the minimum, maximum and average intervals between conviction and transportation over time, 1787 - 1852.

Line graph showing the minimum, maximum and average intervals between conviction and transportation between 1787 and 1852.

The data shows that again, time was a very important factor. Transportation almost halted between 1835 and 1844, as did sentences of transportation. In contrast, the system seems to have been at peak efficiency between about 1814 and 1834, but even then there are a few outliers (represented by the green line) who still had to wait a very long time to be transported.

Detail of a scatterplot variation showing every interval between Proceedings conviction and BTR transporation, represented by horizontal bars running from conviction date to transportation date. Females are blue, males are orange.

Detail of a scatterplot variation showing every interval between Proceedings conviction and BTR transporation, represented by horizontal bars running from conviction date to transportation date. Females are blue, males are orange.

If we look at the data in more detail, we can see that a great many of those sentenced to transportation, at least early in the period, are simply waiting for the next boat to depart. Convicts sentenced at multiple sessions are stored up until, presumably, there are enough to justify a voyage. Nevertheless, there are people who seem to miss multiple voyages; people convicted at the same session as those who depart on the next boat who are, for whatever reason, left behind. Can we detect any common characteristics among these people?

It is not at all easy to find a pattern, but there may be one: Male prisoners below the age of 15 appear to be kept for longer, on average, than those who are older. It’s worth noting that the minimum and maximum intervals show no such trend; there are still people under fifteen who are transported very quickly, and people over fifteen who are held for a very long time. But in terms of the average, there is a definite increase which starts abruptly at the age of fifteen and then accelerates as prisoners get younger. In fact, on average, male prisoners under fifteen are kept for twice as long as those over fifteen.

Age plotted against minimum, maximum and average days between conviction and transportation, for males sentenced at the Old Bailey 1787-1852.

Age plotted against minimum, maximum and average days between conviction and transportation, for males sentenced at the Old Bailey 1787-1852.

This is a finding which we can begin to investigate and verify. Certainly, the pattern is not repeated for female prisoners, whose average transportation time remains remarkably consistent regardless of age. As the project gathers more data and continues its initial investigations, we hope to be able to explore this possible trend in more detail.

This is the very first linking exercise we have done, and there is undoubtedly scope to refine the process. Every dataset we add will help us to evaluate our findings more thoroughly and ask more detailed questions. The next step may be to try and link the Old Bailey and Transportation Registers to the Convict Database, which contains information such as height, and prisoner health. These may well be important factors in determining the treatment of prisoners and providing further clues as to the nature of a journey through the eighteenth century criminal justice system.

Visualising Life-Grids and Narrating the Lives of Convicts

One of the great opportunities presented by the Digital Panopticon project (and one of the most exciting in my opinion) is in uncovering more about the processes of crime and punishment by placing thousands of offenders, and their offences, back within the context of their own lives.

Tracing offenders through the records has been a preoccupation of several groups of historians and criminologists (for example Barry Godfrey, Heather Shore, Pam Cox, David Cox, Helen Johnston, Zoe Alker, Joanne Turner, and Stephen Farrall) in the last decade. On account of the laborious nature of record linkage those studies which have focussed on tracing groups offenders through civil as well as criminal datasets have been able to examine a few hundred offenders at a time. Those pioneering this methodology have taken the collected information and sorted it into ‘lifegrids’ which chart life events and changes for each individual. Lifegrids might typically include details of birth marriage and death, family evolution, employment and residential addresses, and offending and punishment history. Of course, the depth and breadth of documents and information available on different groups of, or individual offenders, dictates how much material can be recorded in each life grid.

Other than life-grid format, there are a number of ways that this information can be presented and communicated. Even the simplest visualisations are able to show the role that offending had in any one person’s life. This might be through indicating what proportion of an individual’s life was spent in custody, or how many offences were recorded against them at what stage of their life. It is possible to chart how someone’s offending accelerated and decelerated. From an institutional perspective it is possible to indicate how an individual’s weight and health changed over time, or how their behaviour and privileges impacted upon their experience of punishment. The myriad of ways in which this fascinating and complex data can be presented has some exciting potential for how others see, interrogate, and engage with this fantastically rich data.

To begin to explore these possibilities, we have been working with an example offender: Patrick Madden (one of a number of offenders included in Johnston, Godfrey and Cox’s ESRC funded research on ‘The costs of imprisonment’).

P Madden

Born and raised in Sheffield, Patrick began offending around the age of sixteen. Although often motivated by property, Patrick’s offences were primarily violent in nature. Madden had 15 offences recorded against him over an almost thirty year period. Each of these was committed either in Sheffield or other close-by northern towns such as Wakefield and Doncaster. It was in these locations that he was incarcerated, accept for one occasion of penal servitude when he served seven years of penal servitude in London, and the south of England. It does not appear as if Patrick ever married or had children, nor that he managed to establish a life for himself that did not involve repeat offending for long before dying at the age of 52.

 

Patrick Maddens lifegrid, of course, contains much more information than this brief overview might suggest. Patrick’s civil and penal records allow us to know about many elements of Patrick’s life right down to his familial relationships and sexual preferences. However, even if we take the most ‘bare bones’ approach to Patrick’s life narrative, it is possible to start creating some interesting visualisations based on his experiences and offending history.

DataHero Patrick Madden years of imprisonment in life course (1) DataHero Patrick Madden type of offending over life course

 

DataHero Weight over period of imprisonment line DataHero Penal class over time of imprisonment

 

Yet the size and scale of the research being undertaken by the Digital Panopticon means that we are faced not just with presenting Patrick Madden’s life, but instead the lives of all of the ‘Patricks’ that went through the old bailey between the late 18th and early 20th centuries. This poses two distinct challenges which we will face in presenting the mass of information traditionally held in lifegrids.  First is that the range of records being linked together for each offender is unprecedented. Some records are well known to our researchers and relatively straightforward to visualise, such as criminal registers that allow us to examine date, place and type of offence. Others such as the changing picture of family life that might evolve from three successive census entries, or the seemingly random personal or professional information that can be carried in a newspaper report, are far more difficult to quantify and visualise. This first problem will become clearer and hopefully less significant as more records are collected and linked. It should be fairly straightforward to identify the information which can be presented easily, and to adapt that which cannot. The second challenges we must meet is that of potentially presenting to other researchers and the public tens of thousands of individual life and offending histories. What we need to work on is finding a way of presenting a range of different information about our offenders both individually and in aggregate so that it is possible for users to access information about an individual they are interested in, but also to see how such an individual compares and contrasts with others in the study – something which enables researchers to identify how typical an individual’s experience was.

BG offered some initial ideas of how we might best achieve this when we met in Oxford. By creating ‘strand’ visualisations which present a mass of offenders by a few ‘key values’ –  for example the year of their first recorded offence, nature of offence, or length of offending career – and then allowing users to further restrict what strands are shown to them by other values – for example sex and location- it would be possible to access information about a single individual, whilst getting a sense of how they match up to their contemporaries.

BG visualisation

We hope that this will prove an excellent starting point as we work to develop future visualisations and methods of presentation which will allow the Digital Panopticon team, fellow researchers, and members of the public to explore, understand, and get the most from the fantastic wealth of data at our fingertips.

 

Visualising Data Workshop Report: part 2

The second half of the workshop was devoted to work in progress and plans for the Digital Panopticon – I’ll say less about these than those in part 1 because longer versions should be appearing (or have already appeared) here on the blog!

Barry Godfrey briefly introduced the project and the challenges of visualisation of our data.

  • we’re looking at systematic changes in punishment over a long period of time (late 18th to early 20th century); but we’re also looking at individuals over their lifetimes and at many thousands of individuals.
  • It’s not just about temporality: we’re also deeply concerned with spatiality – not simply the long distance movement of transportation but movement within Britain.
  • another theme of the project is ethical – the responsibilities of revealing so much information about people: how much does this extend to visualisation too?
  • finally, there are many potential audiences for DP data visualisation – in addition to researchers and academics, students, teachers, genealogists and other non-traditional users of criminal data. How to cater for so many different people and their needs?

Jamie McLaughlin demonstrated some of our early explorations in record linkage and data visualisation, including a number of Sankey diagrams to show connections between two datasets (Old Bailey Proceedings and British Convict Transportation Register). In particular, he’s been comparing the outcomes for defendants sentenced to transportation and those who were sentenced to death which was subsequently commuted to transportation. Another topic of interest is the people sentenced to be transported who don’t subsequently turn up in the transportation records: what happened to them? Can we find them again elsewhere?

Richard Ward focused on visualising (again, extensively using Tableau Public)  a single dataset, the Proceedings, and covering much of the ground on questions of age in his recent blog post here(I learned along the way that the proper demographic term for the tendency to round ages is age heaping.) He also introduced the topic of occupations/status labels – which are problematic in the Proceedings for a number of reasons – and hopefully this will be covered in his next blog post. [slides]

Barry and Lucy Williams rounded off the session by looking at the challenges involved in visualising life grids. Barry’s previous research on 600 prisoners used a wealth of different sources including licenses, medical sources, and other prison records, as well as civil data, and tried to build up as complete a picture as possible of each prisoner’s whole life: this was summarised in life grids. We looked at interesting options for visualising the life of a single prisoner – but how to multiply up to thousands of them? [blog post]

The following discussion introduced a number of suggestions and possible ideas and resources to follow up. Certain themes however, resurfaced throughout the day as key issues:

The importance of seeing data visualisations as part of a process with changing needs and purposes over the course of the project, and for different people. Part of the challenge is that we want to cater not just for the specific research agendas of the project team members but also for a range of other researchers.

The twin challenges of scaling up and the very long period of time we’re covering; but also the sheer variety of different types of source and data that we’re dealing with. The Proceedings are a very different kind of record from the (mostly) highly structured tabular data of Founders and Survivors, and from the English imprisonment records we’ll be working with.

It was all in all a great day! We were bowled over by the wealth of ideas from our three external speakers and the additional input of everyone who attended for the day, not least Andrew Prescott: thanks to everyone who came for making it such an enjoyable and stimulating event. And I’d add a final thank you to Deb Oxley for organising the event and being a splendid host.

Visualising Data Workshop Report: part 1

The first half of the workshop consisted of speakers we invited to introduce the ways in which they have used visualisation in research, and look at how these could be useful to the Digital Panopticon and researchers attending the event. I’ve included as many links to relevant resources as I could find. (See also the Storify of the event.)

Professor Min Chen of the Oxford e-Research Centre got the day off to a great start. He treated us to a dizzying array of examples of different kinds of visualisations, emphasising the importance of who visualisations are being created for. He surveyed the long history of data visualisation and outlined four levels of visualisation:

  1. disseminative (‘this is’) – presentational aids for dissemination
  2. operational (‘what?’) – enable intuitive and speedy observation of captured data
  3. analytical (‘why?’) – investigative, can be used to examine complex relationships
  4. inventive (‘how?’) – aid improving existing models, methods etc

He also got us to think about ‘modes’ of visualisation, the different perspectives/needs of analysts, presenters and viewers. Question asked: ‘what would be a visual language for the Digital Panopticon?’ – taking into account the different kinds of data we’re working with.

These were just some of the examples!

  • Poem Viewer from the Imagery Lenses for Visualizing Text Corpora project (Oxford and Utah collaboration) – designed to support close reading by visualising the sounds of poetry.
  • Temporal Visualization of Boundary-based Geo-information Using Radial Projection – visualising movement of 200 glaciers over 10 years (recorded in satellite images). This was highly challenging: line graphs were too messy, maps not very helpful; a solution was found in radial visualizations.
  • Visualizing facial dynamics – humans are very good at expression recognition, but computers are terrible; project investigating methods to do this
  • Use of glyphs (simple stylised icons) rather than text labels in complex workflow diagrams, and to enable display of multiple measurements simultaneously.
  • Idea of parallel coordinates for visualising multi-dimensional data. (Lots of interest in this!)
  • How to visualise time without animation? – summarising into a single picture can help to see patterns.

Next, William Allen of the Oxford Migration Observatory talked about ‘Doing the Best with Data: critical realism and visualisation’. The Observatory’s goals are to communicate social science research beyond academia; migration is complex and doing this accessibly is challenging, so they make extensive use of visual techniques.

Visualisations are appealing, as they appear to offer comprehensive and independent windows, but actually achieving this needs to approach visualisation as an iterative and critical process. Use of critical realism approach as a lens for evaluation, critical testing of given categories. Rather than ‘what works?’ it’s better to ask ‘what about this visualisation works, for whom in which contexts, for what purposes?’

The media monitoring project was set up to monitor and analyse systematically what the press actually say about migration, over a period of time. Analysis of how press portrays migrant groups uses corpus linguistic methods (43 million words for 2010-12!). Allen showed us a number of visualisations using the tool Tableau Public (which some members of DP team have also been using).

Allen spoke of the ‘frontiers of visualisation’

  • political: how data/research are used by range of actors, decisions made through research
  • technical: the software and built-in assumptions/settings
  • virtual: interactivity, challenges of opening analysis up to public stakeholders

Questions and problems arising from the Observatory’s work: how do we visualise large datasets and patterns in them? Every decision comes with assumptions about what works. Also emphaised the danger that visualisation software can be a black box – eg, misleading on scale.

Additional resource: The Observatory website has a terrific page of data and resources with ‘ready-made charts and maps on migration in the UK as well as a description of key data sources and their limitations’, and a create your own chart facility. Go and play!

Our third speaker, Arthur Downing (Oxford), gave a presentation on Network Analysis and Visualisation for historians.

A network is a particular set of connections between agents: network analysis is analysis of the patterns of these connections (‘nodes’ and ‘links’). It differs from standard social science methodology (which tend to chop up objects by categories like race and gender and then looks at averages), in that network analysis starts with connections between objects/actors and then looks at their attributes. This is important because there can be different patterns of connections within superficially similar scenarios.

Some fascinating case studies he introduced:

Downing’s own work on 19th-century Friendly Societies – a network analysis of proposers and seconders showed that top 20% of recruiters were responsible for 80% of members. But using ‘eigenvector centrality’ (which takes into account degree of node and degree of nodes connected to each node), also showed that some people were important even though they weren’t large recruiters.

Network analysis for maps can show more complex patterns than standard maps:

  • Spread of Freemasons in the US – on a conventional map this just looks like a ‘frontier’ movement, but when mapped as a network, a  different picture emerges with more complex directions of flows
  • Social networks between Australian lodges – most migration is short distance and internal, though migration from England and Wales is very important

Pitfalls and problems:

  • identifying the boundaries of networks can be difficult
  • sampling is hard to justify as any missing ties can skew interpretation
  • longitudinal analysis is difficult – network analysis by definition is a snapshot in time; but may want to know how long does a tie persist. One answer is to breaks down into phases and look at different periods

Conceptually this is very different to standard statistics: ‘analysis of an endogenous system where endogeneity is what is interesting’, but potentially a great method for social history since it’s all about exploring complexity.

In subsequent discussion, concerns about ideological assumptions going into visualisations and how to communicate them to the user – but a reminder that this is a problem with traditional charts and tables too, with no simple answer.

We were deeply grateful to all three speakers for providing us with so much food for thought, and so many ideas to follow up!

[Part 2 of the report to follow shortly…]

Seeing things differently: Visualizing patterns of data from the Old Bailey Proceedings

An OBP

An edition of the Old Bailey Proceedings

The Old Bailey Proceedings are a rich historical resource, almost unimaginably so. They constitute the largest body of texts detailing the lives of non-elite people ever published. Words alone can’t quite do justice to the magnitude of the Proceedings – 197,745 accounts of trials covering 239 years (1674-1913); some 127 million words of text (at an average reading rate of 250 words per minute, this would take eight hours’ solid reading every single day for nearly three years to get through!); details of some 253,382 defendants, including name, gender, age and occupation, as well as details of 223,246 verdicts passed by the juries and 169,243 punishments sentenced by the judges.

The Proceedings clearly contain a huge amount of information, but they don’t record everything – like any historical source, they are selective in what they document. The amount of information that was recorded in the Proceedings on crimes, verdicts, punishments, defendants and so on also varied over time. And whilst the digitization of the Proceedings by The Old Bailey Online has revolutionised the way in which we search and use this rich historical resource, this also has its limits. The marking-up of the text of the Proceedings (assigning tags to particular pieces of information in the text – such as name or crime – so that this information can be systematically searched) makes it possible to undertake sophisticated statistical analysis. Crimes, verdicts, punishments, defendant age and defendant gender can all be counted at the click of a mouse. Nevertheless, marking-up inevitably involves choices (about what information to tag and the level of detail that is tagged), and those choices limit the ways in which the Proceedings can be studied using computers.

Statistical searches of the Proceedings can be carried out through The Old Bailey Online

Statistical searches of the Proceedings can be carried out through The Old Bailey Online

The question that we might ask, then, is what are the limitations of the Proceedings as a source of data on such things as punishments, defendant age and gender? Taking the Proceedings in their entirety, what are the limits in terms of the information that was recorded in the original trial reports? How frequently, for example, was the age of the defendant recorded? And what are the limits in terms of what we can actually search for systematically using digital technologies? Can we, for instance, systematically determine the lengths of imprisonment which offenders were sentenced to?

These are crucial questions for us because the Digital Panopticon will rely so heavily on the Proceedings as a source: in our effort to trace the life histories of offenders who were sentenced to transportation or imprisonment at the Old Bailey between 1787 and 1875, the Proceedings will obviously be a vital source of information. After identifying those who were sentenced to transportation or imprisonment recorded in the Proceedings we will then try to trace such individuals both before and after their conviction by linking the Proceedings with other sets of records.

In trying to better understand the limitations of the Proceedings as a source of data for the Digital Panopticon project, I have recently been making use of data visualization (‘dataviz’) – using computers to create visual representations of numbers. This includes the traditional graphs and pie charts that we are all familiar with, and which I will be talking about here. But it also includes more complex forms of visualization which I will be looking at in future posts (watch this space!).

Since the Proceedings contain such a vast amount of information, manual counting and tables are therefore inadequate in making sense of the data. Turning the raw numbers into a visual form makes it much easier to see overall patterns in the data. Here I give just a brief example of how dataviz has helped me to see the Proceedings differently, to appreciate the limits of this immense historical resource, and to think about how information from the Proceedings can be used most effectively in the Digital Panopticon project.

A data visualisation of the length of trial reports in the Proceedings over time, created by The Datamining with Criminal Intent project

A data visualization of the length of trial reports in the Proceedings over time, created by  William J. Turkel as part of the Datamining with Criminal Intent project (created using Mathematica 8)

One of the key things we want to know on the Digital Panopticon is how useful age data might be in helping us to link offenders recorded in the Proceedings with individuals documented in other sets of records (such as the convict transportation registers or census records). In the first instance, links will be made through name searches of the different types of records. But how can we be sure that the John Smith recorded in the Proceedings is the same individual as the John Smith recorded in the prison parole registers, for example? Age data might help us here. If John Smith is recorded as being 24 years’ old in the Proceedings at the time of his sentence to two years’ imprisonment at the Old Bailey, and the John Smith recorded in the parole registers is stated to be 26 years’ old, then we can be confident that this is indeed the same person. By the same token, if the John Smith recorded in the parole registers is said to be 60 years’ old, this would suggest not.

Ages could then be extremely useful, but it depends on how extensively, and how accurately, age data is recorded in the Proceedings (and our other sets of records). By visualizing the results of quantitative searches of the Proceedings we can get a clear sense of this, far more so than through the use of text-heavy tables which can be hard to “read” for patterns. A statistical search using The Old Bailey Online reveals that 171,168 defendants are recorded in the Proceedings in the years 1755-1870. Of these, age is recorded for 101,364 (59.3%) of them. So for the entire period of our study, we have age data for just over half of all the defendants at the Old Bailey.

Further digging into the data and visualisation of the findings reveals some of the deeper patterns in the age data. In the first instance, the recording of ages only began in the year 1790 for defendants found guilty, and from the 1860s for those found not guilty, as shown in the graph below. In the 1790s, we have age data for 65% of guilty defendants, increasing to 90% and above thereafter. By contrast, age data for the not guilty is missing until at least the 1850s, and in earnest until the 1860s.

Visualisation demonstrating the extent of age recording over time and by verdict

Visualization demonstrating the extent of age recording over time and by verdict

This gives a sense of how extensively ages are recorded in the Proceedings over time, and according to which categories of offenders. By visualizing the patterns of recorded ages we can also get a feel for how ages were actually recorded. The graph below, for instance, suggests that there was a tendency to revise the defendant’s recorded age up or down slightly to match a round figure. The numbers of defendants whose ages are recorded as 30, 40, 50 and (to a lesser extent) 60 are all significantly above the number we might expect according to the moving average (in other words, when the yellow bar goes above the green line in the graph). By contrast, ages just either side of these figures (such as 29, 31, 39, 41 and 51) are systematically below the average (when the yellow bar is below the green line). It may well also have been the tendency for those in their early twenties to have their recorded ages revised down to 18 or 19, since these two ages are also well above the expect number. In short, many more defendants were recorded as being 30 rather than 31, or 40 rather than 41, and the scale of the difference suggests that this resulted from a deliberate policy of revising the defendant’s age up or down to match the nearest round figure.

Visualisation demonstrating the “bunching” of recorded ages at 30, 40, 50 and 60

Visualization demonstrating the “bunching” of recorded ages at 30, 40, 50 and 60

Together this suggests that age data in the Proceedings will be of much use to us in the Digital Panopticon, particularly for the defendants found guilty and subsequently sentenced to transportation or imprisonment. In this instance we have extensive amounts of age data from 1790 onwards. In the case of our not guilty control group, however, we have no age data available in the Proceedings to work with before the 1860s. In this instance we will be reliant on other categories of information to link the not guilty defendants across datasets. And in light of the seeming tendency for recorded ages to be rounded up or down, this suggests that when we use age data to link individuals across datasets it would be more effective to work within age ranges rather than trying to compare specific numbers.

From these early explorations it seems clear that visualization will be invaluable in helping us to identify the overall patterns in the data of the Proceedings. The first step in this is identifying some of the limitations in terms of the information recorded in the Proceedings. Traditional forms of visualization are useful to this end. But there are also potential benefits in going beyond this, by using more complex forms of visualization to uncover deeper patterns in the data – patterns that would be difficult to detect through simple graphs or charts. This is what I will be turning to next.