Tutorial: Calculating disparities in stop and search
I recently published a report on how stop and search is used in London. Although the report contained lots of information, it was probably inevitable that people would focus on the figures for disparities in stop and search rates between ethnic groups. This has led several people to ask how I calculated these disparities, so I’ll set that out in this tutorial.
Prepare the data
First, we load the tidyverse
R package, which contains various functions for handling data and the scales
package for formatting numbers.
library("scales")
library("tidyverse")
To calculate search rates we need data on searches and data on population. The holy grail of population data would be an accurate representation of the number of people available to be searched in a particular place, since a person in a public place is vastly more likely to be searched in relative terms than someone sitting watching TV in their kitchen. Unfortunately, this data doesn’t exist, so we have to rely on residential population data. As I discuss in the report, this is not perfect, but it is probably acceptable if we calculate search rates for London as a whole, rather than particular neighbourhoods.
Search data is available from data.police.uk but is unfortunately stuck behind a web form, meaning it is difficult to download directly within an R script. I’ve instead downloaded the stop-and-search data for the Metropolitan Police for July to September 2020. This is slightly different to the data I used for the report, which also included stops by the City of London Police and British Transport Police in London, but almost all searches in London are done by the Met, so this is unlikely to affect the results much 1.
# you can download a copy of the data file I downloaded from data.police.uk
# from https://github.com/mpjashby/lesscrime.info/raw/master/public/post/met_stop_and_search_data_july_to_september_2020.zip
# the search data comes in a ZIP archive, so we will unzip to a temporary
# directory and then read it from there
unzip(
zipfile = "met_stop_and_search_data_july_to_september_2020.zip",
exdir = str_glue("{tempdir()}/stop-search-data/")
)
# now read all the CSV files from the temporary directory and combine them into
# a single data frame
searches <- str_glue("{tempdir()}/stop-search-data/") %>%
dir(pattern = "csv$", full.names = TRUE, recursive = TRUE) %>%
map(read_csv) %>%
bind_rows() %>%
# convert the column names to be easier to work with in R code
janitor::clean_names()
# select only the columns we need
searches <- select(searches, type, gender, age_range, self_defined_ethnicity)
This gives us a data frame where each row represents a stop and search of one person or vehicle. The first few rows:
type | gender | age_range | self_defined_ethnicity |
---|---|---|---|
Person and Vehicle search | Male | 25-34 | Black/African/Caribbean/Black British - Any other Black/African/Caribbean background |
Person search | Male | 10-17 | White - English/Welsh/Scottish/Northern Irish/British |
Person search | Male | 18-24 | White - English/Welsh/Scottish/Northern Irish/British |
Person search | Male | 10-17 | White - Any other White background |
Person search | Male | NA | Other ethnic group - Not stated |
Person search | Male | 25-34 | Other ethnic group - Not stated |
Population estimates for London for 2020 are available on the Mayor of London’s website. These are only estimates, but are probably the best figures available because the latest census data are now nearly a decade out of date. We won’t get a more-accurate measure of London population until data from the 2021 Census is published in 2022 or 2023.
# download the data to a temporary file
# if this file no longer exists, check
# https://data.london.gov.uk/dataset/ethnic-group-population-projections for the
# current URL
download.file(
url = "https://data.london.gov.uk/download/ethnic-group-population-projections/a9598ef0-808c-4f96-9eac-8bb314bd92cd/Ethnic%20group%20projections%20%282016-based%20central%20trend%29.xlsx",
destfile = str_glue("{tempdir()}/population_data.xlsx")
)
We now load the population data and format it so that it is easier to work with.
# load two sheets from the same Excel file, one for female population and one
# for male
people <- map_dfr(
c("Population - Females", "Population - Males"),
~ readxl::read_excel(path = str_glue("{tempdir()}/population_data.xlsx"),
sheet = .)
)
# filter out rows of data we do not need
people <- filter(
people,
# filter out rows that have totals for all ages combined, since we only want
# data for each age group separately
age != "All ages",
# filter out rows that have values for specific boroughs, since we only want
# data for London as a whole
borough == "Greater London",
# filter out rows that have totals for all ethnic groups combined, since we
# only want data for each group separately
!ethnic_group %in% c("All persons", "BAME")
)
# reformat the data into the format that we need
people <- mutate(
people,
# convert age to a numeric value so we can use between()
age = as.numeric(age),
# categorise ages into age ranges that match the stop and search data
age_range = case_when(
between(age, 0, 9) ~ "under 10",
between(age, 10, 17) ~ "10-17",
between(age, 18, 24) ~ "18-24",
between(age, 25, 34) ~ "25-34",
age > 34 ~ "over 34",
TRUE ~ NA_character_
),
# change sex to lower case
sex = str_to_lower(sex)
) %>%
# change the name of the variable containing the population count for each
# group
rename(people = `2020`)
# select only the columns we need
people <- select(people, sex, age_range, ethnic_group, people)
This gives us a data frame where each row represents the number of residents in London in each category of age, sex and ethnicity. Again, the first few rows:
sex | age_range | ethnic_group | people |
---|---|---|---|
female | under 10 | White British | 19,429 |
female | under 10 | White British | 18,616 |
female | under 10 | White British | 17,883 |
female | under 10 | White British | 17,235 |
female | under 10 | White British | 16,890 |
female | under 10 | White British | 16,741 |
Count searches and people
To calculate search rates, we need to calculate the number of searches involving people in each group. Before we can do this, we need to:
- rename some columns and the values of the ‘sex’ variable so they match those in the population data,
- aggregate ethnic groups into the categories ‘Asian’, ‘Black’, ‘Mixed’, ‘Other’ and ‘White’, since the number of searches in some of the smaller ethnic groups is too low to draw conclusions from,
- remove searches of unattended vehicles (or attended vehicles where no people were searched), since we don’t have age, sex and ethnicity data for those,
- remove searches for which one or more of age, sex and ethnicity are missing, or where sex is listed as ‘other’ since there is no population data for this group.
search_counts <- searches %>%
# rename columns so they match the population data
rename(sex = gender, ethnic_group = self_defined_ethnicity) %>%
# change sex to lower case
mutate(sex = str_to_lower(sex)) %>%
# aggregate ethnic groups by taking the first word ('White', 'Black' etc) from
# each group name
mutate(ethnic_group = str_extract(ethnic_group, "^\\w+")) %>%
filter(
# filter out vehicle-only searches
type %in% c("Person search", "Person and Vehicle search"),
# filter out rows with missing, age, sex or ethnicity
!is.na(age_range), !is.na(ethnic_group), sex %in% c("female", "male")
) %>%
# count searches in each category
count(sex, age_range, ethnic_group, name = "searches")
We must also count the number of people in each population group, since in the raw data there is a separate row for each individual year of age. We also aggregate ethnicity categories to match those in the search data.
people_counts <- people %>%
mutate(
#
ethnic_group = case_when(
ethnic_group %in% c("Bangladeshi", "Indian", "Pakistani", "Other Asian") ~
"Asian",
ethnic_group %in% c("Black African", "Black Caribbean", "Other Black") ~
"Black",
ethnic_group %in% c("White & Asian", "White & Black African",
"White & Black Caribbean", "Other Mixed") ~
"Mixed",
ethnic_group %in% c("White British", "White Irish", "Other White") ~
"White",
TRUE ~ "Other"
)
) %>%
# count searches in each category
count(sex, age_range, ethnic_group, wt = people, name = "people")
Now that the two datasets are in the same format, we can merge them to create a single dataset to work with.
counts <- full_join(
search_counts,
people_counts,
by = c("sex", "age_range", "ethnic_group")
) %>%
# population groups in which no one was searched will have missing values for
# the search data, so we convert these missing values to zeros
replace_na(list(searches = 0))
The first few rows of data are:
sex | age_range | ethnic_group | searches | people |
---|---|---|---|---|
female | 10-17 | Asian | 16 | 87,696 |
female | 10-17 | Black | 77 | 81,226 |
female | 10-17 | Mixed | 31 | 41,925 |
female | 10-17 | Other | 113 | 23,377 |
female | 10-17 | White | 300 | 180,885 |
female | 18-24 | Asian | 134 | 64,680 |
Calculating search rates and disparity
Now we have data we can work with, we can calculate search rates per 1,000 people in a particular population group and for the population as a whole. Note, however, that some demographic groups have very small numbers of searches so the calculated rates are likely to be unstable. To deal with this, we will only calculate rates for groups for which there were at least 100 searches between July and September 2020. We will also exclude the ‘Other’ ethnicity category, because it covers such a broad range of different people that it is difficult to interpret the results.
overall_rate <- sum(counts$searches) / (sum(counts$people) / 1000)
group_rates <- counts %>%
filter(ethnic_group != "Other", searches >= 100) %>%
mutate(search_rate = searches / (people / 1000))
We could show these values as a table, but there would probably be too much information to easily digest. Instead we can show the different search rates as a chart (remember some values will be missing because there were fewer than 100 searches of people in that group).
ggplot(
group_rates,
aes(
# set the order of the age groups so that the follow the natural age order
x = fct_relevel(age_range, "10-17", "18-24", "25-34", "over 34"),
y = search_rate,
fill = sex,
# format the rate for printing
label = number(search_rate, accuracy = 0.1)
)
) +
geom_col() +
geom_text(size = 3.5, vjust = 0, nudge_y = 2) +
facet_grid(cols = vars(ethnic_group), rows = vars(sex)) +
labs(
title = "Stop-and-search rates for different deomgraphic groups",
x = NULL,
y = "search rate per 1,000 people"
) +
theme_minimal() +
theme(
legend.position = "none",
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
plot.title.position = "plot"
)
To calculate disparity ratios, we simply divide the search rate for each group by the overall search rate of 6.20 searches per 1,000 people.
group_disparity <- mutate(group_rates, disparity = search_rate / overall_rate)
sex | age_range | ethnic_group | searches | people | search_rate | disparity |
---|---|---|---|---|---|---|
male | 18-24 | Black | 5,713.0 | 52,159 | 109.5 | 17.7 |
male | 18-24 | Asian | 3,805.0 | 81,313 | 46.8 | 7.5 |
male | 25-34 | Black | 3,058.0 | 75,695 | 40.4 | 6.5 |
male | 10-17 | Black | 2,624.0 | 83,789 | 31.3 | 5.1 |
male | 18-24 | Mixed | 822.0 | 29,774 | 27.6 | 4.5 |
male | 18-24 | White | 5,355.0 | 199,372 | 26.9 | 4.3 |
male | 10-17 | White | 2,737.0 | 189,647 | 14.4 | 2.3 |
male | 25-34 | Asian | 2,221.0 | 174,204 | 12.7 | 2.1 |
male | 10-17 | Asian | 987.0 | 93,309 | 10.6 | 1.7 |
male | 25-34 | Mixed | 450.0 | 43,712 | 10.3 | 1.7 |
male | 10-17 | Mixed | 427.0 | 44,383 | 9.6 | 1.6 |
male | 25-34 | White | 4,461.0 | 504,165 | 8.8 | 1.4 |
male | over 34 | Black | 2,085.0 | 258,994 | 8.1 | 1.3 |
female | 18-24 | Black | 252.0 | 47,977 | 5.3 | 0.8 |
male | over 34 | Mixed | 253.0 | 61,582 | 4.1 | 0.7 |
female | 18-24 | White | 724.0 | 209,483 | 3.5 | 0.6 |
male | over 34 | White | 4,579.0 | 1,435,342 | 3.2 | 0.5 |
male | over 34 | Asian | 1,124.0 | 413,450 | 2.7 | 0.4 |
female | 18-24 | Asian | 134.0 | 64,680 | 2.1 | 0.3 |
female | 25-34 | Black | 180.0 | 88,000 | 2.0 | 0.3 |
female | 10-17 | White | 300.0 | 180,885 | 1.7 | 0.3 |
female | 25-34 | White | 666.0 | 487,537 | 1.4 | 0.2 |
female | over 34 | White | 859.0 | 1,427,660 | 0.6 | 0.1 |
female | over 34 | Black | 186.0 | 334,834 | 0.6 | 0.1 |
These disparity ratios are slightly different from those in last week’s report, because the data used here are only for the Met, not the data for all three London police forces used in the report. Nevertheless they are similar overall, and in particular the ordering of the groups from highest to lowest is almost identical.
Calculating age- and sex-standardised disparity ratios
For some purposes, comparing specific age/sex/ethnicity groups to the population as a whole may not be of interest. For example, we might be interested in comparing people in non-white ethnic groups to white people of the same age and sex. To do this, we have to arrange our data frame into a different format so we can divide the search rate for each group by the corresponding rate for white people.
standardised_disparity <- group_disparity %>%
# remove columns we don't need, which makes pivot_wider() work more easily
select(-searches, -people, -disparity) %>%
# make our data 'wider', so we can divide one column by another
# for an explanation of long vs wide data, see
# https://sejdemyr.github.io/r-tutorials/basics/wide-and-long/
pivot_wider(names_from = ethnic_group, values_from = search_rate) %>%
# divide each search rate by the corresponding search rate for white people
mutate(
across(where(is.numeric), ~ . / White, .names = "{.col}_disparity")
) %>%
# convert the data back to 'long' format
pivot_longer(where(is.numeric), values_to = "value") %>%
# further convert the data to a semi-wide format, so that each row represents
# one demographic group, as in the original data
separate(name, into = c("ethnic_group", "cat"), fill = "right") %>%
replace_na(list(cat = "search_rate")) %>%
pivot_wider(names_from = "cat")
We can visualise these standardised disparities.
ggplot(
standardised_disparity,
aes(
x = ethnic_group,
# show age groups in natural order
y = fct_rev(fct_relevel(age_range, "10-17", "18-24", "25-34", "over 34")),
# set text colour so it is visible over the background fill
colour = disparity > diff(range(disparity, na.rm = TRUE)) / 2,
fill = replace_na(disparity, 0),
# format number for printing
label = replace_na(number(disparity, accuracy = 0.01), "–")
)
) +
geom_raster() +
geom_text(na.rm = TRUE) +
scale_x_discrete(position = "top") +
scale_colour_manual(values = c(`TRUE` = "grey90", `FALSE` = "grey10")) +
scale_fill_gradient(low = "white", high = "darkblue") +
facet_grid(cols = vars(sex)) +
labs(
title = "Standardised stop-and-search disparity ratios",
subtitle = "Rate of search rates for each group relative to white people of the same age and sex",
x = NULL,
y = NULL
) +
theme_minimal() +
theme(
axis.text = element_text(size = 10),
legend.position = "none",
panel.grid = element_blank(),
plot.title.position = "plot",
strip.placement = "outside",
strip.text = element_text(face = "bold", size = 10)
)
Comparing every group to every other group
I’ve had a request to calculate disparity ratios for every group in comparison to every other group. I’m not sure this is necessarily valuable, but I’m including the following code in case it is useful to anyone else. The results are saved in an Excel workbook.
simplified_data <- group_rates %>%
mutate(
# convert categories so they make sense in a sentence
sex = case_when(
age_range %in% c("under 10", "10-17") & sex == "female" ~ "girls",
age_range %in% c("under 10", "10-17") ~ "boys",
sex == "female" ~ "women",
TRUE ~ "men"
),
# convert categories into a description
group = str_glue("{ethnic_group} {sex} aged {age_range}")
) %>%
# remove all the columns we don't need
select(group, search_rate) %>%
# convert the data frame to a named vector
deframe()
every_comparison <- simplified_data %>%
# multiply each value by every other value
map_dfr(~ . / simplified_data) %>%
# add the original group names back
mutate(group = names(simplified_data)) %>%
# move the original group names to be the first column
select(group, everything())
# to create an Excel workbook with multiple sheets, we need to create a list
# of data frames, with each element in the list corresponding to a worksheet
list(
# the first worksheet contains a table of disparity ratios
"comparisons_table" = rename(
# put a description of the table in the first row of the first column
every_comparison,
`Group below is X times more likely to be searched than group to right` =
group
),
# the second worksheet contains those ratios expressed in sentences, since
# tables of the type in the first workbook can sometimes be confusing
"comparisons_descriptions" = every_comparison %>%
# convert the data to 'long' format
pivot_longer(., -group, names_to = "group2", values_to = "disparity") %>%
# remove rows that compare groups against themselves
filter(group != group2) %>%
# write a statement comparing each pair of groups
mutate(
disp1 = ifelse(disparity > 1, disparity, 1/disparity),
disp2 = ifelse(disparity > 1, "more", "less"),
desc = str_glue("{group} are {number(disp1, accuracy = 0.1)} times ",
"{disp2} likely to be searched than {group2}")
) %>%
# remove all variables except the descriptive text, and remove the column
# name so the first row is blank
select(` ` = desc)
) %>%
openxlsx::write.xlsx("disparity_ratios.xlsx")
Conclusion
Disparity ratios are a common way of looking at how stop and search affects different groups in society. They are by no means perfect, since different people (and groups) might experience the same treatment in different ways for all sorts of reasons. Disparity ratios also do not explain why different groups are searched at different rates. I discuss these issues in more detail in the report, which you can download to read for free.
Adding in data from British Transport Police would mean we needed to remove searches conducted by BTP outside London, because they cover the whole of Great Britain. This has to be done using a spatial join, which would make this tutorial quite a lot longer.↩︎