Reproducible data ingestion in practice

How does a reproducible data ingestion look like in practice?

If your analytical work has used a certain type of data at least two times in the past, it is likely that the data will be needed again. This is a good time to make the data ingestion automatic and reproducible. We demonstrate a simple data ingestion from Eurostat. Daniel is part of the rOpenGov collective that maintains the eurostat package for reproducible access to the European statistical agencies data warehouse.

library(eurostat)
nuts3_population_raw <- get_eurostat ("demo_r_pjangrp3") 
nuts3_population_metadata <- get_eurostat_toc() %>%
  filter ( code == "demo_r_pjangrp3" ) %>%
  distinct_all () 

last_update <- nuts3_population_metadata$`last update of data`
first_data <- nuts3_population_metadata$`data start`

The metadata entries to this statistics suggest that the latest data is from Thursday 14 May, 2020, so it is high time to refresh your data tables and update your charts. Sadly, the metadata suggest that on this occasion the table went through structural changes, so you may get a different spreadsheet table when you downloaded compared to the last occasion. You may think that this is another reason to automate the ingestion into a well-defined, tidy table format.

titlecodelast update of datalast table structure changedata startdata end
Population on 1 January by age group, sex and NUTS 3 regiondemo_r_pjangrp314.05.202014.05.202020142019

Looking at the head of this raw data, you realize that this is not table, but a little database! It contains population divisions by sex and age groups. However, your indicator is the total population of each NUTS3 regions, so let’s filter out only the total population for both sexes.

sexunitagegeotimevalues
TNRY_LT5NO072018-01-0125108
MNRY10-14DED442014-01-014620
FNRY45-49FRB2014-01-0188590

Looking at the head of this raw data, you realize that this is not table, but a little database! It contains population divisions by sex and age groups. However, your indicator is the total population of each NUTS3 regions, so let’s filter out only the total population for both sexes.

nuts3_population <- nuts3_population_raw %>%
  filter ( sex == "T",             # for both sexes
           age == "TOTAL") %>%     # for all age groups
  mutate ( year = as.integer(substr(as.character(time), 1,4)), 
           indicator = paste0("population_sex-", sex, "_unit-", unit, 
                               "_age-", age)) %>%
  select ( -all_of(c("sex", "unit", "age"))) 

The following table is already tidy — each observation corresponds with a small region of Europe. The filtering information as metadata is included in the indicator’s name.

geotimevaluesyearindicator
HU2222015-01-012539972015population_sex-T_unit-NR_age-TOTAL
NO2014-01-0151079702014population_sex-T_unit-NR_age-TOTAL
ES7042016-01-011120872016population_sex-T_unit-NR_age-TOTAL
FRK122015-01-011462192015population_sex-T_unit-NR_age-TOTAL
DE32018-01-0136134952018population_sex-T_unit-NR_age-TOTAL
NL2252019-01-014006852019population_sex-T_unit-NR_age-TOTAL

The population size is rather homogeneous — no wonder, the small regions of Europe are exactly defined to be roughly of equal size and easily comparable.

You are wondering how you could match this data with your own records, or data acquired from Google about number of restaurants or searchers for concerts?

geotimevaluesyearindicator
IE0422017-01-014512932017population_sex-T_unit-NR_age-TOTAL
SI0442014-01-011128482014population_sex-T_unit-NR_age-TOTAL
DE2252015-01-01779272015population_sex-T_unit-NR_age-TOTAL
PL6212016-01-015230962016population_sex-T_unit-NR_age-TOTAL
TRB132018-01-012733542018population_sex-T_unit-NR_age-TOTAL
FRK242018-01-0112613732018population_sex-T_unit-NR_age-TOTAL

There is no easy solution for this problem. Within the EU in the last 20 years there were several thousand internal boundary changes — states, provinces, regions are rather free to change their internal administration boundaries. Regions are constantly changing their names or their spelling, too. Sometimes they use a standard English translation of their region, sometimes only the national language variants. In Greece or Bulgaria they do not even use the Latin alphabet!

Daniel and Istvan created and released an open-source code library, regions to deal with similar issues. It allows the changes between boundary definition between 1999-2021, or the connection of the globally used geographical typology adopted by Google with Eurostat’s EU-only definitions.

For example, you can already foresee that there are future changes (to be implemented from the year 2021) that will affect two of these regions.

set.seed(21)
validate_for_2021 <- nuts3_population  %>%
  select ( -all_of(c("time", "indicator")) ) %>%
  sample_n(8) %>%
  regions::validate_nuts_regions(., nuts_year = 2021 ) 
geovaluesyeartypologyvalid_2021
DE2481156812018nuts_level_3TRUE
BG4222281412019nuts_level_3TRUE
UKN131397742014NAFALSE
FRH032765432014nuts_level_2TRUE
DEB320317802016nuts_level_2TRUE
ITG253350972014NAFALSE
DE71A2328482016nuts_level_3TRUE
FI1D5686772014nuts_level_3TRUE

There is trouble ahead, but we have a table that is tidy and works for the time being. Let’s save it and document it.

eurostat::get_bibentry("demo_r_pjangrp3")
## @Misc{demo_r_pjangrp3_14-05-2020,
##   title = {Population on 1 January by age group, sex and NUTS 3 region [demo_r_pjangrp3]},
##   url = {https://ec.europa.eu/eurostat/web/products-datasets/-/demo_r_pjangrp3},
##   language = {en},
##   year = {14.05.2020},
##   publisher = {Eurostat},
##   author = {{Eurostat}},
##   urldate = {2020-07-25},
## }

What is reproducibility in this case?

Replicability means that somebody else in your team can get exactly the same data as in this post.

  • Recording exactly what have your analysts downloaded, including source, date, filtering settings

Reproducibility means that not only your team, but an external auditor or a peer reviewer can easily validate that your data is sound. All our solutions are opens source, and use the most popular statistical programming language, R. Anybody with a basic knowledge of R, or probably even SQL (we use an extension of R that uses SQL-like syntax) can read if everything goes allright. Your team can even modify the code.

  • Creating a documentation that you updates your reports data citations

Confirmability means that somebody who is not using . All our solutions are opens source, and use the most popular statistical programming language, R. R is supported by Microsofts BI products, SPSS and other statistical software, too. We are not a software company, there is no vendor lock-in, and our work can be confirmed with other statistical software.

  • Data acquired via our data gathering apps is automatically documented, follows tidy formats and valid abbreviations, and can be confirmed in a wide range of statistical and spreadsheet applications, including SPSS, Excel, OpenOffice, Google Spreadsheets.
  • Bringing the data to a tidy format which is easy to join with other data in a spreadsheet application, or stored in a relational database.
  • Validating the data structure, in this case, with special attention to regional boundaries and regional names.

Auditability means that external auditors can validate that the data your team uses from our app is correct.

  • We are following methodological guidelines and definitions set by international bodies such as Eurostat’s ESSNet statistical network or IFRS.
  • Our critical software code is not only open source, but goes through various levels of statistical peer-review. We are sending for scientific validation and peer-review our methodology, too.
  • Being open source allows us to make any corrections, updates, should any problem or standard change affect the data.

References

Antal, Daniel. 2020. Regions: Processing Regional Statistics. https://regions.danielantal.eu/.

Eurostat. n.d. Eurostat. Accessed July 25, 2020. https://ec.europa.eu/eurostat/web/products-datasets/-/demo_r_pjangrp3.

Lahti, Leo, Janne Huovari, Markus Kainu, and Przemyslaw Biecek. 2020. Eurostat: Tools for Eurostat Open Data. https://CRAN.R-project.org/package=eurostat.

Hadley Wickham. 2014. Tidy Data.Journal of Statistical Software vol 59, no 10.https://www.jstatsoft.org/v059/i10.

Related