Distributional analysis of road charging scenarios

Department of Transport and Main Roads (TMR)

2019 - 2020

Modeller

The Distributional Analysis of Road Charges (DARC) tool consists of a set of databases and Python scripts (generally running in a Jupyter notebook) that allow a range of road charging scenarios to be analysed and presented. The process is quite innovative, as it combines detailed observed data on vehicle registrations with synthesised households (developed from Census microdata) and modelled travel behaviour (using the TPACS model and Household Travel Survey data).

The process starts with a complete (anonymised) extract of vehicle registration data, which is filtered to remove spurious data and any commercial registrations. The location of each household's registered vehicle is then geocoded to allow it to be joined with other data sets. The filtered and geocoded registration data allows for a number of charging issues to be considered, such as vehicle registration costs and luxury car taxes. But to analyse the distributional effects the registration data needs to be combined with demographic variables.

The ABS Census gives a wide range of demographic data at an aggregate level, but the distributional analysis is facilitated by synthesising demographic data at the level of individual households. In order to maintain the confidentiality of the Census, the ABS does not provide household data at the level of detail available for the vehicle registration data. But it does publish a small sample (5%) of full household data with the household's location specified at such a coarse level that identification of individual records would be impossible (with geographic areas containing at least 250k persons). This Census microdata gives a pool of sample households in each area that are assumed to be representative of the population. By combining this data with full population totals of key variables within a more detailed geographical area it is possible to produce a full synthetic set of household records. This is done using a process known as Population Synthesis.

Population Synthesis can be thought of as a process whereby a small sample of households are factored up (with individual weights for each sample) to ensure that in aggregate they match a set of specified totals. For example, the synthesis may be constrained to match the number of households with 1, 2, 3, 4, 5+ persons and 0, 1, 2, 3, 4+ vehicles. If we used a naïve approach we would just duplicate each household 20 times for a 5% sample. But when we did this we would most likely find that the totals of our synthesised households would not match the published aggregate data. Perhaps we would have too many households with 3 cars and not enough with 0 cars. We could then modify the weights so that the micro-data records with 3 cars were included less often, and those with 0 cars would be included more often. Various techniques can be used to adjust the weights but the basic approach is to iteratively adjust the proportions until the aggregates have the best match while minimising the variation in weights. DARC uses an open source population synthesiser called SynthPop, developed as part of the Urban Data Science Toolkit by UrbanSim. The synthesis is constrained by three household variables (number of persons, number of vehicles and household income quintile) and two person level variables (broad occupation and student type). It should be noted that although the population synthesis is constrained using only these variables, the synthesised households contain the full set of variables included in the micro-data. Analysis using these other variables rests on the implicit assumption that they will be reasonably represented in the expanded set even though there are no specific adjustments done to ensure this.

The population synthesiser produces a full list of synthetic persons and households in each Census Statistical Area Level 1 (SA1) across Queensland. The next step is to combine these synthetic households with the actual households observed in the registration data. Since the registration data provides little detail on the household other than the number of registered vehicles and the household's location, those are the variables used to join the two data sets. Within each SA1, each synthesised household is allocated to an actual household with a matching number of vehicles. Mismatches in total records between the two data sets are managed in a consistent way, with some registration data randomly duplicated or deleted to ensure that the aggregate totals match the full information from the Census.

At this point the database includes a full set of records across Queensland, with information on key demographic variables and detailed vehicle registration data, including the make, model, and year of manufacture of each vehicle. The vehicle information is then expanded to include estimates of fuel efficiency, using the Green Vehicle Guide (for all vehicles manufactured from 2004 onwards) and the Fuel Consumption Guide (1986-2003). However none of this data gives any information on how much each vehicle is actually driven -- a key requirement for calculating the fuel excise. Although some data is available from various Household Travel Surveys, these only cover selected urban areas. To complete the next step for the whole of Queensland requires a travel demand model.

TransPosition had already developed an Agent Cloud Simulation model (TPACS) for Queensland, so it has been used to provide estimates of person-kilometres travelled by car for a range of person types at each location in Queensland. This is possible because TPACS considers all travel as point-to-point, unlike traditional models that aggregate to Traffic Analysis Zones (TAZ). The person-based travel estimates from TPACS can be joined with the person records from the population synthesis at each location to produce estimates of total travel in each synthetic household.

There is still one last complication -- in order to prepare estimates of fuel use the travel estimates need to be allocated to individual vehicles, not just to the persons or household. Assuming that all travel is shared evenly between the vehicles in a household is unrealistic, as many households will have a dominant vehicle that is used more often. Neither TPACS nor the Census provide guidance on which vehicle is used for each type of trip. The best source for this information is the Household Travel Survey data, which records individual records on the trips made by the sampled households on a particular day. DARC uses estimates of the proportion of travel done by each vehicles in the household ranked by their estimated value to assign the TPACS demand estimates to each vehicle.

Thus the final database contains full information on the vehicles, household demographics, fuel efficiency, and vehicle kilometres travelled by vehicle for each observed/synthesised household in Queensland. This data is then used to calculate the various transport charges that are to be considered.

Primary contact: Peter Davidson

Project team: Morgan Weston, Tom McCarthy