Synthesizing Data on Agents and Their Associations: A Simulation and Graph Theoretic Approach
Data on the entire population is almost never publicly available. Moreover, there is an alarming trend of discontinuing the exercise of conducting full Census in many countries (Belgium, Switzerland, etc.). In this context, population synthesis techniques have been developed for policy analysis and forecasting. Currently, the focus is on treating synthesis as a fitting problem. For instance, Iterative Proportional Fitting (IPF) and Combinatorial Optimization based techniques. The key shortcomings of fitting based procedures include: a) synthesis of only one weighting scheme, while there can be many solutions b) due to cloning rather than true synthesis of the population, losing the heterogeneity that may not have been captured in the microdata c) over reliance on the accuracy of the data to determine the cloning weights d) poor scalability and convergence with respect to the increase in number of attributes of the synthesized agents. In order to overcome these shortcomings, we propose a Markov Chain Monte Carlo (MCMC) simulation based approach. Partial views of the joint distribution of agent¹s attributes that are available from various data sources can be used to simulate draws from the original distribution. The problem of association of different types of agents (person-households) is then treated as a maximum weight problem of a bipartite graph. The real population from Swiss census is used to compare the performance of simulation based synthesis with the standard IPF. The standard root mean square error statistics indicated that even the worst case simulation based synthesis (SRMSE=0.35) outperformed the best case IPF synthesis (SRMSE=0.64).
Record created on 2014-01-20, modified on 2017-02-16