Use Auxiliary Information to Improve Statistical Inference

Auxiliary information includes any variables collected but are not the main interest of analysis, e.g., a proxy of an expensive variable, instrumental variables, and negative controls. This auxiliary information is often abundant from third-party data sources, such as remote sensing images, electronic health records, census data, and a baseline survey in a cohort study. However, researchers often ignore this enormous amount of relevant information, missing the opportunity to improve the quality of a study with almost no cost. My interest is to develop tools to help interested users harness the full potential of this easy-to-obtain and often free information.

Currently, I am developing methods to leverage auxiliary information to adjust for unmeasured confounding. I focus on continuous exposure and time series data.

Adjusting for unmeasured confounding bias

Jie Hu, Eric Tchetgen Tchetgen, Francesca Dominici
“Leveraging Auxiliary Information to Adjust for Unmeasured Confounding in Time Series Study Designs”
[Nature Review Method Primer]
[Exercise code in slides 11 and 17]

Hu, J.K., Tchetgen Tchetgen, E.J.
“Causal Inference with Time Series Data and Unmeasured Confounding”
Causal Inference for Time Series Data Workshop @ 39th Conference on Uncertainty in Artificial Intelligence (2023)

Hu, J. K., Zorzetto, D., & Dominici, F.
“A Bayesian Nonparametric Method to Adjust for Unmeasured Confounding with Negative Controls”
[Code]

Enhancing inference precision

for case-cohort studies

Jie Hu, Norman E. Breslow, Chan Gary, Couper David
“Estimating Disease Hazard Differences from Case-Cohort Studies”
European Journal of Epidemiology, Jun, 1-14 (2021).
This article includes methods and software for improving inference precision by leveraging auxiliary variables. [Slides]

for case-control studies
Norman Breslow and Jie Hu.
“Survival Analysis of Case-Control Data: A Sample Survey Approach”
Handbook of Statistical Methods for Case-Control Studies, Chapman and Hall/CRC
Please email me if you don’t have access.

for general two-phase sampling studies
Jie Hu
“A Z-estimation system for two-phase sampling with applications to additive hazards models and epidemiologic studies”
PhD Diss.University of Washington ResearchWorks Archive (2014).
Chapters 4, 5, 6 include methods and results for improving inference and prediction precision in semiparametric models by leveraging auxiliary variables.

Software

Jie Hu
“Fit Additive Hazards Models for Survival Analysis”
CRAN -R Package addhazard (2020)
[github][user’s manual]

Tutorials

Analysis of a National Wilms Tumor Study dataset
Dataset nwts2ph is in R Package addhazard

Analysis of Breast Cancer dataset
Dataset hosted by Department of Mathematics, University of Oslo

Analysis of an Atherosclerosis Risk in Communities Study (ARIC) dataset
hosted by European Journal of Epidemiology, Jun, 1-14 (2021).
[scientific questions]

Improving sampling designs

Hu, J, Jerkins, J, Goebel, N.
Routing Method for Mobile Monitoring Platforms — A scalable sampling method that dispatches a fleet of vehicles to collect environmental data unbiasedly (U. S. Application Serial No.17/332789)
This patent proposes an idea to use nearby air quality monitoring stations to determine the sampling time and weights for measuring hyperlocal air quality in each neighborhood with mobile sensing platforms

Hu, J & Ladoni, M. (2021)
Location Selection for Treatment Sampling —A field Study Design Tool to Optimize Treatment Assignment and Soil Sampling Locations for Model Calibration. (U.S. Patent No. #10,963,606)
This patent proposes an idea to use auxiliary variables that are correlated with the core study variables in a biogeochemical model to determine sampling locations for evaluating and calibrating the model.