Data Science at HCP

For HCP, “data science” or “big data” refers to data involving populations. We have very large amounts of information collected on patients during routine contacts with the medical system. In addition to focusing on the collection, storage, and analysis of individual large, diverse, and (often national) population-based information, HCP also links multiple data sources resulting in clinically rich, detailed, and accurate information on subsets of patients.

Our approach to population-based big data science is based on seven principles:

1. To involve clinical and scientific leaders in key health care disciplines
2. To involve statistical leaders who develop and use solutions to foundational statistical problems in big data based on real world applications
3. To develop, retain, and retrain as appropriate a cohort of statistical programmers with expertise in the analysis of big data
4. To make available in public use mechanisms using new analytic approaches we have developed to facilitate reproducibility of scientific findings
5. To invest and maintain an IT environment to efficiently and reliably store and analyze big data
6. To maintain patient confidentiality and privacy where appropriate and develop safeguards at the personnel, technology, and software levels to ensure confidentiality
7. To use multiple and comprehensive approaches to undertake timely and accurate analyses, improving patient care

Data and Data Science at HCP

HCP maintains one of the most comprehensive data archives of population-based health care information for research purposes held within Harvard Medical School and Harvard University. We currently house: administrative billing claims at the national, state, and private payer levels; international, national, and regional survey data; clinical data at the procedure level in clinical registries; and linked billing and electronic health record data. We link and will continue to link patients’ experiences over time across multiple settings.  

HCP Data Science Expertise

In addition to our substantial clinical and social science faculty, HCP also has five full-time PhD-trained statisticians/biostatisticians. Their research portfolios cover the strategic methodological areas necessary for making scientifically valid conclusions on the basis of large, complex routinely collected databases. Among these, three dimensions are of particular relevance for big data: causal inference, predictive inference, and multi-leveled inference. These areas are critical for discovery: running fast algorithms for data visualization and information retrieval are important, but inferring cause and effect, identifying what treatments work for whom, and determining new areas of investigations require careful, particular expertise. 

Our statistical faculty is at the forefront of many of these larger methodological endeavors. Since 2012, Sharon-Lise Normand serves on the National Research Council’s Committee on Applied and Theoretical Statistics, where the focus has been on fostering methods with massive data and on the reproducibility of scientific results from such data. She is also on the University-wide committee to study and expand “big data” at Harvard. Sherri Rose serves on a high-dimensional data subgroup of an international initiative to strengthen analytical thinking with observational data. Alan Zaslavsky has also been working with the NRC and the Committee on National Statistics on priority issues for surveying the national population; to this end, he has been instrumental in working with the Centers for Medicare and Medicaid Services and incorporating patients’ surveys into evaluation of health plans.

Currently, HCP houses 62 biostatisticians, statisticians, programmers, analysts, data managers, research assistants and postdoctoral fellows. Their combined experience crosses a number of different operating platforms and software packages, allowing for a uniquely robust team that is capable of analyzing and utilizing data at every level.