Building Resilient Public Health Genomic Systems with Data Lakes and Automation


Kelsey Florek, PhD, MPH
Senior Genomics and Data Scientist
Wisconsin State Laboratory of Hygiene
February 27, 2025

Slides live at:
www.k-florek.net/talks

How do you shift bioinformatics research software into a reliable public health product?

Communicable Disease Division

  • Infectious disease surveillance and diagnostic testing
  • Serve the state of Wisconsin and perform activities as a regional center for HAI/AR, Influenza, and Bioinformatics
  • Over 60 laboratory staff including a 6 member multidisciplinary bioinformatics team

Gathering Requirements for a Bioinformatics System

a combination of functional and regulatory requirements

Bioinformatics Software/Workflows

  • Developed by a researcher or student
  • Software updates slow or non-existent
  • Limited documentation
  • Limited scope or applicability
  • Suboptimal resource usage

Testing Laboratory Requirements

  • Validation/Verification
  • Consistency: algorithms and databases
  • Audit trails: data & system
  • Integration with LIS
  • QA/QC/QI
  • Data security

Establishing a system for a changing landscape

2017 - 2019
Pre-COVID-19
  • cost effective workflow runs
  • short workflow runtimes
  • dynamic workflow scaling
2019 - 2022
COVID-19
  • database capacity
    • dashboards
    • data queries
  • multiple user framework
  • accessibility for external partners
2022 - Today
Post-COVID-19
  • regulatory compliance for diagnostic testing
  • automation
  • integrations and analytics

Establishing a system for a changing landscape

Workflow Design




Infrastructure Design
(Data Lake)




Automation and Access




Workflow Design

migrating from massive python scripts to a Workflow Management Language

Workflow Management Languages

Nextflow

  • Channels/Processes
  • Workflow Logs/Process Logs/Trace File/Execution Report
  • Task Caching
  • Job isolation with Containers
  • Compatibility with range of HPC and Cloud platforms

WDL (Cromwell)

  • Task
  • Workflow Logs/Call Logs
  • Call Caching/Checkpoint Files
  • Job isolation with Containers
  • Compatibility with range of HPC and Cloud platforms

SnakeMake

  • Rule
  • Rule Logs/Reports
  • Between Workflow Caching
  • Job isolation with Containers
  • Limited compatibility with Cloud platforms

Addressing Regulatory Requirements (A CAP Perspective)

Workflows/Software -> Workflow Management Language

  • Software/Workflow Version
  • Source code versioning system
  • Unit Tests, Pos/Neg Tests, Integration Tests
  • Records of monitoring software updates
  • Data Provenance tracking through the procedure
  • Controls, Metrics, QC
  • Input Files/Output Files/Databases

Addressing Regulatory Requirements (A CAP Perspective)

Workflows/Software -> Workflow Management Language

  • Software/Workflow Version
  • Source code versioning system
  • Unit Tests, Pos/Neg Tests, Integration Tests
  • Records of monitoring software updates
  • Data Provenance tracking through the procedure
  • Controls, Metrics, QC
  • Input Files/Output Files/Databases

Establishing a system for a changing landscape

Workflow Design




Infrastructure Design
(Data Lake)




Automation and Access




Infrastructure Design (Data Lake)

cost-effective and scalable capacity

AWS Genomics Workflow Infrastructure

Managing a growing data infrastructure

Managing a growing data infrastructure

Public Data Dashboard

Query Data in the Data Lake

Addressing Regulatory Requirements (A CAP Perspective)

Infrastructure

  • Encrypted Electronic Data Transfer
  • Data Transfer Integrity Check
  • Data Backups
  • Storage practices and retention times
  • System Authentication/Activity Logs
  • Change Management Process (users or infrastructure)
  • Continuity of Operations Plan
  • Disaster Recovery Testing Records

Establishing a system for a changing landscape

Workflow Design




Infrastructure Design
(Data Lake)




Automation and Access




Automation and Access

enhancing usability and supporting public health impact

What is the role of a bioinformatician?

Support Staff / Scientist / Engineer

  • Script/Software Development
  • Workflow Development
  • Database Management
  • Infrastructure Management
  • System Support Contact
  • Data Trend Analysis
  • Dashboard Development
  • Pathogen Subject Matter Expert
  • Genomics Subject Matter Expert
  • Research Collaborations / Publications

Automation at WSLH

COVID-19 Genomics UK (COG-UK) CLIMB-COVID

COVID-19 Genomics UK (COG-UK) CLIMB-COVID

Easy Genomics Partnership

Easy Genomics - Minimal Viable Product

  • Simplify the process of launching and monitoring workflows
  • Provide the ability for users to upload sequence data through the web browser
  • Allow users to download analysis results through the web browser

The Next Steps:
leveraging genomic laboratory capacity for public health impact

  • Connecting Laboratory and Health Department Data for more near realtime detection of outbreaks.
  • Enhanced Detection System for Healthcare-Associated Transmission (EDS-HAT)
    • "Combines whole-genome sequencing (WGS) surveillance and machine learning (ML) of the electronic health record (EHR) to identify undetected outbreaks and the responsible transmission routes, respectively."
    • Sundermann et al. Clin Infect Dis. 2021

The Take Away

  • Workflow Management Languages are key to high quality, auditable, and reproducible data workflows.
  • Cloud based approaches can support a cost-effective infrastructure capable of scaling to meet demand.
  • A cloud based infrastructure can simplify the process of adding complex capabilities.
  • A genomic data system requires a flexible agile-like approach, where features are routinely identified and continuously added.

Acknowledgments

Abigail Shockey, PhD
Christopher Jossart, MPH
Dustin Lyfoung, MS
Thomas Blader
Eva Gunawan, MS

Special Thanks to

  • Chie Nakagawa
  • Chris De Vere
  • Evan Davey
  • Zig Tan
  • Marvin Umali
  • Simon Dib
  • William Holt
  • Andrew Purcell
  • Nick Rivas