The Value of Open-Source

Leveraging free, code-first tools to iterate toward advanced analytics

Alex Zajichek

Research Data Scientist, Cleveland Clinic

February 27, 2025

A Little Background


Who is QHS?

  • Department of 120+ biostatisticians, data scientists, programmers, etc. that collaborate on and supply quantitative support to research activities at Cleveland Clinic


  • From clinical trials and study design to precision medicine, population health, AI in medicine, and more, across many disease areas


  • My area focuses on clinical prediction modeling and observational statistical analysis, primarily using EHR and/or registry data

What is AI, Anyway?

Two Sides of A Coin


AI as Tools

  • Pre-built products like ChatGPT, Gemini, etc.
  • Use (purchase) to conform to our tasks
  • Perceived as productivity tools
  • Dominates the conversation

AI as Data Science

  • Data infrastructure, analytical thinking, statistical reasoning, tools for facilitation
  • How we use data to help inform decision making
  • The building blocks of AI itself

Blurred Lines



  • Tend to focus on the first, bypassing the second; conflating views


  • Leads to ambiguity and confusion

Figure: The confusing nature of AI-related fields

Figure: The confusing nature of AI-related fields

The Horse Before the Cart

Current State

  • Gartner (2018): 87% of business at low analytics maturity [1]

Figure: Gartner Analytics Maturity Model

Figure: Gartner Analytics Maturity Model

Back to Basics

  • Are AI tools really the solution?
    • “AI for the sake of AI is a losing proposition” [2]. Be intentional!
  • Are you capturing what’s important? (Data infrastructure)
  • How are you using your data? (Reporting/analytical thinking)
  • What are your limitations? (Tools, skills, time, etc.)

The Case for Open-Source

What Is Open-Source?

Background

  • In essence, free and open software (Wikipedia)
  • Think of as community built
  • Like a workshop for your raw materials

Benefits in Data Science

  • Low-cost iteration and experimentation (anyone can do it)
  • Code-first approach gives flexibility and control (art + science)
  • Use it to facilitate analytical approach
    • Building blocks to AI


The R Programming Language

Background

  • A free and open-source functional programming language
  • Developed in the 1990’s for statistical computing, but has long since expanded to much broader usage
  • Commonly used in the RStudio IDE
  • Advanced through packages developed by the community (Package list)

Example

library(plotly) # Load package
my_data <- trees # Assign object (dataset)
plot_ly(
  data = my_data,
  x = ~Height,
  y = ~Girth,
  size = ~Volume,
  color = ~Volume,
  text = ~paste0("Height: ", Height, "<br>Girth: ", Girth, "<br>Volume: ", Volume), height = 300, width = 500
)

Map example

# Load packages
library(tidyverse)
library(tidycensus)
library(mapgl)

# Import WI tracts
wi_tracts <- 
  arcgislayers::arc_read(
    url = "https://tigerweb.geo.census.gov/arcgis/rest/services/Generalized_ACS2023/Tracts_Blocks/MapServer/4", 
    where = "STATE = '55'"
  )

# Extract median income by tract
dat <- 
  get_acs(
    geography = "tract",
    variables = "B19013_001", # Median income,
    state = "WI",
    year = 2022,
    progress_bar = FALSE
  ) |>
  
  # Join to get boundaries
  inner_join(
    y = wi_tracts |> select(GEOID, geometry),
    by = "GEOID"
  ) |>
  
  # Make an information column
  mutate(
    Info = paste0(str_remove(NAME, ";.+$"), "<br>Median Income ($): ", round(estimate))
  ) |>
  
  # Convert to spatial data frame
  sf::st_as_sf()

# Make the make
maplibre() |>
  
  # Focus the mapping area
  fit_bounds(dat) |>
  
  # Fill with the data values
  add_fill_layer(
    id = "mc_acs",
    source = dat,
    fill_outline_color = "black",
    fill_color = 
      interpolate(
        column = "estimate",
        values = range(dat$estimate, na.rm = TRUE),
        stops = c("#f2d37c", "#08519c"),
        na_color = "gray"
      ),
    fill_opacity = 0.50,
    popup = "Info"
  ) |>
  add_legend(
    legend_title = "Median income ($)",
    values = range(dat$estimate, na.rm = TRUE),
    colors = c("#f2d37c", "#08519c")
  )