The Value of Open-Source

Leveraging free, code-first tools to iterate toward advanced analytics

Alex Zajichek

Research Data Scientist, Cleveland Clinic

February 27, 2025

A Little Background

Who is QHS?

Department of 120+ biostatisticians, data scientists, programmers, etc. that collaborate on and supply quantitative support to research activities at Cleveland Clinic

From clinical trials and study design to precision medicine, population health, AI in medicine, and more, across many disease areas

My area focuses on clinical prediction modeling and observational statistical analysis, primarily using EHR and/or registry data

www.lerner.ccf.org/quantitative-health

What is AI, Anyway?

Two Sides of A Coin

The way I view it, there are two sides of the coin that are generally not differentiated
Go out to the marketplace and see what’s out there to help you accomplish something
I put “purchase” there intentionally
A productivity tool, like any other sort of tool. We don’t particularly think about how it works.
The other side is about what actually goes into building those AI tools themselves
Data infra: How you collect, store data
Thinking: How you report and utilize information and frame problems analytically to be answered with data
Reasoning: How you deduce and infer and take action based on that information
Tools: Tools that are used to facilitate these processes
Boils down to how data can help make informed decisions
Ultimately, these are building blocks that create advanced AI/data science tools on the left

AI as Tools

Pre-built products like ChatGPT, Gemini, etc.
Use (purchase) to conform to our tasks
Perceived as productivity tools
Dominates the conversation

AI as Data Science

Data infrastructure, analytical thinking, statistical reasoning, tools for facilitation
How we use data to help inform decision making
The building blocks of AI itself

Blurred Lines

What I’ve noticed in presentations, etc. is these aren’t really distinguished in the conversation. Focus on tools, talk about needing better data strategy which devolves into how we can use the existing tool for our purpose (specifically ChatGPT).
So many conflicting definitions of AI, and how it relates to data science. If you search Google you get so many differing pictures of it (no clear answer)
I intentionally chose this picture because it’s just as confusing as stating the definition of AI.
There’s so much varying overlap that it doesn’t actually tell you anything.
I’m glad regression analysis is actually in there where it is. Very old technique. Going into my first ML class (after having plenty of statistics training) excited to “allow machines to learn and improve without being explicitly programmed”. Regression was the first thing I learned about in my machine learning course. Regression is a core part of the ML field.
From that point forward I start noticing so many redunancies between these fields, and realized it’s all basically the same thing in kind, just to varying degree
So it all points to: building from a solid data strategy and infrastructure will lead toward these more advanced capabilities.

Tend to focus on the first, bypassing the second; conflating views

Leads to ambiguity and confusion

Figure: The confusing nature of AI-related fields

The Horse Before the Cart

So can we take a step back and strategically think about what we can do?
Before spending tons of money on black box tools in hopes they work…
Where does the value lie?
This is now a somewhat old statistic, but even if we assume improvements, still a large portion of business at early stage in their data journey
Low hanging fruit going back to basics.
This is not to say AI tools don’t have their place. But we should be thinking strategically about what our needs really are vs. jumping on the AI hype train and spending tons of money because we think we need to.
You know your business. Where the value is. A whole world of stuff out there to be utilized…
Think about what sort of fundamental skills we can enhance and what tools we can integrate to facilitate fundamental analytical thinking.
Having solid foundational analytics strategy can then let you better evaluate what and when an AI productivity tool will be useful, rather than just hoping it solves your problem
Start thinking about these things and the decisions you need to make. Is AI really the solution?
This is NOT to say to not use AI tools (on the productivity side) if they have clear and obvious benefit to you

Current State

Gartner (2018): 87% of business at low analytics maturity [1]

Figure: Gartner Analytics Maturity Model

Back to Basics

Are AI tools really the solution?
- “AI for the sake of AI is a losing proposition” [2]. Be intentional!
Are you capturing what’s important? (Data infrastructure)
How are you using your data? (Reporting/analytical thinking)
What are your limitations? (Tools, skills, time, etc.)

The Case for Open-Source

What Is Open-Source?

It is software that is first and foremost free, but also open, so you know exactly how it works
Although open source software can and is built by companies, I think of it as community developed. Namely that the users of the software are building it.
So my favorite way to think about what open source is is like a prestine workshop with all the tools you can imagine available. Now it is up to you to build what you want with them to serve your needs. I specifically include “directions” to emphasize that not only you can use the tools, but a plethora of resources to guide you and understand how you do and can use them.
When I talk about open source, I’m implying code-first. In my opinion it is the best way. Allows you to build exactly what you need. Pre-made, point and click tools always have their limitations and restrictions. Not to mention the cost…
Data science in general is as much an art as science. You need to create things in a way that is useful.
Gives you the groundwork and foundation to facilitate the way you want to solve problems. Use it how you want, instead of trying to conform your problems to how pre-existing tools work.
Syntax to communicate your analytical ideas
Given that it’s free, no matter who you are you can start. Just install the software. So it gives empowerment and flexibility for an individual to start using it in their own work, maybe locally, and see how it helps them solve potentially small problems. Build this foundation and iterate towards more widespread use.

Background

In essence, free and open software (Wikipedia)
Think of as community built
Like a workshop for your raw materials

Benefits in Data Science

Low-cost iteration and experimentation (anyone can do it)
Code-first approach gives flexibility and control (art + science)
Use it to facilitate analytical approach
- Building blocks to AI

Today’s emphasis:: R, Quarto, Shiny.

The R Programming Language

Background

A free and open-source functional programming language
Developed in the 1990’s for statistical computing, but has long since expanded to much broader usage
Commonly used in the RStudio IDE
Advanced through packages developed by the community (Package list)

Example

library(plotly) # Load package
my_data <- trees # Assign object (dataset)
plot_ly(
  data = my_data,
  x = ~Height,
  y = ~Girth,
  size = ~Volume,
  color = ~Volume,
  text = ~paste0("Height: ", Height, "<br>Girth: ", Girth, "<br>Volume: ", Volume), height = 300, width = 500
)

Map example

# Load packages
library(tidyverse)
library(tidycensus)
library(mapgl)

# Import WI tracts
wi_tracts <- 
  arcgislayers::arc_read(
    url = "https://tigerweb.geo.census.gov/arcgis/rest/services/Generalized_ACS2023/Tracts_Blocks/MapServer/4", 
    where = "STATE = '55'"
  )

# Extract median income by tract
dat <- 
  get_acs(
    geography = "tract",
    variables = "B19013_001", # Median income,
    state = "WI",
    year = 2022,
    progress_bar = FALSE
  ) |>
  
  # Join to get boundaries
  inner_join(
    y = wi_tracts |> select(GEOID, geometry),
    by = "GEOID"
  ) |>
  
  # Make an information column
  mutate(
    Info = paste0(str_remove(NAME, ";.+$"), "<br>Median Income ($): ", round(estimate))
  ) |>
  
  # Convert to spatial data frame
  sf::st_as_sf()

# Make the make
maplibre() |>
  
  # Focus the mapping area
  fit_bounds(dat) |>
  
  # Fill with the data values
  add_fill_layer(
    id = "mc_acs",
    source = dat,
    fill_outline_color = "black",
    fill_color = 
      interpolate(
        column = "estimate",
        values = range(dat$estimate, na.rm = TRUE),
        stops = c("#f2d37c", "#08519c"),
        na_color = "gray"
      ),
    fill_opacity = 0.50,
    popup = "Info"
  ) |>
  add_legend(
    legend_title = "Median income ($)",
    values = range(dat$estimate, na.rm = TRUE),
    colors = c("#f2d37c", "#08519c")
  )

Quarto for Reproducible Documents

Background

Quarto is an open source technical publishing system
Vehicle for dissemination
Build custom analytical documents in programmatic way, promoting automation and reproducibility
Integrates well with R, Python, and many other tools
- This presentation (and website that it’s contained in) are built in Quarto (see source code)

YAML (header/metadata)

---
title: "The Value of Open-Source"
subtitle: "Leveraging free, code-first tools to iterate toward advanced analytics"
institute: "Research Data Scientist, Cleveland Clinic"
author: "Alex Zajichek"
date: "2025-02-27"
date-format: long
format:
  revealjs: 
    theme: [serif, custom.scss]
    footer: "<em>AI Innovations at Work Conference 2025</em>"
    slide-number: true
    incremental: true
---

Markdown (body)

### Background

- [Quarto](https://quarto.org/) is a open source technical publishing system
- Build custom analytical documents in programmatic way
- Vehicle for dissemination, promoting automation and reproducibility

Shiny for Web Applications

Background

Shiny is an R package for building custom, interactive web applications
- You can do it in Python too!
Allows users to interact with data, analytics, models, etc. however you see fit

Example: Deploying an LLM-powered Shiny app

Example (in R)

library(shiny) # Load package

# The user interface
ui <- 
  fluidPage(
    title = "MyApp", 
    sidebarLayout(
      sidebarPanel(
        selectInput(
          inputId = "color",
          label = "Choose Color",
          choices = c("red", "blue", "green")
        )
      ),
      mainPanel(
        plotOutput("my_plot")
      )
    )
  )

# How the inputs turn to outputs
server <- 
  function(input, output) {
    
    output$my_plot <-
      renderPlot({
        with(trees, plot(Height, Girth, col = input$color))
      })
    
  }

# Run the app
shinyApp(ui, server)

Deployment and Integration

Start with free; iterate and pay as needed
- Not all or nothing!
Repeat: use to facilitate analytical strategy

Options to get started

Posit Connect Cloud: Deploy Shiny apps, Quarto docs, and more to the web for free from GitHub
- Now in beta with private deployment options (for low cost)
shinyapps.io: Deploy Shiny apps to the web; free basic plan
- Pay for more features like security, domains, and authentication
Quarto Pub: Freely publish Quarto output to the web for public access
Shiny Server: Open source (free) server to deploy Shiny apps and other content
- Install on premises or the cloud
- Intriguing option for orgs new to data science

How can I learn?

What we saw here was a very narrow glimpse into what these tools can do. So how can you learn more yourself?
First, just install them–they are free!
Take the code from these slides even and run it yourself
Again, the workshop analogy. You probably need to be curious and want to learn new skills, but in my opinion it’s the best way to build a foundation.
Start task based: try reproducing some very small little task, like creating a graph programmatically that you’d normally do in Excel
You start to figure things out as you go, which allows you to iterate over time and continuing building and expanding the capabilities
Start with free stuff, and only incur costs as needed for solutions you are building and need
Use AI tools to help you learn!
Prompt: “Make me an app template to dynamically explore customers on a map with various filters”

Free resources
Initial strategy
Use AI tools!

R for Data Science by Hadley Wickham
An Introduction to Statistical Learning for statistics and data science foundations
A free R course on QuantFish
Endless YouTube resources for R, Quarto, Shiny…and everything else
Community boards like Posit Community, Stack Overflow, etc.
Visit Posit PBC’s website and learning resources
- Major developer/maintainer of many tools we talked about today

First, go install them and mess around

Make a (small) tangible goal, figure it out, then iterate

Focus on automation, reporting and reproducibility to start (low hanging fruit)
- Build the foundation for analytics maturity

ChatGPT

A companion for any and all questions along the way

Shiny Assistant

An LLM interface to specifically help you develop Shiny apps
Quickly get a template started for your concept
Live demo: https://gallery.shinyapps.io/assistant/

Tools in Action: riskcalc.org

What is it?

Background

riskcalc.org is a free-to-use repository of risk calculators for individualized medical decision making (~10-15K users/month)
Embedded with published predictive models
Each calculator is a Shiny application
Majority are regression models

The Backend

Key points

Can build working infrastructure for free (to start)

Incur cost for service as needed

AWS versus on premises

Acknowledgements

Co-developers

Alex Milinovich
Xinge (Kathy) Ji
Blaine Martyn-Dow
Other contributors

In Memory

Michael W. Kattan, PhD
Department Chair (2004-2024)
Quantitative Health Sciences
Cleveland Clinic

Creator of riskcalc.org

Conclusion

AI tools are really just implementations of statistical models

Seek to use AI tools to enhance your analytical strategy (not be it)

Open source tools help do this because they are free, flexible, and customizable

Build towards analytics maturity while staying grounded in the problems you are trying to solve

Questions?

Link to slides: www.zajichekstats.com/presentations/the-value-of-open-source

References

https://www.gartner.com/en/newsroom/press-releases/2018-12-06-gartner-data-shows-87-percent-of-organizations-have-low-bi-and-analytics-maturity
https://www.kearney.com/service/digital-analytics/ai-assessment-aia-2024-the-drive-for-greater-maturity-scale-and-impact

The Value of Open-Source

A Little Background

Who is QHS?

What is AI, Anyway?

Two Sides of A Coin

AI as Tools

AI as Data Science

Blurred Lines

The Horse Before the Cart

Current State

Back to Basics

The Case for Open-Source

What Is Open-Source?

Background

Benefits in Data Science

The R Programming Language

Background

Example

Map example

Quarto for Reproducible Documents

Background

Shiny for Web Applications

Background

Example (in R)

Deployment and Integration

How can we share it?

Options to get started

How can I learn?

ChatGPT

Shiny Assistant

Tools in Action: riskcalc.org

What is it?

Background

The Backend

Key points

Acknowledgements

Co-developers

In Memory

Conclusion

Questions?

References