What connection exists between public health and data? Public health experts are interested in epidemiology, which is the study of patterns of disease in populations. In the public health field, analyzing data is crucial when studying the spread of disease, especially for diseases like COVID-19. When viruses like COVID-19 spread rapidly, looking at data is a good way to visualize its rate of growth and presence across the globe.

We can use digital tools to construct COVID-19 patterns, as well as find the incidence of cases in different locations. Having these tools is handy because they generate information about the virus very quickly. Public health officials can then use such information to determine which populations are at risk, and what kind of public safety measures should be implemented.

We can use digital tools to assess the spread of the virus. These include Google Colab, a notebook to execute Python code, which is a type of coding language. We can import COVID-19 data into Google Colab to not only make it readable, but also customize it to create charts and graphs. Public health experts and data scientists do this on a larger scale to interpret huge sets of data for diseases like COVID-19. Through this activity, we will be doing this on a smaller scale! You will also learn other applications such as Plotly to construct COVID-19 trends. Given a CSV file of data, how can you (yes, you!) make predictions for the spread of the virus? How can you find which areas have more cases?

The learning objectives for this activity include:

  • Learn to navigate Google Colab, an online notebook to execute code
  • Import necessary packages to enable tools
  • Read data into an instance and wrangle with Pandas built-in functions
  • Use Plotly to generate a bar chart of COVID-19 data

Getting Started

We’ll be exploring with a browser-based development environment called Google Colab that can run Python code. Think of it like a Google Doc, but for Data Science! You can create reports that run code, display visuals and annotate what’s happening with Markdown or simple text!

To get started, search for “Google Colab” and log into your Google account. Hit “File” then “New Notebook” to generate a new notebook in another tab. Give it a descriptive name and we’ll get started!

Enabling Our Tools

We’ll need to import some working tools. In the first code chunk that appears, type and “shift+enter”:

import pandas as pd
import plotly.express as px
*The “as” that follows our package name gives that package an alias. “Pd” is usually easier and quicker to type out than “Pandas” and same goes for “plotly.express”*
  • “Shift+enter” is a short-hand way of running this chunk. You can also click the play icon found at the left corner of each new code box.

The “as” that follows our package name gives that package an alias. “Pd” is usually easier and quicker to type out than “Pandas” and same goes for “plotly.express”

Reading in the Data

The data we referenced is provided by John Hopkins CSSE’s GitHub repository.

Copy and paste the following into a new code box by clicking on “+ code” found on the top left on the page.

link = 'https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_daily_reports/04-22-2020.csv'
It’s good to separate the “imports” and data links in separate code chunks. In the future, you may find yourself importing 10 or more packages and reading in several data files!

Read the data into a pandas instance with “pd.read_csv()” . This means that we name a variable  that stores the data of all the functions found in the library. I called it “world” to represent the global data.

world = pd.read_csv(link)

Since we have created an instance of the pandas, we can call any function found here to manipulate the data in any way, shape or form! Take a glance at the data by observing the first 5 or “n”  rows and their column labels with pandas.head().

world.head()
Alternatively, you can just type “world” by itself and it will display all 3000+ rows or “world.head(10)” for the top 10 rows, “world.tail()” for the bottom 5, etc…

Up to this point, this is how the notebook should look:

Congrats! You have converted the mess of the comma-separated values into a structured form. Now, before we get into plotting these numbers, let’s introduce how we might select the columns we want to see and how to apply some basic math operations to them.

Data Wrangling

To select the first or any column of the table, we can either type:

world['Country_Region']

#or

world.Country_Region
This should return a numbered list of values found within that column only


To select certain column values, such as “US” only  in that particular column, we use the comparison operator “==” to denote if two values are the same, return “True.” For rows that return “True” we will grab the row by indexing.

world[world.Country_Region == "US"]

This line of code first compares column values in “Country_Region” with the given value “US”. If true, the line “world.Country_Region == "US" will evaluate to true and return the row, checking if “ world[True] 

Prepping the Data for Bar Chart

Everything mentioned so far will help us create a bar chart per country and their confirmed sums. We’ll achieve this by using Pandas’ groupby() to group up particular columns and apply an aggregating operation to them.

Start by grouping all the distinct country names from the “Country_Region” with:

Code Description Output
Group Country names world.groupby(‘Country_Region’)
Select only the “Confirmed” column “....”.Confirmed
Find the sum of confirmed cases “....”.sum()
Reset the index for readability “...”.reset_index()
Assign to a variable grouped_countries = “...”.reset_index()
Print the first 5 rows grouped_countries.head()

Expected code and output should be:


To reorder the rows, we can use “pd.sort_values()”

Code Description Output
Sort in ascending order (default) grouped_countries.sort_values('Confirmed')
Sort in descending order grouped_countries.sort_values('Confirmed',ascending=False)
Store in variable sorted_countries= ...

Expected code and output:

NOW WE CAN PLOT!

Using Plotly’s “bar()” we’ll pass in the sorted dataframe we’ve just constructed. What’s a bar chart? It is one way to visualize data when you want to compare several subjects and their counts for a certain category. Different bar heights indicate a higher count for the respective subject.

fig = px.bar(sorted_countries.head(),
             x='Country_Region',
             y='Confirmed',
           color= 'Confirmed')
fig.show()
We pass in the sorted dataframe, define the x-axis as the subjects, the y-axis as the category we are interested in looking at, and the color scale for the y-axis range.

Wrapping up...

  • You were able to organize COVID-19 data based on country/county
  • You were able to create graphs that describe number of incidences
  • You were able to execute code to convert data into readable form

Through this activity, we were able to import data, convert it into a readable form, and modify as desired to generate a bar chart of the number of virus cases per country. We can see that such digital tools are useful to visualize the spread of the virus within a couple of steps on a computer! Data science can enable anyone, even yourself, to learn more about the spread of COVID-19 from anywhere! Data science is a key part of understanding how a disease spreads and at what rate cases occur. It is becoming increasingly important for people not only in public health, but in healthcare to use such data projections to improve patient care and ensure public safety.