What connection exists between public health and data? Public health experts are interested in epidemiology, which is the study of patterns of disease in populations. In the public health field, analyzing data is crucial when studying the spread of disease, especially for diseases like COVID-19. When viruses like COVID-19 spread rapidly, looking at data is a good way to visualize its rate of growth and presence across the globe.
We can use digital tools to construct COVID-19 patterns, as well as find the incidence of cases in different locations. Having these tools is handy because they generate information about the virus very quickly. Public health officials can then use such information to determine which populations are at risk, and what kind of public safety measures should be implemented.
We can use digital tools to assess the spread of the virus. These include Google Colab, a notebook to execute Python code, which is a type of coding language. We can import COVID-19 data into Google Colab to not only make it readable, but also customize it to create charts and graphs. Public health experts and data scientists do this on a larger scale to interpret huge sets of data for diseases like COVID-19. Through this activity, we will be doing this on a smaller scale! You will also learn other applications such as Plotly to construct COVID-19 trends. Given a CSV file of data, how can you (yes, you!) make predictions for the spread of the virus? How can you find which areas have more cases?
The learning objectives for this activity include:
- Learn to navigate Google Colab, an online notebook to execute code
- Import necessary packages to enable tools
- Read data into an instance and wrangle with Pandas built-in functions
- Use Plotly to generate a bar chart of COVID-19 data
Getting Started
We’ll be exploring with a browser-based development environment called Google Colab that can run Python code. Think of it like a Google Doc, but for Data Science! You can create reports that run code, display visuals and annotate what’s happening with Markdown or simple text!
To get started, search for “Google Colab” and log into your Google account. Hit “File” then “New Notebook” to generate a new notebook in another tab. Give it a descriptive name and we’ll get started!
Enabling Our Tools
We’ll need to import some working tools. In the first code chunk that appears, type and “shift+enter”:
import pandas as pd
import plotly.express as px
- “Shift+enter” is a short-hand way of running this chunk. You can also click the play icon found at the left corner of each new code box.
The “as” that follows our package name gives that package an alias. “Pd” is usually easier and quicker to type out than “Pandas” and same goes for “plotly.express”
Reading in the Data
The data we referenced is provided by John Hopkins CSSE’s GitHub repository.
Copy and paste the following into a new code box by clicking on “+ code” found on the top left on the page.
link = 'https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_daily_reports/04-22-2020.csv'
Read the data into a pandas instance with “pd.read_csv()” . This means that we name a variable that stores the data of all the functions found in the library. I called it “world” to represent the global data.
world = pd.read_csv(link)
Since we have created an instance of the pandas, we can call any function found here to manipulate the data in any way, shape or form! Take a glance at the data by observing the first 5 or “n” rows and their column labels with pandas.head().
world.head()
Up to this point, this is how the notebook should look:
Congrats! You have converted the mess of the comma-separated values into a structured form. Now, before we get into plotting these numbers, let’s introduce how we might select the columns we want to see and how to apply some basic math operations to them.
Data Wrangling
To select the first or any column of the table, we can either type:
world['Country_Region']
#or
world.Country_Region
To select certain column values, such as “US” only in that particular column, we use the comparison operator “==” to denote if two values are the same, return “True.” For rows that return “True” we will grab the row by indexing.
world[world.Country_Region == "US"]
Prepping the Data for Bar Chart
Everything mentioned so far will help us create a bar chart per country and their confirmed sums. We’ll achieve this by using Pandas’ groupby() to group up particular columns and apply an aggregating operation to them.
Start by grouping all the distinct country names from the “Country_Region” with:
Code Description | Output |
---|---|
Group Country names | world.groupby(‘Country_Region’) |
Select only the “Confirmed” column | “....”.Confirmed |
Find the sum of confirmed cases | “....”.sum() |
Reset the index for readability | “...”.reset_index() |
Assign to a variable | grouped_countries = “...”.reset_index() |
Print the first 5 rows | grouped_countries.head() |
Expected code and output should be:
To reorder the rows, we can use “pd.sort_values()”
Code Description | Output |
---|---|
Sort in ascending order (default) | grouped_countries.sort_values('Confirmed') |
Sort in descending order | grouped_countries.sort_values('Confirmed',ascending=False) |
Store in variable | sorted_countries= ... |
Expected code and output:
NOW WE CAN PLOT!
Using Plotly’s “bar()” we’ll pass in the sorted dataframe we’ve just constructed. What’s a bar chart? It is one way to visualize data when you want to compare several subjects and their counts for a certain category. Different bar heights indicate a higher count for the respective subject.
fig = px.bar(sorted_countries.head(),
x='Country_Region',
y='Confirmed',
color= 'Confirmed')
fig.show()
Wrapping up...
- You were able to organize COVID-19 data based on country/county
- You were able to create graphs that describe number of incidences
- You were able to execute code to convert data into readable form
Through this activity, we were able to import data, convert it into a readable form, and modify as desired to generate a bar chart of the number of virus cases per country. We can see that such digital tools are useful to visualize the spread of the virus within a couple of steps on a computer! Data science can enable anyone, even yourself, to learn more about the spread of COVID-19 from anywhere! Data science is a key part of understanding how a disease spreads and at what rate cases occur. It is becoming increasingly important for people not only in public health, but in healthcare to use such data projections to improve patient care and ensure public safety.