Coronavirus in the United States

14 minute read

Today we are faced with a virus that is rapidly spreading and affecting everyone across the world. While the world governments are taking action and implementing safety procedures; we still see that the virus is spreading. This draws the questions if these actions are preventing the spread or simply just stalling the spread. The data being collected will give me the option to be able to see when the virus will be predicted to end and if the safety measurements being taken are actually slowing down the spread for the virus. By using linear regression I will be able to map out the trajectory of the virus. Also, I will be able to use a time series to map out when certain safety precautions have been taken and what effects those actions have on the virus outbreak.

Problem Statement

In this study I want to see if the precautions being taken in the United States is really slowing down the rapid pace of the virus. This virus affects the people who fall in the demographic of 70 years of age or older. Some of the issues came from the lag between virus recognition and action taken by governments when first facing the virus. We are unclear when this virus will end and this paper will try to tackle what the next month will look like for the United States.

Background

The coronavirus recently came into light in the world. This virus originated in the city of Wuhan, China and thus has spread across the world. With the recent technology and transportation systems we have seen that the virus has been able to spread faster across 70 different countries. Around December 31st of last year the Chinese authorities alerted the World Health Organization of the outbreak.

The data that I have gathered needs to come with a disclaimer. Tests were not initially available for testing. Tests have driven the conversation and data surrounding the virus. If patients were not tested and could have possibly had the virus without being properly tracked. This is where my model could be inaccurate as far as having the correct data. In the United States the tests have been in a word delayed. The problem in China was not really appreciated until tests were available. The world did not see the true colors of the issue until the tests started reporting back and seeing that the number of cases had spiked.

The data I will be collecting will have to be well labeled and up to date. I looked for a couple different parameters. The only variables that I was curious in was; location, data, cases, deaths. With these variables I will be able to feed my model and make sure that I was answering my problem statement.

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix

Solution

My data was clean and split up into train and testing datasets. This data had to be manipulated a little bit to be able to be as accurate as possible. I transformed my date fields into useable integers because I did not have to deal with string values going into my model. I also dropped the province and state column as I do not need the data to be that locational based as it will not matter for the model. Then I removed the null values from the training dataset as I require all the fields to be filled in. As for the test dataset I kept that as is because it had the data that I was transforming my training set to mimic.

In order to find variable correlation to make sure that cases and deaths were worth pursuing I graphed it out. I took my training data set and graphed the global confirmed cases and the global confirmed deaths.

testData = pd.read_csv('test.csv')
trainData = pd.read_csv('train.csv')

display(trainData.tail())

	Id	Province/State	Country/Region	Lat	Long	Date	ConfirmedCases
17887	26378	NaN	Zambia	-15.4167	28.2833	2020-03-20	2.0
17888	26379	NaN	Zambia	-15.4167	28.2833	2020-03-21	2.0
17889	26380	NaN	Zambia	-15.4167	28.2833	2020-03-22	3.0
17890	26381	NaN	Zambia	-15.4167	28.2833	2020-03-23	3.0
17891	26382	NaN	Zambia	-15.4167	28.2833	2020-03-24	3.0

display(testData.head())

	ForecastId	Province/State	Country/Region	Lat	Long	Date
0	1	NaN	Afghanistan	33.0	65.0	2020-03-12
1	2	NaN	Afghanistan	33.0	65.0	2020-03-13
2	3	NaN	Afghanistan	33.0	65.0	2020-03-14
3	4	NaN	Afghanistan	33.0	65.0	2020-03-15
4	5	NaN	Afghanistan	33.0	65.0	2020-03-16

In order to make some observations I need to format the data to be useable and integers for graphing. The date field will be transformed into an integer by removing all symbols and setting the type as an int. Also, I will remove the NAN’s from the training set as those will provide no value. At the same time I will drop the Province/State column as I want to focus on the Country as a whole and not just individual states.

For the test data I want to make sure that I have the same primary key so I will use the date field and simply perform the same transformed as I did to the training dataset.

# Format date
trainData["Date"] = trainData["Date"].apply(lambda x: x.replace("-",""))
trainData["Date"]  = trainData["Date"].astype(int)

testData["Date"] = testData["Date"].apply(lambda x: x.replace("-",""))
testData["Date"]  = testData["Date"].astype(int)

# drop nan's
trainData = trainData.drop(['Province/State'],axis=1)
trainData = trainData.dropna()
trainData.isnull().sum()

Id                0
Country/Region    0
Lat               0
Long              0
Date              0
ConfirmedCases    0
Fatalities        0
dtype: int64

testData.isnull().sum()

ForecastId           0
Province/State    6622
Country/Region       0
Lat                  0
Long                 0
Date                 0
dtype: int64

trainData.nunique()

Id                17892
Country/Region      163
Lat                 272
Long                276
Date                 63
ConfirmedCases     1023
Fatalities          204
dtype: int64

testData.nunique()

ForecastId        12212
Province/State      128
Country/Region      163
Lat                 272
Long                276
Date                 43
dtype: int64

confirmed_total_date = trainData.groupby(['Date']).agg({'ConfirmedCases':['sum']})
fatalities_total_date = trainData.groupby(['Date']).agg({'Fatalities':['sum']})
total_date = confirmed_total_date.join(fatalities_total_date)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(17,7))
total_date.plot(ax=ax1)
ax1.set_title("Global confirmed cases", size=13)
ax1.set_ylabel("Number of cases", size=13)
ax1.set_xlabel("Date", size=13)
fatalities_total_date.plot(ax=ax2, color='orange')
ax2.set_title("Global deceased cases", size=13)
ax2.set_ylabel("Number of cases", size=13)
ax2.set_xlabel("Date", size=13)

Text(0.5, 0, 'Date')

png

Observations

The only concern that is transparent here is that initially the graph starts out on a straight unaffected line. This could be due to the fact that in the wake of the new virus the healthcare industry was unsure of the true classification of the virus. This in turn will result in false readings and overlooked cases. When I focus my efforts on the United States I will start the data when the virus was recognized as an issue.Some issues that I see that could be affecting the graph would be the availability of the tests for the COVID-19 virus. Looking at the confirmed cases we can notice that the graph exponentially grew around the same time that the tests were distributed to the affected areas. This could affect the model.

matplotlib.style.use('ggplot')
plt.scatter(confirmed_total_date, fatalities_total_date)
plt.show()

png

I used the scipy library to map out the correlations between the confirmed cases and confirmed deaths. When I used the spearman correlation I received a score of 1, then when I used the kendall tau correlation i got a score of .99. These scores show that there is a correlation between these two variables that when the confirmed cases goes up as does the deaths. But as I mentioned above this could also be reflective of when tests are being administered as that is when they become available in bulk. Or this could be reflective of the lack of response when first faced with the virus.

scipy.stats.spearmanr(confirmed_total_date, fatalities_total_date)[0]

1.0

scipy.stats.kendalltau(confirmed_total_date, fatalities_total_date)[0]

0.9999999999999999

United States Analysis

I want to identify all of the United States data in the datasets to identify a cleaner look. Once the Corona Virus became transparent in the United States there were tests available to test and identify for the virus. I anticipate that this analysis will differ from the rest of the world, graph above, as the graph will not include China. China’s influence in the data showed a delay in progression against the virus when it comes to tests and identification portrayed by the spikes in the data.

When the data comes to the United States I expect for it to start out slow then begin to grow exponentially as more tests will be available towards the beginning of the outbreak. China was unaware of the virus at first and allowed for the virus to spread. As the United States looking in, they should be able to catch the virus early on and take measures early on to prevent a wide spread disease.

unitedStates =trainData[trainData['Country/Region']=='US']
unitedStates.describe()

	Id	Lat	Long	Date	ConfirmedCases	Fatalities
count	3654.000000	3654.000000	3654.000000	3.654000e+03	3654.000000	3654.000000
mean	22584.500000	37.771567	-84.323891	2.020024e+07	60.047072	0.813355
std	1557.201509	8.018508	46.996733	6.621358e+01	681.297097	7.030237
min	19903.000000	13.444300	-157.498300	2.020012e+07	0.000000	0.000000
25%	21236.250000	34.969700	-99.784000	2.020021e+07	0.000000	0.000000
50%	22584.500000	38.978600	-87.944200	2.020022e+07	0.000000	0.000000
75%	23932.750000	42.230200	-76.802100	2.020031e+07	0.000000	0.000000
max	25266.000000	61.370700	144.793700	2.020032e+07	25681.000000	210.000000

confirmed_total_date_US = unitedStates.groupby(['Date']).agg({'ConfirmedCases':['sum']})
fatalities_total_date_US = unitedStates.groupby(['Date']).agg({'Fatalities':['sum']})
total_date_US = confirmed_total_date_US.join(fatalities_total_date_US)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(17,7))
total_date_US.plot(ax=ax1)
ax1.set_title("Global confirmed cases", size=13)
ax1.set_ylabel("Number of cases", size=13)
ax1.set_xlabel("Date", size=13)
fatalities_total_date_US.plot(ax=ax2, color='orange')
ax2.set_title("Global deceased cases", size=13)
ax2.set_ylabel("Number of cases", size=13)
ax2.set_xlabel("Date", size=13)

Text(0.5, 0, 'Date')

png

Analysis

Taken from the same timeline we see that there was some time where the virus was not reported in the United States. But once the virus was reported we can see that the cases exploded.

The United States was aware of the virus as it originated in China. On Jan 21 2020 the United States had its first case of the virus. Then a month later Feb 26, 2020 the United States had it first suspected local transmission. Then a short three days later the United States had reported its first COVID-19 related death.

Looking at the graph above we notice that around early March there is a spike in the data. We are able to see that the graph started to grow. This is because on March 3, 2020 the CDC lifts restrictions for virus testing. This means that people will close contact with people diagnosed with COVID-19 or people with severe symptoms could get tested.

This, much like what had happened in China, availability in tests to the general public made it clear that the virus had already spread across the population. Once the testing was presented to the United States and people began to take the tests the volume of the virus came into light. Through the month of march the number of cases grew at an alarming rate because the availability of the tests gave us a clearer picture.

I expect this number to keep on growing as a result that tests are not accessible to everyone. Once there are enough tests to completely test everyone who is concerned with having the virus I expect the data to reflect a steady growth in cases and deaths.

matplotlib.style.use('ggplot')
plt.scatter(confirmed_total_date_US, fatalities_total_date_US)
plt.show()

png

scipy.stats.spearmanr(confirmed_total_date_US, fatalities_total_date_US)[0]

0.8073991628275585

scipy.stats.kendalltau(confirmed_total_date_US, fatalities_total_date_US)[0]

0.7657375167529012

Correlation

When using the United States data the correlation scores between the confirmed deaths and confirmed cases got a spearman score of .80 and a kendall tau of .76. Even though this is not as strong as the global correlation there is still a correlation between the confirmed number of cases and fatalities in the United States. Even though the numbers are not similar, they follow a similar pattern and if the number of cases grow so with the number of deaths.

Random Forest Classification Model

The random forest classification model is an ensemble tree-based learning algorithm. The method is to randomly select a subset of training sets and it will aggregate the votes from different decision trees. This will allow the model to predict the final class of the test object.

The algorithm is stable and will not move or produce significantly different results if a new data set is introduced. The algorithm also does not have any bias. The algorithm can handle missing values and unscaled data points but for this project I have removed those concerns as it should lead to a better result.

from sklearn.ensemble import RandomForestClassifier
Tree_model = RandomForestClassifier(max_depth=200, random_state=0)

total_date_US

	ConfirmedCases	Fatalities
	sum	sum
Date
20200122	0.0	0.0
20200123	0.0	0.0
20200124	0.0	0.0
20200125	0.0	0.0
20200126	0.0	0.0
...	...	...
20200320	18967.0	241.0
20200321	25347.0	302.0
20200322	33083.0	413.0
20200323	43442.0	547.0
20200324	53490.0	698.0

63 rows × 2 columns

train_unitedStates=unitedStates.drop(['Country/Region'],axis=1)
trainData_US=train_unitedStates[['Lat','Long','Date']]
trainData_US

	Lat	Long	Date
13482	32.3182	-86.9023	20200122
13483	32.3182	-86.9023	20200123
13484	32.3182	-86.9023	20200124
13485	32.3182	-86.9023	20200125
13486	32.3182	-86.9023	20200126
...	...	...	...
17131	42.7560	-107.3025	20200320
17132	42.7560	-107.3025	20200321
17133	42.7560	-107.3025	20200322
17134	42.7560	-107.3025	20200323
17135	42.7560	-107.3025	20200324

3654 rows × 3 columns

train_unitedStates.to_csv('unitedstatesdata.csv')

test_unitedStates =testData[testData['Country/Region']=='US']
test_unitedStates=test_unitedStates.drop(['Country/Region'],axis=1)
testData_US = test_unitedStates[['Lat','Long','Date']]
testData_US

	Lat	Long	Date
9202	32.3182	-86.9023	20200312
9203	32.3182	-86.9023	20200313
9204	32.3182	-86.9023	20200314
9205	32.3182	-86.9023	20200315
9206	32.3182	-86.9023	20200316
...	...	...	...
11691	42.7560	-107.3025	20200419
11692	42.7560	-107.3025	20200420
11693	42.7560	-107.3025	20200421
11694	42.7560	-107.3025	20200422
11695	42.7560	-107.3025	20200423

2494 rows × 3 columns

testData_US.tail()

	Lat	Long	Date
11691	42.756	-107.3025	20200419
11692	42.756	-107.3025	20200420
11693	42.756	-107.3025	20200421
11694	42.756	-107.3025	20200422
11695	42.756	-107.3025	20200423

train_unitedStates_Confirmed=train_unitedStates[['ConfirmedCases']]
train_unitedStates_Confirmed.describe()

	ConfirmedCases
count	3654.000000
mean	60.047072
std	681.297097
min	0.000000
25%	0.000000
50%	0.000000
75%	0.000000
max	25681.000000

train_unitedStates_Fatalities=train_unitedStates[['Fatalities']]
train_unitedStates_Fatalities.describe()

	Fatalities
count	3654.000000
mean	0.813355
std	7.030237
min	0.000000
25%	0.000000
50%	0.000000
75%	0.000000
max	210.000000

x = trainData_US
y1 = train_unitedStates_Confirmed
y2 = train_unitedStates_Fatalities
x_test = testData_US

Tree_model.fit(x,y1)
predConfirmed = Tree_model.predict(x_test)
predConfirmed = pd.DataFrame(predConfirmed)
predConfirmed.columns = ["ConfirmedCases_prediction"]

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  """Entry point for launching an IPython kernel.

predConfirmed

	ConfirmedCases_prediction
0	0.0
1	0.0
2	6.0
3	12.0
4	12.0
...	...
2489	29.0
2490	29.0
2491	29.0
2492	29.0
2493	29.0

2494 rows × 1 columns

Tree_model.fit(x,y2)
predFatalities = Tree_model.predict(x_test)
predFatalities = pd.DataFrame(predFatalities)
predFatalities.columns = ["Fatalities_prediction"]

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  """Entry point for launching an IPython kernel.

predFatalities

	Fatalities_prediction
0	0.0
1	0.0
2	0.0
3	0.0
4	0.0
...	...
2489	0.0
2490	0.0
2491	0.0
2492	0.0
2493	0.0

2494 rows × 1 columns

predictions = predConfirmed.join(predFatalities)
predictions

	ConfirmedCases_prediction	Fatalities_prediction
0	0.0	0.0
1	0.0	0.0
2	6.0	0.0
3	12.0	0.0
4	12.0	0.0
...	...	...
2489	29.0	0.0
2490	29.0	0.0
2491	29.0	0.0
2492	29.0	0.0
2493	29.0	0.0

2494 rows × 2 columns

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(17,7))
predictions.plot(ax=ax1)
ax1.set_title("US confirmed cases", size=13)
ax1.set_ylabel("Number of cases", size=13)
ax1.set_xlabel("Date", size=13)
predFatalities.plot(ax=ax2, color='orange')
ax2.set_title("US deceased cases", size=13)
ax2.set_ylabel("Number of cases", size=13)
ax2.set_xlabel("Date", size=13)

Text(0.5, 0, 'Date')

png

predictions.describe()

	ConfirmedCases_prediction	Fatalities_prediction
count	2494.000000	2494.000000
mean	690.180834	8.576183
std	2943.759642	27.336496
min	0.000000	0.000000
25%	30.000000	0.000000
50%	105.000000	1.000000
75%	368.000000	6.000000
max	25681.000000	210.000000

matplotlib.style.use('ggplot')
plt.scatter(predConfirmed, predFatalities)
plt.show()

png

scipy.stats.spearmanr(predConfirmed, predFatalities)[0]

0.8056651838655811

scipy.stats.kendalltau(predConfirmed, predFatalities)[0]

0.6666551009677553

confirmedCasesPrint = predictions.sum()[0]
fatalititiesPrint = predictions.sum()[1]

print('By 4/23/2020 the below predictions were made: ')
print('{} more people in the US will contract the disease.'.format(confirmedCasesPrint))
print('As a result I predict that {} people will expire.'.format(fatalititiesPrint))

By 4/23/2020 the below predictions were made:
1721311.0 more people in the US will contract the disease.
As a result I predict that 21389.0 people will expire.

Conclusion

Upon using this model I was able to predict the following metrics. By 4/23/2020 the below predictions were made: 1,721,311.0 more people in the US will contract the disease, and as a result I predict that 21,389.0 people will expire. Between these numbers the correlation score was .80 for spearman and .66 for kendall tau. For this score I see a pattern that these numbers could represent valid numbers. Given the timeline with the skyrocketing numbers and the correlations of the predicted number I feel confident in the prediction.

Disclaimer

This project and data had been directly affected by the distribution of tests across the global. In the data and graphs we are able to see the spikes of which the tests became available based on location. In the United States we are able to see when tests are available and when supplies begin to deplete. I also suspect that those spike affected my outcome and the model of which I trained using that data. I suspect that the correlations between the two variables are accurate even if the numbers may be off.

Youtube Video link

https://youtu.be/Vg55Ry7Q7VU

Link to paper

https://github.com/SudoZyphr/covid19/blob/master/Covid-19.pdf

Resources:

https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/summary.html
https://ourworldindata.org/coronavirus
https://www.who.int/emergencies/diseases/novel-coronavirus-2019
https://www.worldometers.info/coronavirus/
https://informationisbeautiful.net/visualizations/covid-19-coronavirus-infographic-datapack/
https://www.ecdc.europa.eu/en/publications-data/rapid-risk-assessment-novel-coronavirus- disease-2019-covid-19-pandemic-increased
https://www.barrons.com/articles/latest-coronavirus-data-show-disease-continues-to-spread-even-in- the-u-s-51584224660
https://www.theguardian.com/world/2020/mar/13/coronavirus-pandemic-visualising-the-global-crisis
https://ourworldindata.org/coronavirus-source-data
https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset
https://www.sciencedaily.com/releases/2020/03/200317175442.htm
https://www.worldometers.info/coronavirus/coronavirus-age-sex-demographics/

Share on

Twitter Facebook LinkedIn

Luke Gruszka