Coronavirus in the United States

14 minute read

Today we are faced with a virus that is rapidly spreading and affecting everyone across the world. While the world governments are taking action and implementing safety procedures; we still see that the virus is spreading. This draws the questions if these actions are preventing the spread or simply just stalling the spread. The data being collected will give me the option to be able to see when the virus will be predicted to end and if the safety measurements being taken are actually slowing down the spread for the virus. By using linear regression I will be able to map out the trajectory of the virus. Also, I will be able to use a time series to map out when certain safety precautions have been taken and what effects those actions have on the virus outbreak.

Problem Statement

In this study I want to see if the precautions being taken in the United States is really slowing down the rapid pace of the virus. This virus affects the people who fall in the demographic of 70 years of age or older. Some of the issues came from the lag between virus recognition and action taken by governments when first facing the virus. We are unclear when this virus will end and this paper will try to tackle what the next month will look like for the United States.

Background

The coronavirus recently came into light in the world. This virus originated in the city of Wuhan, China and thus has spread across the world. With the recent technology and transportation systems we have seen that the virus has been able to spread faster across 70 different countries. Around December 31st of last year the Chinese authorities alerted the World Health Organization of the outbreak.

The data that I have gathered needs to come with a disclaimer. Tests were not initially available for testing. Tests have driven the conversation and data surrounding the virus. If patients were not tested and could have possibly had the virus without being properly tracked. This is where my model could be inaccurate as far as having the correct data. In the United States the tests have been in a word delayed. The problem in China was not really appreciated until tests were available. The world did not see the true colors of the issue until the tests started reporting back and seeing that the number of cases had spiked.

The data I will be collecting will have to be well labeled and up to date. I looked for a couple different parameters. The only variables that I was curious in was; location, data, cases, deaths. With these variables I will be able to feed my model and make sure that I was answering my problem statement.

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix

Solution

My data was clean and split up into train and testing datasets. This data had to be manipulated a little bit to be able to be as accurate as possible. I transformed my date fields into useable integers because I did not have to deal with string values going into my model. I also dropped the province and state column as I do not need the data to be that locational based as it will not matter for the model. Then I removed the null values from the training dataset as I require all the fields to be filled in. As for the test dataset I kept that as is because it had the data that I was transforming my training set to mimic.

In order to find variable correlation to make sure that cases and deaths were worth pursuing I graphed it out. I took my training data set and graphed the global confirmed cases and the global confirmed deaths.

testData = pd.read_csv('test.csv')
trainData = pd.read_csv('train.csv')

display(trainData.tail())
Id Province/State Country/Region Lat Long Date ConfirmedCases Fatalities
17887 26378 NaN Zambia -15.4167 28.2833 2020-03-20 2.0 0.0
17888 26379 NaN Zambia -15.4167 28.2833 2020-03-21 2.0 0.0
17889 26380 NaN Zambia -15.4167 28.2833 2020-03-22 3.0 0.0
17890 26381 NaN Zambia -15.4167 28.2833 2020-03-23 3.0 0.0
17891 26382 NaN Zambia -15.4167 28.2833 2020-03-24 3.0 0.0
display(testData.head())
ForecastId Province/State Country/Region Lat Long Date
0 1 NaN Afghanistan 33.0 65.0 2020-03-12
1 2 NaN Afghanistan 33.0 65.0 2020-03-13
2 3 NaN Afghanistan 33.0 65.0 2020-03-14
3 4 NaN Afghanistan 33.0 65.0 2020-03-15
4 5 NaN Afghanistan 33.0 65.0 2020-03-16

The data I will be collecting will have to be well labeled and up to date. I looked for a couple different parameters. The only variables that I was curious in was; location, data, cases, deaths. With these variables I will be able to feed my model and make sure that I was answering my problem statement.

In order to make some observations I need to format the data to be useable and integers for graphing. The date field will be transformed into an integer by removing all symbols and setting the type as an int. Also, I will remove the NAN’s from the training set as those will provide no value. At the same time I will drop the Province/State column as I want to focus on the Country as a whole and not just individual states.

For the test data I want to make sure that I have the same primary key so I will use the date field and simply perform the same transformed as I did to the training dataset.

# Format date
trainData["Date"] = trainData["Date"].apply(lambda x: x.replace("-",""))
trainData["Date"]  = trainData["Date"].astype(int)

testData["Date"] = testData["Date"].apply(lambda x: x.replace("-",""))
testData["Date"]  = testData["Date"].astype(int)
# drop nan's
trainData = trainData.drop(['Province/State'],axis=1)
trainData = trainData.dropna()
trainData.isnull().sum()

Id                0
Country/Region    0
Lat               0
Long              0
Date              0
ConfirmedCases    0
Fatalities        0
dtype: int64
testData.isnull().sum()
ForecastId           0
Province/State    6622
Country/Region       0
Lat                  0
Long                 0
Date                 0
dtype: int64
trainData.nunique()
Id                17892
Country/Region      163
Lat                 272
Long                276
Date                 63
ConfirmedCases     1023
Fatalities          204
dtype: int64
testData.nunique()
ForecastId        12212
Province/State      128
Country/Region      163
Lat                 272
Long                276
Date                 43
dtype: int64
confirmed_total_date = trainData.groupby(['Date']).agg({'ConfirmedCases':['sum']})
fatalities_total_date = trainData.groupby(['Date']).agg({'Fatalities':['sum']})
total_date = confirmed_total_date.join(fatalities_total_date)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(17,7))
total_date.plot(ax=ax1)
ax1.set_title("Global confirmed cases", size=13)
ax1.set_ylabel("Number of cases", size=13)
ax1.set_xlabel("Date", size=13)
fatalities_total_date.plot(ax=ax2, color='orange')
ax2.set_title("Global deceased cases", size=13)
ax2.set_ylabel("Number of cases", size=13)
ax2.set_xlabel("Date", size=13)
Text(0.5, 0, 'Date')

png

Observations

The only concern that is transparent here is that initially the graph starts out on a straight unaffected line. This could be due to the fact that in the wake of the new virus the healthcare industry was unsure of the true classification of the virus. This in turn will result in false readings and overlooked cases. When I focus my efforts on the United States I will start the data when the virus was recognized as an issue.Some issues that I see that could be affecting the graph would be the availability of the tests for the COVID-19 virus. Looking at the confirmed cases we can notice that the graph exponentially grew around the same time that the tests were distributed to the affected areas. This could affect the model.

matplotlib.style.use('ggplot')
plt.scatter(confirmed_total_date, fatalities_total_date)
plt.show()

png

I used the scipy library to map out the correlations between the confirmed cases and confirmed deaths. When I used the spearman correlation I received a score of 1, then when I used the kendall tau correlation i got a score of .99. These scores show that there is a correlation between these two variables that when the confirmed cases goes up as does the deaths. But as I mentioned above this could also be reflective of when tests are being administered as that is when they become available in bulk. Or this could be reflective of the lack of response when first faced with the virus.

scipy.stats.spearmanr(confirmed_total_date, fatalities_total_date)[0]
1.0
scipy.stats.kendalltau(confirmed_total_date, fatalities_total_date)[0]
0.9999999999999999

United States Analysis

I want to identify all of the United States data in the datasets to identify a cleaner look. Once the Corona Virus became transparent in the United States there were tests available to test and identify for the virus. I anticipate that this analysis will differ from the rest of the world, graph above, as the graph will not include China. China’s influence in the data showed a delay in progression against the virus when it comes to tests and identification portrayed by the spikes in the data.

When the data comes to the United States I expect for it to start out slow then begin to grow exponentially as more tests will be available towards the beginning of the outbreak. China was unaware of the virus at first and allowed for the virus to spread. As the United States looking in, they should be able to catch the virus early on and take measures early on to prevent a wide spread disease.

unitedStates =trainData[trainData['Country/Region']=='US']
unitedStates.describe()
Id Lat Long Date ConfirmedCases Fatalities
count 3654.000000 3654.000000 3654.000000 3.654000e+03 3654.000000 3654.000000
mean 22584.500000 37.771567 -84.323891 2.020024e+07 60.047072 0.813355
std 1557.201509 8.018508 46.996733 6.621358e+01 681.297097 7.030237
min 19903.000000 13.444300 -157.498300 2.020012e+07 0.000000 0.000000
25% 21236.250000 34.969700 -99.784000 2.020021e+07 0.000000 0.000000
50% 22584.500000 38.978600 -87.944200 2.020022e+07 0.000000 0.000000
75% 23932.750000 42.230200 -76.802100 2.020031e+07 0.000000 0.000000
max 25266.000000 61.370700 144.793700 2.020032e+07 25681.000000 210.000000
confirmed_total_date_US = unitedStates.groupby(['Date']).agg({'ConfirmedCases':['sum']})
fatalities_total_date_US = unitedStates.groupby(['Date']).agg({'Fatalities':['sum']})
total_date_US = confirmed_total_date_US.join(fatalities_total_date_US)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(17,7))
total_date_US.plot(ax=ax1)
ax1.set_title("Global confirmed cases", size=13)
ax1.set_ylabel("Number of cases", size=13)
ax1.set_xlabel("Date", size=13)
fatalities_total_date_US.plot(ax=ax2, color='orange')
ax2.set_title("Global deceased cases", size=13)
ax2.set_ylabel("Number of cases", size=13)
ax2.set_xlabel("Date", size=13)
Text(0.5, 0, 'Date')

png

Analysis

Taken from the same timeline we see that there was some time where the virus was not reported in the United States. But once the virus was reported we can see that the cases exploded.

The United States was aware of the virus as it originated in China. On Jan 21 2020 the United States had its first case of the virus. Then a month later Feb 26, 2020 the United States had it first suspected local transmission. Then a short three days later the United States had reported its first COVID-19 related death.

Looking at the graph above we notice that around early March there is a spike in the data. We are able to see that the graph started to grow. This is because on March 3, 2020 the CDC lifts restrictions for virus testing. This means that people will close contact with people diagnosed with COVID-19 or people with severe symptoms could get tested.

This, much like what had happened in China, availability in tests to the general public made it clear that the virus had already spread across the population. Once the testing was presented to the United States and people began to take the tests the volume of the virus came into light. Through the month of march the number of cases grew at an alarming rate because the availability of the tests gave us a clearer picture.

I expect this number to keep on growing as a result that tests are not accessible to everyone. Once there are enough tests to completely test everyone who is concerned with having the virus I expect the data to reflect a steady growth in cases and deaths.

matplotlib.style.use('ggplot')
plt.scatter(confirmed_total_date_US, fatalities_total_date_US)
plt.show()

png

scipy.stats.spearmanr(confirmed_total_date_US, fatalities_total_date_US)[0]
0.8073991628275585
scipy.stats.kendalltau(confirmed_total_date_US, fatalities_total_date_US)[0]
0.7657375167529012

Correlation

When using the United States data the correlation scores between the confirmed deaths and confirmed cases got a spearman score of .80 and a kendall tau of .76. Even though this is not as strong as the global correlation there is still a correlation between the confirmed number of cases and fatalities in the United States. Even though the numbers are not similar, they follow a similar pattern and if the number of cases grow so with the number of deaths.

Random Forest Classification Model

The random forest classification model is an ensemble tree-based learning algorithm. The method is to randomly select a subset of training sets and it will aggregate the votes from different decision trees. This will allow the model to predict the final class of the test object.

The algorithm is stable and will not move or produce significantly different results if a new data set is introduced. The algorithm also does not have any bias. The algorithm can handle missing values and unscaled data points but for this project I have removed those concerns as it should lead to a better result.

from sklearn.ensemble import RandomForestClassifier
Tree_model = RandomForestClassifier(max_depth=200, random_state=0)
total_date_US
ConfirmedCases Fatalities
sum sum
Date
20200122 0.0 0.0
20200123 0.0 0.0
20200124 0.0 0.0
20200125 0.0 0.0
20200126 0.0 0.0
... ... ...
20200320 18967.0 241.0
20200321 25347.0 302.0
20200322 33083.0 413.0
20200323 43442.0 547.0
20200324 53490.0 698.0

63 rows × 2 columns

train_unitedStates=unitedStates.drop(['Country/Region'],axis=1)
trainData_US=train_unitedStates[['Lat','Long','Date']]
trainData_US
Lat Long Date
13482 32.3182 -86.9023 20200122
13483 32.3182 -86.9023 20200123
13484 32.3182 -86.9023 20200124
13485 32.3182 -86.9023 20200125
13486 32.3182 -86.9023 20200126
... ... ... ...
17131 42.7560 -107.3025 20200320
17132 42.7560 -107.3025 20200321
17133 42.7560 -107.3025 20200322
17134 42.7560 -107.3025 20200323
17135 42.7560 -107.3025 20200324

3654 rows × 3 columns

train_unitedStates.to_csv('unitedstatesdata.csv')
test_unitedStates =testData[testData['Country/Region']=='US']
test_unitedStates=test_unitedStates.drop(['Country/Region'],axis=1)
testData_US = test_unitedStates[['Lat','Long','Date']]
testData_US
Lat Long Date
9202 32.3182 -86.9023 20200312
9203 32.3182 -86.9023 20200313
9204 32.3182 -86.9023 20200314
9205 32.3182 -86.9023 20200315
9206 32.3182 -86.9023 20200316
... ... ... ...
11691 42.7560 -107.3025 20200419
11692 42.7560 -107.3025 20200420
11693 42.7560 -107.3025 20200421
11694 42.7560 -107.3025 20200422
11695 42.7560 -107.3025 20200423

2494 rows × 3 columns

testData_US.tail()
Lat Long Date
11691 42.756 -107.3025 20200419
11692 42.756 -107.3025 20200420
11693 42.756 -107.3025 20200421
11694 42.756 -107.3025 20200422
11695 42.756 -107.3025 20200423
train_unitedStates_Confirmed=train_unitedStates[['ConfirmedCases']]
train_unitedStates_Confirmed.describe()
ConfirmedCases
count 3654.000000
mean 60.047072
std 681.297097
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 25681.000000
train_unitedStates_Fatalities=train_unitedStates[['Fatalities']]
train_unitedStates_Fatalities.describe()
Fatalities
count 3654.000000
mean 0.813355
std 7.030237
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 210.000000
x = trainData_US
y1 = train_unitedStates_Confirmed
y2 = train_unitedStates_Fatalities
x_test = testData_US
Tree_model.fit(x,y1)
predConfirmed = Tree_model.predict(x_test)
predConfirmed = pd.DataFrame(predConfirmed)
predConfirmed.columns = ["ConfirmedCases_prediction"]
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  """Entry point for launching an IPython kernel.
predConfirmed
ConfirmedCases_prediction
0 0.0
1 0.0
2 6.0
3 12.0
4 12.0
... ...
2489 29.0
2490 29.0
2491 29.0
2492 29.0
2493 29.0

2494 rows × 1 columns

Tree_model.fit(x,y2)
predFatalities = Tree_model.predict(x_test)
predFatalities = pd.DataFrame(predFatalities)
predFatalities.columns = ["Fatalities_prediction"]
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:1: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  """Entry point for launching an IPython kernel.
predFatalities
Fatalities_prediction
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
... ...
2489 0.0
2490 0.0
2491 0.0
2492 0.0
2493 0.0

2494 rows × 1 columns

predictions = predConfirmed.join(predFatalities)
predictions
ConfirmedCases_prediction Fatalities_prediction
0 0.0 0.0
1 0.0 0.0
2 6.0 0.0
3 12.0 0.0
4 12.0 0.0
... ... ...
2489 29.0 0.0
2490 29.0 0.0
2491 29.0 0.0
2492 29.0 0.0
2493 29.0 0.0

2494 rows × 2 columns

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(17,7))
predictions.plot(ax=ax1)
ax1.set_title("US confirmed cases", size=13)
ax1.set_ylabel("Number of cases", size=13)
ax1.set_xlabel("Date", size=13)
predFatalities.plot(ax=ax2, color='orange')
ax2.set_title("US deceased cases", size=13)
ax2.set_ylabel("Number of cases", size=13)
ax2.set_xlabel("Date", size=13)
Text(0.5, 0, 'Date')

png

predictions.describe()
ConfirmedCases_prediction Fatalities_prediction
count 2494.000000 2494.000000
mean 690.180834 8.576183
std 2943.759642 27.336496
min 0.000000 0.000000
25% 30.000000 0.000000
50% 105.000000 1.000000
75% 368.000000 6.000000
max 25681.000000 210.000000
matplotlib.style.use('ggplot')
plt.scatter(predConfirmed, predFatalities)
plt.show()

png

scipy.stats.spearmanr(predConfirmed, predFatalities)[0]
0.8056651838655811
scipy.stats.kendalltau(predConfirmed, predFatalities)[0]
0.6666551009677553
confirmedCasesPrint = predictions.sum()[0]
fatalititiesPrint = predictions.sum()[1]
print('By 4/23/2020 the below predictions were made: ')
print('{} more people in the US will contract the disease.'.format(confirmedCasesPrint))
print('As a result I predict that {} people will expire.'.format(fatalititiesPrint))
By 4/23/2020 the below predictions were made:
1721311.0 more people in the US will contract the disease.
As a result I predict that 21389.0 people will expire.

Conclusion

Upon using this model I was able to predict the following metrics. By 4/23/2020 the below predictions were made: 1,721,311.0 more people in the US will contract the disease, and as a result I predict that 21,389.0 people will expire. Between these numbers the correlation score was .80 for spearman and .66 for kendall tau. For this score I see a pattern that these numbers could represent valid numbers. Given the timeline with the skyrocketing numbers and the correlations of the predicted number I feel confident in the prediction.

Disclaimer

This project and data had been directly affected by the distribution of tests across the global. In the data and graphs we are able to see the spikes of which the tests became available based on location. In the United States we are able to see when tests are available and when supplies begin to deplete. I also suspect that those spike affected my outcome and the model of which I trained using that data. I suspect that the correlations between the two variables are accurate even if the numbers may be off.

https://youtu.be/Vg55Ry7Q7VU

https://github.com/SudoZyphr/covid19/blob/master/Covid-19.pdf

Resources:

  1. https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/summary.html
  2. https://ourworldindata.org/coronavirus
  3. https://www.who.int/emergencies/diseases/novel-coronavirus-2019
  4. https://www.worldometers.info/coronavirus/
  5. https://informationisbeautiful.net/visualizations/covid-19-coronavirus-infographic-datapack/
  6. https://www.ecdc.europa.eu/en/publications-data/rapid-risk-assessment-novel-coronavirus- disease-2019-covid-19-pandemic-increased
  7. https://www.barrons.com/articles/latest-coronavirus-data-show-disease-continues-to-spread-even-in- the-u-s-51584224660
  8. https://www.theguardian.com/world/2020/mar/13/coronavirus-pandemic-visualising-the-global-crisis
  9. https://ourworldindata.org/coronavirus-source-data
  10. https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset
  11. https://www.sciencedaily.com/releases/2020/03/200317175442.htm
  12. https://www.worldometers.info/coronavirus/coronavirus-age-sex-demographics/