5 min read

EDA for top 10 worst affected Indian states from SARS-CoV-2

EDA for top 10 worst affected Indian states from SARS-CoV-2

Exploratory data analysis (EDA) is what every data science enthusiast like to do whenever he/she comes across some interesting data-set. Well the same thing happened with me when I got Covid-19 data of India.
I thought may be I will analyze and investigate the dataset and find out top 10 states with maximum number of confirmed cases across India.
Here I have used Line Chart for the depiction of data and have decided to visualize my conclusion on Apache Superset.

Source: https://www.kaggle.com/sudalairajkumar/covid19-in-india?select=covid_19_india.csv

LINE CHART:
A Line Chart is an example of Time-Series chart that displays data over time and are best used to discover trends and patterns. It offers a straightforward way to convey changes over time.
Now let's create a line chart on covid_19_india data-set from scratch on Superset.


STEP 1: Selecting a Line-Chart.
Start by logging on to your Superset account and uploading the CSV file for analysis.
NOTE: make sure that your data-set has a well defined Date/Time column.
Once you are logged  in select "Upload a CSV" option under "Sources" section.
It looks like this,

Now, fill the data-set details such as "Table Name", "CSV File" and select the "Database" where you wish to save your file as a table.

Now the most important step while dealing with Time-series charts. We have to Parse the Date/Time column for better development of the charts.

As you see in the above picture "Date" column which is our date/time feature in our data-set is Parsed as Dates and then we click on save data.
Once our data is saved in our desired data-set, we the proceed with creation of line chart.
Select Charts under "New" option at the right most corner of the screen.

Now, choose your saved data-source under "Create New Chart" section, select "Line Chart" as your visualization type and press "CREATE NEW CHART" button.    

STEP 2: Defining the Chart's data

Okay now that you selected a visualization, let’s tell Superset which data to use for your new line chart. At this point, your screen should look something like this:

Let's start with the left-side of the screen. You’ll notice that the Data tab is selected by default — Datasource & Chart Type panel is showing the selections you just made:

In the Time attribute under the Data tab you will find Time Column set to "Date" as it was parsed by us at the starting of this process. Time Grain and Time Range is set as per our need.
For example, if I want my line chart’s granularity to be monthly (“month” option in Time Grain field) and we want to show data from the last quarter (“Last quarter” option in Time Range field) or let's just select "No filter".
Here’s what that looks like:

The last section which is Query enables us to specify exactly what data you’d like to include in your line chart.
Let's start with the Metrics field. This field allows us to select one, or multiple, metrics to include in your charts. In this example, we’re going to show confirmed Covid-19 monthly cases in India so, in the Metrics field, we selected Confirmed.
An additional panel will appear asking what type of aggregate to use. By default, we’ll go with the SUM option, and click Save.

Okay let's now see how our Line Chart looks like now by just clicking the Run button above the data tab.

Now to analyze our data state-wise we use Group by field under Metrics field, and select group by "State/UnionTerritory".
Now our Graph will look like:

To get specific time-based data, just hover your cursor over the line chart to view a breakdown of information for that day.

Feel free to play around with the Data tab settings to see how your data can be visualized differently. Maybe change the time range? Time granularity? Metrics? Visualization type?
Just make a change and then select Run Query whenever you want to see your new visualization!
But sticking to our problem statement to extract top 10 worse hit states we need to set "Series limit" to 10 and then proceed for customization.
Now, let’s have a look at how Superset can help you to customize your chart.

STEP 3: Customizing your chart.

Customization of the chart is done under Customize tab. This allows us to experiment with the appearance and defines how our data is displayed.
In the Chart Options panel, you can make changes to the appearance of your chart, such as selecting a different color scheme or toggling the legend, markers, and tooltips.

The X and Y Axis panels enable you to customize how each axis is presented within the line chart. In the example below, we added a label and a log scale to the chart.
This makes our line chart looks like this...

Conclusion:

Top 10 most affected states are Andhra Pradesh, Delhi, Karnataka, Kerala, Maharashtra, Odisha, Tamil Nadu, Telangana, Uttar Pradesh, West Bengal.

Throughout my investigation and analysis of the data set I have found the above mentioned states as the worst affected by corona virus in India. While undergoing my study, Apache Superset has helped me a lot in concluding my findings. It's simple user interface and availability of numerous charts made my analysis more insightful and easy to understand.