Key tools/skills:
- Dataset Lomba 04 Sept 2020 - DQLab as a data source
- PostgreSQL as a data manipulation tool
- Metabase as a data visualization tool
- Data Analysis
- Data Visualization
Project : https://github.com/nairkivm/store-sales-data-analysis
Introduction
Previously, I created an ETL data pipeline for an e-commerce store. Recently, I discovered another intriguing e-commerce dataset that I wanted to analyze. This personal project aims to enhance my data analytics and visualization skills while gaining more insights into e-commerce data. In this blog, I’ll walk you through how I familiarized myself with the data by tackling SQL exercises, preparing the data, building an analysis report, and presenting the findings via a dashboard (with some explanations).
Getting exercise with some SQL problems
This dataset was originally published by the learning platform DQLab as part of a data challenge in 2020. It is also featured in their SQL learning course, which involves solving real-world case examples. The course is divided into three chapters based on difficulty and complexity. Some of the problems include:
- Understanding the Tables: Analyzing row numbers, column lengths, null values, distinct values, etc.
- Top Buyers: Identifying the top buyers of all time.
- String Manipulation: Extracting user emails.
- Greatest Mean Value Buyers: Finding buyers with the highest average purchase value.
- Big Orders: Identifying large orders in a specific month.
- Dropshipper Identification: Detecting dropshippers.
- Average Payment Duration: Calculating the average payment duration.
You can explore these questions by enrolling in their course (no paid endorsement, just appreciation). I have also modified some of the questions and solved them here.
Data Preparation
I loaded the data into my local PostgreSQL database and made some modifications. I added rating
column to orders
table.
-- Add 'rating' column
alter table orders
add column rating int;
-- Fill `rating` column with random ratings value in range 0-5
-- and add some right-skewness (with factor of 0.4) to the data distribution
update orders
set rating = floor((random() ^ 0.4) * 6)::int;
-- Move some '3-rate' values to the `0-rate` to make the data more 'realistic'
update orders
set rating = 0
where rating = 3 and random() < 0.5;
I also added latitude
, longitude
, and age
columns to users
table.
-- Add 'latitude' & 'longitude' columns
alter table users
add column latitude float,
add column longitude float;
-- Because 'latitude' & 'longitude' are coupled to represent a single meaningful location,
-- we have to add them simultaneously (in this case, using json data type)
with random_locations as (
select
user_id,
case
-- Jakarta
when random() < 0.1 then format('{"latitude": "%s", "longitude": "%s"}',
-6.2 + (random() * ( -6.1 - (-6.2) )), 106.7 + (random() * ( 106.9 - 106.7 )) )::json
-- Bogor
when random() < 0.2 then format('{"latitude": "%s", "longitude": "%s"}',
-6.6 + (random() * ( -6.5 - (-6.6) )), 106.8 + (random() * ( 106.9 - 106.8 )) )::json
-- Depok
when random() < 0.3 then format('{"latitude": "%s", "longitude": "%s"}',
-6.4 + (random() * ( -6.3 - (-6.4) )), 106.7 + (random() * ( 106.8 - 106.7 )) )::json
-- Tangerang
when random() < 0.4 then format('{"latitude": "%s", "longitude": "%s"}',
-6.2 + (random() * ( -6.1 - (-6.2) )), 106.6 + (random() * ( 106.7 - 106.6 )) )::json
-- Bekasi
when random() < 0.5 then format('{"latitude": "%s", "longitude": "%s"}',
-6.3 + (random() * ( -6.2 - (-6.3) )), 106.9 + (random() * ( 107.0 - 106.9 )) )::json
-- South Tangerang (Tangerang Selatan)
when random() < 0.6 then format('{"latitude": "%s", "longitude": "%s"}',
-6.3 + (random() * ( -6.2 - (-6.3) )), 106.7 + (random() * ( 106.8 - 106.7 )) )::json
-- Karawang
when random() < 0.7 then format('{"latitude": "%s", "longitude": "%s"}',
-6.4 + (random() * ( -6.3 - (-6.4) )), 107.2 + (random() * ( 107.3 - 107.2 )) )::json
-- Cikarang
when random() < 0.8 then format('{"latitude": "%s", "longitude": "%s"}',
-6.3 + (random() * ( -6.2 - (-6.3) )), 107.1 + (random() * ( 107.2 - 107.1 )) )::json
-- Serang
when random() < 0.9 then format('{"latitude": "%s", "longitude": "%s"}',
-6.2 + (random() * ( -6.1 - (-6.2) )), 106.1 + (random() * ( 106.2 - 106.1 )) )::json
-- Cilegon
else format('{"latitude": "%s", "longitude": "%s"}',
-6.0 + (random() * ( -5.9 - (-6.0) )), 106.0 + (random() * ( 106.1 - 106.0 )) )::json
end::json as location
from users
)
-- Fill the columns
update users u
set latitude = cast(r.location ->> 'latitude' as float),
longitude = cast(r.location ->> 'longitude' as float)
from random_locations r
where u.user_id = r.user_id
;
-- Add 'age' column
alter table users
add column age int;
-- Fill the column with random age between 18 and 63
update users
set age = floor(18 + (random() * ( 63 - 18 )));
-- Add 'skewness' to the data (right-skew for age <= 32)
update users
set age = floor(18 + (random()^0.6 * ( 63 - 18 )))
where age <= 32;
-- Add 'skewness' to the data (left-skew for age >= 49)
update users
set age = floor(18 + (random()^1.2 * ( 63 - 18 )))
where age >= 49;
Building Analysis Report
I created the analysis report based on the Metabase example dashboard as a reference. The dashboard is divided into four pages:
- Overall Business Metrics: This page displays general metrics such as total revenues, active users, and order counts.
- Portfolio Performance: Here, we delve into factors affecting revenue, including top-selling products or categories and customer satisfaction.
- Demographics: This section provides insights into our customers, including who they are and where they are located.
- Retention Rate: The final page concludes with retention rate metrics, highlighting customer loyalty and its impact on revenue stability.
In Metabase, a dashboard consists of several “questions,” which are essentially SQL queries designed to “answer the problem.” These questions can be visualized using various options such as metrics, trends, tables, pivot tables, maps, and graphs. Each metric mentioned above requires a corresponding query. Fortunately, Metabase’s GUI simplifies this task. You can find the detailed question scripts in this repository.
Analysis Report
What’s the situation with this store’s business?
Overall, the store’s performance is very promising. In May 2020, the store generated a total revenue of 29 billion rupiah, marking a 46% increase from April 2020. To achieve this, they allocated 234 million rupiah for discounts, which is about 8% of the generated revenue. The number of orders and active customers also increased by 25% (9,327 orders) and 28% (8,972 users) from the previous month, respectively. The trends in orders and revenue, both overall and by each channel, have been increasing over time, except for a dip in January 2020, likely due to the end-of-year holidays. By the second month of the second quarter, they had achieved 49% of their quarterly revenue target.
What factors contribute to this store’s revenue?
The top three best-selling products of all time are women’s accessories, men’s wear, and women’s wear. However, the top three best-selling product categories are personal care, fresh food, and men’s wear. The trends indicate that personal care sales significantly outperform other categories. From the customer perspective, the store has received a good rating, with a score of 3.47 out of 5.
Who and where are the customers?
The customers are primarily located in the Jabodetabek area and its neighboring cities, with the majority coming from Tangerang and Tangerang Selatan. The average customer is aged between 40-45 years. There is no significant difference in the preferred categories and channels among different age groups.
How loyal are the customers?
As of May 2020, 25% of the store’s monthly active users are repeat customers. The rate of new users is decreasing over time, meaning most active users are returning customers. From the cohort analysis and further examination,
Month | New User | Contribution to May 2020 | Users Returned at May 2020 |
---|---|---|---|
Jan 2019 | 117 | 49.57% | 58 |
Feb 2019 | 347 | 52.16% | 181 |
Mar 2019 | 633 | 54.50% | 345 |
Apr 2019 | 898 | 52.56% | 472 |
May 2019 | 1250 | 52.40% | 655 |
Jun 2019 | 1489 | 53.93% | 803 |
Jul 2019 | 1832 | 53.60% | 982 |
Aug 2019 | 1952 | 53.69% | 1048 |
Sep 2019 | 2018 | 54.36% | 1097 |
Oct 2019 | 1967 | 54.65% | 1075 |
Nov 2019 | 1850 | 52.27% | 967 |
Dec 2019 | 1567 | 53.86% | 844 |
Jan 2020 | 542 | 52.21% | 283 |
Feb 2020 | 480 | 49.58% | 238 |
Mar 2020 | 385 | 47.53% | 183 |
Apr 2020 | 268 | 36.18% | 97 |
we can see that users in May 2020 are predominantly new users from September, October, and August 2019. It appears that customers tend to return and make purchases again after more than one month.
Posted at 2024-10-05