Ever wonder how Chicagoans and visitors zip around the city? At Cyclistic, we were curious too! That’s why we delved into a year-long exploration, diving deep into our ride data. Our mission? To understand the unique ways our riders experience the Windy City.
Guess what we found? Fascinating trends and patterns that are now steering our next big moves. Armed with these insights, we’re fine-tuning our services and gearing up our marketing game to give you an even smoother ride around Chicago. Because for us, it’s not just about getting from point A to B; it’s about enhancing your journey every pedal of the way.
Ready to be a part of Cyclistic’s exciting future in urban transportation? Hop on, and let’s ride the end together!
The Marketing Analytics team has established a clear directive:
To realize this ambition, the team recognizes the need for a deeper understanding of several facets:
In addition to the above, there’s a marked interest from the Marketing Analytics team in dissecting Cyclistic’s historical bike trip data to pinpoint pivotal trends.
While the primary focus remains on converting casual riders into annual members, considering the significant presence of casual riders, I suggest a dual-strategy approach. Not only should we direct efforts towards conversion, but we must also amplify our engagement strategies targeting the casual rider segment. By resonating with the preferences and requirements of this substantial segment, we may uncover greater opportunities and ensure elevated user satisfaction across the board.
My analysis indicates that Cyclistic’s annual members are mainly workday commuters who prefer shorter rides. At the same time, casual riders opt for relaxed weekend explorations. As summer is the prime biking season, I recommend implementing strategies tailored to these insights, such as offering a limited Spring 2024 Pass, exclusive digital perks, and targeted SEO campaigns. By leveraging these insights, Cyclistic can convert casual riders into committed members. Check out the report for a more detailed analysis.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr 1.1.3 âś” readr 2.1.4
## âś” forcats 1.0.0 âś” stringr 1.5.0
## âś” ggplot2 3.4.3 âś” tibble 3.2.1
## âś” lubridate 1.9.3 âś” tidyr 1.3.0
## âś” purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
## â„ą Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(ggplot2)
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(dplyr)
library(hms)
##
## Attaching package: 'hms'
##
## The following object is masked from 'package:lubridate':
##
## hms
I’ve sourced 12 months of trip data for Cyclistic from July 2022 to
June 2023. Each month’s dataset resides in its own .csv
file. This first-party data, which is directly gathered and utilized by
Cyclistic, is the backbone of my analysis. For this capstone, the data
is licensed from “divvybikes.com,” an established bike-sharing service
in Chicago. It’s worth noting that all riders’ personal information has
been thoroughly anonymized for privacy reasons.
jun23_df <- read_csv("C:/Users/irmin/OneDrive/Desktop/Case_Study_R/202306-divvy-tripdata.csv")
## Rows: 719618 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## â„ą Use `spec()` to retrieve the full column specification for this data.
## â„ą Specify the column types or set `show_col_types = FALSE` to quiet this message.
may23_df <- read_csv("C:/Users/irmin/OneDrive/Desktop/Case_Study_R/202305-divvy-tripdata.csv")
## Rows: 604827 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## â„ą Use `spec()` to retrieve the full column specification for this data.
## â„ą Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr23_df <- read_csv("C:/Users/irmin/OneDrive/Desktop/Case_Study_R/202304-divvy-tripdata.csv")
## Rows: 426590 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## â„ą Use `spec()` to retrieve the full column specification for this data.
## â„ą Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar23_df <- read_csv("C:/Users/irmin/OneDrive/Desktop/Case_Study_R/202303-divvy-tripdata.csv")
## Rows: 258678 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## â„ą Use `spec()` to retrieve the full column specification for this data.
## â„ą Specify the column types or set `show_col_types = FALSE` to quiet this message.
feb23_df <- read_csv("C:/Users/irmin/OneDrive/Desktop/Case_Study_R/202302-divvy-tripdata.csv")
## Rows: 190445 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## â„ą Use `spec()` to retrieve the full column specification for this data.
## â„ą Specify the column types or set `show_col_types = FALSE` to quiet this message.
jan23_df <- read_csv("C:/Users/irmin/OneDrive/Desktop/Case_Study_R/202301-divvy-tripdata.csv")
## Rows: 190301 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## â„ą Use `spec()` to retrieve the full column specification for this data.
## â„ą Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec22_df <- read_csv("C:/Users/irmin/OneDrive/Desktop/Case_Study_R/202212-divvy-tripdata.csv")
## Rows: 181806 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## â„ą Use `spec()` to retrieve the full column specification for this data.
## â„ą Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov22_df <- read_csv("C:/Users/irmin/OneDrive/Desktop/Case_Study_R/202211-divvy-tripdata.csv")
## Rows: 337735 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## â„ą Use `spec()` to retrieve the full column specification for this data.
## â„ą Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct22_df <- read_csv("C:/Users/irmin/OneDrive/Desktop/Case_Study_R/202210-divvy-tripdata.csv")
## Rows: 558685 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## â„ą Use `spec()` to retrieve the full column specification for this data.
## â„ą Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep22_df <- read_csv("C:/Users/irmin/OneDrive/Desktop/Case_Study_R/202209-divvy-publictripdata.csv")
## Rows: 701339 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## â„ą Use `spec()` to retrieve the full column specification for this data.
## â„ą Specify the column types or set `show_col_types = FALSE` to quiet this message.
aug22_df <- read_csv("C:/Users/irmin/OneDrive/Desktop/Case_Study_R/202208-divvy-tripdata.csv")
## Rows: 785932 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## â„ą Use `spec()` to retrieve the full column specification for this data.
## â„ą Specify the column types or set `show_col_types = FALSE` to quiet this message.
jul22_df <- read_csv("C:/Users/irmin/OneDrive/Desktop/Case_Study_R/202207-divvy-tripdata.csv")
## Rows: 823488 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## â„ą Use `spec()` to retrieve the full column specification for this data.
## â„ą Specify the column types or set `show_col_types = FALSE` to quiet this message.
Reliable: The dataset is both complete and accurate, encapsulating every bike ride in Chicago during our analysis period.
Original: The data has been sourced from the Google Data Analytics Capstone: Complete a Case Study.
Comprehensive: It provides a holistic view of each ride, detailing the start/end times, station names and IDs, membership types, and more.
Current: The dataset is recent, covering up to the end of June 2023.
Cited: We’ve adhered to data licensing terms, sourcing our information under the appropriate Data License Agreement.
In this section, I will verify if all the datasets have the same column structure. This ensures that they’re consistent and compatible for my further analysis.
all_same_columns <- all(
identical(colnames(jun23_df), colnames(may23_df)),
identical(colnames(jun23_df), colnames(apr23_df)),
identical(colnames(jun23_df), colnames(mar23_df)),
identical(colnames(jun23_df), colnames(feb23_df)),
identical(colnames(jun23_df), colnames(jan23_df)),
identical(colnames(jun23_df), colnames(dec22_df)),
identical(colnames(jun23_df), colnames(nov22_df)),
identical(colnames(jun23_df), colnames(oct22_df)),
identical(colnames(jun23_df), colnames(sep22_df)),
identical(colnames(jun23_df), colnames(aug22_df)),
identical(colnames(jun23_df), colnames(jul22_df))
)
if (all_same_columns) {
cat("All datasets have the same columns.\n")
} else {
cat("Not all datasets have the same columns.\n")
}
## All datasets have the same columns.
cyclistic_df <- rbind(jun23_df, may23_df, apr23_df, mar23_df, feb23_df, jan23_df, dec22_df, nov22_df, oct22_df, sep22_df, aug22_df, jul22_df)
trimmed_cyclistic_df <- select(cyclistic_df, rideable_type, started_at, ended_at, member_casual)
cyclistic_new_df <- trimmed_cyclistic_df
cyclistic_new_df$ride_length <- difftime(cyclistic_new_df$ended_at, cyclistic_new_df$started_at, units = "mins")
cyclistic_new_df$ride_length <- round(cyclistic_new_df$ride_length, digits = 1)
cyclistic_new_df$date <- as.Date(cyclistic_new_df$started_at)
cyclistic_new_df$day_of_week <- format(as.Date(cyclistic_new_df$date),"%A")
cyclistic_new_df$month <- format(as.Date(cyclistic_new_df$date), "%m")
cyclistic_new_df$day <- format(as.Date(cyclistic_new_df$date), "%d")
cyclistic_new_df$year <- format(as.Date(cyclistic_new_df$date), "%Y")
cyclistic_new_df$time <- as_hms(cyclistic_new_df$started_at)
cyclistic_new_df$hour <- hour(cyclistic_new_df$time)
head(cyclistic_new_df)
## # A tibble: 6 Ă— 12
## rideable_type started_at ended_at member_casual
## <chr> <dttm> <dttm> <chr>
## 1 electric_bike 2023-06-05 13:34:12 2023-06-05 14:31:56 member
## 2 electric_bike 2023-06-05 01:30:22 2023-06-05 01:33:06 member
## 3 electric_bike 2023-06-20 18:15:49 2023-06-20 18:32:05 member
## 4 electric_bike 2023-06-19 14:56:00 2023-06-19 15:00:35 member
## 5 electric_bike 2023-06-19 15:03:34 2023-06-19 15:07:16 member
## 6 electric_bike 2023-06-09 21:30:25 2023-06-09 21:49:52 member
## # â„ą 8 more variables: ride_length <drtn>, date <date>, day_of_week <chr>,
## # month <chr>, day <chr>, year <chr>, time <time>, hour <int>
cyclistic_new_df <- na.omit(cyclistic_new_df)
cyclistic_new_df <- distinct(cyclistic_new_df)
cyclistic_new_df <- cyclistic_new_df[!(cyclistic_new_df$ride_length <=0),]
overall_stats <- cyclistic_new_df %>%
summarise(
total_rides = n(),
avg_ride_length = mean(ride_length, na.rm = TRUE),
busiest_time = hms::as_hms(as.numeric(names(which.max(table(hour)))) * 3600),
busiest_weekday = names(which.max(table(day_of_week))),
busiest_month = month.name[as.numeric(names(which.max(table(month))))],
most_popular_bike = names(which.max(table(rideable_type))),
most_rides = ifelse(sum(member_casual == "member") > sum(member_casual == "casual"), "Members", "Casual Riders")
)
overall_stats
## # A tibble: 1 Ă— 7
## total_rides avg_ride_length busiest_time busiest_weekday busiest_month
## <int> <drtn> <time> <chr> <chr>
## 1 5768331 18.37589 mins 17:00 Saturday July
## # â„ą 2 more variables: most_popular_bike <chr>, most_rides <chr>
stats_by_type <- cyclistic_new_df %>%
group_by(member_casual) %>%
summarise(
total_rides = n(),
avg_ride_length = mean(ride_length, na.rm = TRUE),
busiest_time = names(which.max(table(hour))),
busiest_weekday = names(which.max(table(day_of_week))),
busiest_month = names(which.max(table(month))),
most_popular_bike = names(which.max(table(rideable_type)))
)
stats_by_type
## # A tibble: 2 Ă— 7
## member_casual total_rides avg_ride_length busiest_time busiest_weekday
## <chr> <int> <drtn> <chr> <chr>
## 1 casual 2240782 27.76864 mins 17 Saturday
## 2 member 3527549 12.40939 mins 17 Wednesday
## # â„ą 2 more variables: busiest_month <chr>, most_popular_bike <chr>
avg_ride_by_weekday <- cyclistic_new_df %>%
group_by(member_casual, day_of_week) %>%
summarise(average_ride_length = mean(ride_length, na.rm = TRUE), .groups = 'drop') %>%
spread(key = member_casual, value = average_ride_length) %>%
arrange(desc(casual), desc(member))
avg_ride_by_weekday
## # A tibble: 7 Ă— 3
## day_of_week casual member
## <chr> <drtn> <drtn>
## 1 Sunday 32.82831 mins 13.69855 mins
## 2 Saturday 32.39272 mins 13.98907 mins
## 3 Monday 27.21524 mins 11.79224 mins
## 4 Friday 26.95822 mins 12.33629 mins
## 5 Tuesday 24.24343 mins 11.86197 mins
## 6 Thursday 23.29346 mins 11.91361 mins
## 7 Wednesday 23.28850 mins 11.84987 mins
most_rides <- cyclistic_new_df %>%
count(member_casual) %>%
arrange(desc(n))
most_rides
## # A tibble: 2 Ă— 2
## member_casual n
## <chr> <int>
## 1 member 3527549
## 2 casual 2240782
stats_by_type <- stats_by_type %>%
arrange(desc(member_casual)) %>%
mutate(
percent = total_rides / sum(total_rides) * 100,
label = sprintf("%s\n%.1f%%", scales::comma(total_rides), percent),
lab.y = cumsum(total_rides) - 0.5 * total_rides
)
stats_by_type %>%
ggplot(aes(x = "", y = total_rides, fill = member_casual)) +
geom_bar(width = 1, stat = "identity") +
geom_text(aes(label = label, y = lab.y), size = 4) +
coord_polar(theta = "y") +
labs(title = "Total Rides By Customer Type", fill = "Customer Type", caption = "Data represents rides from Jan to Dec 2023.") +
scale_fill_manual(values = c("casual" = "#B3DDF2", "member" = "#3F81B6")) +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
)
stats_by_type %>%
ggplot(aes(x = member_casual, y = avg_ride_length, fill = member_casual)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = sprintf("%.2f mins", avg_ride_length)), vjust = -0.25, size = 4) +
labs(title = "Average Ride Length By Customer Type",
y = "Ride Length (minutes)",
x = "Customer Type") +
scale_fill_manual(values = c("casual" = "#B3DDF2", "member" = "#3F81B6")) +
theme_minimal() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(face = "bold", size = 12))
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.
avg_ride_by_weekday <- cyclistic_new_df %>%
group_by(member_casual, day_of_week) %>%
summarize(average_ride_length = mean(ride_length, na.rm = TRUE)) %>%
arrange(member_casual, day_of_week)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
avg_ride_by_weekday %>%
ggplot(aes(x = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")),
y = average_ride_length, fill = member_casual)) +
geom_col(position = "dodge", width = 0.8) +
labs(title = "Average Ride Length by Weekday",
y = "Average Ride Length (minutes)",
x = "") +
scale_fill_manual(values = c("casual" = "#B3DDF2", "member" = "#3F81B6")) +
theme_light() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5),
legend.title = element_blank())
## Don't know how to automatically pick scale for object of type <difftime>.
## Defaulting to continuous.
stats_by_time_function <- function(df) {
df %>%
group_by(hour, member_casual) %>%
summarize(number_of_trips = n()) %>%
arrange(hour, member_casual)
}
stats_by_time <- stats_by_time_function(cyclistic_new_df)
## `summarise()` has grouped output by 'hour'. You can override using the
## `.groups` argument.
stats_by_time %>%
ggplot(aes(x = hour, y = number_of_trips, fill = member_casual)) +
geom_col(position = "dodge", width = 0.7) +
labs(title = "Bike Demand by Hour",
y = "Number of Rides",
x = "Hour of Day") +
scale_y_continuous(labels = scales::comma) +
scale_fill_manual(values = c("casual" = "#B3DDF2", "member" = "#3F81B6")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5),
legend.title = element_blank())
stats_by_weekday_function <- function(df) {
df %>%
group_by(day_of_week, member_casual) %>%
summarize(number_of_rides = n(), .groups = "drop") %>%
mutate(day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))) %>%
arrange(day_of_week)
}
plot_by_weekday <- function(data) {
data %>%
ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge", width = 0.7) +
labs(title = "Bike Rides by Day of the Week",
y = "Number of Rides",
x = "Day of the Week") +
scale_y_continuous(labels = scales::comma) +
scale_fill_manual(values = c("casual" = "#B3DDF2", "member" = "#3F81B6")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5),
legend.title = element_blank())
}
weekday_data <- stats_by_weekday_function(cyclistic_new_df)
plot_by_weekday(weekday_data)
stats_by_month_function <- function(df) {
df %>%
group_by(month, member_casual) %>%
summarize(number_of_rides = n(), .groups = "drop") %>%
arrange(match(month, month.name), member_casual)
}
stats_by_month_data <- stats_by_month_function(cyclistic_new_df)
stats_by_month_data %>%
ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
geom_col(position = "dodge", width = 0.7) +
labs(title = "Bike Rides by Month of the Year",
y = "Number of Rides",
x = "Month") +
scale_y_continuous(labels = scales::comma) +
scale_fill_manual(values = c("casual" = "#B3DDF2", "member" = "#3F81B6")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5),
legend.title = element_blank())
Amplify brand awareness with targeted campaigns at high-frequency stations during peak hours (5 p.m.) and summer months.
Weekend Wonders: Collaborate with renowned influencers to spotlight Chicago’s prime locations best explored by bike.
Healthy Commutes: Promote the health benefits of daily bike commuting.
Transformational Tales: Share impactful stories of individuals whose lives transformed positively through regular cycling.
Unveil time-limited membership deals targeting casual riders on emails, social media, and partner apps.
Capitalize on summer’s allure with offers such as “Summer Special 2024” or September’s “Ride into Autumn 2024” to captivate casual riders.
Launch a “Savings Calculator” online, empowering casual riders to visualize potential savings with memberships.
Search Intent: Boost organic reach with an SEO campaign focusing on queries about cycling in Chicago.
Engaging Queries: Address questions like “Best bike spots in Chicago” or “Chicago bike rental costs” to build a memorable brand presence.