• Cyclistic Share Bike Capstone R Project

    How Does a Bike-Share Navigate Speedy Success?


    Ask Questions:

    1.       How do annual members and casual riders use Cyclistic bikes differently?

    2.       Why would casual riders buy Cyclistic annual memberships?

    3.       How can Cyclistic use digital media to influence casual riders to become members? 

    As a junior data analyst, I have been assigned to answer the first question: How do annual members and casual riders use Cyclistic bikes differently? 

    Key stakeholders:

    1.       Lily Moreno: The director of marketing and my manager.

    2.       Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program. 

    Prepare:

    1.       Download the previous 12 months of Cyclistic trip data from the website called “divvy-tripdata”. The datasets are appropriate and organized. These data are reliable, original, comprehensive, current and cited. The data has been made available by motivate international Inc. under this license. By analyzing these data, I would be to find the connection between casual and member riders to identify the trends. 

    2.       Sorting and filtering the data: There are some NULL cells in the tables and needs to be marked as N/A which will not be used in the data analysis.  Also, I removed the blank cells by using filtering.

     Process:

    R-Studio: 

    GitHub code

    # Install all the packages
    
    install.packages("tidyverse")
    install.packages("lubridate")
    install.packages("skimr")
    
    library(tidyverse)
    library(lubridate)
    library(ggplot2)
    library(dplyr)
    library(skimr)

    Sep21 <- read.csv("202109-divvy-tripdata.csv") 
    Oct21 <- read.csv("202110-divvy-tripdata.csv") 
    Nov21 <- read.csv("202111-divvy-tripdata.csv") 
    Dec21 <- read.csv("202112-divvy-tripdata.csv") 
    Jan22 <- read.csv("202201-divvy-tripdata.csv") 
    Feb22 <- read.csv("202202-divvy-tripdata.csv") 
    Mar22 <- read.csv("202203-divvy-tripdata.csv") 
    Apr22 <- read.csv("202204-divvy-tripdata.csv") 
    May22 <- read.csv("202205-divvy-tripdata.csv") 
    Jun22 <- read.csv("202206-divvy-tripdata.csv") 
    Jul22 <- read.csv("202207-divvy-tripdata.csv") 
    Aug22 <- read.csv("202208-divvy-tripdata.csv")

    # Combine all files into one file called "totaltripdata"
    
    totaltripdata <- rbind(Sep21,Oct21,Nov21,Dec21,Jan22,Feb22,Mar22,Apr22,May22,Jun22,Jul22,Aug22)
    nrow(totaltripdata) # number of rows
    [1] 5883043
    
    ncol(totaltripdata) # number of columns
    [1] 13
    
    head(totaltripdata) # check the first 6 rows of the data frame
    tail(totaltripdata) # check the last 6 rows of the data frame
    summary(totaltripdata) # statistical summary of the data 
    colnames(totaltripdata) # list of column names
    
    [1] "ride_id"            "rideable_type"      "started_at"         "ended_at"           "start_station_name"
    [6] "start_station_id"   "end_station_name"   "end_station_id"     "start_lat"          "start_lng"         
    [11] "end_lat"            "end_lng"            "member_casual"  

    sum(is.na(totaltripdata)) # check how many rows are NA
    [1] 11454
    
    totaltripdata <- na.omit(totaltripdata) # Drop all NA rows
    
    summary(totaltripdata) # Check the summary of data
    
    dim(totaltripdata) # dimension of the dataset
    
    colnames(totaltripdata) # check column name of the dataset
    
    str(totaltripdata) # Check the column data type 
    
    summary(totaltripdata) # Check the summary of the dataset
    
    skim(totaltripdata) # Check the summary of the data, see if any data missing

    #totaltripdata$date <- as.Date(totaltripdata$started_at)
    totaltripdata$month <- format(as.Date(totaltripdata$date), "%m")
    totaltripdata$day <- format(as.Date(totaltripdata$date), "%d")
    totaltripdata$year <- format(as.Date(totaltripdata$date), "%y")
    totaltripdata$day_of_week <- format(as.Date(totaltripdata$date), "%w")
    
    # Calculate the time difference between ended_at and started_at, add a new column called ride_length
    totaltripdata$ride_length <- difftime(totaltripdata$ended_at, totaltripdata$started_at)
    
    is.factor(totaltripdata$ride_length)
    [1] FALSE
    
    totaltripdata$ride_length <- as.numeric(as.character(totaltripdata$ride_length))
    
    is.numeric(totaltripdata$ride_length)
    [1] TRUE
    
    skim(totaltripdata$ride_length)
    
    # Set up a new dataset called "totaltripdata_2" including the ride_length greater than 0
    totaltripdata_2 <- totaltripdata[!(totaltripdata$ride_length<0),] 
    
    sskim(totaltripdata_2)

    # Aggregate the totaltripdata_2
    aggregate(totaltripdata_2$ride_length ~ totaltripdata_2$member_casual, FUN = mean)
    
      totaltripdata_2$member_casual totaltripdata_2$ride_length
    1                        casual                   1758.0228
    2                        member                    771.3398
    
    aggregate(totaltripdata_2$ride_length ~ totaltripdata_2$member_casual, FUN = median)
    
      totaltripdata_2$member_casual totaltripdata_2$ride_length
    1                        casual                         835
    2                        member                         538
    
    aggregate(totaltripdata_2$ride_length ~ totaltripdata_2$member_casual, FUN = max)
     
      totaltripdata_2$member_casual totaltripdata_2$ride_length
    1                        casual                     2442301
    2                        member                       89998
    
    aggregate(totaltripdata_2$ride_length ~ totaltripdata_2$member_casual, FUN = min)
    
      totaltripdata_2$member_casual totaltripdata_2$ride_length
    1                        casual                           0
    2                        member                           0

    # Order the days of the week, set Sunday as the first day of the week 
    totaltripdata_2$day_of_week <- ordered(totaltripdata_2$day_of_week, levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))

    # Plot the numnber_of_rides vs. day of week
    totaltripdata_2 %>% 
      mutate(weekday = wday(started_at, label = TRUE)) %>% 
      group_by(member_casual, weekday) %>% 
      summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>% 
      arrange(member_casual, weekday) %>% 
      ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) + geom_col(position = "dodge")


    # Plot the average duration of rides vs. day of week
    totaltripdata_2 %>% 
      mutate(weekday = wday(started_at, label = TRUE)) %>% 
      group_by(member_casual, weekday) %>% 
      summarise(number_of_rides = n(), average_duration = mean(ride_length)) %>%
      arrange(member_casual, weekday)  %>% 
      ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) + geom_col(position = 'dodge')



    # Export the analyzed file into the work drive
    write.csv(totaltripdata_2, "data.csv")

    Power BI Dashboard: 





    Conclusion: 

        1. During weekdays, member riders tend to ride bikes more often than casual riders because they are mainly using bikes for the home-work commutes. Casual riders ride bikes more than member riders during weekends because they have more available time to ride for leisure. 

        2. Casual riders ride bikes much longer than casual riders because they have a relatively steady start point and destination point (e.g. home and workplace). 



















  • 0 comments:

    Post a Comment

    Search This Blog

    Powered by Blogger.

    Blog Archive

    LOCATION

    Katy, Greater Houston, Texas

    EMAIL

    LeeYao.us@gmail.com

    TELEPHONE

    NONE

    MOBILE

    +1 347-749-3598