This blog post delves into my exploratory data analysis (EDA) project focused on U.S. governors. The project aims to uncover insights, such as identifying the governor with the shortest term and understanding the dominance of political affiliations. Beyond these outcomes, the project provides a valuable opportunity for skill development.


The Data

Sourced from Kaggle, the dataset contains information about U.S. governors, excluding territories and specific states from the Thirteen Colonies with a presidential office. With eight columns in CSV format, it covers crucial details like:

                            StateFull, 
                            StateAbbrev,
                            GovernorNumber, 
                            GovernorName, 
                            TookOffice, 
                            LeftOffice, 
                            PartyAffiliation, and 
                            PartyAbbrev
                        
This publicly available dataset, owned by Brandon Conrady, was last updated on May 12, 2021, and operates under the CC0: Public Domain license.


Some Questions & Answers

  1. Show governors that share the same seat number.
  2. The lower the seat number the more governors served.

  3. What is the earliest inaugural date?
  4. January 10, 1769 is the earliest inaugural date.

  5. How many distinct party_affiliations are there?
  6. There are 34 distinct affiliations combined. Interestingly, Rhode Island has the most affiliations with 11.

  7. Which state has the most governors so far?
  8. South Carolina has the most governors served with 91 governors.


Tidying the Data

Various tidying steps enhance the dataset, including converting date columns, standardizing naming conventions, and addressing inaccuracies such as correcting a date error in November. This meticulous process involves replacing problematic dates, ensuring consistent formatting, and handling issues during the as.Date conversion.

Dataset:

                            ## # A tibble: 2,587 x 3
                            ##    governor_full_name     took_office       left_office      
                            ##      chr                     chr                 chr            
                            ##  1 William Wyatt Bibb     November 9, 1819  July 10, 1820    
                            ##  2 Thomas Bibb            July 10, 1820     November 9, 1821 
                            ##  3 Israel Pickens         November 9, 1821  November 25, 1825
                            ##  4 John Murphy            November 25, 1825 November 25, 1829
                            ##  5 Gabriel Moore          November 25, 1829 March 3, 1831    
                            ##  6 Samuel Moore           March 3, 1831     November 26, 1831
                            ##  7 John Gayle             November 26, 1831 November 21, 1835
                            ##  8 Clement Comer Clay     November 21, 1835 July 17, 1837    
                            ##  9 Hugh McVay             July 17, 1837     November 21, 1837
                            ## 10 Arthur Pendleton Bagby November 21, 1837 November 22, 1841
                            ## # ... with 2,577 more rows
                        

Code:

                            #................ convert & save .............................
                            StateGov_df %>%
                              mutate(took_office = as.Date(took_office,
                                                           format = "%B %d, %Y"),
                                     left_office = as.Date(left_office,
                                                           format = "%B %d, %Y")) -> StateGov_df
                            #................ view ........
                            StateGov_df %>% 
                              select(governor_full_name,
                                     took_office,
                                     left_office) %>% 
                              head(n = 3)
                        

Result:

                            ## # A tibble: 3 x 3
                            ##   governor_full_name took_office left_office
                            ##      chr              date           date     
                            ## 1 William Wyatt Bibb 1819-11-09  1820-07-10 
                            ## 2 Thomas Bibb        1820-07-10  1821-11-09 
                            ## 3 Israel Pickens     1821-11-09  1825-11-25
                        

Revisiting the Imported Dataframe

During the review, challenges related to duplicate governor seat numbers leading to NAs are encountered. Despite warnings and potential syntax complexities, I persist in refining the process to ensure accurate results and avoid undesirable impacts.


Tackling NAs

Identifying 21 values resulting in NAs during the as.Date conversion prompts a deeper investigation. I systematically address this by replacing, converting, and saving the data, prioritizing accuracy and completeness.

Replace NAs

The final step involves replacing problematic dates impacting the took_office and left_office columns. Cross-referencing with NGA.org and employing meticulous error checking ensures a comprehensive and error-free dataset. The result is a refined dataset without any remaining NAs, marking the successful completion of the EDA project.

                            ## # A tibble: 1 x 8
                            ##   state_full_name state_abbrev governor_seat_order governor_full_na~ took_office
                            ##             int        int               int             int       int
                            ## 1               0            0                   0                 0           0
                            ## # ... with 3 more variables: left_office , party_affiliation ,
                            ## #   party_abbrev