R Subsetting Vectors and Dataframes
Simple Guide to Subsetting
R allows for a lot of different ways to do what we need to do. There’s the data.tables way, the Tidyverse way, the Base R way and any other number of unique and different approaches. Coming from SAS, I was most comfortable with the Tidyverse way of doing things. However, I’ve found that the underlying Base R approach is sometimes a bit easier.
Subsetting
I find the easiest way to subset is using simple logic and the square brackets []. At first, the syntax is a bit tricky but after a little bit of practice, I can build new sets of data with ease.
Subsetting a Vector
I have a vector with NA records and I want to get rid of the NA records. This is rather simple, I subset the records and negate (!) my selection argument.
vector_clean <- vector_dirty[!is.na(vector_dirty)]
Simple!
Subsetting a Dataframe
This is where the syntax can get a bit trickier and ugly. However, with some practice, this gets pretty simple. Since we have two dimensions (rows and columns) we have to separate our subset with a column [do row work, do column work]. In many cases, I want all the columns and a subset of rows, this is where I get confused a lot.
Here’s the basic syntax pulling dirty data from the dataframe dirty_df removing all NA values and returning all columns.
clean_df <- dirty_df[!is.na(dirty_df$dirty_vector),]
Example with Real Data
Okay, keeping this very simple, I’m going to use the Lahman baseball dataset. Install the package Lahman and read the teams file as a dataframe:
install.packages("Lahman")
teams <- Lahman::Teams
This is a huge dataset going back to the 1870s. I’m only interested in really good teams since 1960. First, I am going to filter for teams from 1960 and later.
teams<-teams[teams$yearID>=1960,]
Now, I want to only look at teams that won 95 games.
teams<-teams[teams$W>=95,]
This is great but I have way too much information, I’m only interested in the team, the year, their wins and losses, and if they were they won the World Series. For this, I can use the column selectors. The Lahman data is great with names for each column so picking these out and changing their order is simple.
teams<-teams[,c("name", "yearID", "W", "L","WSWin")]
The great thing is, we can do all this work in a single step with ease.
teams<-teams[teams$yearID>=1960 & teams$W>=95,c("name", "yearID", "W", "L","WSWin")]