13 Data Simulation

Packages needed:

* tidyr
* rgdal
* ggplot2

Welcome to Chapter 13! You’ve been analyzing the Tauntaun dataset throughout this primer, beginning with the two .csv files that were given to you way back in Chapter 5 (Projects). In previous chapters, we spent some time cleaning and merging the datasets, and then summarized the data and created a markdown annual report. We also analyzed the data using our sex-age-kill function and other basic statistical approaches. In this chapter, we’ll spill the beans and show you how two original csv datasets were created in R. The main purpose of this chapter is to show you how to simulate a dataset, and the key functions involved.

The first csv stores the Tauntaun harvest data, which we created by simulating Tauntaun population dynamics and harvest through time. This approach involves creating and working with arrays, using a few probability distributions for simulating data, and then using several of R’s data manipulation functions to create the final csv file. We’ve incorporated some typos along the way so that you could learn some functions that come in handy for proofing data. The second csv file stores hunter information, which if you remember, gave us practice merging two datasets.

WARNING: The “long chapter” warning is in full effect! This chapter introduces new material and makes extensive use of previous material as a review . . . make sure to give yourself time and take some breaks.

13.1 Population Model Overview

To create the Tauntaun harvest data, we need to simulate the Tauntaun population through time. We’ll set up a 10 year population simulation, where we will track the number of male and female Tauntauns separately in each time step (each year).

Tauntaun life history has been well studied by biologists on the icy planet of Hoth, and it’s reasonable to assume that life history characteristics are the same in Vermont. Individuals can begin breeding at the age of 1, and produce approximately 2 offspring per year. Individuals that are 0 years old are called “young” because they cannot breed. One-year-old breeders are called “recruits” because these individuals have been recruited into the breeding population. Older breeders (age 2+) are called “adults” because they are adult (experienced) breeders.

In a population model, nailing down the timing of events is absolutely critical. Here’s an overview of the Tauntaun timing of events . . .

Tauntaun life cycle.

Figure 13.1: Tauntaun life cycle.

  1. The census (top) occurs on the birthday, which occurs before the harvest. All Tauntauns are born on the exact same day - August 1st. The census is a count of 0, 1, 2, 3, 4, 5 year males and females. Because the census includes 0-year-olds, this is referred to as a “post-breeding census”. All individuals counted at the census remain alive until the harvest.
  2. The harvest occurs in the fall, and the number of animals harvested depends on a harvest rate (harvest probability).
  3. After the harvest, there is a period of natural mortality. We’ll assume this natural mortality does not affect animals during the short harvest season.
  4. We’ve come full circle. Tauntauns that survive the harvest and the natural mortality period are able to give advance to the birthday again. On the birthday, the animals that were 0 years old are now 1, and they instantly give birth to ~ 2 offspring. The animals that were 1 years old are now 2, and they instantly give birth to ~2 offspring, etc. These “fresh” individuals are now counted for the year 2 census.

13.1.1 Simulation Inputs

We developed a Tauntaun population in a spreadsheet called SimHarvest_2020.xlsx. This file should be located in your simulations directory, and our goal now is to convert the spreadsheet model into an R script.

You may not remember this, but the main inputs to the model are given in two sections of the spreadsheet. In Chapter 5, you read in the Excel file and saved these inputs as a list. Then you saved the inputs as sim_inputs.Rdata. This should be located in your simulations directory. Let’s check that now, and read in the file with the load function:

# check that the file exists
file.exists("simulations/sim_inputs.Rdata")
## [1] TRUE
# load in the file
load("simulations/sim_inputs.Rdata")

# check the structure of the object sim_inputs
str(sim_inputs)
## List of 2
##  $ pop.seed  :'data.frame':  2 obs. of  7 variables:
##   ..$ Sex: chr [1:2] "females" "males"
##   ..$ 0  : num [1:2] 3000 1200
##   ..$ 1  : num [1:2] 1500 900
##   ..$ 2  : num [1:2] 500 300
##   ..$ 3  : num [1:2] 300 180
##   ..$ 4  : num [1:2] 100 60
##   ..$ 5  : num [1:2] 50 30
##  $ pop.params:'data.frame':  5 obs. of  7 variables:
##   ..$ Rate: chr [1:5] "survival.f" "survival.m" "birth" "harvest.f" ...
##   ..$ 0   : num [1:5] 0.400000000000000022204 0.400000000000000022204 0 0.000000000000000000001 0.000000000000000000001
##   ..$ 1   : num [1:5] 0.7 0.7 2 0.2 0.3
##   ..$ 2   : num [1:5] 0.8 0.8 2.5 0.2 0.3
##   ..$ 3   : num [1:5] 0.9 0.9 2.5 0.2 0.3
##   ..$ 4   : num [1:5] 0.9 0.9 2.5 0.2 0.3
##   ..$ 5   : num [1:5] 0 0 2.5 0.2 0.3

Does this ring a bell? We created sim_inputs.RData in Chapter 5, but you’re brain may be a bit cluttered right now! The inputs for our population model are stored in a list (sim_inputs) consisting of 2 objects (pop.seed, pop.params, each of which is a data.frame. Remember, because the information is stored in a list, we need to first tell R what section of the list we are referring to with double brackets [[ ]], and from there you can work with the enclosed object.

With sim_inputs loaded into R’s memory, we can now translate the Tauntaun life history characteristics into a population model. We’ll be running a 10 year simulation that has a 1 year time step.

13.2 The “pop” Array

Our model will hold numbers of individuals, tracked for each age and sex, at different stages of the life cycle. We’ll start with the census, and work through the life cycle in chronological order.

Here are the stages we’ll be tracking:

  • census - number of individuals at the time of the population census,
  • harvest - the number of individuals harvested,
  • escapees - the number of individuals that escaped the harvest,
  • survivors - the number of individuals that survive the natural mortality period.

We’ll need to keep track of all of these individuals for each year, and by age and sex. The best way to store this information is in an object that is an array, which we introduced in Chapter 4. Arrays can be thought of as multi-dimension vectors. They hold a single data type (in our case, numbers). Arrays have rows, columns, and can have multiple pages of rows and columns, much like an Excel workbook. The trick is how to give instructions to “snake” through the rows, columns, and pages in the correct order.

You can create an array with the array function. Let’s use the args function to look at this function’s arguments.

# look at the arguments for the array function
args(array)
## function (data = NA, dim = length(data), dimnames = NULL) 
## NULL

From the output, we see that the first value in the argument list is the data. We can start by entering NA’s in the array, which indicates Not Available, and then replace them as we build the model. The second argument is called dim, which is where you specify the dimensions of your array. The default is length(data), but this won’t work for us. How many dimensions do we need? We’re tracking 10 years, 6 ages (ages 0-6), 2 sexes (m for male, and f for female), and 4 stages in the life cycle (census, harvest, escapees, and survivors). Remember that the order of the dimensions is critical for determining how the resulting array looks, as well as how you feed the array and extract from it. The order gives the direction in which to “snake” through the array to do calculations. So an array with dimensions as 10, 6, 2, 4 is organized differently than an array with dimensions 6, 2, 10, 4, even if they hold the same number of entries. Russell Crowe can help us remember that the first two numbers set the rows and columns of our array, but the last two entries show additional dimensions.

The last argument is called dimnames and it allows you to name each of the dimensions. We strongly encourage this practice. If you remember from Chapter 4, dimnames is stored as an attribute of the object, and the names are stored in a list. When creating an array, the order of the names needs to match the order of the dimensions perfectly. That is, the first name in the dimnames list indicates the row names, the second name in the dimnames list indicates the column names, and so on.

# create arrays for holding numbers
pop <- array(data = NA,
            dim = c(10, 6, 2, 4),
            dimnames = list(yr = 1:10, 
                            age = 0:5,
                            sex = c('f','m'),
                            stage = c('census','harvest', 'escapees', 'survivors')))

Now that we’ve created an array (an object called pop), let’s have a look at it.

# look at the array
pop
## , , sex = f, stage = census
## 
##     age
## yr    0  1  2  3  4  5
##   1  NA NA NA NA NA NA
##   2  NA NA NA NA NA NA
##   3  NA NA NA NA NA NA
##   4  NA NA NA NA NA NA
##   5  NA NA NA NA NA NA
##   6  NA NA NA NA NA NA
##   7  NA NA NA NA NA NA
##   8  NA NA NA NA NA NA
##   9  NA NA NA NA NA NA
##   10 NA NA NA NA NA NA
## 
## , , sex = m, stage = census
## 
##     age
## yr    0  1  2  3  4  5
##   1  NA NA NA NA NA NA
##   2  NA NA NA NA NA NA
##   3  NA NA NA NA NA NA
##   4  NA NA NA NA NA NA
##   5  NA NA NA NA NA NA
##   6  NA NA NA NA NA NA
##   7  NA NA NA NA NA NA
##   8  NA NA NA NA NA NA
##   9  NA NA NA NA NA NA
##   10 NA NA NA NA NA NA
## 
## , , sex = f, stage = harvest
## 
##     age
## yr    0  1  2  3  4  5
##   1  NA NA NA NA NA NA
##   2  NA NA NA NA NA NA
##   3  NA NA NA NA NA NA
##   4  NA NA NA NA NA NA
##   5  NA NA NA NA NA NA
##   6  NA NA NA NA NA NA
##   7  NA NA NA NA NA NA
##   8  NA NA NA NA NA NA
##   9  NA NA NA NA NA NA
##   10 NA NA NA NA NA NA
## 
## , , sex = m, stage = harvest
## 
##     age
## yr    0  1  2  3  4  5
##   1  NA NA NA NA NA NA
##   2  NA NA NA NA NA NA
##   3  NA NA NA NA NA NA
##   4  NA NA NA NA NA NA
##   5  NA NA NA NA NA NA
##   6  NA NA NA NA NA NA
##   7  NA NA NA NA NA NA
##   8  NA NA NA NA NA NA
##   9  NA NA NA NA NA NA
##   10 NA NA NA NA NA NA
## 
## , , sex = f, stage = escapees
## 
##     age
## yr    0  1  2  3  4  5
##   1  NA NA NA NA NA NA
##   2  NA NA NA NA NA NA
##   3  NA NA NA NA NA NA
##   4  NA NA NA NA NA NA
##   5  NA NA NA NA NA NA
##   6  NA NA NA NA NA NA
##   7  NA NA NA NA NA NA
##   8  NA NA NA NA NA NA
##   9  NA NA NA NA NA NA
##   10 NA NA NA NA NA NA
## 
## , , sex = m, stage = escapees
## 
##     age
## yr    0  1  2  3  4  5
##   1  NA NA NA NA NA NA
##   2  NA NA NA NA NA NA
##   3  NA NA NA NA NA NA
##   4  NA NA NA NA NA NA
##   5  NA NA NA NA NA NA
##   6  NA NA NA NA NA NA
##   7  NA NA NA NA NA NA
##   8  NA NA NA NA NA NA
##   9  NA NA NA NA NA NA
##   10 NA NA NA NA NA NA
## 
## , , sex = f, stage = survivors
## 
##     age
## yr    0  1  2  3  4  5
##   1  NA NA NA NA NA NA
##   2  NA NA NA NA NA NA
##   3  NA NA NA NA NA NA
##   4  NA NA NA NA NA NA
##   5  NA NA NA NA NA NA
##   6  NA NA NA NA NA NA
##   7  NA NA NA NA NA NA
##   8  NA NA NA NA NA NA
##   9  NA NA NA NA NA NA
##   10 NA NA NA NA NA NA
## 
## , , sex = m, stage = survivors
## 
##     age
## yr    0  1  2  3  4  5
##   1  NA NA NA NA NA NA
##   2  NA NA NA NA NA NA
##   3  NA NA NA NA NA NA
##   4  NA NA NA NA NA NA
##   5  NA NA NA NA NA NA
##   6  NA NA NA NA NA NA
##   7  NA NA NA NA NA NA
##   8  NA NA NA NA NA NA
##   9  NA NA NA NA NA NA
##   10 NA NA NA NA NA NA

That’s quite an array! We won’t need to look at the entire array in future steps, but it helps to see the whole thing when you first create one to verify that it has the dimensions you expect. Notice that there are a series of two-dimensional grids provided, with rows as years and columns as age. Each two dimensional grid is identified by both sex (“m” or “f”) and stage of the life cycle (“census”, “harvest”, “escapees”, “survivors”). Right now our array is filled with NA’s. We’ll have this filled in with real numbers in no time.

Instead of looking at the entire array, the str function can be used to see what this array is all about:

# verify it's structure
str(pop)
##  logi [1:10, 1:6, 1:2, 1:4] NA NA NA NA NA NA ...
##  - attr(*, "dimnames")=List of 4
##   ..$ yr   : chr [1:10] "1" "2" "3" "4" ...
##   ..$ age  : chr [1:6] "0" "1" "2" "3" ...
##   ..$ sex  : chr [1:2] "f" "m"
##   ..$ stage: chr [1:4] "census" "harvest" "escapees" "survivors"

The first line of this output tells us what kind of data are in the array (num), and the section [1:10, 1:6, 1:2, 1:4] provides the dimensions. The next line of output, attr(*, “dimnames”)=List of 4 tells us that the names of each dimension are stored as an attribute for the object, pop. An object can have multiple attributes; the attribute called dimnames is a list that contains four objects (which are all character vectors). The four vectors within it are named yr, age, sex, and stage, and the $ that precedes the name suggests that these values are contained within the dimnames list. The output tells you the actual dimension names (such as “f” and “m”) are stored as a vector of characters (chr). You can add more attributes to the list if you want, as we did in Chapter 4.

If you want to extract specific attributes for the object, pop, use the attr function.

# look at the dimnames attribute for the object called pop
attr(x = pop, which = "dimnames")
## $yr
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
## 
## $age
## [1] "0" "1" "2" "3" "4" "5"
## 
## $sex
## [1] "f" "m"
## 
## $stage
## [1] "census"    "harvest"   "escapees"  "survivors"

If you’d like to see all of the attributes for the object, pop, use the attributes function. You’ll see that our array has another attribute called dim, which stores the array’s dimensions only.

# look at the attributes for the object called pop
attributes(pop)
## $dim
## [1] 10  6  2  4
## 
## $dimnames
## $dimnames$yr
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
## 
## $dimnames$age
## [1] "0" "1" "2" "3" "4" "5"
## 
## $dimnames$sex
## [1] "f" "m"
## 
## $dimnames$stage
## [1] "census"    "harvest"   "escapees"  "survivors"

Hopefully it is sinking in that you must know how the array is arranged in order to work with it. We’ll leave our population array for now, and move onto the topic of rates. Then, we’ll use the rate data to help fill in our array called pop.

13.3 The “rates” Array

The Tauntaun population model requires that you specify vital rates, such as harvest rate, birth rate, and natural survival rate. We’ll track each of these rates by age, sex, and year, so we’ll need another array for holding these rates. One more vital rate is needed for our population model – the sex ratio of the offspring – but for this exercise they will be a constant 0.5 (50% of the offspring will be female, and 50% will be male), so we won’t need to add it to the array.

Let’s start by creating an object called rates with the array function. We’ll be tracking the three rates for each year, age and sex class. This time, our dimensions will be 10 years, 6 age classes, 2 sexes, and 3 rates, so we’ll set the dim argument to c(10, 6, 2, 3). The names of each dimension will be stored in a list, given in the exact same order as our dimensions: yr will run from 1:10, age will run between 0 and 5, sex will be “f” and “m”, and rate will be identified by “harvest”, “birth”, and “survival”. Remember, names in R are stored as characters, so ages will be stored as “0”, “1”, “2”, etc.

# create an object called rates that is an array
rates <- array(data = NA, 
               dim = c(10, 6, 2, 3),
               dimnames = list(yr = 1:10,
                               age = 0:5,
                               sex = c('f','m'),
                               rate = c('harvest', 'birth','survival')))

# look at the structure of the object called rates
str(rates)
##  logi [1:10, 1:6, 1:2, 1:3] NA NA NA NA NA NA ...
##  - attr(*, "dimnames")=List of 4
##   ..$ yr  : chr [1:10] "1" "2" "3" "4" ...
##   ..$ age : chr [1:6] "0" "1" "2" "3" ...
##   ..$ sex : chr [1:2] "f" "m"
##   ..$ rate: chr [1:3] "harvest" "birth" "survival"
# look at the rates array
rates
## , , sex = f, rate = harvest
## 
##     age
## yr    0  1  2  3  4  5
##   1  NA NA NA NA NA NA
##   2  NA NA NA NA NA NA
##   3  NA NA NA NA NA NA
##   4  NA NA NA NA NA NA
##   5  NA NA NA NA NA NA
##   6  NA NA NA NA NA NA
##   7  NA NA NA NA NA NA
##   8  NA NA NA NA NA NA
##   9  NA NA NA NA NA NA
##   10 NA NA NA NA NA NA
## 
## , , sex = m, rate = harvest
## 
##     age
## yr    0  1  2  3  4  5
##   1  NA NA NA NA NA NA
##   2  NA NA NA NA NA NA
##   3  NA NA NA NA NA NA
##   4  NA NA NA NA NA NA
##   5  NA NA NA NA NA NA
##   6  NA NA NA NA NA NA
##   7  NA NA NA NA NA NA
##   8  NA NA NA NA NA NA
##   9  NA NA NA NA NA NA
##   10 NA NA NA NA NA NA
## 
## , , sex = f, rate = birth
## 
##     age
## yr    0  1  2  3  4  5
##   1  NA NA NA NA NA NA
##   2  NA NA NA NA NA NA
##   3  NA NA NA NA NA NA
##   4  NA NA NA NA NA NA
##   5  NA NA NA NA NA NA
##   6  NA NA NA NA NA NA
##   7  NA NA NA NA NA NA
##   8  NA NA NA NA NA NA
##   9  NA NA NA NA NA NA
##   10 NA NA NA NA NA NA
## 
## , , sex = m, rate = birth
## 
##     age
## yr    0  1  2  3  4  5
##   1  NA NA NA NA NA NA
##   2  NA NA NA NA NA NA
##   3  NA NA NA NA NA NA
##   4  NA NA NA NA NA NA
##   5  NA NA NA NA NA NA
##   6  NA NA NA NA NA NA
##   7  NA NA NA NA NA NA
##   8  NA NA NA NA NA NA
##   9  NA NA NA NA NA NA
##   10 NA NA NA NA NA NA
## 
## , , sex = f, rate = survival
## 
##     age
## yr    0  1  2  3  4  5
##   1  NA NA NA NA NA NA
##   2  NA NA NA NA NA NA
##   3  NA NA NA NA NA NA
##   4  NA NA NA NA NA NA
##   5  NA NA NA NA NA NA
##   6  NA NA NA NA NA NA
##   7  NA NA NA NA NA NA
##   8  NA NA NA NA NA NA
##   9  NA NA NA NA NA NA
##   10 NA NA NA NA NA NA
## 
## , , sex = m, rate = survival
## 
##     age
## yr    0  1  2  3  4  5
##   1  NA NA NA NA NA NA
##   2  NA NA NA NA NA NA
##   3  NA NA NA NA NA NA
##   4  NA NA NA NA NA NA
##   5  NA NA NA NA NA NA
##   6  NA NA NA NA NA NA
##   7  NA NA NA NA NA NA
##   8  NA NA NA NA NA NA
##   9  NA NA NA NA NA NA
##   10 NA NA NA NA NA NA

Another impressive array! Make sure to look at its length (360). We are going to fill in this array with numbers next. We’ll then use these rates to project the Tauntaun population through time.

13.3.1 Filling the Rates Array

The vital rates are currently stored in a dataframe called pop.params, which is part of the sim.inputs list. Let’s have a look.

sim_inputs$pop.params
##         Rate                       0   1   2   3   4   5
## 1 survival.f 0.400000000000000022204 0.7 0.8 0.9 0.9 0.0
## 2 survival.m 0.400000000000000022204 0.7 0.8 0.9 0.9 0.0
## 3      birth 0.000000000000000000000 2.0 2.5 2.5 2.5 2.5
## 4  harvest.f 0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
## 5  harvest.m 0.000000000000000000001 0.3 0.3 0.3 0.3 0.3

This is a dataframe that stores vital rates by Tauntaun age class (0-5). The name of the rate is given in the first column. The first two rows give the natural survival rate of females and males, respectively, by age class. In row 3, the per capita birth rate is stored. This is the total number of offspring produced per female per year. Notice that Tauntauns >= 1 year old can breed. The final two rows give the harvest rate of females and males by age class.

Now it’s just a matter of moving these rates into their correct position in the rates array. Because the information above is stored as a dataframe, we’ll need to extract the row we need, and convert it to a matrix with the as.matrix function. Remember that an array (like a vector and matrix) can store a single datatype (mode or type) within it. Dataframes are lists, so they cannot be inserted into an array directly.

Let’s start with the female survival rates, which is stored in row 1 of our sim_inputs$pop.params object.

# extract the female survival rates by index
data <-  as.matrix(sim_inputs$pop.params[1, 2:7])

# look at the data; it is a 1 row by 6 column matrix
data
##     0   1   2   3   4 5
## 1 0.4 0.7 0.8 0.9 0.9 0

These are our age-specific female natural survival rates. If we assume that they are the same for all 10 years of the simulation, we can simply turn this into a 10 row by 6 column matrix.

(data <- matrix(data = data, nrow = 10, ncol = 6, byrow = TRUE))
##       [,1] [,2] [,3] [,4] [,5] [,6]
##  [1,]  0.4  0.7  0.8  0.9  0.9    0
##  [2,]  0.4  0.7  0.8  0.9  0.9    0
##  [3,]  0.4  0.7  0.8  0.9  0.9    0
##  [4,]  0.4  0.7  0.8  0.9  0.9    0
##  [5,]  0.4  0.7  0.8  0.9  0.9    0
##  [6,]  0.4  0.7  0.8  0.9  0.9    0
##  [7,]  0.4  0.7  0.8  0.9  0.9    0
##  [8,]  0.4  0.7  0.8  0.9  0.9    0
##  [9,]  0.4  0.7  0.8  0.9  0.9    0
## [10,]  0.4  0.7  0.8  0.9  0.9    0

What happened here? We created a matrix with 10 rows and 6 columns, and passed into it our data matrix. This matrix consisted of only 6 values (the female survival rates by age), but R recycled these values by row.

This matrix can now be stuffed into its correct spot in our rates array. Let’s first remind ourselves of the structure of the rates array so that we know are indexing system:

# look at the structure of the rates array
str(rates)
##  logi [1:10, 1:6, 1:2, 1:3] NA NA NA NA NA NA ...
##  - attr(*, "dimnames")=List of 4
##   ..$ yr  : chr [1:10] "1" "2" "3" "4" ...
##   ..$ age : chr [1:6] "0" "1" "2" "3" ...
##   ..$ sex : chr [1:2] "f" "m"
##   ..$ rate: chr [1:3] "harvest" "birth" "survival"

Oh, yeah, right - the ordering is yr, age, sex, rate. Our matrix already contains all years and ages in the correct order, so we only need to specify the sex (“f”) and rate (“survival”) to correctly fill in the data:

# insert the female harvest rate into the rates array.
rates[yr = , age = , sex = "f", rate = "survival"] <- data

# look at the female survival rate portion of the rates array
rates[yr = , age = , sex = "f", rate = "survival"]
##     age
## yr     0   1   2   3   4 5
##   1  0.4 0.7 0.8 0.9 0.9 0
##   2  0.4 0.7 0.8 0.9 0.9 0
##   3  0.4 0.7 0.8 0.9 0.9 0
##   4  0.4 0.7 0.8 0.9 0.9 0
##   5  0.4 0.7 0.8 0.9 0.9 0
##   6  0.4 0.7 0.8 0.9 0.9 0
##   7  0.4 0.7 0.8 0.9 0.9 0
##   8  0.4 0.7 0.8 0.9 0.9 0
##   9  0.4 0.7 0.8 0.9 0.9 0
##   10 0.4 0.7 0.8 0.9 0.9 0

We take the code rates[yr = , age = , sex = “f”, rate = “survival”] to read “dip into the rates array, all years, all ages to extract the survival rate of females”. Note that our code above including the names of each section of the array:

rates[yr = , age = , sex = "f", rate = "survival"]

You don’t need to add the section name, however. As with functions, as long as you know the order of the array, you can extract what is needed without section names, like this:

# extract the female survival rates; no section names here
rates[, , "f", "survival"]
##     age
## yr     0   1   2   3   4 5
##   1  0.4 0.7 0.8 0.9 0.9 0
##   2  0.4 0.7 0.8 0.9 0.9 0
##   3  0.4 0.7 0.8 0.9 0.9 0
##   4  0.4 0.7 0.8 0.9 0.9 0
##   5  0.4 0.7 0.8 0.9 0.9 0
##   6  0.4 0.7 0.8 0.9 0.9 0
##   7  0.4 0.7 0.8 0.9 0.9 0
##   8  0.4 0.7 0.8 0.9 0.9 0
##   9  0.4 0.7 0.8 0.9 0.9 0
##   10 0.4 0.7 0.8 0.9 0.9 0

We’ll keep the names as it produces more readable code, but we mention this because many examples in the R helpfiles do not include names.

Try typing rates[ in your console or script, and then press the tab key. You should see some helpers appear that allow you to select a particular element in the array by name, such as “f” or “survival”. These are not the named sections of the array, such as yr, age, sex, or stage; rather they are actually the dimnames (names of the dimensions in the array itself). The tab may come in handy because it reminds you of the structure of the array object.

As an aside, it appears that R won’t throw an error if you use incorrect section names. For example, this code works:

# the rates array with incorrect use of section names
rates[mo = , shemp = , curly = "m", larry = "survival"] <- data

# look at the rates array with bad names
rates[larry = , mo = , curly = "m", shemp = "survival"]
##     age
## yr     0   1   2   3   4 5
##   1  0.4 0.7 0.8 0.9 0.9 0
##   2  0.4 0.7 0.8 0.9 0.9 0
##   3  0.4 0.7 0.8 0.9 0.9 0
##   4  0.4 0.7 0.8 0.9 0.9 0
##   5  0.4 0.7 0.8 0.9 0.9 0
##   6  0.4 0.7 0.8 0.9 0.9 0
##   7  0.4 0.7 0.8 0.9 0.9 0
##   8  0.4 0.7 0.8 0.9 0.9 0
##   9  0.4 0.7 0.8 0.9 0.9 0
##   10 0.4 0.7 0.8 0.9 0.9 0

However, for any extraction, you must specify the named elements (dimnames) correctly. For example, this code throws an error because there is no array element called “female”.

# extract the female survival rates; no names here
rates[, , "female", "survival"]
Error in rates[, , "female", "survival"] : subscript out of bounds

You can also extract by position index. For example, these two lines of code return the same result.

# extract female survival rates for all years, age class "0"
rates[yr = , age = "0", sex = "f", rate = "survival"]
##   1   2   3   4   5   6   7   8   9  10 
## 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4
# try it with indexing instead indexing
rates[yr = , age = 1, sex = 1, rate = "survival"]
##   1   2   3   4   5   6   7   8   9  10 
## 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4

If you do extract by index, keep the str function close at hand so you know how the different elements are indexed. Note here that age “0” is the first element of the ages (0:5). This can be confusing!

OK, let’s continue filling the rates array. Let’s follow suit to add in the male natural survival rates, which as stored in row 2 of the sim_inputs$pop.params object.

# extract the male survival rates
data <-  as.matrix(sim_inputs$pop.params[2, 2:7])

# expand the data to a 10 row by 6 column matrix
data <- matrix(data = data, nrow = 10, ncol = 6, byrow = TRUE)

# fill in the correct portion of the array
rates[yr = , age = , sex = "m", rate = "survival"] <- data

# look at the male survival rate portion of the rates array
rates[yr = , age = , sex = "m", rate = "survival"]
##     age
## yr     0   1   2   3   4 5
##   1  0.4 0.7 0.8 0.9 0.9 0
##   2  0.4 0.7 0.8 0.9 0.9 0
##   3  0.4 0.7 0.8 0.9 0.9 0
##   4  0.4 0.7 0.8 0.9 0.9 0
##   5  0.4 0.7 0.8 0.9 0.9 0
##   6  0.4 0.7 0.8 0.9 0.9 0
##   7  0.4 0.7 0.8 0.9 0.9 0
##   8  0.4 0.7 0.8 0.9 0.9 0
##   9  0.4 0.7 0.8 0.9 0.9 0
##   10 0.4 0.7 0.8 0.9 0.9 0
# compare with original data stored in sim.inputs$pop.params
sim_inputs$pop.params[2, 2:7]
##     0   1   2   3   4 5
## 2 0.4 0.7 0.8 0.9 0.9 0

You may be wondering what the number 2 is doing to the left of this output. When we subset the pop.params dataframe with sim_inputs$pop.params[2, 2:7], we specificially asked for row 2, columns 2:7 to be returned. The #2 is the remnant row name from this extraction.

Next, let’s fill in the female birth rates. Recall that these rates provide the number of offspring produced per female Tauntaun by age. The rates are currently stored in row 3 of the sim_inputs$pop.params object.

# extract the female birth rates
data <-  as.matrix(sim_inputs$pop.params[3, 2:7])

# expand the data to a 10 row by 6 column matrix
data <- matrix(data = data, nrow = 10, ncol = 6, byrow = T)

# fill in the correct portion of the array
rates[yr = , age = , sex = "f", rate = "birth"] <- data

# look at the female birth rate portion of the rates array
rates[yr = , age = , sex = "f", rate = "birth"]
##     age
## yr   0 1   2   3   4   5
##   1  0 2 2.5 2.5 2.5 2.5
##   2  0 2 2.5 2.5 2.5 2.5
##   3  0 2 2.5 2.5 2.5 2.5
##   4  0 2 2.5 2.5 2.5 2.5
##   5  0 2 2.5 2.5 2.5 2.5
##   6  0 2 2.5 2.5 2.5 2.5
##   7  0 2 2.5 2.5 2.5 2.5
##   8  0 2 2.5 2.5 2.5 2.5
##   9  0 2 2.5 2.5 2.5 2.5
##   10 0 2 2.5 2.5 2.5 2.5
# compare with original data stored in sim.inputs$pop.params
sim_inputs$pop.params[3, 2:7]
##   0 1   2   3   4   5
## 3 0 2 2.5 2.5 2.5 2.5

Finally, let’s fill in the harvest rates. Recall that these rates provide the probability that a Tauntaun will be harvested by age class. The rates are currently stored in rows 4 and 5 of the sim_inputs$pop.params object. let’s remind ourselves what this looks like:

sim_inputs$pop.params
##         Rate                       0   1   2   3   4   5
## 1 survival.f 0.400000000000000022204 0.7 0.8 0.9 0.9 0.0
## 2 survival.m 0.400000000000000022204 0.7 0.8 0.9 0.9 0.0
## 3      birth 0.000000000000000000000 2.0 2.5 2.5 2.5 2.5
## 4  harvest.f 0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
## 5  harvest.m 0.000000000000000000001 0.3 0.3 0.3 0.3 0.3

Next, let’s fill in the female harvest rates. The rates are currently stored in row 4 of the sim_inputs$pop.params object.

# extract the female harvest rates
data <-  as.matrix(sim_inputs$pop.params[4, 2:7])

# expand the data to a 10 row by 6 column matrix
data <- matrix(data = data, nrow = 10, ncol = 6, byrow = TRUE)

# fill in the correct portion of the array
rates[yr = , age = , sex = "f", rate = "harvest"] <- data

# look at the male survival rate portion of the rates array
rates[yr = , age = , sex = "f", rate = "harvest"]
##     age
## yr                         0   1   2   3   4   5
##   1  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   2  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   3  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   4  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   5  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   6  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   7  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   8  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   9  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   10 0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
# compare with original data stored in sim.inputs$pop.params
sim_inputs$pop.params[4, 2:7]
##                         0   1   2   3   4   5
## 4 0.000000000000000000001 0.2 0.2 0.2 0.2 0.2

Finally, let’s fill in the male harvest rates. The rates are currently stored in row 5 of the sim_inputs$pop.params object.

Try this on your own! The main section to fill is rates[yr = , age = , sex = “m”, rate = “harvest”], and we are referencing row 5, columns 2:7 for the data. Your results should look like this:

##     age
## yr                         0   1   2   3   4   5
##   1  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   2  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   3  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   4  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   5  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   6  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   7  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   8  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   9  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   10 0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##                         0   1   2   3   4   5
## 5 0.000000000000000000001 0.3 0.3 0.3 0.3 0.3

How did you do? Let’s have a look at our rates array once more in full.

# look at the rates array
rates
## , , sex = f, rate = harvest
## 
##     age
## yr                         0   1   2   3   4   5
##   1  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   2  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   3  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   4  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   5  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   6  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   7  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   8  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   9  0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
##   10 0.000000000000000000001 0.2 0.2 0.2 0.2 0.2
## 
## , , sex = m, rate = harvest
## 
##     age
## yr                         0   1   2   3   4   5
##   1  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   2  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   3  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   4  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   5  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   6  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   7  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   8  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   9  0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
##   10 0.000000000000000000001 0.3 0.3 0.3 0.3 0.3
## 
## , , sex = f, rate = birth
## 
##     age
## yr   0 1   2   3   4   5
##   1  0 2 2.5 2.5 2.5 2.5
##   2  0 2 2.5 2.5 2.5 2.5
##   3  0 2 2.5 2.5 2.5 2.5
##   4  0 2 2.5 2.5 2.5 2.5
##   5  0 2 2.5 2.5 2.5 2.5
##   6  0 2 2.5 2.5 2.5 2.5
##   7  0 2 2.5 2.5 2.5 2.5
##   8  0 2 2.5 2.5 2.5 2.5
##   9  0 2 2.5 2.5 2.5 2.5
##   10 0 2 2.5 2.5 2.5 2.5
## 
## , , sex = m, rate = birth
## 
##     age
## yr    0  1  2  3  4  5
##   1  NA NA NA NA NA NA
##   2  NA NA NA NA NA NA
##   3  NA NA NA NA NA NA
##   4  NA NA NA NA NA NA
##   5  NA NA NA NA NA NA
##   6  NA NA NA NA NA NA
##   7  NA NA NA NA NA NA
##   8  NA NA NA NA NA NA
##   9  NA NA NA NA NA NA
##   10 NA NA NA NA NA NA
## 
## , , sex = f, rate = survival
## 
##     age
## yr     0   1   2   3   4 5
##   1  0.4 0.7 0.8 0.9 0.9 0
##   2  0.4 0.7 0.8 0.9 0.9 0
##   3  0.4 0.7 0.8 0.9 0.9 0
##   4  0.4 0.7 0.8 0.9 0.9 0
##   5  0.4 0.7 0.8 0.9 0.9 0
##   6  0.4 0.7 0.8 0.9 0.9 0
##   7  0.4 0.7 0.8 0.9 0.9 0
##   8  0.4 0.7 0.8 0.9 0.9 0
##   9  0.4 0.7 0.8 0.9 0.9 0
##   10 0.4 0.7 0.8 0.9 0.9 0
## 
## , , sex = m, rate = survival
## 
##     age
## yr     0   1   2   3   4 5
##   1  0.4 0.7 0.8 0.9 0.9 0
##   2  0.4 0.7 0.8 0.9 0.9 0
##   3  0.4 0.7 0.8 0.9 0.9 0
##   4  0.4 0.7 0.8 0.9 0.9 0
##   5  0.4 0.7 0.8 0.9 0.9 0
##   6  0.4 0.7 0.8 0.9 0.9 0
##   7  0.4 0.7 0.8 0.9 0.9 0
##   8  0.4 0.7 0.8 0.9 0.9 0
##   9  0.4 0.7 0.8 0.9 0.9 0
##   10 0.4 0.7 0.8 0.9 0.9 0

Don’t worry about the male birth rate section containing NA – male Tauntauns don’t give birth. We’ll be using this array for our Tauntaun population simulation soon. But first, let’s have a quick break:

Time for a break.

Figure 13.2: Time for a break.

13.4 Stepping through the Life Cycle

Now that all of our rates are calculated, we’re ready to use them to project the population size through time, filling in our pop array. Let’s remind ourselves of the timing of events in the Tauntaun life cycle.

Tauntaun life cycle.

Figure 13.3: Tauntaun life cycle.

Don’t forget the life cycle is an annual cycle, but we must start somewhere. Our starting point will be called the “census”, which is when we count individuals in the population. This census will occur on the birth day, all Tauntauns are born on this day. So a 0 year old is a freshly born Tauntaun, a 1 year is a Tauntaun who is celebrating their first birthday, and so on. This structure is known as a post-breeding census because we count all individuals immediately after they are born. We assume that there are no deaths between the birthday and the harvest, which occurs in the fall. When a Tauntaun biologist analyzes the harvest data, which contains information about dead Tauntans, their analyses reveal something about the living Tauntaun population size immediately before the harvest began. After the harvest is a period of natural survival (one minus the natural survival rate is the natural mortality rate). Any Tauntauns that escape the harvest and escape natural mortality are able to reproduce on their birthdays, and we’ve come full cycle.

The model should be pretty straight forward, and has the following sequence:

  1. Census animals - this is the count of 0-5 year-old animals in each time step. We need to provide this number for the very first year, and thereafter the numbers are calculated.
  2. Harvest animals - we will harvest animals according to their harvest probability.
  3. Natural survival - after the harvest, animals will survive to breeding according to their natural survival rate.
  4. An we’re back to the birthday and census. Those females that survive the harvest and survive the natural mortality period then give birth. The number of offspring per breeding female is determined by the birth.rate. The numbers of 0-year-old males and 0-year-old females is established by the sex.ratio, which we will assume is 50-50. Immediately after the birthday, the population can be counted in year 2.
  5. Do you feel a loop coming on?

13.4.1 The Population Seed

The population model starts with what is called a “population seed”. This is the number of individuals at the start of our population model for all age and sex classes. You can’t create individuals out of thin air!

The population seed is stored in sim_inputs$pop.seed:

 sim_inputs$pop.seed
##       Sex    0    1   2   3   4  5
## 1 females 3000 1500 500 300 100 50
## 2   males 1200  900 300 180  60 30

This is another dataframe, with the starting female population size by ageclass stored in row 1, and the starting male population size stored in row 2. We’ll need to extract this content and add it to the pop array. Before we do that, it is useful to remind ourselves what this array is all about, and how it is structured:

str(pop)
##  logi [1:10, 1:6, 1:2, 1:4] NA NA NA NA NA NA ...
##  - attr(*, "dimnames")=List of 4
##   ..$ yr   : chr [1:10] "1" "2" "3" "4" ...
##   ..$ age  : chr [1:6] "0" "1" "2" "3" ...
##   ..$ sex  : chr [1:2] "f" "m"
##   ..$ stage: chr [1:4] "census" "harvest" "escapees" "survivors"

OK, knowing the structure, let’s now add in the population seed:

# fill in the female population seed
female.seed <- as.matrix(sim_inputs$pop.seed[1, 2:7])
pop[yr = 1, age = , sex = "f", stage = "census"] <-  female.seed

# fill in the male population seed
male.seed <- as.matrix(sim_inputs$pop.seed[2, 2:7])
pop[yr = 1, age = , sex = "m", stage = "census"] <- male.seed 

# have a look
pop[yr = 1, age = , sex = , stage = "census"]
##    sex
## age    f    m
##   0 3000 1200
##   1 1500  900
##   2  500  300
##   3  300  180
##   4  100   60
##   5   50   30

With our population seeded, we are now ready to begin. From here on out, all of our Tauntaun numbers through time will be calculated.

Before we actually write our simulation loop code, it will be useful to have an understanding of how the Tauntaun numbers will be calculated for each stage in one life cycle iteration. We’ll track the females only in the rest of this section so that you see the 1st year of the life cycle in its entirety. After that, we’ll start from the beginning and code the entire Tauntaun simulation as a loop.

13.4.2 The Harvest Season

To begin, to calculate the number of female Tauntauns harvested in year 1, we can multiply the census number of individuals of a specific age and sex class by their respective harvest probabilities. This is easy to accomplish….you just need to use the correct indexing for both the pop array and the rates array.

# look at the female seed population in year 1: 
# these represent the females that can potentially be harvested; 
pop[yr = 1, age = , sex = "f", stage = "census"] 
##    0    1    2    3    4    5 
## 3000 1500  500  300  100   50
# look at the female harvest rates for year 1:  
rates[yr = 1, age = , sex = "f", rate = "harvest"]
##                       0                       1                       2                       3                       4                       5 
## 0.000000000000000000001 0.200000000000000011102 0.200000000000000011102 0.200000000000000011102 0.200000000000000011102 0.200000000000000011102

Next, we’ll perform element-by-element multiplication to get the number of harvested animals. We’ll round these values to 0 with the round function (rounding to 0 digits, the default). This is done only so that we don’t harvest a fraction of an individual.

# fill in the year 1 female harvest
pop[yr = 1, age = , sex = "f", rate = "harvest"] <- 
          round(pop[yr = 1, age = , sex = "f", stage = "census"] *
          rates[yr = 1, age = , sex = "f", rate = "harvest"])

# look at the number of harvested females
pop[yr = 1, age = , sex = "f", rate = "harvest"] 
##   0   1   2   3   4   5 
##   0 300 100  60  20  10

You should be able to verify that the number harvested is the number of individuals times their harvest rate for each class.

However, our model will include stochasticity - that is, we would like to introduce elements of randomness to our model. We do so here not because we really want a stochastic Tauntaun population model (we really don’t need this complexity), but because it provides us an opportunity to show you, the reader, how to use if-else coding structures, describe random number seeds, and introduce some basic probability functions in R.

13.4.2.1 The Binomial Distribution

Instead of simply multiplying the census by the harvest rate (with accompanying rounding) to compute our harvested individuals, we will instead make use of the the binomial distribution. Let’s quickly introduce the binomial distribution now. Of course, our first stop will be the distribution’s help page:

help(topic = 'Binomial')

Scan the helpfile, and you should see that there are 4 functions related to the binomial distribution: dbinom, pbinom, qbinom and rbinom. We will use two of these in our Tauntaun simulation (qbinom and rbinom).

In a nutshell, the shape of a binomial distribution is controlled by two parameters: the probability of success and the number of trials. For example, if we set the probability of success to 0.5 (and take this to mean the probability of harvest) and the number of trials to 10 (and take this to mean 10 Tauntauns), the binomial distribution looks like this:

plot(x = 0:10,
     y = dbinom(0:10, 10, prob = 0.5, log = F),
     main = "Binomial Distribution: n = 10, p = 0.5",
     xlab = "Harvested Tauntauns out of 10 Total", 
     ylab = "Probability", col = "Red", 
     type = "h")

This is a binomial distribution for p (the probability of success, which R calls “prob”) = 0.5 and n (the number of trials is 10, which R calls “size”). On the x-axis, we have the number of harvested Tauntauns, and the y-axis is probability. From this graph, we can see that the probability of harvesting, say, 4 Tauntans out of 10, given a harvest rate of 0.5 is 0.205. If we were to sample from this distribution, can you see that it would be less likely to harvest, say, 0 or 10 Tauntauns (which are near the tails of the distribution) than to harvest 5 Tauntauns (which is towards the middle of the distribution)? The most likely number of harvested animals is 0.5 * 10, which is 5 Tauntauns. You can get the most likely value with the qbinom function, and setting the argument called p to 0.5 (in case you were wondering, this will return the middle of the cumulative binomial distribution):

qbinom(p = 0.5, size = 10, prob = 0.5)
## [1] 5

Change the parameters of the binomial function, and you get a different binomial distribution. For example, here is a binomial distribution where p (the probability of success, which again we take to mean the probability of harvest) is 0.2:

plot(x = 0:10,
     y = dbinom(0:10, size = 10, prob = 0.2, log = F),
     main = "Binomial Distribution: n = 10, p = 0.2",
     xlab = "Harvested Tauntauns out of 10 Total", 
     ylab = "Probability", col = "Red", 
     type = "h")

Here, with the probability of harvest set at 0.2, and 10 Tauntauns available to be harvested, it is much more likely that we will observe 0, 1, 2, 3, or 4 Tauntauns compared to 7, 8, 9, or 10 Tauntauns. Again, we can use the qbinom function to return the most likely value:

qbinom(p = 0.5, size = 10, prob = 0.2)
## [1] 2

Suppose we would like to make the harvest stochastic. How exactly do you sample from this distribution? In Excel, you may have used the “BINOM.INV” function and set the argument named Alpha to a random number if you want a stochastic draw:

The BINOM.INV function in Excel.

Figure 13.4: The BINOM.INV function in Excel.

In R, we use the rbinom function, where the letter r is a tip that we will be drawing random values from a distribution. Let’s look at this function’s arguments:

# look at the arguments of `rbinom`
args(rbinom)
## function (n, size, prob) 
## NULL

Here, n is the number of samples to draw, and size and prob define which binomial distribution is of interest. Now, let’s draw 15 random values from a binomial distribution whose trials (size) = 10 and probability of success (prob) = 0.2.

rbinom(n = 15, size = 10, prob = 0.2)
##  [1] 3 2 3 1 1 1 3 4 3 1 1 1 3 1 1

Note that each result is an integer…you can’t get 1.5 harvested animals out of 10 trials.

13.4.2.2 Random Number Seeds

Let’s run the exact same code again two more times, and compare results:

# draw 15 random values from a given binomial distribution
rbinom(n = 15, size = 10, prob = 0.5)
##  [1] 4 8 5 5 6 6 5 4 3 6 4 5 3 5 3
# repeat this sampling once more
rbinom(n = 15, size = 10, prob = 0.5)
##  [1] 5 7 4 9 5 7 6 7 3 6 7 6 6 4 5

What happened here? Well, when you ran rbinom with the exact same arguments, you very likely got two different results. That’s the heart of the function . . . it is a random draw. Most software programs, including R, have random number generators that aren’t truly random. Instead, they are called “pseudo random numbers”, which are random numbers that are obtained by using an algorithm (a set of rules). Check out this site for more background information, which has this to say:

A pseudorandom number generator (PRNG), also known as a deterministic random bit generator (DRBG), is an algorithm for generating a sequence of numbers that approximates the properties of random numbers. The sequence is not truly random in that it is completely determined by a relatively small set of initial values, called the PRNG’s state, which includes a truly random seed. Although sequences that are closer to truly random can be generated using hardware random number generators, pseudorandom numbers are important in practice for their speed in number generation and their reproducibility.

R has many functions that allow you to draw random values from a variety of probability distributions (the binomial distribution is just one of them). Because it’s nice to be able to reproduce your simulations exactly, R has a set.seed function. If you set the random number seed before you run the function, you should be able to get the exact same random sample. Take a look:

# now set the seed, choose any number, say 42
set.seed(42)
# and run it the first time
rbinom(n = 10, size = 10, prob = 0.5)
##  [1] 7 7 4 7 6 5 6 3 6 6
# set the seed before you call the function
set.seed(42)
# run the code again
(samples <- rbinom(n = 10, size = 10, prob = 0.5))
##  [1] 7 7 4 7 6 5 6 3 6 6

Same result! The bottom line is that if you need to recreate your simulation exactly, you must use the set.seed function.

Back to the female Tauntaun harvest. Earlier, we calculated our harvested population size by multiplying the year 1 female census by the year 1 female harvest rate:

# fill in the year 1 female harvest
pop[yr = 1, age = , sex = "f", rate = "harvest"] <- 
          round(pop[yr = 1, age = , sex = "f", stage = "census"] *
          rates[yr = 1, age = , sex = "f", rate = "harvest"])

# look at the number of harvested females
pop[yr = 1, age = , sex = "f", rate = "harvest"] 
##   0   1   2   3   4   5 
##   0 300 100  60  20  10

Now let’s use the binomial distribution functions instead. First, let’s create a new object called stoch.harvest. This object will have a logical datatype whose value will be TRUE if we wish the harvest to be stochastic and FALSE if not.

# create a new object and set its value to FALSE for now
stoch.harvest <- FALSE

If stoch.harvest is FALSE, we will use qbinom to get the number of harvested females in year 1, like this:

qbinom(p = 0.5,
       size = pop[yr = 1, age = , sex = "f", stage = "census"], 
       prob = rates[yr = 1,age = , sex = "f", rate = "harvest"])
##   0   1   2   3   4   5 
##   0 300 100  60  20  10

Alternatively, if stoch.harvest is TRUE, we will use rbinom to get the number of harvested females in year 1. We only want 1 random draw for each age class. But we’ll take advantage of the fact that R can run rbinom in a vectorized manner if we feed it a vector of six 1’s (representing a single draw from each age class), a vector of trials (representing the number of females in each age class, stored in pop[1,,“f”,“census”]), and a vector of probabilities (representing the harvest probability for each age class, stored in rates[1,,“f”,“harvest”]):

# now use rbinom to make a single draw from a binomial distribution for each age class
rbinom(n = rep(1, times = 6),
       size = pop[yr = 1, age = , sex = "f", stage = "census"], 
       prob = rates[yr = 1,age = , sex = "f", rate = "harvest"])
## [1]   0 298  96  52  27   6

13.4.2.3 If-Else Coding

We’re at a point in our population model code where we need to implement the harvest, but we need to execute this in one of two ways, depending on whether the harvest is stochastic or not. We need an if-then function to help us through this. If you’ve used Excel, you’ve seen the IF function, which has three arguments:

The IF function in Excel.

Figure 13.5: The IF function in Excel.

In the first argument you enter a logical test (in our case, is stoch.harvest == TRUE). In the second argument, you enter what Excel should return if the logical test is true. In the third argument, you enter what Excel should return if the logical test is false.

R has a similar structure for if-then-else coding:

  • Begin your code with the word if.
  • Then enter your logical test within a set of parentheses.
  • Then, tell R what to do if the logical test is true within a set of braces { }.
  • Then enter the word else.
  • Then tell R what to do in a second set of braces.

In words, it will look something like this:

if(stoch.harvest == TRUE) {
   use rbinom() to generate the number harvested
  }
  else {
   use qbinom() to generate the number harvested
  }

Let’s try it to get our total female harvest in year 1:

if (stoch.harvest == TRUE) {
   pop[yr = 1, age = , sex = "f", stage = "harvest"] <- rbinom(n = rep(1, times = 6),
       size = pop[yr = 1, age = , sex = "f", stage = "census"], 
       prob = rates[yr = 1,age = , sex = "f", rate = "harvest"])
   } else 
   {
  pop[yr = 1, age = , sex = "f", stage = "harvest"] <- qbinom(p = 0.5,
       size = pop[yr = 1, age = , sex = "f", stage = "census"], 
       prob = rates[yr = 1,age = , sex = "f",rate = "harvest"])
    } # end if 

# look at the number harvested in year 1
 pop[year = 1, age = , sex = "f", stage = "harvest"] 
##   0   1   2   3   4   5 
##   0 300 100  60  20  10

Your numbers should match ours because stoch.harvest is currently set to FALSE. Notice that it is helpful to align your braces and add comments so that you can find sections within this code easily. Also note that RStudio allows you to “collapse” the code by pressing the little down-arrow to the right of the code line number. This collapses the IF function code so that you can hide portions of it if your code is getting too “thick”. Try it!

13.4.3 Computing Non-Harvested Totals

After the harvest is over, our surviving females should be counted. We’ve named these “escapees” and we’ll store them in the pop array under the “escapees” section.

# review the structure of the pop array
str(pop)
##  num [1:10, 1:6, 1:2, 1:4] 3000 NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "dimnames")=List of 4
##   ..$ yr   : chr [1:10] "1" "2" "3" "4" ...
##   ..$ age  : chr [1:6] "0" "1" "2" "3" ...
##   ..$ sex  : chr [1:2] "f" "m"
##   ..$ stage: chr [1:4] "census" "harvest" "escapees" "survivors"

We can calculate the number of escapees as the number in the census minus the number harvested.

# fill in the female escapees by subtraction
(pop[1,,"f","escapees"]  <- pop[yr = 1, , "f", "census"] - pop[yr = 1, , "f", "harvest"])
##    0    1    2    3    4    5 
## 3000 1200  400  240   80   40

These are the lucky female Tauntauns that escaped the harvest in year 1.

13.4.4 The Natural Survival Period

Now that the harvest is over, let’s continue to follow our “seed” population through the first time step of our model. The next phase deals with natural survival. If you recall, after the harvest is over, some Tauntauns will die of natural causes. As Hans Solo discovered, even Tauntauns can freeze to death! This portion of the model is not stochastic, and we could simply multiply the escapees times their natural survival rate to give us the number that survive this period. However, let’s put the qbinom function back to work:

  pop[yr = 1, age = , sex = "f", stage = "survivors"] <- qbinom(p = 0.5,
       size = pop[yr = 1, age = , sex = "f", stage = "escapees"], 
       prob = rates[yr = 1,age = , sex = "f",rate = "survival"])

# look at the number of survivors in year 1
 pop[year = 1, age = , sex = "f", stage = "survivors"] 
##    0    1    2    3    4    5 
## 1200  840  320  216   72    0

The 0 entry for 5 years olds is correct: Tauntauns cannot survive to their 6th birthday. However, all of the other age classes will advance 1 level on their birthday,

13.4.5 Birth Rates and Census

When individuals advance 1 age class on their birthday, the following birth rates will apply to females:

rates[1, , "f", "birth"]
##   0   1   2   3   4   5 
## 0.0 2.0 2.5 2.5 2.5 2.5

These are the per capita birth rates for individuals that just had their birthday. A “fresh” 3-year-old female Tauntan will give birth to 2.5 offspring each year, on average.

However, we can’t simply do element by element multiplication here to get the total births produced by age class. Why not? Because we need to offset the population vector to advance individuals (i.e., get rid of the current 5 year olds), and we further need to lop off the 0-year old birth rate. Thus, to compute total births, we must consider these offsets. We can do so with the help of the head and tail functions, as shown below:

# get the first five entries from the pop array
breeders <- head(pop[year = 1, age = , sex = "f", stage = "survivors"], n = 5)

# get the last five entries from the rates array
birth.rates <- tail(rates[1, , "f", "birth"], n = 5)

# compute the total births by age class by vector multiplication 
births <- round(breeders * birth.rates)

These are the total births produced by each age class. If we assume a 50-50 sex ratio, we can use the qbinom to allocate these offspring by sex, as follows:

# compute the male offspring with the qbinom function
(male.offspring <- qbinom(p = 0.5, size = births, prob = 0.5))
##    0    1    2    3    4 
## 1200 1050  400  270   90
# compute the female offspring as the total births minus male.offspring
(female.offspring <- births - male.offspring)
##    0    1    2    3    4 
## 1200 1050  400  270   90

That brings us through an entire year for our seed cohort.

13.4.6 Year 2 Census

We’ve actually done the calculations for year 2 census . . . you just might not realize it yet. What would our population look like in year 2?

  • Well, since the birthday is on the census, we can assume that all of the young produced by our population seed females would be counted as 0-year-olds in year 2.
  • What about the number of 1-year-olds? The number of one year olds in year 2 would be the number of 0-year-olds in year 1 that escaped harvest and survived through the natural mortality period. These numbers are currently stored in ‘survivors’ portion of our pop array.
  • What about the number of 2-year-olds? The number of 2-year-olds in year 2 depends on the number of 1-year-olds from the seed that escaped the harvest and survived to the year 2 census. These numbers are currently stored in ‘survivors’ portion of our pop array.
  • Same goes for the 3 and 4-year-olds.
  • The age 5 individuals are assumed to die and are not counted.

13.5 Looping in R

It would be very inefficient to enter calculations for year 2 because we’d have to enter them again for years 3, 4, . . . 10 of the simulation. What this needs is a loop . . . a coding structure that allows a certain set of instructions to be repeated over and over again until we tell it when to stop.

R offers a handful of loop options. If we wanted the loop to iterate until a certain criterion is met, we would use a while loop. However, in our population model, we know we want 10 years for a population simulation, so we can define exactly how many iterations we need the loop to pass through to populate the pop array. To do this, we use a for loop.

Let’s remind ourselves about R’s looping structure, and see a for loop in action.

# look at the built-in vector called letters
letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
# allocate an empty vector that will store results of the loop
output <- vector(length = length(letters), mode = "character")

# loop through each element of numbers, and add the number 1 to it

for (i in 1:length(letters)) {
  output[i] <- paste(letters[i], "_out")
}

# look at the object, numbers
output
##  [1] "a _out" "b _out" "c _out" "d _out" "e _out" "f _out" "g _out" "h _out" "i _out" "j _out" "k _out" "l _out" "m _out" "n _out" "o _out" "p _out"
## [17] "q _out" "r _out" "s _out" "t _out" "u _out" "v _out" "w _out" "x _out" "y _out" "z _out"

The for loop has the following code structure:

  • the word “for” starts the loop
  • this is followed by a set of parentheses, inside of which there is an identifier that identifies which loop you are on. Here, we named our identifier i (for iteration). We could have named it something more descriptive if we wished. The code, i in 1:10, means we’ll let i take on the values 1 through 10. When the loop starts, a new variable called i will be created, and it will have a value of 1.
  • this is followed by a set of braces { }, inside of which is code that specifies what to do. Here, we’ll take the vector, letters, and we’ll extract the ith element from it and paste "_out" to the end of it. For example, if we are on loop # 4 (i = 4), we’ll extract the fourth element from letters (d) and paste "_out" to it, creating the output “d_out” as the fourth element of the results vector.

In the next section, we’ll use a loop to create our final Tauntaun population model.

13.6 Final Tauntaun Population Model

Are you ready to add a loop and finalize our Tauntaun population model? We’ll run the entire code in one code “chunk” because R can’t break the loop (unless you are in a function and are using a debugger). We’ve been through each and every step of this model…the main difference is that we are going to put it all together in a loop.

At the beginning of our code, we set stoch.harvest to be TRUE or FALSE. We also reset our pop array, and select a random number seed so that we can reproduce our simulation completely.

The loop code starts with for (i in 1:10) {. This sets up a for loop for each year in our simulation. In the very first block of code, if(i == 1) evaluates to true, so the population seed for males and females will be entered into the pop array, and the else will be ignored. Then we move to step 2, which is the harvest. Here, we have an if code that carries out the harvest depending on whether the harvest is stochastic or not. Then we move to step 3, which simply counts the escapees. In step 4, we use the qbinom function to calculate the number of individuals who survive the natural mortality period. This ends our loop for i = 1, and i will then be set to 2. Since i is no longer 1, the else portion of the code at the top of the loop is executed. In this code, all individuals advance to the next age class, and moments later, we count the surviving females that make up the breeding population (1-year-olds on up). These breeders reproduce according to the birth rate, and we then allocate the offspring as males and females. These represent the 0-year-olds in the this time step, and the loop continues. Here we go . . .

Note: make sure to copy the entire script here and paste it in one fell-swoop. The most common error is a missing curly brace. If you are getting an error, check for matching braces. Don’t forget that you can click just to the left of an opening brace or bracket, and RStudio should highlight the matching bracket or brace.

# reset the pop array for a clean start
pop[,,,] <- 0

# set the stochastic settings
stoch.harvest <- FALSE

# set a random number seed for reproducibility
set.seed(4242)

# begin loop
for (i in 1:10) {  
      
 # Step 1. Population Census
       
  if (i == 1) {
      
      # Initialize the population seed for year 1
      pop[1, age = , sex = 'f', stage = 'census'] <- female.seed
      pop[1, age = , sex = 'm', stage = 'census'] <- male.seed
        
      } else {
      
      # advance individuals in previous year to the next age class. 
      pop[yr = i, age = 2:6, , stage = 'census'] <- pop[(i - 1), age = 1:5, , 'survivors']
      
      # create 0 year olds
      breeders <- pop[yr = i, age = 2:6, "f", stage = 'census']
      birth.rates <- rates[yr = i, age = 2:6, "f", "birth"]
      births <- round(breeders * birth.rates)
      
      # allocate offspring by sex
      male.offspring <- qbinom(p = 0.5, size = births, prob = 0.5)
      female.offspring <- births - male.offspring
      
      # add to pop array
      pop[yr = i, age = 1 , sex = 'f', stage = 'census'] <- sum(female.offspring)
      pop[yr = i, age = 1, sex = 'm', stage = 'census'] <-  sum(male.offspring)
  
  } # end of not i = 1
      
 # Step 2. Harvest -----------------------------------
  
  if (stoch.harvest == TRUE) {
        
      # female harvest
      pop[yr, age = , sex = "f", stage = "harvest"] <- rbinom(n = rep(1, times = 6),
         size = pop[yr = i, age = , sex = "f", stage = "census"], 
         prob = rates[yr = i, age = , sex = "f", rate = "harvest"])
        
      # male harvest
      pop[yr = i, age = , sex = "m", stage = "harvest"] <- rbinom(n = rep(1, times = 6),
         size = pop[yr = i, age = , sex = "m", stage = "census"], 
         prob = rates[yr = i, age = , sex = "m", rate = "harvest"])
     
   } else {
       
      # female harvest
      pop[yr = i, age = , sex = "f", stage = "harvest"] <- qbinom(p = 0.5,
         size = pop[yr = i, age = , sex = "f", stage = "census"], 
         prob = rates[yr =  i, age = , sex = "f", rate = "harvest"])
        
      # male harvest
      pop[yr = i, age = , sex = "m", stage = "harvest"] <- qbinom(p = 0.5,
         size = pop[yr = i, age = , sex = "m", stage = "census"], 
         prob = rates[yr = i, age = , sex = "m", rate = "harvest"])
  
    } # end if 
      
 # Step 3.  Escapees ---------------------------------
  
    pop[yr = i, , "f","escapees"]  <- pop[yr = i, , "f", "census"] - pop[yr = i, , "f", "harvest"]
    pop[yr = i, , "m","escapees"]  <- pop[yr = i, , "m", "census"] - pop[yr = i, , "m", "harvest"]
      
  # Step 4. Survivors -----------------

    # female survivors  
    pop[yr = i, age = , sex = "f", stage = "survivors"] <- qbinom(p = 0.5,
           size = pop[yr = i, age = , sex = "f", stage = "escapees"], 
           prob = rates[yr = i, age = , sex = "f",rate = "survival"])
      
    # male survivors
    pop[yr = i, age = , sex = "m", stage = "survivors"] <- qbinom(p = 0.5,
           size = pop[yr = i, age = , sex = "m", stage = "escapees"], 
           prob = rates[yr = i, age = , sex = "m",rate = "survival"])

  } # end of year loop

And now we can look at the results of the 10-year simulation, focusing on the population census:

# look at the final population census results
pop[yr = , age = , sex = , stage = "census"]
## , , sex = f
## 
##     age
## yr      0    1   2   3   4   5
##   1  3000 1500 500 300 100  50
##   2  3010 1200 840 320 216  72
##   3  3201 1204 672 538 231 156
##   4  3356 1280 674 431 388 167
##   5  3516 1342 717 431 311 280
##   6  3589 1406 752 459 311 224
##   7  3717 1436 788 482 330 224
##   8  3856 1487 804 505 348 238
##   9  3996 1542 833 515 364 251
##   10 4138 1598 864 534 371 262
## 
## , , sex = m
## 
##     age
## yr      0    1   2   3   4   5
##   1  1200  900 300 180  60  30
##   2  3010  480 441 168 114  38
##   3  3200 1204 235 247 106  72
##   4  3355 1280 590 132 156  67
##   5  3516 1342 627 331  83  98
##   6  3589 1406 657 351 209  52
##   7  3715 1436 689 368 222 132
##   8  3855 1486 704 386 232 140
##   9  3996 1542 728 395 243 146
##   10 4136 1598 755 408 249 153

13.7 Converting Arrays to Dataframes

It’s useful to turn our results into a data frame for graphing. Plus, don’t forget that our ultimate goal is to create the harvest.csv file that we downloaded from the web. We have the basic dataset, but now we need to add additional information to it, and add some mistakes (typos) to it as well.

First, let’s convert the array to a dataframe with the spread function in the package, tidyr. This package is part of the tidyerse, an “opinionated collection of R packages designed for data science.” This function, and others in tidyr, allow a user to reshape the data in one R object into a different shape. Let’s load the package and then scan the helpfile.

# load tidyr
library(tidyr)

# look at the spread helpfile
help(topic = 'spread', package = 'tidyr')

Three of arguments to the spread function are required:

args(spread)
## function (data, key, value, fill = NA, convert = FALSE, drop = TRUE, 
##     sep = NULL) 
## NULL

For the first argument, you pass in some data. The second argument, key is how you’d like to pivot the array. We would like to collapse the array into a dataframe, such that each entry in the arrays “stage” element is returned with a frequency by year, age, and sex. Let’s try it:

data <- tidyr::spread(as.data.frame.table(pop), key = stage, value = Freq)

# look at the structure of the returned data
str(data)
## 'data.frame':    120 obs. of  7 variables:
##  $ yr       : Factor w/ 10 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ age      : Factor w/ 6 levels "0","1","2","3",..: 1 1 2 2 3 3 4 4 5 5 ...
##  $ sex      : Factor w/ 2 levels "f","m": 1 2 1 2 1 2 1 2 1 2 ...
##  $ census   : num  3000 1200 1500 900 500 300 300 180 100 60 ...
##  $ harvest  : num  0 0 300 270 100 90 60 54 20 18 ...
##  $ escapees : num  3000 1200 1200 630 400 210 240 126 80 42 ...
##  $ survivors: num  1200 480 840 441 320 168 216 114 72 38 ...
# look at the first 10 records
head(data, n = 10)
##    yr age sex census harvest escapees survivors
## 1   1   0   f   3000       0     3000      1200
## 2   1   0   m   1200       0     1200       480
## 3   1   1   f   1500     300     1200       840
## 4   1   1   m    900     270      630       441
## 5   1   2   f    500     100      400       320
## 6   1   2   m    300      90      210       168
## 7   1   3   f    300      60      240       216
## 8   1   3   m    180      54      126       114
## 9   1   4   f    100      20       80        72
## 10  1   4   m     60      18       42        38

See if you can match these values up with the pop array results, and confirm the data are correct. Notice that the column, yr is now a factor. We’ll want to convert these to integers for plotting:

data$yr <- as.integer(data$yr)

Now that we have a dataframe, we can easily plot our Tauntaun population size through time:

library(ggplot2)

# fix this
ggplot(data, aes(x = yr, y = census, group = age)) + 
  geom_line(aes(linetype = age)) +
  theme_minimal()

What’s the benefit of simulating the full population cycle for Tauntauns? Well, in previous chapters, you created a summary of the Tauntaun harvest with markdown, and created the SAK function to estimate population size and look at trends. By simulating a population, you can directly compare your estimated results (from the SAK function) with the true, simulated population. Simulation also allows you to use your population model as if it were an experiment; you can learn a lot by tweaking inputs and exploring how the dynamics change. In this exercise, you can add “noise” to the data by setting the harvest stochasticity settings to TRUE.

Now, there are many, many different ways we could have coded this model, and you will undoubtedly find your own coding style.

Our next steps will be to take the reported animals and masterfully create two csv files that provide information on each harvested animal, along with a hunter dataset. Before we do that, let’s have another break!

Time for another break!

Figure 13.6: Time for another break!

13.8 Creating the Tauntaun Harvest Data CSV File

The Tauntaun biologist will not have access to the true population data. The only “glimpse” they have is the actual, reported harvest - information about a pile of dead Tauntauns. In this section, we’ll shape the harvest data into the csv file that you worked with in Chapters 6-8. This includes adding additional information about each harvested animal (such as fur color), and the town in which the animal was harvested. To add the town, we’ll dig into the spatial shapefile that contains the towns of Vermont (stored in your directory called towns), and then extract the shapefile’s attribute table. We’ll also add some mistakes to the data (because you almost never will have a perfectly proofed dataset before you analyze the data).

We’ll also simulate a hunter dataset (another csv file) that contains information about the person who harvested each reported animal (Han Solo is not on this list . . . a poacher!).

If you’ve had a break, you might need a reminder of our simulation outputs, now stored as a dataframe. Let’s have a look:

# look at the data's structure
str(data)
## 'data.frame':    120 obs. of  7 variables:
##  $ yr       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ age      : Factor w/ 6 levels "0","1","2","3",..: 1 1 2 2 3 3 4 4 5 5 ...
##  $ sex      : Factor w/ 2 levels "f","m": 1 2 1 2 1 2 1 2 1 2 ...
##  $ census   : num  3000 1200 1500 900 500 300 300 180 100 60 ...
##  $ harvest  : num  0 0 300 270 100 90 60 54 20 18 ...
##  $ escapees : num  3000 1200 1200 630 400 210 240 126 80 42 ...
##  $ survivors: num  1200 480 840 441 320 168 216 114 72 38 ...
# look at the first five records
head(data, n = 5)
##   yr age sex census harvest escapees survivors
## 1  1   0   f   3000       0     3000      1200
## 2  1   0   m   1200       0     1200       480
## 3  1   1   f   1500     300     1200       840
## 4  1   1   m    900     270      630       441
## 5  1   2   f    500     100      400       320

13.8.1 Trimming Columns

The harvest data are stored in the column called “harvest”. We don’t need other columns such as “escapees” or “survivors”. Let’s trim these columns out now:

data <- data[, c("yr", "age", "sex", "harvest")]

# peek at the trimmed data
head(data)
##   yr age sex harvest
## 1  1   0   f       0
## 2  1   0   m       0
## 3  1   1   f     300
## 4  1   1   m     270
## 5  1   2   f     100
## 6  1   2   m      90

13.8.2 Expanding the Dataframe Rows

This looks more like a harvest dataset. However, the harvest data normally won’t be recorded in this summarized format. Usually, each new harvested Tauntaun is added to the dataset, one record at a time. So we need to develop a record (row) for each harvested individual.

The number of individuals for each year is identified in the column ‘harvest’; we need a function that goes down the harvest vector row-by-row, replicating each row the number of times listed in the ‘harvest’ column.

For instance, our third record shows that 300 f individuals were harvested in year 1 that were 1 years old. We need to expand our grid so that each of the 300 individuals is stored on a unique row in the dataset.

The rep function does this (plus it can also be used to repeat elements of a vector). What we’ll do is first look at the row names of our data frame, and then we’ll repeat each row name by the number of animals harvested.

When used to repeat rows of a data frame, rep illustrates that the row name in a dataframe can actually reference the contents of the row.

# get the row  names and store as a vector called rows
rows  <- row.names(data)

# create a new vector called new.rows
# this repeats the elements within rows by the number of time stored in the harvest column 
new.rows  <- rep(rows, times = data$harvest)

# now put it all together
data <- data[rep(row.names(data), data$harvest), 1:3]

# look at the first five records
head(data)
##     yr age sex
## 3    1   1   f
## 3.1  1   1   f
## 3.2  1   1   f
## 3.3  1   1   f
## 3.4  1   1   f
## 3.5  1   1   f

Notice that after we run rep, the row names are automatically changed so they remain unique (e.g., 3, 3.1 ,3.2, 3.3, etc). What is shown are some of the expanded Tauntauns from row 3 of our original data. In row 3, the original data noted that there were 300 harvested animals, you’re looking at the first 6 of these individuals. What happened to rows 1 and 2 in the original dataset? Well, since no Tauntauns of age class 0 were harvested, this information is omitted.

We will reassign integer-looking row names for visual clarity (but remember that R stores all row and column names as characters). First, we use the nrow function to count the number of rows in the table, and the colon operator to establish an integer sequence from 1 to the total number of rows. Then, we use the row.names function to assign row names with this sequence.

# reassign the row names using the row.names function.
row.names(data) <- 1:nrow(data)

Now, we’ll create a new column called ‘individul’ which takes on the new row name. This is a typo that will need to be fixed in chapter 6. You’ll notice that the row names are characters, and although there is nothing wrong with this, we’ll use the as.numeric function to create unique numbers for each harvest record.

# create new column called "individul" (a typo)
data['individul'] <- as.numeric(row.names(data))

# peek at the records again
head(data)
##   yr age sex individul
## 1  1   1   f         1
## 2  1   1   f         2
## 3  1   1   f         3
## 4  1   1   f         4
## 5  1   1   f         5
## 6  1   1   f         6

Converting the unique row names to integer-like characters is for more than just visual clarity. We’ve created a new column that uniquely identifies each individual, so that this column acts like a “primary key” for those of you familiar with databases.

Now, our Tauntaun harvest dataset is starting to look more like the way a Tauntaun biologist would normally see it. In the next section, we’ll add more columns of information for each harvest record.

13.8.3 Adding New Columns To A Data Frame

The first new column we’ll add to our data is a column called species, and we’ll populate it with the word “tauntaun”.

# A new column can also be added by naming it and supplying a single value for all rows
data['species']  <- 'tauntaun'

Next, let’s introduce a few typos into this column so that we can practice repairing them in Chapter 6. Here, we randomly add a couple of misspelled species names into an otherwise uniform field.

This will let us use the sample function. The sample function has four arguments; here we’ll use three. The first argument, x, defines the samples you can draw from. The argument size is the number of samples you want. The argument, replace indicates whether you want to sample with replacement or not. Notice the default is FALSE, so if you don’t specify this argument, you will sample without replacement (which means you can’t have duplicates in your sample).

# look at the arguments for the `sample` function
args(sample)
## function (x, size, replace = FALSE, prob = NULL) 
## NULL
# sample 14 rows between 1 and nrow without replacement; store as indices1
(indices1  <- sample(1:nrow(data), size = 14, replace = FALSE))
##  [1]  6404 11057 13290 12093  4667 11052 10602  8791 10660  3972  1597  4755  2903 13296
# sample of 20 rows between 1 and nrows without replacement; store as indices2
(indices2  <- sample(1:nrow(data), size = 20, replace = FALSE))
##  [1] 12969 11333  8616  5630 11289 10036  8235 12256  5531  8988 11630 11010  2476  1331 13034  2277  7569 12352 12546  7340

What exactly was returned? Well, the object named indices1 has a length of 14. It is 14 randomly sampled row numbers from our harvest dataset. the object named indices2 has a length of 20. It is 20 randomly sampled rows from our harvest dataset. We’ll make use of these two objects next:

# introduce a typo in the spelling of tauntaun for the randomly selected rows
data[indices1, 'species'] <- "tantaun"
data[indices2, 'species'] <- "tauntaan"

Here, we replaced the value stored in data[indices1, ‘species’] with a typo, “tantaun”. The rows are provided by the indices1 object, and the column is the species column.

We’ll continue to add more attributes to the harvest dataset, including the “id” number of the hunter. We’ll create a new column called hunter.id. So, hunter.id of 1 points to hunter number 1 in the hunter dataset (which we’ll create in a few minutes). Here, we’ll create a new column in data called hunter.id, and assign a value to each record by sampling between 1 and hunters. In this case, we’ll set the replace argument to TRUE so that hunters can harvest more than one Tauntaun.

# assign the total number of known tauntaun hunters
hunters <- 700

# add a column called hunter.id to the data
data['hunter.id'] <- sample(1:hunters, size = nrow(data),  replace = TRUE)

# look at the heading of the data
head(data)
##   yr age sex individul  species hunter.id
## 1  1   1   f         1 tauntaun       285
## 2  1   1   f         2 tauntaun        56
## 3  1   1   f         3 tauntaun       549
## 4  1   1   f         4 tauntaun       296
## 5  1   1   f         5 tauntaun       521
## 6  1   1   f         6 tauntaun       355

Next, we’ll add the date of harvest to our dataset. We’ll assume the legal harvest occurs in the months of October, November, and December. As you know, these have month numbers of 10, 11, and 12. We’ll use the sample function to randomly assign each Tauntaun harvest to one of these months. This time, we’ll include the argument called prob so that we can sample according to the monthly probabilities.

# add a field named 'month'
data['month'] <- sample(c(10, 11, 12), nrow(data), replace = TRUE, prob = c(0.5, 0.3, 0.2) )

# look at the first six records
head(data)
##   yr age sex individul  species hunter.id month
## 1  1   1   f         1 tauntaun       285    10
## 2  1   1   f         2 tauntaun        56    11
## 3  1   1   f         3 tauntaun       549    12
## 4  1   1   f         4 tauntaun       296    12
## 5  1   1   f         5 tauntaun       521    11
## 6  1   1   f         6 tauntaun       355    10

13.8.4 The “ifelse” Function

To insert day of the month is a little trickier, since not all months have the same number of days. Fortunately, both October and December have 31 days, leaving only November with 30. For simplicity we first populate the ‘day’ column with NAs. Then, we can use an ifelse function to work through the day column and replace the NA with a randomly sampled day between 1 and 30 (for November) or with a randomly sampled day between 1 and 31 (for October and December).

This is the first time we’ve used the ifelse function, so let’s take a moment to discuss it. If you check out the helpfile, you’ll see that “ifelse returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE.” This works a lot like Excel’s IF function. The first argument must be a logical test (something where the answer is TRUE or FALSE). The second argument specifies what to do if the answer is TRUE. The third argument specifies what to do if the answer evaluates to FALSE.

In the code below, we first set up a column called ‘day’ and populate it with NA’s. Then we use the ifelse function. The logical test is month == “November” (really, month == 11). If the month column indeed is November, then we carry out the second argument, which is collect random samples between 1 and 30 for each record where month == “November”. If the month column is not November, we just return the day column.

In this code, we’ll use another new function, with. This useful function helps you clean up your code to make it a bit more readable. The with function starts with the word with, followed by an open parenthesis, followed by an argument which is the object to work on (in our case, data). Then you add any additional code that involves the object. Finally, you close the with function with a closing parenthesis. When you specify “data” as the object to work on, you no longer need to use the dollar sign referencing because R already knows the object you’re working on . . . you’re free to write month instead of data$month, which some coders think is easier to read.

# make and name the column
data['day'] <- NA

#use the ifelse function to draw random samples between 1:30 if the month is November
data$day  <- with(data, ifelse(month == 11,
                               yes = sample(1:30, size = sum(month == 11), replace = TRUE),
                               no = day))
#look at the data
head(data)
##   yr age sex individul  species hunter.id month day
## 1  1   1   f         1 tauntaun       285    10  NA
## 2  1   1   f         2 tauntaun        56    11  13
## 3  1   1   f         3 tauntaun       549    12  NA
## 4  1   1   f         4 tauntaun       296    12  NA
## 5  1   1   f         5 tauntaun       521    11  21
## 6  1   1   f         6 tauntaun       355    10  NA

That takes care of November. What about day of harvest during October or December? We can use another ifelse function; this time our logical test will evaluate if the month is 10 or 12:

# draw random samples between 1:31 for month == 10 or month == 12
data$day  <- with(data, ifelse(test = month == 10 | month == 12,
                               yes = sample(1:31, size = sum(month == 10 | month == 12), 
                                            replace = TRUE),
                               no = day))

You may be wondering what the difference is between the ifelse function and the if notation for carrying out R code. In short, ifelse is an actual function, whereas if (test) { do something } else { do something else } is not. The if approach is generally used to control the flow of code (if something is true, jump to this bit of code, otherwise jump to that bit of code).

Now we have a familiar looking date, but it is spread across the three columns of month,day, and year. However, our year column doesn’t look much like a year that we know of (unless we are harvesting Tauntauns at the beginning of the common era). Let’s change the year column now by adding a constant to it:

data$yr <- data$yr + 1900

Now then. To create a date for Chapter 6 readers, we need to combine the different fields using paste. This function combines a group of distinct values or objects into a single character vector much like the ‘concatenate’ function in Excel. This is a very useful function. Don’t forget Russell Crow, here. We want to apply this function to all rows, and specify the column names in our paste function.

# create a date character string with the paste function
data['date'] <- paste(data[,'month'],data[,'day'],data[,'yr'], sep = "/")

# look at the structure of data
str(data)
## 'data.frame':    13351 obs. of  9 variables:
##  $ yr       : num  1901 1901 1901 1901 1901 ...
##  $ age      : Factor w/ 6 levels "0","1","2","3",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ sex      : Factor w/ 2 levels "f","m": 1 1 1 1 1 1 1 1 1 1 ...
##  $ individul: num  1 2 3 4 5 6 7 8 9 10 ...
##  $ species  : chr  "tauntaun" "tauntaun" "tauntaun" "tauntaun" ...
##  $ hunter.id: int  285 56 549 296 521 355 674 297 7 388 ...
##  $ month    : num  10 11 12 12 11 10 11 10 11 12 ...
##  $ day      : int  8 13 26 11 21 18 4 4 18 14 ...
##  $ date     : chr  "10/8/1901" "11/13/1901" "12/26/1901" "12/11/1901" ...

That looks more like a typical date entry.

13.8.5 Adding Towns from Shapefile Attributed Table

Next, we will add in the town in which each record of the harvest table originates. We worked with a town shapefile in Chapter 7, which is stored in the towns directory. As you know, shapefiles consist of multiple files that have the same filename, but have different extensions. Shapefiles consist of many files sharing the same core filename and different suffixes (i.e. file extensions). For example, the town shapefile consists of the following files:

list.files("towns")
## [1] "towns.zip"                     "VT_Data_-_Town_Boundaries.cpg" "VT_Data_-_Town_Boundaries.dbf" "VT_Data_-_Town_Boundaries.prj"
## [5] "VT_Data_-_Town_Boundaries.shp" "VT_Data_-_Town_Boundaries.shx" "VT_Data_-_Town_Boundaries.xml"

The package rgdal can be used to work with such files. We can use the readOGR function to read this shapefile into R. The attribute table is stored in the “data” slot.

# load the package rgdal
library(rgdal)

# use the st_read function to read in the shapefile
towns <- readOGR(dsn = "towns", layer = "VT_Data_-_Town_Boundaries")
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\tdonovan\University of Vermont\Spreadsheet Project - General\R Training\Fledglings\towns", layer: "VT_Data_-_Town_Boundaries"
## with 255 features
## It has 8 fields
# get the structure
str(towns, max.level = 1)
## Formal class 'SpatialPolygonsDataFrame' [package "sp"] with 5 slots

As you can see, this is an object of class ‘SpatialPolygonsDataFrame’ from the package, “sp”. This is an S4 object that is deeply structured (we’re only looking at the top level). The information in S4 objects are stored in “slots”, which can be accessed with the @ symbol. Let’s plot this shapefile so you can see what’s what:

# plot the towns
plot(towns)

And now, let’s look at the attribute table, which is stored in the data slot.

# look at the first 10 records in the attribute table
head(towns@data, n = 10)
##   OBJECTID FIPS6  TOWNNAME TOWNNAMEMC CNTY  TOWNGEOID ShapeSTAre ShapeSTLen
## 0        1  9030    CANAAN     Canaan    9 5000911800   86133926   63085.37
## 1        2 11040  FRANKLIN   Franklin   11 5001127100  105502442   42582.92
## 2        3 11015 BERKSHIRE  Berkshire   11 5001105425  108509286   41941.76
## 3        4 11050  HIGHGATE   Highgate   11 5001133025  155573562   57367.84
## 4        5 11060  RICHFORD   Richford   11 5001159125  111897347   42398.62
## 5        6 13005   ALBURGH    Alburgh   13 5001300860  123041223   55791.54
## 6        7  9080    NORTON     Norton    9 5000952750  100482148   43902.68
## 7        8  9005   AVERILL    Averill    9 5000902125   98589185   39794.77
## 8        9 19050   HOLLAND    Holland   19 5001933775  100189636   40818.36
## 9       10 19060       JAY        Jay   19 5001936325   87962933   37738.29

See the column named TOWNNAME? This single column in the attributes table contains the towns in Vermont. Let’s use the sample function once more to assign a town of harvest to each and every Tauntaun in our dataset.

# convert a single field in the data frame to a character vector of town names
towns  <-  towns@data$TOWNNAME

# create a town field and populate it with random value from the town name vector
data$town <- sample(towns, nrow(data), replace = TRUE)

# look at the first 10 records of data
head(data, n = 10)
##      yr age sex individul  species hunter.id month day       date            town
## 1  1901   1   f         1 tauntaun       285    10   8  10/8/1901        STANNARD
## 2  1901   1   f         2 tauntaun        56    11  13 11/13/1901       BALTIMORE
## 3  1901   1   f         3 tauntaun       549    12  26 12/26/1901         HANCOCK
## 4  1901   1   f         4 tauntaun       296    12  11 12/11/1901 ST. ALBANS TOWN
## 5  1901   1   f         5 tauntaun       521    11  21 11/21/1901        WHEELOCK
## 6  1901   1   f         6 tauntaun       355    10  18 10/18/1901         PEACHAM
## 7  1901   1   f         7 tauntaun       674    11   4  11/4/1901     AVERYS GORE
## 8  1901   1   f         8 tauntaun       297    10   4  10/4/1901         SUDBURY
## 9  1901   1   f         9 tauntaun         7    11  18 11/18/1901     MOUNT HOLLY
## 10 1901   1   f        10 tauntaun       388    12  14 12/14/1901      BROOKFIELD

Now we know who harvested each Tauntaun, on what day, and in which town. Notice that R thinks that “town” is a factor with 255 levels. That’s fine.

13.8.6 The Normal Distribution

Let’s continue to add more variables to the harvest dataset, focusing on adding different kinds of datatypes. Let’s start by adding some numeric (decimal) data types. Each Tauntaun will have a length; males will be longer than females. We’ll use another probability distribution, the normal distribution to assign lengths. No doubt, you’ve encountered this distribution before. The shape of the normal distribution is controlled by two parameters, the mean and the standard deviation. We can create a normal distribution with thednorm function. These arguments are called mean and sd. For example, let’s look at a normal distribution whose mean is 50 and whose standard deviation is 10:

# use `dnorm` to return the probability of observing x, given the mean and standard deviation

probs <- dnorm(x = 1:100, mean = 50, sd = 10)

# plot x along the axis, and probs along the y axis.  
plot(x = 1:100, y =  probs, 
     type = "l", 
     xlab = "X", 
     ylab = "probability", 
     main = "Normal Distribution")

Notice how this distribution is centered on 50? If we draw random samples from this distribution, you can see that it is much more likely to draw samples near 50 than samples in the tails of the distribution.

We can use the rnorm function to draw samples from a normal distribution. Like rbinom and we can specify how many samples to draw with the n argument. Note that this function has a default mean of 0 and a default sd of 1 (a Z score), so we must set these arguments or R will use the defaults.

In the code below, we use different means for males and females,so we need to subset data to include only those rows where sex == ‘m’ or sex ==‘f’. To do that subset, we use rnorm and draw the number of samples within the subset from a normal distribution with a specified mean and standard deviation. We used the ifelse function once again, following the same logic as we did when assigning the day of harvest. You should be able to work through the arguments.

# look at the arguments for rnorm
args(rnorm)
## function (n, mean = 0, sd = 1) 
## NULL
# add a column called length
data$length <- NA

# add lengths of harvested animals (males) 
data$length <-   with(data,
                      ifelse(sex == "m",
                             round(rnorm(n = nrow(data['sex' == 'm']),
                                         mean = 400,
                                         sd = 40),2),
                             length))

# add lengths of harvested animals (females)
data$length <-   with(data,
                      ifelse(sex == "f",
                             round(rnorm(n = nrow(data['sex' == 'f']),
                                         mean = 300,
                                         sd = 20),2),
                             length))

# look the first few records
head(data,3)
##     yr age sex individul  species hunter.id month day       date      town length
## 1 1901   1   f         1 tauntaun       285    10   8  10/8/1901  STANNARD 277.10
## 2 1901   1   f         2 tauntaun        56    11  13 11/13/1901 BALTIMORE 311.86
## 3 1901   1   f         3 tauntaun       549    12  26 12/26/1901   HANCOCK 274.84

Next, we’ll add a new column called weight, which is the weight of harvested animals in some Jedi metric (integer data). Here, we’ll assume that weight is correlated with an animals length.
The fastest way to create correlated weight data is to establish a linear relationship between length and weight, and add some noise to this relationship. Let’s run some code, and then discuss it.

# create column called weight, which is the length * 2.5
data$weight  <- data$length * 2.5

# add noise to the weight column by adding a random deviate generated by rnorm
data$weight  <- data$weight + rnorm(n = nrow(data), mean = 0, sd = 100)

# look at the data
head(data)
##     yr age sex individul  species hunter.id month day       date            town length   weight
## 1 1901   1   f         1 tauntaun       285    10   8  10/8/1901        STANNARD 277.10 699.3044
## 2 1901   1   f         2 tauntaun        56    11  13 11/13/1901       BALTIMORE 311.86 718.1022
## 3 1901   1   f         3 tauntaun       549    12  26 12/26/1901         HANCOCK 274.84 550.6420
## 4 1901   1   f         4 tauntaun       296    12  11 12/11/1901 ST. ALBANS TOWN 266.70 699.5951
## 5 1901   1   f         5 tauntaun       521    11  21 11/21/1901        WHEELOCK 271.84 613.3350
## 6 1901   1   f         6 tauntaun       355    10  18 10/18/1901         PEACHAM 301.03 851.4671

Next, let’s double check the correlation between length and weight – this is done so that the fledgling cor function in the chapter on statistical analysis:

# check the correlation between weight and length
cor(data$weight, data$length)
## [1] 0.8287147
# visualize the data
plot(x = data$length, 
     y = data$weight, 
     main = "Scatterplot Example", 
     xlab = "Length ", 
     ylab = "Weight ", 
     pch = 19) 

# check the correlation for males and females separately
males  <- subset(data, sex == 'm')
cor(males$weight, males$length)
## [1] 0.7093369
females  <- subset(data, sex == 'm')
cor(females$weight, females$length)
## [1] 0.7093369

Here, we’ve multiplied each Tauntaun’s length by 2.5 to give us each animal’s weight. Then, we used the rnorm function again, this time drawing random values from a distribution whose mean is 0 and whose standard deviation is 100. These deviates were then added to the vector of weights to give us our final weights. The standard deviation argument controls how much “noise” we added to the data. We then subset the males and females, and used the cor function to double-check how correlated height and weight are for males and females. All of this was done so that a fledgling reader can lean how to use the cor function in the chapter on statistical analysis.

Now, let’s add in some outlier lengths so the Tauntaun biologist can practice finding outliers. We will create a random vector of rows that will be used for indexing, and we’ll store the index so we can extract the current lengths, multiply them by 10, and reinsert the new length in the right place.

# draw 6 random samples of row numbers from the data; these rows will contain outliers
bad.len <- sample(1:nrow(data), size = 6)

# multiply the original length by 10, and replace the outliers into the original dataset
data[bad.len,'length'] <- data[bad.len,'length'] * 10

Heh, heh, heh.

13.8.7 Adding Factor-like Data

That takes care of a few columns with continuous data. Now let’s add some nominal data types (names, which are stored as characters). We’ll add the following columns: method of harvest and color of the Tauntaun’s fur. Again we’ll use the sample function to draw samples according to a probability vector . . . remember that your probabilities must sum to 1! Then we’ll introduce NA’s and typos.

# add method of harvest (nominal data)
data['method'] <- sample(c('muzzle','bow', 'lightsaber'),
                         size = nrow(data),
                         prob = c(0.25,0.25,0.50),
                         replace = TRUE)

# delete a few method records to spike the data with realistic errors
data[sample(1:nrow(data), size = 512), 'method'] <- NA

# add fur color (nominal data)
data['color'] <- sample(c('gray','white'),
                        size = nrow(data),
                        replace = TRUE,
                        prob = c(0.68,0.32))

# add some bad fur colors
bad.fur.indices <- sample(1:nrow(data), size = 1306)
data[bad.fur.indices,'color'] <- 'grey'

Finally, we’ll add in one column that contains ordinal (or ordered) data. Let’s assume that Tauntauns have fur length that can be defined as “short”, “medium”, or “long”. Note that this is ordered data, but we don’t know how short is “short” or how short is “medium”. We only know that short < medium < long. This lets us work with ordered factors in Chapter 6.

# add fur length (ordinal, or ordered, data)
data['fur'] <- sample(c('short','medium','long'),
                      size = nrow(data), replace = TRUE)

Nice! We now have all of the rows and variables (columns) of harvest data for our chapter 6 user.

13.8.8 Sorting (Ordering) Data Frames

Before we create a csv file from the data, let’s clean it up in terms of sorting the data and columns.

Sorting is a simple task that can be frustratingly tricky until you’ve seen it once or twice. Let’s sort the table in order of increasing individual ID, which is in the column ‘individual’. You’d think that sort would be the function of choice, but it’s not. The order function is used instead to allow us the option for sorting by multiple columns if we wish. With order, you simply enter each additional sort vector as an argument of the function separated by commas). Save sort for sorting vectors.

# sort the pop data by individul (a typo)
data <- data[order(data[,'individul']),]

# preview the table
head(data,10)
##      yr age sex individul  species hunter.id month day       date            town length   weight     method color    fur
## 1  1901   1   f         1 tauntaun       285    10   8  10/8/1901        STANNARD 277.10 699.3044 lightsaber white   long
## 2  1901   1   f         2 tauntaun        56    11  13 11/13/1901       BALTIMORE 311.86 718.1022     muzzle  gray   long
## 3  1901   1   f         3 tauntaun       549    12  26 12/26/1901         HANCOCK 274.84 550.6420        bow white   long
## 4  1901   1   f         4 tauntaun       296    12  11 12/11/1901 ST. ALBANS TOWN 266.70 699.5951     muzzle  gray   long
## 5  1901   1   f         5 tauntaun       521    11  21 11/21/1901        WHEELOCK 271.84 613.3350        bow  gray   long
## 6  1901   1   f         6 tauntaun       355    10  18 10/18/1901         PEACHAM 301.03 851.4671 lightsaber  grey  short
## 7  1901   1   f         7 tauntaun       674    11   4  11/4/1901     AVERYS GORE 256.77 720.2733 lightsaber white  short
## 8  1901   1   f         8 tauntaun       297    10   4  10/4/1901         SUDBURY 306.25 579.1136        bow  gray   long
## 9  1901   1   f         9 tauntaun         7    11  18 11/18/1901     MOUNT HOLLY 317.58 842.1904 lightsaber white medium
## 10 1901   1   f        10 tauntaun       388    12  14 12/14/1901      BROOKFIELD 310.31 867.4912 lightsaber white   long

Looks good. But the columns might not be organized in a way that matches the Tauntaun harvest data collection form. To reorganize the columns within our dataset, just specify the columns in the order you’d like. Here, we’ll mix them up in the order that you, the Tauntaun biologist, will receive the dataset.

data <- data[,c("hunter.id", "age", "sex", "individul", "species", "date", "town", "length",  "weight","method", "color", "fur")]

Notice that we’ve left a few columns out (such as month, day, and year), which means they’ll be dropped.

13.8.9 Saving the Harvest CSV file

Finally, let’s save the complete simulated harvest recordset to the working directory as a comma separated values file with the write.csv function.

write.csv(data, file = 'datasets/TauntaunHarvest.csv', row.names = FALSE)

# confirm the file exists
file.exists("datasets/TauntaunHarvest.csv")
## [1] TRUE

13.8.10 The Hunter CSV file

In addition to the harvest dataset, we need a table that stores data pertaining to active hunters. Specifically we want to store their gender (in a character field as either ‘m’ or ‘f’), whether they are a state resident (in a logical field, either TRUE or FALSE), and we will assign each one a unique license number, which we will call their hunter ID. Again we will just use random values for each hunter, assigning them with sample according to predefined frequencies that we want to see in the resulting vectors. We could call sample for each vector, but to save a step this time we will nest multiple calls to sample inside the call to data.frame.

# create a dataset of hunter information.
hunter <- data.frame(hunter.id = 1:hunters,
                     sex.hunter = sample(c('m','f'),
                                         size = hunters,
                                         replace = TRUE,
                                         prob = c(0.85, 0.15)),
                     resident = sample(c(TRUE,FALSE),
                                     size = hunters,
                                     replace = TRUE,
                                     prob = c(0.8, 0.2)))

# look at the data
head(hunter)
##   hunter.id sex.hunter resident
## 1         1          m     TRUE
## 2         2          m    FALSE
## 3         3          m     TRUE
## 4         4          m     TRUE
## 5         5          m    FALSE
## 6         6          f     TRUE
# Save the CSV file
write.csv(hunter,file = 'datasets/hunter.csv', row.names = FALSE)

# confirm the file exists
file.exists("datasets/hunter.csv")
## [1] TRUE

That wraps up this chapter. Hopefully you now understand how we created the two csv files that you inherited as the new Tauntaun biologist, which include mistakes.

This has been a deep, deep chapter. We’ve learned a lot about arrays, rates, statistical distributions, loops, and more! If you have been thinking to yourself “I think I can code this a bit differently than what’s shown”, you’ve definitely caught the R bug. As we said in Chapter 1, you won’t learn R in 3 days. But, hopefully you have learned some tools that will help make your journey a bit easier from this point forward.

Most importantly, if you were able to follow along with this chapter, you’ve fledged!

Time for another break!

Figure 13.7: Time for another break!