My First Live Coding Interview

Yesterday I interviewed for a position maintaining and creating ShinyApps. To call that a JOB is crazy to me. I love developing reactive web applications, the fact that you can get paid to do that is still mind blowing. I’m realizing that having fun at work is actually a possibility!

That said, the data scientist position usually includes a live coding portion. I went into it trying to treat my first one as practice, but every second I didn’t spend typing spanned an eternity. It was horrifying… but thinking about how to solve these questions was also kind of really fun?

I’m fairly certain I won’t get the job. But I’m also certain it was an experience to learn and grow. The interview was so intense that it was pretty easy to recall the questions almost verbatim. I wanted to explore the questions again on my own with no pressure. And I’d love input on how to answer these more elegantly!

The R Way

Before we begin, I’ve updated this post to include asides provided from the wonderful world of #rstats twitter. If you have any suggestions on tidying the code feel free to contact me or submit a PR to my blog repo!

Question 1

Create a for loop for n iterations where every third iteration prints “buzz” and every fifth iteration prints “fizz”. Every combination prints “buzz-fizz”. Print the iterator for all other values.

n = 30

for (i in 1:n) {
  if (i %% 15 == 0) {
    print(paste(i,"buzz-fizz"))
      } else if (i %% 3 == 0) {
        print(paste(i, "buzz"))
      } else if (i %% 5 == 0) {
        print(paste(i, "fizz"))
      }
  print(i)
}
## [1] 1
## [1] 2
## [1] "3 buzz"
## [1] 3
## [1] 4
## [1] "5 fizz"
## [1] 5
## [1] "6 buzz"
## [1] 6
## [1] 7
## [1] 8
## [1] "9 buzz"
## [1] 9
## [1] "10 fizz"
## [1] 10
## [1] 11
## [1] "12 buzz"
## [1] 12
## [1] 13
## [1] 14
## [1] "15 buzz-fizz"
## [1] 15
## [1] 16
## [1] 17
## [1] "18 buzz"
## [1] 18
## [1] 19
## [1] "20 fizz"
## [1] 20
## [1] "21 buzz"
## [1] 21
## [1] 22
## [1] 23
## [1] "24 buzz"
## [1] 24
## [1] "25 fizz"
## [1] 25
## [1] 26
## [1] "27 buzz"
## [1] 27
## [1] 28
## [1] 29
## [1] "30 buzz-fizz"
## [1] 30

My first attempt answering the question revealed a gap in my mental model. I first attempted to construct the loop using an if statement with logical arguments in the same order as the question: (i %% 3 == 0), then (i %% 5 == 0) and lastly (i %% 15 == 0). I was operating under the idea that the arguments within a loop are circular. However, these arguments are inside an if statement, not the loop itself, so of course order matters! By putting (i %% 15 == 0) first you ensure the numbers divisible by both 3 and 5 are assigned to buzz-feed prior to buzz or feed assignment.

The R Way

R’s stregnth is in dealing with vectors, so leverage that in the approach! Jon Harmon suggested a better approach for this problem.

    n = 15
    dplyr::tibble(
    iteration = seq_len(n),
    output = dplyr::case_when(
    iteration %% 15 == 0 ~ "buzz-fizz", 
    iteration %% 3 == 0 ~ "buzz",
    iteration %% 5 == 0 ~ "fizz",
    TRUE ~ as.character(iteration)
  )
)
## # A tibble: 15 x 2
##    iteration output   
##        <int> <chr>    
##  1         1 1        
##  2         2 2        
##  3         3 buzz     
##  4         4 4        
##  5         5 fizz     
##  6         6 buzz     
##  7         7 7        
##  8         8 8        
##  9         9 buzz     
## 10        10 fizz     
## 11        11 11       
## 12        12 buzz     
## 13        13 13       
## 14        14 14       
## 15        15 buzz-fizz

In fact, this same question is the first example within the dplyr::case_when documentation!

Question 2

Summarise the diamonds data set

summary(ggplot2::diamonds)
##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 

In an attempt to over-complicate this question and to flex my tidyverse skills, I was quick to type diamonds %>% summarise(mean =....) but the instructor asked “Are you going to write the name of every column?” I panicked. I skipped this question finally remembered the summary function. (Clearly, base R functions are currently in the dark recesses of my mind. Use it or lose it…)

Find the maximum diamond price

diamonds %>%
  filter(price == max(diamonds$price))
## # A tibble: 1 x 10
##   carat cut     color clarity depth table price     x     y     z
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  2.29 Premium I     VS2      60.8    60 18823   8.5  8.47  5.16

I was quick to type max(diamonds$price) and smugly said ‘Done!’ The interviewer responded, okay but I wanted to know everything else about that diamond. This meant I needed to print the whole row. I’m not sure the function I’m using is the most efficient, but I like it?

Calculate the mean, median, standard deviation of the price for each diamond cut

ggplot2::diamonds %>%
  group_by(cut) %>%
  summarise(mean = mean(price),
            med = median(price),
            std = sd(price))
## Warning: The `printer` argument is deprecated as of rlang 0.3.0.
## This warning is displayed once per session.
## # A tibble: 5 x 4
##   cut        mean   med   std
##   <ord>     <dbl> <dbl> <dbl>
## 1 Fair      4359. 3282  3560.
## 2 Good      3929. 3050. 3682.
## 3 Very Good 3982. 2648  3936.
## 4 Premium   4584. 3185  4349.
## 5 Ideal     3458. 1810  3808.

Finally a question I felt comfortable answering! My TidyBlocks focus of the past couple months made me feel quite comfortable with this one.

Question 3

Using the MTCars data set, create a linear model to see the affect of mpg on disp and explain the output of the model

m.1 <- lm(mpg ~ disp, data = mtcars)
summary(m.1)
## 
## Call:
## lm(formula = mtcars$mpg ~ mtcars$disp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8922 -2.2022 -0.9631  1.6272  7.2305 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
## mtcars$disp -0.041215   0.004712  -8.747 9.38e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared:  0.7183, Adjusted R-squared:  0.709 
## F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10

Honestly, I could write this simple code from memory, but what I said as an explanation is an embarassing blur. I think I can only attibute floundering over the output of a linear model with a single predictor to nerves.

I’m taking the time here to break the output of the model summary down line for line because every aspiring data scientist should be so comfortable the lm output that even nerves shouldn’t matter.

  • The call is an R feature that shows the function and its parameters
  • The residuals are the difference between the model predicted and actual values of disp
  • The coefficents are the weights that minimize the sum of the square of the errors
    • Since mpg never equals zero, there’s no intrinsic meaning to the intercept
    • The negative sign of disp means as mpg increases, disp decreases
  • Residual standard error is the standard deviation of the error where the SD is the square root of the variance
  • Multiple R squared is a measurement of how well the model fits your data
    • An R = 0.7 is pretty good?
  • Adjusted R squared takes the amount of variables you add to the model into account as that will inevitably produce a better fit. Because we only have one predictor this number is only very slightly different from our R squared.
  • F-Statistic a global statistic to check if at least one coefficient is non-zero.

Question 4

Create a function that separates a list into two lists, one of unique values and the second containing the duplicates

set.seed(42)
my_list <- list(round(runif(100, min=0, max=100)), 1)

seperated <- function(input) {
  dup <- unique(input[[1]][duplicated(input[[1]])])
  unq <- input[[1]][!input[[1]] %in% input[[1]][duplicated(input[[1]])]]
  return(list(dup, unq))
}

seperated(my_list)
## [[1]]
##  [1] 46 94 91 74 39 83 64 97  4 26 51 68 98 69 14 72  1 38  0 78 56  9 21
## [24] 93 33 52 62
## 
## [[2]]
##  [1] 29 13 66 71 12 47 90 99 95  8 45 84 81 61 44 43 96 89 35 40 75 17 76
## [24] 57 85 19 27 24 22 48 20 58 16 36 65 23 31 67 73

To get there, I made a dummy data set to play with, a list with 6 numbers, only one of which is a duplicate. This helped to highlight the workflow (1) find the duplicates for the first list, then (2) find the unique values, but remove the duplicates

test <- list(c(1,2,3,4,5,3))

# find duplicates
test[[1]][duplicated(test[[1]])]
## [1] 3
# I thought of another case -
# if we have muliple duplicates (three 3s)
# we need to wrap this function in unique()
test2 <- list(c(1,2,3,4,5,3,3))
unique(test2[[1]][duplicated(test2[[1]])])
## [1] 3
# remove duplicates from unique values
test[[1]][!test[[1]] %in% test[[1]][duplicated(test[[1]])]]
## [1] 1 2 4 5

Obtaining the data from inside a list, especially nested lists, is a skill I know I need to build. This answer does not look elegent to me but it gets the job done? I’m going to play with “better”, cleaner solutions.

Summary

I left the interview feeling exhausted and deflated. I found myself asking: if I can’t answer these questions, what am I doing trying to become a data scientist? But now that I’ve spent a day to reflect, the interview was an incredible learning experience. It pinpointed concrete areas where I can grow and I honestly had fun thinking about these problems. I’m not sure I’ll ever perform smoothly under pressure, but at the very least I now have a function to seperate duplicates from unique values!

Avatar
Maya Gans
Data Scientist

Maya’s work as a Master’s student was focused in quantitative biology. She loves coding and is extremely passionate about data science and data visualization.

Related