Contact Us

Use the form on the right to contact us.

You can edit the text in this area, and change where the contact form on the right submits to, by entering edit mode using the modes on the bottom right. 


123 Street Avenue, City Town, 99999

(123) 555-6789


You can set your address, phone number, email and site description in the settings tab.
Link to read me page with more information.

Data Blog

Data Blog

NYC Taxi Data (Part I)

Clay Gibson

In mid 2014, Chris Whong FOIL'ed (Freedom of Information Law) data from the Taxi & Limousine Commission of NYC , resulting in over 40GB of fare and trip information collected from the millions of 2013 cab rides. That data is available for download here

This was my first exploration with big data in R. Using the ff package, I loaded "trip_fare_1.csv" and "trip_data_1.csv" into the console. I combined the two data sources, a sample of over 14 million cab trips in the month of January. Deciding that 14 million was still a bit more wieldy than I really wanted, I took a random sample of 100,000 trips.

For each of those trips, I had the following variables:

> colnames(total)
 [1] "X"                 "medallion"         "hack_license"     
 [4] "vendor_id"         "pickup_datetime"   "payment_type"     
 [7] "fare_amount"       "surcharge"         "mta_tax"          
[10] "tip_amount"        "tolls_amount"      "total_amount"     
[13] "dropoff_datetime"  "passenger_count"   "trip_time_in_secs"
[16] "trip_distance"     "pickup_longitude"  "pickup_latitude"  
[19] "dropoff_longitude" "dropoff_latitude"

Growing up in the city, I had the vague idea that cabbies preferred that you tip with cash because they could underreport their earnings and get that additional income tax-free. To test that hypothesis, I looked at the effect of "payment_type" on "tip_amount." There are five payment types :

  • "CRD" -- card, debit or credit
  • "CSH" -- cash
  • "DIS" -- disputed fare 
  • "NOC" -- no charge
  • "UNK" -- unknown

In my sample, the majority of the payments came from credit card, closely followed by cash. Disputed fares, no charges, and unknown payment sources were rare. The breakdowns for my sample are below:

# CG Estimate   
   CRD     CSH     DIS     NOC     UNK 
0.52600 0.47060 0.00070 0.00223 0.00047 

Despite the difference between a sample of 100,000 and 173 million total transactions, my sample wasn't far off the total breakdown (taken from Muhammed Ahmad's post here): 

# Total % from 173 million transactions   
   CRD     CSH     DIS     NOC     UNK 
0.53894 0.45681 0.00074 0.00232 0.00119

The tips for disputed fares and no charge fares were, as suspected, rarely above $0. The interesting part was the comparison between reported cash tips and card tips, summarized in this graph:

The mean reported tip for a card payment was $2.42. For a cash tip, it was $0.00. With a sample of 100,000, that's easily enough data to argue that there is a real difference between cash and card tips (p < 2.2e-16) with a large effect size. What does that mean? Well, on one hand, maybe people tend not to tip when paying with cash. Or perhaps cabbies underreport their cash tip earnings. 

If we assume people tip a bit less for a cash transaction, say $2.00, this graph indicates that in those 100,000 transactions, there's around $90,000 in unreported income. If that's representative and scales up to the 173 million total transactions during the 2013 calendar year, that would mean NYC cabbies earned over $155 million of untaxed income. 

Since reported cash tips are mostly $0, they're not very helpful when trying to use other variables to predict the true tip amount (I want to know how my tipping compares to other people's). So, I restricted my analysis to those who had payed with card. (Thankfully) there's at least a moderate correlation between the total fare before tip and tip amount (r-squared of .55). But when you graph the card tips based on the total fare before tip, you start to see an interesting pattern:

Those lines that seem to be coming out of the origin -- they're pretty much exactly the 20%, 25%, 30% lines. That's pretty good news for whoever thought to institute these: they're working. 

Data is available for download here. R code available here