In mid 2014, Chris Whong FOIL'ed (Freedom of Information Law) data from the Taxi & Limousine Commission of NYC , resulting in over 40GB of fare and trip information collected from the millions of 2013 cab rides. That data is available for download here.
This was my first exploration with big data in R. Using the ff package, I loaded "trip_fare_1.csv" and "trip_data_1.csv" into the console. I combined the two data sources, a sample of over 14 million cab trips in the month of January. Deciding that 14 million was still a bit more wieldy than I really wanted, I took a random sample of 100,000 trips.
For each of those trips, I had the following variables:
> colnames(total)  "X" "medallion" "hack_license"  "vendor_id" "pickup_datetime" "payment_type"  "fare_amount" "surcharge" "mta_tax"  "tip_amount" "tolls_amount" "total_amount"  "dropoff_datetime" "passenger_count" "trip_time_in_secs"  "trip_distance" "pickup_longitude" "pickup_latitude"  "dropoff_longitude" "dropoff_latitude"
Growing up in the city, I had the vague idea that cabbies preferred that you tip with cash because they could underreport their earnings and get that additional income tax-free. To test that hypothesis, I looked at the effect of "payment_type" on "tip_amount." There are five payment types :
- "CRD" -- card, debit or credit
- "CSH" -- cash
- "DIS" -- disputed fare
- "NOC" -- no charge
- "UNK" -- unknown
In my sample, the majority of the payments came from credit card, closely followed by cash. Disputed fares, no charges, and unknown payment sources were rare. The breakdowns for my sample are below:
# CG Estimate CRD CSH DIS NOC UNK 0.52600 0.47060 0.00070 0.00223 0.00047
Despite the difference between a sample of 100,000 and 173 million total transactions, my sample wasn't far off the total breakdown (taken from Muhammed Ahmad's post here):
# Total % from 173 million transactions CRD CSH DIS NOC UNK 0.53894 0.45681 0.00074 0.00232 0.00119
The tips for disputed fares and no charge fares were, as suspected, rarely above $0. The interesting part was the comparison between reported cash tips and card tips, summarized in this graph:
The mean reported tip for a card payment was $2.42. For a cash tip, it was $0.00. With a sample of 100,000, that's easily enough data to argue that there is a real difference between cash and card tips (p < 2.2e-16) with a large effect size. What does that mean? Well, on one hand, maybe people tend not to tip when paying with cash. Or perhaps cabbies underreport their cash tip earnings.
If we assume people tip a bit less for a cash transaction, say $2.00, this graph indicates that in those 100,000 transactions, there's around $90,000 in unreported income. If that's representative and scales up to the 173 million total transactions during the 2013 calendar year, that would mean NYC cabbies earned over $155 million of untaxed income.
Since reported cash tips are mostly $0, they're not very helpful when trying to use other variables to predict the true tip amount (I want to know how my tipping compares to other people's). So, I restricted my analysis to those who had payed with card. (Thankfully) there's at least a moderate correlation between the total fare before tip and tip amount (r-squared of .55). But when you graph the card tips based on the total fare before tip, you start to see an interesting pattern:
Those lines that seem to be coming out of the origin -- they're pretty much exactly the 20%, 25%, 30% lines. That's pretty good news for whoever thought to institute these: they're working.