The problem was to find robots in some penny-like online auction. The data consists of a list of bid events (auction id, user id, time, IP, country) and a table with the bidder id’s, the hashed contact and payment information and whether the bidder is a robot or a human.
The most interesting part of this problem was to find features which describe bidding behavior. But before jumping to features I need to describe data cleanup. The dataset had missed
country field for some records, so I decided to “restore” it based on
IP field. Since
IP was obfuscated I couldn’t use any IP/Country databases. I made a map of 2-, 3- and 4-octets of IP addresses to countries and filled missed column with most probably country (with avg. .70 probability).
I was not sure how to interpret those big gaps within auction. Was it due to obfuscation, or were those gaps time when auction wasn’t operating (like during the night) or was it another data error and several data sets with non-unique auctions ids were merged? So I ran some stats on each period:
=== Period 1 === Min: 9631916842105263, Max: 9645558894736842, Min diff.: 52631578, Max diff.: 210526316 === Period 2 === Min: 9695580000000000, Max: 9709222052631578, Min diff.: 52631578, Max diff.: 526315790 === Period 3 === Min: 9759243157894736, Max: 9772885210526315, Min diff.: 52631578, Max diff.: 210526315 Diff between first and second period is 50021105263158 Diff between second and third period is 50021105263158
Then it became clear that 52631578 is minimum unit of time measure (second or millisecond). And that periods are result of mixed datasets since difference between them is too strict, there are no auctions running during all three periods. So I removed those gaps and created new auctions from each period (auctions count moved from 15k to 19k)
Address and payment are 36 characters long which is unusual for guid fields. I took closer look and noted that each address and payment consists of repeating 32 characters string and unique 6 characters string. So I split them up to two new fields each (although it didn’t lead to any good features).
After this I started digging into user behavior and looking for any patterns. I plotted users bids to country in a hope to see robots coming from some subset of countries: (green – users bids, red – robots bids)
If robots fake IP/country they might not take timezone into account and bid when human usually don’t bid (during the night for example). So I plotted human/robot bids time per country:
Again, nothing conclusive there – some countries do have normal day/night pattern, but in most popular countries, like India, people were bidding non-stop.
Although this chart revealed that robots had no interest in
auto parts category.
At this time I started generating features based on different bidder statistics like ratio of devices per auction, number of simultaneous auctions, avg number of different ips per set of consecutive bids, avg number of bids per auction, avg number of bids per referral url, avg price, how frequently ip got changed and so on.
This already gave me 0.91 score with simple ensemble model of Random Forest and Logistic Regression. Another small boost I got from computing fraction of fraudulent bids from country and 2-, 3-octet IP. But none of those features described bidder behavior. I’d expect bots with two strategies: robots who bid early in auction to drive up the price and robots who bid as close to the end of auction as possible (to spend as less bids as possible and still win the item).
I noted that auctions still have some gaps, but with less strict edges and some of them were shifted, so it might be night/day change in different timezones. To confirm this I plotted time histogram for both humans and robots:
Humans bids definitely have some periodicity, while robots have less obvious periods. I concluded there was three-days data for each auction.
But there were both robots and humans who bided aggressively. I’ve tried to calculate “time to response”, i.e. time between previous user’s last bid and current user time bid. The idea was that robots should have smaller response time since they’re machines. But looks like it wasn’t the case – humans were using some software to make bids and were as fast as robots.
Another important observation was that none of the auctions ended with dense amount of bids from the users. Looks like users exhausted their budget for the product and decided to either go and “Buy it now” the product or just bail. I decided to confirm my theory by plotting number of unique bids per last n items. n=50 appeared to be the most descriptive
So looks like robots start bidding more aggressively when there were 4-5 uniques left in an auction. But humans also followed very similar pattern, but shifted towards 1-2 uniques.
Knowing another bidders strategy seemed valuable from robot standpoint of view – robot may prefer to avoid aggressive bidders. This is where my next theory came from – robots may follow other users or prefer auctions with some users. In order to identify that I put users into buckets based on their co-occurrence in some auction. Although, this did not lead to any useful features, I discovered (by creating a graph from disjoint-set data structure based on users co-occurrence in auctions) that there are two sets of users who never appeared in the same auction together. It confirmed my theory of mixed data sets.
Another approach to determine robots strategy was to identify similar bid sequences across all auctions. So I named set of bids from particular user within some auction as a
track. I wanted to find Jaccard similarity coefficient between all pairs of tracks. Computing it directly is not feasible, so I implemented min-hashing algorithm and since we were interested only in similar pairs it made sense to apply locality-sensitive hashing as well. Unfortunately those optimizations were not enough to compute it within several hours (the point where my patience ran out) and I didn’t have more time to optimize it further.
Another thing I’ve tried is to predict not the bidder class, but bid class. And then compute average probability for each bidder based on their bids. It didn’t work out well, plus I had memory allocation problems since dataset is pretty big.
For those who curious, I was running my predictions on Amazon r3.2xlarge instance (since I needed a lot of RAM). I was using scikit-learn, python and R (R studio in the cloud). Since this is a classification problem scored by the AUC, the algorithm should generate probabilities for each class. I’ve tried LogisticRegression RandomForestClassifier, AdaBoostClassifier and GradientBoostingClassifier and combination of first two worked out for me the best during 5-fold CV. Please find my model on github.Read more!