Kaggle competition aftermath

After taking Andrew Ng’s class and reading couple books about machine learning, I finally found time to do something with it. This was my first Kaggle competition and it was really fun!

The problem was to find robots in some penny-like online auction. The data consists of a list of bid events (auction id, user id, time, IP, country) and a table with the bidder id’s, the hashed contact and payment information and whether the bidder is a robot or a human.

The most interesting part of this problem was to find features which describe bidding behavior. But before jumping to features I need to describe data cleanup. The dataset had missed country field for some records, so I decided to “restore” it based on IP field. Since IP was obfuscated I couldn’t use any IP/Country databases. I made a map of 2-, 3- and 4-octets of IP addresses to countries and filled missed column with most probably country (with avg. .70 probability).

Time was obfuscated and it was even more confusing after plotting auction to bids time: auctions_to_bids_gap_small (blue – humans bids, red – robot bids)

I was not sure how to interpret those big gaps within auction. Was it due to obfuscation, or were those gaps time when auction wasn’t operating (like during the night) or was it another data error and several data sets with non-unique auctions ids were merged? So I ran some stats on each period:

=== Period 1 ===
Min: 9631916842105263, Max: 9645558894736842, Min diff.: 52631578, Max diff.: 210526316
=== Period 2 ===
Min: 9695580000000000, Max: 9709222052631578, Min diff.: 52631578, Max diff.: 526315790
=== Period 3 ===
Min: 9759243157894736, Max: 9772885210526315, Min diff.: 52631578, Max diff.: 210526315

Diff between first and second period is 50021105263158
Diff between second and third period is 50021105263158

Then it became clear that 52631578 is minimum unit of time measure (second or millisecond). And that periods are result of mixed datasets since difference between them is too strict, there are no auctions running during all three periods. So I removed those gaps and created new auctions from each period (auctions count moved from 15k to 19k)

Address and payment are 36 characters long which is unusual for guid fields. I took closer look and noted that each address and payment consists of repeating 32 characters string and unique 6 characters string. So I split them up to two new fields each (although it didn’t lead to any good features). Screenshot 2015-06-09 15.55.12

After this I started digging into user behavior and looking for any patterns. I plotted users bids to country in a hope to see robots coming from some subset of countries: country_distribution_using_bids (green – users bids, red – robots bids)

If robots fake IP/country they might not take timezone into account and bid when human usually don’t bid (during the night for example). So I plotted human/robot bids time per country:

lr in
Liberia India

Again, nothing conclusive there – some countries do have normal day/night pattern, but in most popular countries, like India, people were bidding non-stop.

Robots intent was obviously to buy goods as cheap as possible, but humans were greedy as well, so price per product chart got us nothing as well: prices

Although this chart revealed that robots had no interest in auto parts category.

At this time I started generating features based on different bidder statistics like ratio of devices per auction, number of simultaneous auctions, avg number of different ips per set of consecutive bids, avg number of bids per auction, avg number of bids per referral url, avg price, how frequently ip got changed and so on.

This already gave me 0.91 score with simple ensemble model of Random Forest and Logistic Regression. Another small boost I got from computing fraction of fraudulent bids from country and 2-, 3-octet IP. But none of those features described bidder behavior. I’d expect bots with two strategies: robots who bid early in auction to drive up the price and robots who bid as close to the end of auction as possible (to spend as less bids as possible and still win the item).

I plotted bid times in auctions to find patterns if any: auctions_to_bids (green – unknown bids, blue – humans, red – robots)

I noted that auctions still have some gaps, but with less strict edges and some of them were shifted, so it might be night/day change in different timezones. To confirm this I plotted time histogram for both humans and robots: bids_to_time_bucket

Humans bids definitely have some periodicity, while robots have less obvious periods. I concluded there was three-days data for each auction.

Plotting several auctions as bid time per bidder id revealed that some users bid very aggressively while others not. Here is auction cfw9q: auction_cfw9q_1

But there were both robots and humans who bided aggressively. I’ve tried to calculate “time to response”, i.e. time between previous user’s last bid and current user time bid. The idea was that robots should have smaller response time since they’re machines. But looks like it wasn’t the case – humans were using some software to make bids and were as fast as robots.

Another important observation was that none of the auctions ended with dense amount of bids from the users. Looks like users exhausted their budget for the product and decided to either go and “Buy it now” the product or just bail. I decided to confirm my theory by plotting number of unique bids per last n items. n=50 appeared to be the most descriptive bids_to_unique_50

So looks like robots start bidding more aggressively when there were 4-5 uniques left in an auction. But humans also followed very similar pattern, but shifted towards 1-2 uniques.

Knowing another bidders strategy seemed valuable from robot standpoint of view – robot may prefer to avoid aggressive bidders. This is where my next theory came from – robots may follow other users or prefer auctions with some users. In order to identify that I put users into buckets based on their co-occurrence in some auction. Although, this did not lead to any useful features, I discovered (by creating a graph from disjoint-set data structure based on users co-occurrence in auctions) that there are two sets of users who never appeared in the same auction together. It confirmed my theory of mixed data sets. 2_robots_graph-min_opt

Another approach to determine robots strategy was to identify similar bid sequences across all auctions. So I named set of bids from particular user within some auction as a track. I wanted to find Jaccard similarity coefficient between all pairs of tracks. Computing it directly is not feasible, so I implemented min-hashing algorithm and since we were interested only in similar pairs it made sense to apply locality-sensitive hashing as well. Unfortunately those optimizations were not enough to compute it within several hours (the point where my patience ran out) and I didn’t have more time to optimize it further.

Another thing I’ve tried is to predict not the bidder class, but bid class. And then compute average probability for each bidder based on their bids. It didn’t work out well, plus I had memory allocation problems since dataset is pretty big.

For those who curious, I was running my predictions on Amazon r3.2xlarge instance (since I needed a lot of RAM). I was using scikit-learn, python and R (R studio in the cloud). Since this is a classification problem scored by the AUC, the algorithm should generate probabilities for each class. I’ve tried LogisticRegression RandomForestClassifier, AdaBoostClassifier and GradientBoostingClassifier and combination of first two worked out for me the best during 5-fold CV. Please find my model on github.

Read more!

What’s Bitcoin good for, anyway?

​The idea of using Bitcoins instead of “normal” monetary currencies has been getting a lot of attention lately, and a lot of people seem to be wondering why anyone would want to use Bitcoin in the first place. This is, of course, a question worth pondering, but I think it has been sufficiently addressed. Entrepreneur and software engineer Marc Andreessen has persuasively argued in favor of Bitcoin, highlighting several advantages of Bitcoin over conventional currency. Consider for a moment that one of the characteristics that makes Bitcoin a completely revolutionary currency is that it can be divided into very small amounts. While a U.S. dollar can be divided into no more than one hundred parts – or up to 2 decimal places – a Bitcoin can be partitioned into a number with 8 decimal places. This is actually very exciting and opens up a whole new world of possibilities, and Marc Andreessen points out that the ability to send very small amounts of money (“micropayments”) can be harnessed for beneficial purposes. For instance, Bitcoins could be instrumental in filtering out spam: if posting a comment to a given site required the commenter to pay a very small fee in Bitcoins, then spammers would almost certainly be hesitant to post. Since the effectiveness of spam relies on the volume of spam, having to pay for spam content would basically defeat the whole purpose of spam. On the other hand, individuals who genuinely have something to say would not be affected all that much since the “Bitcoin fee” would be extremely small (say, thousandths of a penny).

Read more!

Escape unescaped

JavaScript has earned reputation of tricky and well, the most popular language nowadays. But in day-to-day programming life software engineers try to avoid any hacks, shortcuts and tricks. This make code simple, easy to maintain and even more reliable. Although there is a flip side of the coin – security, which requires more deep understanding of how the language and the browser works. The world have seen a lot of examples of XSS and other JavaScript-based attacks even on such reliable websites as Facebook, Facebook, Google, Google – just to list few recent attacks. Therefore when I came by this nifty JavaScript escaping challenge I immediately decided to test my skills.

Spoiler Alert: I’m going to walk through each challenge and explain my solution for it. So if you want to test your knowledge of JavaScript first – stop reading and come back only if you desparate.

If you reading this line, means that you either stuck somewhere in the middle of the quiz or just want to learn something new about JavaScript. Welcome.

Each challenge consists in a given JavaScript function which accepts single parameter – string. Your job is to write such a string, that will force the function to execute alert(1).

Challenge 0

function escape(s) {
  // Warmup.
  return '<script>console.log("'+s+'");</script>';



This is pretty straight-forward XSS injection. The resulting html will be


which is completely different from what an author intended to do there.

Read more!

CSS-only: Load images on demand

The brain is a muscle, and as all muscles, it needs regular exercise to keep sharp.

Thats why I decided to take very old (but efficient) web optimization technique and implement it in new crazy way. You most likely heard about loading images (and other resources) on demand – which is not only a common sense, but also a good way to speedup initial page load.  Usually it is implemented via JavaScript, but today I’ll share with you how you can achieve the same effect by using pure CSS.

Read more!

Another scriptless clickjacking vector

Recently, one of my colleagues showed to me that Google do the following trick on their search results page. If you search for something, initially search results contains html with anchors we’d expect:

<a class="l" onmousedown="return rwt(this,'','','','1','AFQjCNGGfyJjOyiWYPB3FW-h7Pt6A5uwlA','4k2v33QNU7tijpC6ZLriyQ','0CDIQFjAA','','',event)" 
href="http://en.wikipedia.org/wiki/Cross-site_scripting"><em>Cross-site scripting</em> - Wikipedia, the free encyclopedia</a>

But have you noted onmousedown event handler? Let’s see what it does – right click on a link and examine its html again:

<a class="l" onmousedown="return rwt(this,'','','','1','AFQjCNGGfyJjOyiWYPB3FW-h7Pt6A5uwlA','4k2v33QNU7tijpC6ZLriyQ','0CDIQFjAA','','',event)" 
href="/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=1&amp;cad=rja&amp;ved=0CDIQFjAA&amp;url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FCross-site_scripting&amp;ei=6oZYUcKUA4WSiALF4YHYDQ&amp;usg=AFQjCNGGfyJjOyiWYPB3FW-h7Pt6A5uwlA&amp;sig2=4k2v33QNU7tijpC6ZLriyQ"><em>Cross-site scripting</em> - Wikipedia, the free encyclopedia</a>

Now it points to a different location! Also note that before we click on it – status bar, at the bottom, showed us a link to Wikipedia instead of google’s /url page…

It is a clickjacking attack, isn’t it? Of course Google is a good boy and will navigate us to Wikipedia eventually, but first it will log our navigation through /url page.

Due to their high practical impact, clickjacking attacks have attracted a lot of attention from the security community members. If you recall, I’ve wrote several blog posts dedicated to this topic, and here are some more links for you:

A Solution for the Automated Detection of Clickjacking Attacks

Scriptless Attacks – Stealing the Pie Without Touching the Sill

Clickjacking in LinkedIn

Self-Exfiltration: The Dangers of Browser-Enforced Information Flow Control

UI Redressing Mayhem: Identification Attacks and UI Redressing in Chrome

Hyperlink Spoofing and the Modern Web

Following the developments and published work mentioned above, a plethora of more or less feasible defense techniques has been proposed. All these attempts have a clear goal: stopping clickjacking attacks. In general, one can say that if an attacker manages to execute JavaScript on the target domain, then he can control the whole Web page navigated at by the victim. Therefore, a recommended mitigation strategy would be to deactivate/limit JavaScript code execution for security reasons, employing tools such as NoScript, Content Security Policy (CSP), or, alternatively, making use of HTML5-sandboxed Iframes.

In this blog post I’d like to evaluate whether restricting scripting content is sufficient for attack mitigation by examining it in practice. For the rest of the article I assume that an attacker has access to DOM and/or CSS of the victim page, but scripting is completely disabled.

Read more!