You yourself mostly know it, even if you could not look outside the window, just eyeballing the route on the screen is enough to intuitively, fast, and with little uncertainty tell which road you are on right now and how you got there.
Even when the GPS jitters, jumps around, or is a bit off to the side, this is not enough to confuse a human. The GPS measurement error and sampling rate, however, combined with dense urban roads are enough to confuse a computer. …
I extracted Pull Request stats for 40 popular open source GitHub projects to see how likely it is for a PR to ever be merged. In this post you will find contributions to which projects are the best use of your time. Spoiler: some big mature projects do better in this ranking than you would think!
(See my original post introducing my minimal implementation of Macrobase Diff)
After fetching the dataset I transformed it a bit, most notably I collapsed fields containing JSON inside cells as this is not something Macrobase Diff can handle as of now. If you are curious how the dataset was treated before analysis see this gist.
Lets see what kind of data do we have about the IMDB movies:
'budget', 'genres', 'homepage', 'id', 'keywords', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies','production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count
There are two types of columns that Macrobase Diff can use:
In my last post ( https://pzakrzewski.com/posts/macrobase/) I introduced Macrobase, a tool and a methodology for prioritising attention in data analysis. I said that I would try to reimplement part of the Macrobase pipeline to understand it better, and this is what today’s post is about.
Along with many other pipelines implemented in Macrobase one got its own separate publication: https://cs.stanford.edu/~matei/papers/2019/vldb_macrobase_diff.pdf
The Diff Operator belongs to the Explain part of the methodology. Tools that are supposed to provide possible clues as to what makes detected outliers/anomalies in the data special, or different than the in-group.
The diff requires the data to…
Some time ago I played around with Macrobase, as I have been interested in anomaly detection for analysing monitoring metrics. It is an academic project that produced a tool and a methodology, like it is often the case, the tool itself is more of a proof of concept useful for exploring the methodology. So I will focus on the method part and the problems it can address.
This obviously heavily depends on the domain, but in a gross generalisation you would like to find the unexpected. …
This blog post was written in collaboration with Ezrah Ligthart Schenk
For quite some time it seemed as if REST apis with JSON were the only game in town, with other choices falling into legacy or niche projects. This is no longer true in 2018, protocol buffers/gRPC and GraphQL entered the mainstream and are frequently considered for new projects. While there might be great many reasons why one goes for one tech or another, in this blog post I will focus on raw performance and try to dispel some myths and misconceptions about where gRPC performance comes from.