Notes about a conference on practical big data
Over two days in late October, I had the privilege of attending a conference about data science in Budapest, Hungary. The audience of the conference was evenly split between data scientists, business analysts, project managers and software engineers. Most of the people were employees of local software development companies Prezi or LogMeIn, or employees of the major sponsors (Ericsson, Bosch, MSCI, SAP, etc.). Strangely, half of the speakers where from the US (and Alistair Croll is from Montreal). I has accompanied with two data scientists from Shopify with whom I had many insightful conversations during the breaks. There was also an UX conference called Amuse at the same time and we had access to talks from both, but we stayed in the Crunch room for the whole two days.
Following is a list of my personal notes and impressions for each of the talks. The Twitter steam under the #crunchbp hashtag captures a large part of the audience reactions to the talks. Most talks were filmed and are available on Ustream, but some will be missing. The number of stars corresponds roughly to the order I think someone should watch the videos if you’re not going to watch them all, ranked roughly by quality and general interest.
Thursday October 29th
The Web was initially thought as being digital print. Today, technology takes the form of digital agents. If mobile phones continue in the trend of becoming prosthetic brains, they must learn to tell us when they have useful information. Interruptions are the new “interface”. Augmented reality doesn’t get in the way (think more feeling that you know there’s someone behind you more than putting things right into your visual field).
We as humans are bad at finding causation and computers are even worse. Humans because they like a good story, and computers because they are easily caught in a local optimum point and can’t innovate (aren’t all currently “smart” devices just always stating the obvious?). For an example of such mindless optimization, consider the appearance of the PPM device to track radio broadcast audiences automatically. Compared to the older listening diaries, the device could now track music heard in public areas. This changed the resulting ratings and radio programming was adapted in a way that caused the frequency of the most played songs to increase significantly. Instead of playing more of what people liked, they basically started playing more of what was playing.
Analytics systems are in a position to cause discrimination unintentionally in the form of digital redlining. (it’s sad that the video is not available, I would really like to hear about that part again; maybe we can talk to Alistair later in Montreal).
More recently, there has been a new blog article that roughly follows the subject of the talk.
Stephen Brobst and Scott Gnau (Twitter) from Hortonworks ⭐
Big data is marked by the transition of an infrastructure around Transactions (CRM, ERP systems) towards Interactions (Social networks, Web services, analytics).
The role of a Data Scientist is not the same of an “old” BI Analyst. The Data Scientist has to come up with new questions first, as a traditional BI Analyst was only concerned by finding answers.
Data centers are boring; today’s companies want their smart people think about science. Netflix, which uses extensively AWS for its operations, is given in example.
Data provenance is an upcoming operational challenge and it should be done as early as possible in the data capture process.
Oh, and they want us to use the terms data reservoir and data swamp. sigh
First step of doing an A/B testing experiment is to come up with a question. There are two opposing strategies to solve the problem: the “scientific” way is to set up a small number of specific hypothesis and do read all the available literature. The other strategy, the “crime boss” strategy, is to try lots of little things quickly and do no reading at all. As the first method is strictly ad-hoc, the second becomes an operational process and lends to frequent A/B testing.
A/B tests require large samples and the effects are weak (like it is most often for conversion rates) so all your tests will be negative by lack of statistical power. If that’s the case, there is nothing wrong in going for the “scientific” approach and doing the reading instead of guessing.
If you still go for A/B tests and wait for your appropriate sample size, don’t peek while it is running! This is like cookie baking 😄 Otherwise, you’re (very badly) doing sequential testing.
There are sequential sampling methods (see Optimizely, VWO SmartStats) that are popular in clinical research since WWII, but there are fudge factors in the algorithms. Miller proposes his own Ruin Test based on random walks and theory from Gambler’s ruin paradox. This method checks for excessive “drift” in order to end the test early are require a more modest sample size.
The 5 steps of building a data culture:
- Bad decisions → Getting started
- Under-utilization → Evangelization and internal marketing (“a good product doesn’t speak for itself”)
- Guidance needed → The realization stage?
- Scale yourself → Training
- Simple mistakes → Simplification (use humans to do just the hard parts)
Code reviews are rated for actually improving the product (are are appropriately named “r+”). The pressure in on engineers to prove that the change is bringing an improvement and they than request help and guidance from the analytics team to achieve that. Part of that knowledge sharing is done through checklists (before launching a test, during and after it ended, not much different from airplane flight checks).
“Help” does not mean “on-call”. Do rotations.
Presented the architecture for building an analytics pipeline with open-source tools, with Google Cloud Services and Amazon Web Services. Sadly, the huge projector screen starting blinking like crazy and was shut down and it was sadly quite disruptive.
Yali explains to us how Snowplow works and how it extensively uses schema versioning to match the data coming in the system and to provide a contract to its consumers.
There are three main parts of the data pipeline:
The first part is the Validation part. The main output of this stage is the raw events data. This data is immutable. The events that could not pass validation are kept in an alternative storage, where it may be possible to reinterpret or mutate them in order to reintegrate them in the validated set.
The second part uses the validated event data and performs Dimension Widening. This step is where data is expanded in order to include all the interesting contextual information about this event. For example, geolocation may be used to expand from an IP address to a Country, or a user agent string may be used to determine the platform. The result of this part is also immutable and represent the “golden set” of the data.
The third part performs Data Modeling. This is where fancy processing can be done. The result often overwrites, updates or generally mutates existing data in order to provide more precise information. There is a challenge in inspiring confidence in the data while you keep changing it. Business people hate it when yesterday’s numbers change. Involve people.
This talk was a bit different because it was presenting data instead of just being academic. At the same time, this was probably the most “academic” talk in terms of culture. The subject was how to hire (and subsequently retain) Data Scientists.
People can’t code. 46% of applicants claim they know Python, 35% really ever wrote some Python, 26% can actually give a correct answer to a problem, but just a tiny 12% do this is an elegant way (no crazy copy-pasting). Come on people! In the same vein, many people can define regression, most cannot apply it to a problem.
Friday October 30th
The guy wrote Lucene “as an experiment” and released it as open source. It went well so when he read a paper from Google he also wrote Hadoop.
We live in a era of hardware abundance. This is a good time for the Hadoop ecosystem. No single component is critical in that ecosystem, even Spark is replacing MR. The name Hadoop will probably stick for longer and become the name of the whole ecosystem, which is becoming a world-wide technological revolution. Let’s remember again that this is the name of his kid’s toy.
All this is nice and sunny, but we’ve barely touched the dark part: Hadoop is still immature (in terms of software quality) and there is a shortage of talent (the community is young). The integration problem Hadoop tries to solve is terribly complex, contrasting with the silos it is replacing (just for this reason, the guy also wrote Avro and most of the security layers). In times where having lots of data is becoming a social liability, we don’t want our industry to be seen as an evil industry and we don’t want to be the bad guys. Tho address this, we need to improve accountability, just like in almost any other industry.
Event streams are the representation that some things happened. Their content is immutable. Martin in showing us how to build the middle parts of a data pipeline and uses the draw an owl metaphor (Shopify folks rejoice) with Kafka and Samza. There are three kinds of events:
- Time series ticks (from sensors data, for example)
- Database change events (more on this later)
How to enrich events (for example PageViewEvent → PageViewWithProfileEvent) ? The stream processor (hidden in that little arrow) may hit a DB to get the user profile data, but it will absolutely kill your poor operational DB because the aggregate bandwidth capacity of the event stream is roughly 2 orders of magnitude greater than the throughput of the DB. Caching will help you very little because you will still have spikes when a stream processor catches up on a stream. So don’t do that.
Make your operational systems publish database change events to your event streams. Move a private DB inside the stream processor (LevelDB or RocksDB seem pertinent choices). Perform “join” between the streams by memoizing the incoming database change data into the internal DB. You parallelize by having Kafka streams partitioned by a meaningful key (here for example, the profile ID would be a good partitioning key). If a stream worker crashes, start a new one and just replay the streams. This is where stream compaction is a nice optimization and allows rebuilding the state in an efficient and robust way.
Most of these systems degenerate in some kind of stream/batch hybrids that some people have named kappa architectures. If you need to reprocess some data, start a new (versioned) stream pipeline beside the old one and when everything’s good, kill the old version.
A/B testing for conversion rates is not far from stock trading.
David Weisman ⭐⭐
Gave his talk made in LaTeX on Beamer and with sexist examples taken from statistics articles from the sixties. Explained to us how you would do manually what scikit-learn does for you and that Olivier Grisel explained much better at Montreal Python 53.
This talk mostly sounded like a pitch for a few of her books. The content of the talk was way too abstract even if the subject is of the most interesting. Seriously the most disappointing talk of the conference because I had built up high expectations.
Seriously, I might revisit the slides because maybe I just wasn’t in very good shape mentally at that time.
Elena Verna from SurveyMonkey [Video] ⭐⭐⭐⭐
Build 3 or 4 KPIs for the whole company.
A KPI must be a predictor. It is measures revenue and it starts to go down, you’re already screwed. Base them on your customer lifecycle for example.
KPIs have drivers. For example, the number of surveys drives the number of respondents which in turn drives the acquisition (this is the KPI).
Push KPIs especially to exec, via email. Do it.
Track growth, in absolute value (avoid rates because the denominator changes with the reference, in time). Provide trailing week or month. Consider seasonal effects and day of week.
Personalization is about UX and accounts for cultural variation. Don’t be afraid to use external data (blogs, for example) to complete your dataset.
Uses word2vec with abstract data! They use it to build playlist profiles where songs are “words” and playlists are “documents”.
Don’t be afraid to vectorize your data. It’s easy to manipulate, lightweight to transfer or store, safe in terms of privacy. You can use Euclidian or Cosine distance to measure similarity, and Nearest Neighbors to find clusters. But we know that already, don’t we?
Data-driven strategy is actually about $$$ (in fact, 5% more than without). Analytics have a 13:1 ROI (Nucleus 2014).
Data Scientists require impact in their role and guidance (which often means training in the business domain).
Nice wrap-up, we’re all exhausted.