An entire new suite of open source technologies is driving forward innovative use cases in processing big data. Here at HomeAway bleeding edge technologies are tracked and gauged for their value in meeting our business and technical needs. Solutions that would have been difficult or near impossible to accomplish otherwise are being solved in the space of A/B testing thanks to a custom open source technology stack.

Kafka Logo Samza Logo Hadoop Logo Splunk Logo

A/B testing is a powerful tool that leverages data analytics to maximize conversion on all our products. From a technical point of view, this is a daunting task that requires vast tracts of data to be refined and handled appropriately before being recorded in such a way that enables a full range of flexibility in drawing conclusions from different views. The desired traits in designing an A/B testing system are as follows:

  • Velocity in setting up and retrieving test results
  • Accuracy and consistency in statistical analysis
  • Re-usability of analytical data for multiple tests
  • Customizability of test readouts on multiple variables

Time is of the essence in an agile environment, and being without an automated A/B test readout system cripples the workflow while producing a huge bottleneck in waiting for data to be analyzed. Velocity increases tremendously when feedback is instantaneous. Furthermore, automating statistical calculations in a standardized format reduces exposure to human error, leading to greater trust in each outcome. Re-usability is also a large concern since much of the analytical data is shared among multiple tests. Lastly, each A/B test is completely unique and has an array of parameters that must be set to get an accurate picture of the data.

An innovation that meets these critical needs is using preprocessing and real-time indexing to provide faster feedback than can be obtained via post facto analysis of Apache Hadoop data. The two technologies central to this solution are open source projects that specialize in handling real time streams of data. Apache Kafka is a distributed streaming platform that functions as a message broker. Apache Samza accomplishes the other half of the job by processing the streams of data provided by Kafka. It serves as a distributed stream processing framework. The final datastore used in this architecture is Splunk, which specializes in searching, monitoring, and analyzing big data. Future plans include having this last leg of the data flow moved to an open source data store solution. However, for now Splunk serves our needs.

Architecture design

From the web browser a lightweight JSON payload that describes the UI event is produced. This includes all event-specific data and is sent to a logging service which in turn publishes to a Kafka topic. A Samza job is subscribed to the pertinent Kafka topic which feeds in the events and processes them. The output of the Samza job is log files with grouped and processed event information. Splunk is then set to index these files. At the end of the flow, all UI events have been aggregated, summarized, and can now be queried from Splunk. The goal of live feedback has been achieved, and this data is accessible with a simple REST API call from a web application that can manage the statistical calculations and any other needs of a test readout.

This entire innovation of real-time A/B test results was designed and implemented over the course of a matter of weeks during a summer internship. Such agile efforts are exemplary of the direction here at HomeAway in quickly accessing and utilizing the latest technologies in an ever evolving technological landscape.