Data Platform at Treebo
If we have data, let’s look at data. If all we have are opinions, let’s go with mine — Jim Barksdale, former CEO of Netscape
A few months back I joined Treebo with a clear mandate of setting up data platform which, effectively, in the longer run, fosters a data-driven culture in the organisation. We, as a company, have been quite data centric, but lot of this data has existed in different silos. Hence, there was a need to build a central data repository which becomes the single source of truth for all the metrics across the board, with self-serve enablement as the primary requisite.
The asks were for:
- A platform for realtime capture of user interactions over the website/app which enables smarter analytics, engagement insights & new product design; hence opening up avenues for better customer experiences & in-turn monetisation for the company.
- A data lake which enables data scientists to run experiments, try out different models & build algorithms. Also be able to cater to the SQL-empowered folks for advanced slice & dice.
- A warehouse which consolidates all the business verticals & exposes a unified view on the health of the company.
- A visualisation framework which enables canned reporting, sharing, alerting & dash boarding in self-serve manner.
In this post, I’d highlight the considerations (and musings) which went into coming up with the design and where we currently stand with it.
For starters — most of the Treebo’s infra is on AWS.
Contemplation & PoCs !
Below were the options which we had for each of the milestones in the data lifecycle; starting from capture/ingestion, streaming, transformation, lake, analytical storage, orchestration & end user querying/reporting/dash boarding.
- There are quite a few open source plugins available for Postgres which trickles WAL (write-ahead logs) in real time. Debezium, Pg_kafka & bottledwater being some of the most popular implementations. For this, DBs have to be moved from RDS to EC2 and would have to be managed by us — not an option at this point in time.
- As most of our applications are backedup by RDS Postgres, the fetching options get limited. Extracts can happen only by direct querying the respective application tables.
→ These extracts could be custom handwritten copy scripts while watermarking the last fetch timestamp. These have to be supported & managed by metadata tables.
→ The extracts could be mapreduce jobs (maybe Sqoop), wherein incremental data could be pushed to HDFS/EMRFS. Mapreduce will be inherently faster when dealing with bigger data sets. On the other hand, incremental fetches for our data will be slower — given the smaller delta.
→ Apache Spark etc are other options, but not really needed when the sources are mostly structured.
- Another option was to make all the applications push events to a post-back URL for real time ingestion. And these events would then be written to a streaming layer. This option would require significant engineering efforts from application teams.
- The operations can be performed on the EMRFS and the cleaned/munged data would again be stored in different S3 bucket than the input ones.
- For the above, Hive/Pig/Spark/custom scripting could be used.
- Hadoop cluster can be AWS (EMR) or in-house managed (too painful to put a thought to it, at this point).
- Data lake could be built on EMR HDFS. It’d require bigger persistent EBS volumes attached to each Hadoop node.
- The other alternative is to natively use S3. Amazon provides tight integration with S3, in form of EMRFS, providing great elasticity.
Pipeline to DW:
- As our Treebo infra is in AWS, Redshift becomes de-facto for powering org-wide analytics. I didn’t want to do Azure/Big Query by digressing from the cloud provider. Also with Redshift Spectrum on the rollout, it’d make data lake interactions very easy. In my previous organisation, we had setup DW on Redshift and it really worked out well for us.
- Dimensional modeling had to be done encapsulating all of Treebo. For each business vertical, KPIs had to be identified & bus-matrix of warehouse had to be built. It has to be signed-off from stakeholders, prior to actual implementation.
There were a few awesome frameworks for workflow orchestration & scheduling management (As in Oozie for Hadoop ecosystem).
- Airflow, open sourced by Airbnb, makes DAG of tasks.
- Azkaban, open sourced by Linkedin.
- Pinball, open sourced by Pinterest.
- Open solutions as Redash/Superset/Metabase or QuickSight could be used.
Tech Stack Assortment !
After weeks of rigorous PoC with different choices, we zeroed-in at the following technologies & framework:
- For structured data fetches, we chose Pentaho; for APIs — Python; and Shell scripting for tying up all of them.
- For clickstream, we already had Segment setup to capture user interactions. We wanted API at scale without worrying about hosting, load-balancing & other overheads. AWS API Gateway provided a jumpstart at it, hence was chosen to be exposed as a web hook for incoming events.
- AWS Kinesis Stream to stream clickstream data. We didn’t go Kafka way because at our scale provisioned service made more sense, in terms of total cost of ownership (TCO).
- AWS Firehose to persist events at S3 (our datalake) in Hive-compatible date partitions.
- AWS Lambda to enrich, parse, transform & apply business logic to the incoming stream. For our usecase, running code on EC2 or EMR isn’t best suited.
- MySql RDS for checkpointing, audit & failure recovery.
- Redshift as the warehouse.
- AWS Athena for working with the Data Lake.
- Metabase as self-serve platform for reporting & dash boarding. With MySql based metadata management.
- Airflow for the workflow orchestration.
Unveiling the Data Platform !
We have implemented the architecture differently for micro batching & streaming workloads — built out on AWS, given the scalability it offers with lesser DevOps efforts.
For realtime streaming, the AWS components are pay-as-you-go; enabling us to intuitively scale out. Interactive funnels, user paths & engagement metrics are designed on top of Redshift, stored in ‘smart partitioned’ fashion — a table each per event per day.
We’re continuously exploring ways to tune/improve the individual components’ performance to best suit our use cases. As can be seen, the platform components are loosely coupled so as to provide sufficient scope for technology evolvement. Maybe we’d go open source to replace some of it in the times to come. But as of now, this is what powers the analytics at Treebo!
In the next post on data platform use-cases, we’ve covered thoughts around BI platform at Treebo.
**Image courtesy Dilbert — Scott Adams