Sprout Social is, at its core, a data-driven company. Sprout processes billions of messages from multiple social networks every day. Because of this, Sprout engineers face a unique challenge—how to store and update multiple versions of the same message (i.e. retweets, comments, etc.) that come into our platform at a very high volume.

Since we store multiple versions of messages, Sprout engineers are tasked with “recreating the world” several times a day—an essential process that requires iterating through the entire data set to consolidate every part of a social message into one “source of truth.”

For example, keeping track of a single Twitter post’s likes, comments and retweets. Historically, we have relied on self-managed Hadoop clusters to maintain and work through such large amounts of data. Each Hadoop cluster would be responsible for different parts of the Sprout platform—a practice that is relied on across the Sprout engineering team to manage big data projects, at scale.

Keys to Sprout’s big data approach

Our Hadoop ecosystem depended on Apache Hbase, a scalable and distributed NoSQL database. What makes Hbase crucial to our approach on processing big data is its ability to not only do quick range scans over entire datasets, but to also do fast, random, single record lookups.

Hbase also allows us to bulk load data and update random data so we can more easily handle messages arriving out of order or with partial updates, and other challenges that come with social media data. However, self-managed Hadoop clusters burden our Infrastructure engineers with high operational costs, including manually managing disaster recovery, cluster expansion and node management.

To help reduce the amount of time that comes from managing these systems with hundreds of terabytes of data, Sprout’s Infrastructure and Development teams came together to find a better solution than running self-managed Hadoop clusters. Our goals were to:

  • Allow Sprout engineers to better build, manage, and operate large data sets
  • Minimize the time investment from engineers to manually own and maintain the system
  • Cut unnecessary costs of over-provisioning due to cluster expansion
  • Provide better disaster recovery methods and reliability

As we evaluated alternatives to our current big data system, we strived to find a solution that easily integrated with our current processing and patterns, and would relieve the operational toil that comes with manually managing a cluster.

Evaluating new data pattern alternatives

One of the solutions our teams considered were data warehouses. Data warehouses act as a centralized store for data analysis and aggregation, but more closely resemble traditional relational databases compared to Hbase. Their data is structured, filtered and has a strict data model (i.e. having a single row for a single object).

For our use case of storing and processing social messages that have many versions of a message living side-by-side, data warehouses had an inefficient model for our needs. We were unable to adapt our existing model effectively to data warehouses, and the performance was much slower than we anticipated. Reformatting our data to adapt to the data warehouse model would require major overhead to rework in the timeline we had.

Another solution we looked into were data lakehouses. Data lakehouses expand data warehouse concepts to allow for less structured data, cheaper storage and an extra layer of security around sensitive data. While data lakehouses offered more than what data warehouses could, they were not as efficient as our current Hbase solution. Through testing our merge record and our insert and deletion processing patterns, we were unable to generate acceptable write latencies for our batch jobs.

Reducing overhead and upkeep with AWS EMR

Given what we learned about data warehousing and lakehouse solutions, we began to look into alternative tools running managed Hbase. While we decided that our current use of Hbase was effective for what we do at Sprout, we asked ourselves: “How can we run Hbase better to lower our operational burden while still maintaining our major usage patterns?”

This is when we began to evaluate Amazon’s Elastic Map Reduce (EMR) managed service for Hbase. Evaluating EMR required assessing its performance in the same way we tested data warehouses and lakehouses, such as testing data ingestion to see if it could meet our performance requirements. We also had to test data storage, high availability and disaster recovery to ensure that EMR suited our needs from an infrastructure/administrative perspective.

EMR’s features improved our current self-managed solution and enabled us to reuse our current patterns for reading, writing and running jobs the same way we did with Hbase. One of EMR’s biggest benefits is the use of the EMR File System (EMRFS), which stores data in S3 rather than on the nodes themselves.

A challenge we found was that EMR had limited high availability options, which restricted us to running multiple main nodes in a single availability zone, or one main node in multiple availability zones. This risk was mitigated by leveraging EMRFS as it provided additional fault tolerance for disaster recovery and the decoupling of data storage from compute functions. By using EMR as our solution for Hbase, we are able to improve our scalability and failure recovery, and minimize the manual intervention needed to maintain the clusters. Ultimately, we decided that EMR was the best fit for our needs.

The migration process was easily tested beforehand and executed to migrate billions of records to the new EMR clusters without any customer downtime. The new clusters showed improved performance and reduced costs by nearly 40%. To read more about how moving to EMR helped reduce infrastructure costs and improve our performance, check out Sprout Social’s case study with AWS.

What we learned

The size and scope of this project gave us, the Infrastructure Database Reliability Engineering team, the opportunity to work cross-functionally with multiple engineering teams. While it was challenging, it proved to be an incredible example of the large scale projects we can tackle at Sprout as a collaborative engineering organization. Through this project, our Infrastructure team gained a deeper understanding of how Sprout’s data is used, stored and processed, and we are more equipped to help troubleshoot future issues. We have created a common knowledge base across multiple teams that can help empower us to build the next generation of customer features.

If you’re interested in what we’re building, join our team and apply for one of our open engineering roles today.