Organizations need a secure data pipeline to extract real-time analytics from workloads and deliver trusted data. But data pipelines are becoming increasingly complex to manage.
That’s why companies such as Booking.com, Capital One, Fidelity and CNN are re-engineering them using Snowflake’s Apache Iceberg and a new solution, Iceberg Tables. The latter includes data lakehouses, data lakes and data meshes, and it allows IT leaders to simplify the development of pipelines so they can work with open data on their own terms and flexibly scale according to business use cases.
“With Iceberg, we can broaden our use cases for Snowflake as our open data lakehouse for machine learning, AI, business intelligence and geospatial analysis — even for data stored externally,” said Thomas Davey, chief data officer for Booking.com.
Iceberg Tables, announced June 4 at the Snowflake Summit in San Francisco, comes on the heels of the recently announced Polaris Catalog, a vendor-neutral and fully open catalog implementation for Apache Iceberg. Polaris Catalog enables cross-engine interoperability, giving organizations more choice, flexibility and control over their data.
Organizations can get started running Polaris Catalog hosted in Snowflake’s AI Data Cloud or using containers within their own infrastructure.
Here’s what IT leaders need to know about re-engineering their data pipelines, along with some expert advice
CHECKOUT: How IT leaders are leveraging data analytics in their businesses.
Why Companies Are Replacing Existing Batch Pipelines
Fidelity has reimagined its data pipelines using Snowflake Marketplace, saving the company time and resources in data engineering. Its supported business units, including fixed income and data science, can now analyze data faster, and the firm is spending “more time on research and less on pipeline management,” said Balaram Keshri, vice president of architecture at Fidelity.
“We are also seeing the scalability and collaboration benefits with these partners,” he said.
With Snowflake managing its data, Fidelity has also improved performance time significantly so that it can load, query and analyze data faster. In fact, the Snowflake Performance Index, which measures Snowflake’s impact, reports that it has “reduced organizations’ query duration by 27% since it started tracking this metric, and by 12% over the past 12 months,” according to a press release.
READ MORE: Extract real value from your enterprise data.
Capital One, reportedly the first U.S. bank to migrate its entire on-premises data center to the cloud, has also found success with its new data pipelines, thanks to Snowflake’s data sharing capabilities. The feature enables multiple analysts to access related data without affecting one another’s performance. Users can also categorize the data according to workload type.
“Snowflake is so flexible and efficient that you can quickly go from ‘data starved’ to ‘data drunk.’ To avoid that data avalanche and associated costs, we worked to put some controls in place,” Salim Syed, head of engineering for Capital One Software, wrote in a blog post.
CNN’s dramatic pipeline transformation also gave it accelerated access to analytics. Over the past year, the multinational news channel and website, owned by Warner Bros. Discovery., has shifted to using real-time data pipelines for workloads that support critical parts of its content delivery strategy. The goal is to move the horizon of actionable data down “from hours to seconds” by replacing existing batch pipelines, noted JT Torrance, data engineer with Warner Bros. Discovery.
“We will move around 100 terabytes of data a day across about 600,000 queries from our various partners,” said Zach Lancaster, engineering manager of Warner Bros. Discovery. Now, with its scalable and newly managed pipeline, CNN can scrape the data for core use cases and prioritize workloads that drive the most business value.
DIG DEEPER: What are some data governance strategies for AI success?
3 Steps to Transform Your Data Pipeline
As user-friendly as the Snowflake platform is, IT leaders still need a clear strategy in mind as they improve their data pipelines. For starters, “think about how you can bring your stakeholders on board. You want them to become the ultimate stewards of the process,” Lancaster said.
Second, revisit your use cases. “Platforms develop over the years, as does your business, so try to re-evaluate your use cases and dial back your system,” Torrance advised. This approach can help with cost optimization.
Third, “make sure you understand the ask of each request and how you expect to use it over time in your data pipeline,” Lancaster said.
If a company is going through the work of redesigning its data pipeline, it needs to be cross-functional and serve the most central parts of the business. So, consider “machine-to-machine use cases,” said Lancaster, as these are important for interoperability within your entire tech stack.
Finally, remember that more intricate systems aren’t always better. “Think carefully. Just because I have a request, do I need to accomplish it? And does the added complexity add value to the business, or does it do a disservice to the stakeholder?” Lancaster said.
Keep this page bookmarked for articles and videos from the event, follow us on X (formerly Twitter) @BizTechMagazine and join the event conversation at #DataSummit24.