Beta

Explorez tous les épisodes de Data Engineering Podcast

Plongez dans la liste complète des épisodes de Data Engineering Podcast. Chaque épisode est catalogué accompagné de descriptions détaillées, ce qui facilite la recherche et l'exploration de sujets spécifiques. Suivez tous les épisodes de votre podcast préféré et ne manquez aucun contenu pertinent.

Rows per page:

1–50 of 462

DateTitreDurée
03 Jul 2023How Data Engineering Teams Power Machine Learning With Feature Platforms01:03:30

Summary

Feature engineering is a crucial aspect of the machine learning workflow. To make that possible, there are a number of technical and procedural capabilities that must be in place first. In this episode Razi Raziuddin shares how data engineering teams can support the machine learning workflow through the development and support of systems that empower data scientists and ML engineers to build and maintain their own features.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • Your host is Tobias Macey and today I'm interviewing Razi Raziuddin about how data engineers can empower data scientists to develop and deploy better ML models through feature engineering

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What is feature engineering is and why/to whom it matters?
    • A topic that commonly comes up in relation to feature engineering is the importance of a feature store. What are the tradeoffs for that to be a separate infrastructure/architecture component?
  • What is the overall lifecycle of a feature, from definition to deployment and maintenance?
    • How is this distinct from other forms of data pipeline development and delivery?
    • Who are the participants in that workflow?
  • What are the sharp edges/roadblocks that typically manifest in that lifecycle?
  • What are the interfaces that are needed for data scientists/ML engineers to be able to self-serve their feature management?
    • What is the role of the data engineer in supporting those interfaces?
    • What are the communication/collaboration channels that are necessary to make the overall process a success?
  • From an implementation/architecture perspective, what are the patterns that you have seen teams build around for feature development/serving?
  • What are the most interesting, innovative, or unexpected ways that you have seen feature platforms used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on feature engineering?
  • What are the resources that you find most helpful in understanding and designing feature platforms?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

19 Mar 2023Aligning Data Security With Business Productivity To Deploy Analytics Safely And At Speed00:51:38

Summary

As with all aspects of technology, security is a critical element of data applications, and the different controls can be at cross purposes with productivity. In this episode Yoav Cohen from Satori shares his experiences as a practitioner in the space of data security and how to align with the needs of engineers and business users. He also explains why data security is distinct from application security and some methods for reducing the challenge of working across different data systems.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today
  • RudderStack makes it easy for data teams to build a customer data platform on their own warehouse. Use their state of the art pipelines to collect all of your data, build a complete view of your customer and sync it to every downstream tool. Sign up for free at dataengineeringpodcast.com/rudder
  • Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today.
  • Your host is Tobias Macey and today I'm interviewing Yoav Cohen about the challenges that data teams face in securing their data platforms and how that impacts the productivity and adoption of data in the organization

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Data security is a very broad term. Can you start by enumerating some of the different concerns that are involved?
  • How has the scope and complexity of implementing security controls on data systems changed in recent years?
    • In your experience, what is a typical number of data locations that an organization is trying to manage access/permissions within?
  • What are some of the main challenges that data/compliance teams face in establishing and maintaining security controls?
    • How much of the problem is technical vs. procedural/organizational?
  • As a vendor in the space, how do you think about the broad categories/boundary lines for the different elements of data security? (e.g. masking vs. RBAC, etc.)
    • What are the different layers that are best suited to managing each of those categories? (e.g. masking and encryption in storage layer, RBAC in warehouse, etc.)
  • What are some of the ways that data security and organizational productivity are at odds with each other?
    • What are some of the shortcuts that you see teams and individuals taking to address the productivity hit from security controls?
  • What are some of the methods that you have found to be most effective at mitigating or even improving productivity impacts through security controls?
    • How does up-front design of the security layers improve the final outcome vs. trying to bolt on security after the platform is already in use?
    • How can education about the motivations for different security practices improve compliance and user experience?
  • What are the most interesting, innovative, or unexpected ways that you have seen data teams align data security and productivity?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data security technology?
  • What are the areas of data security that still need improvements?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

26 Feb 2025The Future of Data Engineering: AI, LLMs, and Automation00:59:39
Summary
In this episode of the Data Engineering Podcast Gleb Mezhanskiy, CEO and co-founder of DataFold, talks about the intersection of AI and data engineering. He discusses the challenges and opportunities of integrating AI into data engineering, particularly using large language models (LLMs) to enhance productivity and reduce manual toil. The conversation covers the potential of AI to transform data engineering tasks, such as text-to-SQL interfaces and creating semantic graphs to improve data accessibility, and explores practical applications of LLMs in automating code reviews, testing, and understanding data lineage.


Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. 
  • Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy about 
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • modern data stack is dead
  • where is AI in the data stack?
  • "buy our tool to ship AI"
  • opportunities for LLM in DE workflow
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
29 Dec 2022Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI00:59:21

Summary

Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!
  • Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
  • Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.
  • Your host is Tobias Macey and today I'm interviewing Rehgan Avon about her work at AlignAI to help organizations standardize their technical and procedural approaches to working with data

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what AlignAI is and the story behind it?
  • What are the core problems that you are focused on addressing?
    • What are the tactical ways that you are working to solve those problems?
  • What are some of the common and avoidable ways that analytics/AI projects go wrong?
    • What are some of the ways that organizational scale and complexity impacts their ability to execute on data and AI projects?
  • What are the ways that incomplete/unevenly distributed knowledge manifests in project design and execution?
  • Can you describe the design and implementation of the AlignAI platform?
    • How have the goals and implementation of the product changed since you first started working on it?
  • What is the workflow at the individual and organizational level for businesses that are using AlignAI?
  • One of the perennial challenges with knowledge sharing in an organization is managing incentives to engage with the available material. What are some of the ways that you are working to integrate the creation and distribution of institutional knowledge into employees' day-to-day work?
  • What are the most interesting, innovative, or unexpected ways that you have seen AlignAI used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on AlignAI?
  • When is AlignAI the wrong choice?
  • What do you have planned for the future of AlignAI?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

04 Dec 2023Designing Data Transfer Systems That Scale01:03:57

Summary

The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues for every part of your data workflow, from migration to deployment. Datafold has recently launched a 3-in-1 product experience to support accelerated data migrations. With Datafold, you can seamlessly plan, translate, and validate data across systems, massively accelerating your migration project. Datafold leverages cross-database diffing to compare tables across environments in seconds, column-level lineage for smarter migration planning, and a SQL translator to make moving your SQL scripts easier. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold today!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Andrei Tserakhau about operationalizing high bandwidth and low-latency change-data capture

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Your most recent project involves operationalizing a generalized data transfer service. What was the original problem that you were trying to solve?
    • What were the shortcomings of other options in the ecosystem that led you to building a new system?
  • What was the design of your initial solution to the problem?
    • What are the sharp edges that you had to deal with to operate and use that initial implementation?
  • What were the limitations of the system as you started to scale it?
  • Can you describe the current architecture of your data transfer platform?
    • What are the capabilities and constraints that you are optimizing for?
  • As you move beyond the initial use case that started you down this path, what are the complexities involved in generalizing to add new functionality or integrate with additional platforms?
  • What are the most interesting, innovative, or unexpected ways that you have seen your data transfer service used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on the data transfer system?
  • When is DoubleCloud Data Transfer the wrong choice?
  • What do you have planned for the future of DoubleCloud Data Transfer?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Speaker - Andrei Tserakhau, DoubleCloud Tech Lead. He has over 10 years of IT engineering experience and for the last 4 years was working on distributed systems with a focus on data delivery systems.

Sponsored By:

Support Data Engineering Podcast

13 Jul 2024The Role of Product Managers in Data-Centric Organizations00:52:58
Summary
In this episode Praveen Gujar, Director of Product at LinkedIn, talks about the intricacies of product management for data and analytical platforms. Praveen shares his journey from Amazon to Twitter and now LinkedIn, highlighting his extensive experience in building data products and platforms, digital advertising, AI, and cloud services. He discusses the evolving role of product managers in data-centric environments, emphasizing the importance of clean, reliable, and compliant data. Praveen also delves into the challenges of building scalable data platforms, the need for organizational and cultural alignment, and the critical role of product managers in bridging the gap between engineering and business teams. He provides insights into the complexities of platformization, the significance of long-term planning, and the necessity of having a strong relationship with engineering teams. The episode concludes with Praveen offering advice for aspiring product managers and discussing the future of data management in the context of AI and regulatory compliance.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Praveen Gujar about product management for data and analytical platforms
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Product management is typically thought of as being oriented toward customer facing functionality and features. What is involved in being a product manager for data systems?
  • Many data-oriented products that are customer facing require substantial technical capacity to serve those use cases. How does that influence the process of determining what features to provide/create?
  • investment in technical capacity/platforms
  • identifying groupings of features that can be served by a common platform investment
  • managing organizational pressures between engineering, product, business, finance, etc.
  • What are the most interesting, innovative, or unexpected ways that you have seen "Data Products & Platforms @ Big-tech" used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on "Building Data Products & Platforms for Big-tech"?
  • When is "Data Products & Platforms @ Big-tech" the wrong choice?
  • What do you have planned for the future of "Data Products & Platforms @ Big-tech"?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
09 Oct 2023Using Data To Illuminate The Intentionally Opaque Insurance Industry00:51:58

Summary

The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
  • As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES.
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • Your host is Tobias Macey and today I'm interviewing Max Cho about the wild world of insurance companies and the challenges of collecting quality data for this opaque industry

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what CoverageCat is and the story behind it?
  • What are the different sources of data that you work with?
    • What are the most challenging aspects of collecting that data?
    • Can you describe the formats and characteristics (3 Vs) of that data?
  • What are some of the ways that the operational model of insurance companies have contributed to its opacity as an industry from a data perspective?
  • Can you describe how you have architected your data platform?
    • How have the design and goals changed since you first started working on it?
    • What are you optimizing for in your selection and implementation process?
  • What are the sharp edges/weak points that you worry about in your existing data flows?
    • How do you guard against those flaws in your day-to-day operations?
  • What are the most interesting, innovative, or unexpected ways that you have seen your data sets used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on insurance industry data?
  • When is a purely statistical view of insurance the wrong approach?
  • What do you have planned for the future of CoverageCat's data stack?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

09 Jul 2023Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling01:12:55

Summary

For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • Your host is Tobias Macey and today I'm interviewing Max Beauchemin about the concept of entity-centric data modeling for analytical use cases

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what entity-centric modeling (ECM) is and the story behind it?

    • How does it compare to dimensional modeling strategies?
    • What are some of the other competing methods
    • Comparison to activity schema
  • What impact does this have on ML teams? (e.g. feature engineering)

  • What role does the tooling of a team have in the ways that they end up thinking about modeling? (e.g. dbt vs. informatica vs. ETL scripts, etc.)

    • What is the impact on the underlying compute engine on the modeling strategies used?
  • What are some examples of data sources or problem domains for which this approach is well suited?

    • What are some cases where entity centric modeling techniques might be counterproductive?
  • What are the ways that the benefits of ECM manifest in use cases that are down-stream from the warehouse?

  • What are some concrete tactical steps that teams should be thinking about to implement a workable domain model using entity-centric principles?

    • How does this work across business domains within a given organization (especially at "enterprise" scale)?
  • What are the most interesting, innovative, or unexpected ways that you have seen ECM used?

  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on ECM?

  • When is ECM the wrong choice?

  • What are your predictions for the future direction/adoption of ECM or other modeling techniques?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

25 Sep 2023Powering Vector Search With Real Time And Incremental Vector Indexes00:59:16

Summary

The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • If you’re a data person, you probably have to jump between different tools to run queries, build visualizations, write Python, and send around a lot of spreadsheets and CSV files. Hex brings everything together. Its powerful notebook UI lets you analyze data in SQL, Python, or no-code, in any combination, and work together with live multiplayer and version control. And now, Hex’s magical AI tools can generate queries and code, create visualizations, and even kickstart a whole analysis for you – all from natural language prompts. It’s like having an analytics co-pilot built right into where you’re already doing your work. Then, when you’re ready to share, you can use Hex’s drag-and-drop app builder to configure beautiful reports or dashboards that anyone can use. Join the hundreds of data teams like Notion, AllTrails, Loom, Mixpanel and Algolia using Hex every day to make their work more impactful. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial of the Hex Team plan!
  • Your host is Tobias Macey and today I'm interviewing Louis Brandy about building vector indexes in real-time for analytics and AI applications

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what vector search is and how it differs from other search technologies?
    • What are the technical challenges related to providing vector search?
    • What are the applications for vector search that merit the added complexity?
  • Vector databases have been gaining a lot of attention recently with the proliferation of LLM applications. Is a dedicated database technology required to support vector indexes/vector search queries?
    • What are the use cases for native vector data types that are separate from AI?
  • With the increasing usage of vectors for data and AI/ML applications, who do you typically see as the owner of that problem space? (e.g. data engineers, ML engineers, data scientists, etc.)
  • For teams who are investing in vector search, what are the architectural considerations that they need to be aware of?
    • How does it impact the data pipeline strategies/topologies used?
  • What are the complexities that need to be addressed when updating vector data in a real-time/streaming fashion?
    • How does that influence the client strategies that are querying that data?
  • What are the most interesting, innovative, or unexpected ways that you have seen vector search used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on vector search applications?
  • When is vector search the wrong choice?
  • What do you see as future potential applications for vector indexes/vector search?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. The Machine Learning Podcast helps you go from idea to production with machine learning. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

24 Dec 2023Troubleshooting Kafka In Production01:14:44

Summary

Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: : Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Elad Eldor about operating Kafka in production and how to keep your clusters stable and performant

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe your experiences with Kafka?
    • What are the operational challenges that you have had to overcome while working with Kafka?
    • What motivated to write a book about how to manage Kafka in production?
  • There are many options now for persistent data queues. What are the factors to consider when determining whether Kafka is the right choice?
    • In the case where Kafka is the appropriate tool, there are many ways to run it now. What are the considerations that teams need to work through when determining whether/where/how to operate a cluster?
  • When provisioning a Kafka cluster, what are the requirements that need to be considered when determining the sizing?
    • What are the axes along which size/scale need to be determined?
  • The core promise of Kafka is that it is a durable store for continuous data. What are the mechanisms that are available for preventing data loss?
    • Under what circumstances can data be lost?
  • What are the different failure conditions that cluster operators need to be aware of?
    • What are the monitoring strategies that are most helpful for identifying (proactively or reactively) those errors?
  • In the event of these different cluster errors, what are the strategies for mitigating and recovering from those failures?
  • When a cluster's usage expands beyond the original designed capacity, what are the options/procedures for expanding that capacity?
    • When a cluster is underutilized, how can it be scaled down to reduce cost?
  • What are the most interesting, innovative, or unexpected ways that you have seen Kafka used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working with Kafka?
  • When is Kafka the wrong choice?
  • What are the changes that you would like to see in Kafka to make it easier to operate?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

03 Mar 2024When And How To Conduct An AI Program00:46:25

Summary

Artificial intelligence technologies promise to revolutionize business and produce new sources of value. In order to make those promises a reality there is a substantial amount of strategy and investment required. Colleen Tartow has worked across all stages of the data lifecycle, and in this episode she shares her hard-earned wisdom about how to conduct an AI program for your organization.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today!
  • Your host is Tobias Macey and today I'm interviewing Colleen Tartow about the questions to answer before and during the development of an AI program

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • When you say "AI Program", what are the organizational, technical, and strategic elements that it encompasses?
    • How does the idea of an "AI Program" differ from an "AI Product"?
    • What are some of the signals to watch for that indicate an objective for which AI is not a reasonable solution?
  • Who needs to be involved in the process of defining and developing that program?
    • What are the skills and systems that need to be in place to effectively execute on an AI program?
  • "AI" has grown to be an even more overloaded term than it already was. What are some of the useful clarifying/scoping questions to address when deciding the path to deployment for different definitions of "AI"?
  • Organizations can easily fall into the trap of green-lighting an AI project before they have done the work of ensuring they have the necessary data and the ability to process it. What are the steps to take to build confidence in the availability of the data?
    • Even if you are sure that you can get the data, what are the implementation pitfalls that teams should be wary of while building out the data flows for powering the AI system?
    • What are the key considerations for powering AI applications that are substantially different from analytical applications?
  • The ecosystem for ML/AI is a rapidly moving target. What are the foundational/fundamental principles that you need to design around to allow for future flexibility?
  • What are the most interesting, innovative, or unexpected ways that you have seen AI programs implemented?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on powering AI systems?
  • When is AI the wrong choice?
  • What do you have planned for the future of your work at VAST Data?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

11 Nov 2024An Opinionated Look At End-to-end Code Only Analytical Workflows With Bruin00:56:11
Summary
The challenges of integrating all of the tools in the modern data stack has led to a new generation of tools that focus on a fully integrated workflow. At the same time, there have been many approaches to how much of the workflow is driven by code vs. not. Burak Karakan is of the opinion that a fully integrated workflow that is driven entirely by code offers a beneficial and productive means of generating useful analytical outcomes. In this episode he shares how Bruin builds on those opinions and how you can use it to build your own analytics without having to cobble together a suite of tools with conflicting abstractions.


Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
  • Your host is Tobias Macey and today I'm interviewing Burak Karakan about the benefits of building code-only data systems
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Bruin is and the story behind it?
    • Who is your target audience?
  • There are numerous tools that address the ETL workflow for analytical data. What are the pain points that you are focused on for your target users?
  • How does a code-only approach to data pipelines help in addressing the pain points of analytical workflows?
    • How might it act as a limiting factor for organizational involvement?
  • Can you describe how Bruin is designed?
    • How have the design and scope of Bruin evolved since you first started working on it?
  • You call out the ability to mix SQL and Python for transformation pipelines. What are the components that allow for that functionality?
    • What are some of the ways that the combination of Python and SQL improves ergonomics of transformation workflows?
  • What are the key features of Bruin that help to streamline the efforts of organizations building analytical systems?
  • Can you describe the workflow of someone going from source data to warehouse and dashboard using Bruin and Ingestr?
  • What are the opportunities for contributions to Bruin and Ingestr to expand their capabilities?
  • What are the most interesting, innovative, or unexpected ways that you have seen Bruin and Ingestr used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bruin?
  • When is Bruin the wrong choice?
  • What do you have planned for the future of Bruin?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
30 Jun 2024Improve Data Quality Through Engineering Rigor And Business Engagement With Synq00:59:48
Summary
This episode features an insightful conversation with Petr Janda, the CEO and founder of Synq. Petr shares his journey from being an engineer to founding Synq, emphasizing the importance of treating data systems with the same rigor as engineering systems. He discusses the challenges and solutions in data reliability, including the need for transparency and ownership in data systems. Synq's platform helps data teams manage incidents, understand data dependencies, and ensure data quality by providing insights and automation capabilities. Petr emphasizes the need for a holistic approach to data reliability, integrating data systems into broader business processes. He highlights the role of data teams in modern organizations and how Synq is empowering them to achieve this.
Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Petr Janda about Synq, a data reliability platform focused on leveling up data teams by supporting a culture of engineering rigor
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Synq is and the story behind it? 
    • Data observability/reliability is a category that grew rapidly over the past ~5 years and has several vendors focused on different elements of the problem. What are the capabilities that you saw as lacking in the ecosystem which you are looking to address?
  • Operational/infrastructure engineers have spent the past decade honing their approach to incident management and uptime commitments. How do those concepts map to the responsibilities and workflows of data teams? 
    • Tooling only plays a small part in SLAs and incident management. How does Synq help to support the cultural transformation that is necessary?
  • What does an on-call rotation for a data engineer/data platform engineer look like as compared with an application-focused team?
  • How does the focus on data assets/data products shift your approach to observability as compared to a table/pipeline centric approach?
  • With the focus on sharing ownership beyond the boundaries on the data team there is a strong correlation with data governance principles. How do you see organizations incorporating Synq into their approach to data governance/compliance?
  • Can you describe how Synq is designed/implemented? 
    • How have the scope and goals of the product changed since you first started working on it?
  • For a team who is onboarding onto Synq, what are the steps required to get it integrated into their technology stack and workflows?
  • What are the types of incidents/errors that you are able to identify and alert on? 
    • What does a typical incident/error resolution process look like with Synq?
  • What are the most interesting, innovative, or unexpected ways that you have seen Synq used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Synq?
  • When is Synq the wrong choice?
  • What do you have planned for the future of Synq?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
27 Feb 2023Building A Data Mesh Platform At PayPal00:46:54

Summary

There has been a lot of discussion about the practical application of data mesh and how to implement it in an organization. Jean-Georges Perrin was tasked with designing a new data platform implementation at PayPal and wound up building a data mesh. In this episode he shares that journey and the combination of technical and organizational challenges that he encountered in the process.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today.
  • Your host is Tobias Macey and today I'm interviewing Jean-Georges Perrin about his work at PayPal to implement a data mesh and the role of data contracts in making it work

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing the goals and scope of your work at PayPal to implement a data mesh?
    • What are the core problems that you were addressing with this project?
    • Is a data mesh ever "done"?
  • What was your experience engaging at the organizational level to identify the granularity and ownership of the data products that were needed in the initial iteration?
  • What was the impact of leading multiple teams on the design of how to implement communication/contracts throughout the mesh?
  • What are the technical systems that you are relying on to power the different data domains?
    • What is your philosophy on enforcing uniformity in technical systems vs. relying on interface definitions as the unit of consistency?
  • What are the biggest challenges (technical and procedural) that you have encountered during your implementation?
  • How are you managing visibility/auditability across the different data domains? (e.g. observability, data quality, etc.)
  • What are the most interesting, innovative, or unexpected ways that you have seen PayPal's data mesh used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data mesh?
  • When is a data mesh the wrong choice?
  • What do you have planned for the future of your data mesh at PayPal?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

21 Apr 2024Making Email Better With AI At Shortwave00:53:43

Summary

Generative AI has rapidly transformed everything in the technology sector. When Andrew Lee started work on Shortwave he was focused on making email more productive. When AI started gaining adoption he realized that he had even more potential for a transformative experience. In this episode he shares the technical challenges that he and his team have overcome in integrating AI into their product, as well as the benefits and features that it provides to their customers.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Andrew Lee about his work on Shortwave, an AI powered email client

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Shortwave is and the story behind it?
    • What is the core problem that you are addressing with Shortwave?
  • Email has been a central part of communication and business productivity for decades now. What are the overall themes that continue to be problematic?
  • What are the strengths that email maintains as a protocol and ecosystem?
  • From a product perspective, what are the data challenges that are posed by email?
  • Can you describe how you have architected the Shortwave platform?
    • How have the design and goals of the product changed since you started it?
    • What are the ways that the advent and evolution of language models have influenced your product roadmap?
  • How do you manage the personalization of the AI functionality in your system for each user/team?
  • For users and teams who are using Shortwave, how does it change their workflow and communication patterns?
  • Can you describe how I would use Shortwave for managing the workflow of evaluating, planning, and promoting my podcast episodes?
  • What are the most interesting, innovative, or unexpected ways that you have seen Shortwave used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Shortwave?
  • When is Shortwave the wrong choice?
  • What do you have planned for the future of Shortwave?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

13 Nov 2023Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine01:07:52

Summary

Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • Your host is Tobias Macey and today I'm interviewing Eran Yahav about building an AI powered developer assistant at Tabnine

Interview

  • Introduction
  • How did you get involved in machine learning?
  • Can you describe what Tabnine is and the story behind it?
  • What are the individual and organizational motivations for using AI to generate code?
    • What are the real-world limitations of generative AI for creating software? (e.g. size/complexity of the outputs, naming conventions, etc.)
    • What are the elements of skepticism/oversight that developers need to exercise while using a system like Tabnine?
  • What are some of the primary ways that developers interact with Tabnine during their development workflow?
    • Are there any particular styles of software for which an AI is more appropriate/capable? (e.g. webapps vs. data pipelines vs. exploratory analysis, etc.)
  • For natural languages there is a strong bias toward English in the current generation of LLMs. How does that translate into computer languages? (e.g. Python, Java, C++, etc.)
  • Can you describe the structure and implementation of Tabnine?
    • Do you rely primarily on a single core model, or do you have multiple models with subspecialization?
    • How have the design and goals of the product changed since you first started working on it?
  • What are the biggest challenges in building a custom LLM for code?
    • What are the opportunities for specialization of the model architecture given the highly structured nature of the problem domain?
  • For users of Tabnine, how do you assess/monitor the accuracy of recommendations?
    • What are the feedback and reinforcement mechanisms for the model(s)?
  • What are the most interesting, innovative, or unexpected ways that you have seen Tabnine's LLM powered coding assistant used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI assisted development at Tabnine?
  • When is an AI developer assistant the wrong choice?
  • What do you have planned for the future of Tabnine?

Contact Info

Parting Question

  • From your perspective, what is the biggest barrier to adoption of machine learning today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0

Sponsored By:

Support Data Engineering Podcast

04 Feb 2024Tackling Real Time Streaming Data With SQL Using RisingWave00:56:55

Summary

Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
  • Your host is Tobias Macey and today I'm interviewing Yingjun Wu about the RisingWave database and the intricacies of building a stream processing engine on S3

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what RisingWave is and the story behind it?
  • There are numerous stream processing engines, near-real-time database engines, streaming SQL systems, etc. What is the specific niche that RisingWave addresses?
    • What are some of the platforms/architectures that teams are replacing with RisingWave?
  • What are some of the unique capabilities/use cases that RisingWave provides over other offerings in the current ecosystem?
  • Can you describe how RisingWave is architected and implemented?
    • How have the design and goals/scope changed since you first started working on it?
    • What are the core design philosophies that you rely on to prioritize the ongoing development of the project?
  • What are the most complex engineering challenges that you have had to address in the creation of RisingWave?
  • Can you describe a typical workflow for teams that are building on top of RisingWave?
    • What are the user/developer experience elements that you have prioritized most highly?
  • What are the situations where RisingWave can/should be a system of record vs. a point-in-time view of data in transit, with a data warehouse/lakehouse as the longitudinal storage and query engine?
  • What are the most interesting, innovative, or unexpected ways that you have seen RisingWave used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on RisingWave?
  • When is RisingWave the wrong choice?
  • What do you have planned for the future of RisingWave?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

22 Jan 2023Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI00:45:40

Summary

The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!
  • Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda today to find out more.
  • Your host is Tobias Macey and today I'm interviewing Adam Kamor about Tonic, a service for generating data sets that are safe for development, analytics, and machine learning

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Tonic is and the story behind it?
  • What are the core problems that you are trying to solve?
  • What are some of the ways that fake or obfuscated data is used in development and analytics workflows?
  • challenges of reliably subsetting data
    • impact of ORMs and bad habits developers get into with database modeling
  • Can you describe how Tonic is implemented?
    • What are the units of composition that you are building to allow for evolution and expansion of your product?
    • How have the design and goals of the platform evolved since you started working on it?
  • Can you describe some of the different workflows that customers build on top of your various tools
  • What are the most interesting, innovative, or unexpected ways that you have seen Tonic used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Tonic?
  • When is Tonic the wrong choice?
  • What do you have planned for the future of Tonic?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

04 Sep 2023Eliminate The Overhead In Your Data Integration With The Open Source dlt Library00:42:13

Summary

Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
  • Your host is Tobias Macey and today I'm interviewing Adrian Brudaru about dlt, an open source python library for data loading

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what dlt is and the story behind it?
    • What is the problem you want to solve with dlt?
    • Who is the target audience?
  • The obvious comparison is with systems like Singer/Meltano/Airbyte in the open source space, or Fivetran/Matillion/etc. in the commercial space. What are the complexities or limitations of those tools that leave an opening for dlt?
  • Can you describe how dlt is implemented?
  • What are the benefits of building it in Python?
  • How have the design and goals of the project changed since you first started working on it?
  • How does that language choice influence the performance and scaling characteristics?
  • What problems do users solve with dlt?
  • What are the interfaces available for extending/customizing/integrating with dlt?
  • Can you talk through the process of adding a new source/destination?
  • What is the workflow for someone building a pipeline with dlt?
  • How does the experience scale when supporting multiple connections?
  • Given the limited scope of extract and load, and the composable design of dlt it seems like a purpose built companion to dbt (down to the naming). What are the benefits of using those tools in combination?
  • What are the most interesting, innovative, or unexpected ways that you have seen dlt used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on dlt?
  • When is dlt the wrong choice?
  • What do you have planned for the future of dlt?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

23 Jun 2024Stitching Together Enterprise Analytics With Microsoft Fabric00:53:23

Summary

Data lakehouse architectures have been gaining significant adoption. To accelerate adoption in the enterprise Microsoft has created the Fabric platform, based on their OneLake architecture. In this episode Dipti Borkar shares her experiences working on the product team at Fabric and explains the various use cases for the Fabric service.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Dipti Borkar about her work on Microsoft Fabric and performing analytics on data withou

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Microsoft Fabric is and the story behind it?
  • Data lakes in various forms have been gaining significant popularity as a unified interface to an organization's analytics. What are the motivating factors that you see for that trend?
  • Microsoft has been investing heavily in open source in recent years, and the Fabric platform relies on several open components. What are the benefits of layering on top of existing technologies rather than building a fully custom solution?
    • What are the elements of Fabric that were engineered specifically for the service?
    • What are the most interesting/complicated integration challenges?
  • How has your prior experience with Ahana and Presto informed your current work at Microsoft?
  • AI plays a substantial role in the product. What are the benefits of embedding Copilot into the data engine?
    • What are the challenges in terms of safety and reliability?
  • What are the most interesting, innovative, or unexpected ways that you have seen the Fabric platform used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data lakes generally, and Fabric specifically?
  • When is Fabric the wrong choice?
  • What do you have planned for the future of data lake analytics?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

12 May 2024Release Management For Data Platform Services And Logic00:20:09

Summary

Building a data platform is a substrantial engineering endeavor. Once it is running, the next challenge is figuring out how to address release management for all of the different component parts. The services and systems need to be kept up to date, but so does the code that controls their behavior. In this episode your host Tobias Macey reflects on his current challenges in this area and some of the factors that contribute to the complexity of the problem.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments today to subscribe. My thanks to the team at Code Comments for their support.
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I want to talk about my experiences managing the QA and release management process of my data platform

Interview

  • Introduction
  • As a team, our overall goal is to ensure that the production environment for our data platform is highly stable and reliable. This is the foundational element of establishing and maintaining trust with the consumers of our data. In order to support this effort, we need to ensure that only changes that have been tested and verified are promoted to production.
  • Our current challenge is one that plagues all data teams. We want to have an environment that mirrors our production environment that is available for testing, but it’s not feasible to maintain a complete duplicate of all of the production data. Compounding that challenge is the fact that each of the components of our data platform interact with data in slightly different ways and need different processes for ensuring that changes are being promoted safely.

Contact Info

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

04 Aug 2024The Evolution of DataOps: Insights from DataKitchen's CEO00:53:30
Summary
In this episode of the Data Engineering Podcast, host Tobias Macey welcomes back Chris Berg, CEO of DataKitchen, to discuss his ongoing mission to simplify the lives of data engineers. Chris explains the challenges faced by data engineers, such as constant system failures, the need for rapid changes, and high customer demands. Chris delves into the concept of DataOps, its evolution, and the misappropriation of related terms like data mesh and data observability. He emphasizes the importance of focusing on processes and systems rather than just tools to improve data engineering workflows. Chris also introduces DataKitchen's open-source tools, DataOps TestGen and DataOps Observability, designed to automate data quality validation and monitor data journeys in production.
Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Chris Bergh about his tireless quest to simplify the lives of data engineers
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what DataKitchen is and the story behind it?
  • You helped to define and popularize "DataOps", which then went through a journey of misappropriation similar to "DevOps", and has since faded in use. What is your view on the realities of "DataOps" today?
  • Out of the popularized wave of "DataOps" tools came subsequent trends in data observability, data reliability engineering, etc. How have those cycles influenced the way that you think about the work that you are doing at DataKitchen?
  • The data ecosystem went through a massive growth period over the past ~7 years, and we are now entering a cycle of consolidation. What are the fundamental shifts that we have gone through as an industry in the management and application of data?
  • What are the challenges that never went away?
  • You recently open sourced the dataops-testgen and dataops-observability tools. What are the outcomes that you are trying to produce with those projects?
  • What are the areas of overlap with existing tools and what are the unique capabilities that you are offering?
  • Can you talk through the technical implementation of your new obserability and quality testing platform?
  • What does the onboarding and integration process look like?
  • Once a team has one or both tools set up, what are the typical points of interaction that they will have over the course of their workday?
  • What are the most interesting, innovative, or unexpected ways that you have seen dataops-observability/testgen used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on promoting DataOps?
  • What do you have planned for the future of your work at DataKitchen?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
25 Jun 2023Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh00:50:19

Summary

Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack-
  • Your host is Tobias Macey and today I'm interviewing Toby Mao about SQLMesh, an open source DataOps framework designed to scale data transformations with ease of collaboration and validation built in

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what SQLMesh is and the story behind it?
    • DataOps is a term that has been co-opted and overloaded. What are the concepts that you are trying to convey with that term in the context of SQLMesh?
  • What are the rough edges in existing toolchains/workflows that you are trying to address with SQLMesh?
    • How do those rough edges impact the productivity and effectiveness of teams using those
  • Can you describe how SQLMesh is implemented?
    • How have the design and goals evolved since you first started working on it?
  • What are the lessons that you have learned from dbt which have informed the design and functionality of SQLMesh?
  • For teams who have already invested in dbt, what is the migration path from or integration with dbt?
  • You have some built-in integration with/awareness of orchestrators (currently Airflow). What are the benefits of making the transformation tool aware of the orchestrator?
  • What do you see as the potential benefits of integration with e.g. data-diff?
  • What are the second-order benefits of using a tool such as SQLMesh that addresses the more mechanical aspects of managing transformation workfows and the associated dependency chains?
  • What are the most interesting, innovative, or unexpected ways that you have seen SQLMesh used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on SQLMesh?
  • When is SQLMesh the wrong choice?
  • What do you have planned for the future of SQLMesh?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

26 Nov 2024Bridging Code and UI in Data Orchestration with Kestra00:44:30
Summary
In this episode of the Data Engineering Podcast, Anna Geller talks about the integration of code and UI-driven interfaces for data orchestration. Anna defines data orchestration as automating the coordination of workflow nodes that interact with data across various business functions, discussing how it goes beyond ETL and analytics to enable real-time data processing across different internal systems. She explores the challenges of using existing scheduling tools for data-specific workflows, highlighting limitations and anti-patterns, and discusses Kestra's solution, a low-code orchestration platform that combines code-driven flexibility with UI-driven simplicity. Anna delves into Kestra's architectural design, API-first approach, and pluggable infrastructure, and shares insights on balancing UI and code-driven workflows, the challenges of open-core business models, and innovative user applications of Kestra's platform.


Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
  • As a listener of the Data Engineering Podcast you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us you should listen to Data Citizens® Dialogues, the forward-thinking podcast from the folks at Collibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. They address questions around AI governance, data sharing, and working at global scale. In particular I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast-moving field. While data is shaping our world, Data Citizens Dialogues is shaping the conversation. Subscribe to Data Citizens Dialogues on Apple, Spotify, Youtube, or wherever you get your podcasts.
  • Your host is Tobias Macey and today I'm interviewing Anna Geller about incorporating both code and UI driven interfaces for data orchestration
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by sharing a definition of what constitutes "data orchestration"?
  • There are many orchestration and scheduling systems that exist in other contexts (e.g. CI/CD systems, Kubernetes, etc.). Those are often adapted to data workflows because they already exist in the organizational context. What are the anti-patterns and limitations that approach introduces in data workflows?
    • What are the problems that exist in the opposite direction of using data orchestrators for CI/CD, etc.?
  • Data orchestrators have been around for decades, with many different generations and opinions about how and by whom they are used. What do you see as the main motivation for UI vs. code-driven workflows?
  • What are the benefits of combining code-driven and UI-driven capabilities in a single orchestrator?
    • What constraints does it necessitate to allow for interoperability between those modalities?
  • Data Orchestrators need to integrate with many external systems. How does Kestra approach building integrations and ensure governance for all their underlying configurations?
  • Managing workflows at scale across teams can be challenging in terms of providing structure and visibility of dependencies across workflows and teams. What features does Kestra offer so that all pipelines and teams stay organised?
  • What are the most interesting, innovative, or unexpected ways that you have seen Kestra used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Kestra?
  • When is Kestra the wrong choice?
  • What do you have planned for the future of Kestra?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

In this episode of the Data Engineering Podcast, host Tobias Macy interviews Anna Geller, a data engineer turned product manager, about the integration of code and UI-driven interfaces for data orchestration. Anna shares her journey from working with data during an internship at KPMG to her current role as a product lead at Kestra. She provides her insights into the concept of data orchestration, emphasizing its broader scope beyond just ETL and analytics, and discusses the challenges and anti-patterns that arise when using existing scheduling systems for data-specific workflows.

Anna explains the overlap between CI/CD, scheduling, and orchestration tools, and the limitations that occur when these tools are used for data workflows. She highlights the importance of visibility and governance at scale and the need for a dedicated orchestrator like Kestra. The conversation also delves into the challenges of using data orchestrators for non-data workflows and the benefits of combining code and UI-driven approaches.

Anna discusses Kestra's architecture, which supports both JDBC and Kafka backends, and its focus on API-first interactions. She explains how Kestra handles task granularity, inputs, and outputs, and the flexibility provided by its plugin system. The episode also explores Kestra's approach to data as assets, the target audience for Kestra, and how it bridges different workflows across organizational boundaries.

The discussion touches on Kestra's open-core model, the challenges of balancing open-source and enterprise features, and the innovative ways Kestra is being applied. Anna shares insights into Kestra's local development experience, the lessons learned in building the product, and the upcoming features and projects that Kestra is excited to explore.
15 May 2023What Happens When The Abstractions Leak On Your Data00:26:42

Summary

All of the advancements in our technology is based around the principles of abstraction. These are valuable until they break down, which is an inevitable occurrence. In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack
  • Your host is Tobias Macey and today I'm sharing some thoughts and observances about abstractions and impedance mismatches from my experience building a data lakehouse with an ELT workflow

Interview

  • Introduction
  • impact of community tech debt
    • hive metastore
    • new work being done but not widely adopted
  • tensions between automation and correctness
  • data type mapping
    • integer types
    • complex types
    • naming things (keys/column names from APIs to databases)
  • disaggregated databases - pros and cons
    • flexibility and cost control
    • not as much tooling invested vs. Snowflake/BigQuery/Redshift
  • data modeling
    • dimensional modeling vs. answering today's questions
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on your data platform?
  • When is ELT the wrong choice?
  • What do you have planned for the future of your data platform?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

06 Aug 2023Quantifying The Return On Investment For Your Data Team01:01:53

Summary

As businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your company.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • Your host is Tobias Macey and today I'm interviewing Barr Moses and Anna Filippova about how and whether to measure the ROI of your data team

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What are the typical motivations for measuring and tracking the ROI for a data team?
    • Who is responsible for collecting that information?
    • How is that information used and by whom?
  • What are some of the downsides/risks of tracking this metric? (law of unintended consequences)
  • What are the inputs to the number that constitutes the "investment"? infrastructure, payroll of employees on team, time spent working with other teams?
  • What are the aspects of data work and its impact on the business that complicate a calculation of the "return" that is generated?
  • How should teams think about measuring data team ROI?
  • What are some concrete ROI metrics data teams can use?
    • What level of detail is useful? What dimensions should be used for segmenting the calculations?
  • How can visibility into this ROI metric be best used to inform the priorities and project scopes of the team?
  • With so many tools in the modern data stack today, what is the role of technology in helping drive or measure this impact?
  • How do your respective solutions, Monte Carlo and dbt, help teams measure and scale data value?
  • With generative AI on the upswing of the hype cycle, what are the impacts that you see it having on data teams?
    • What are the unrealistic expectations that it will produce?
    • How can it speed up time to delivery?
  • What are the most interesting, innovative, or unexpected ways that you have seen data team ROI calculated and/or used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on measuring the ROI of data teams?
  • When is measuring ROI the wrong choice?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

24 Mar 2025Bringing AI Into The Inner Loop of Data Engineering With Ascend00:52:47
Summary
In this episode of the Data Engineering Podcast Sean Knapp, CEO of Ascend.io, explores the intersection of AI and data engineering. He discusses the evolution of data engineering and the role of AI in automating processes, alleviating burdens on data engineers, and enabling them to focus on complex tasks and innovation. The conversation covers the challenges and opportunities presented by AI, including the need for intelligent tooling and its potential to streamline data engineering processes. Sean and Tobias also delve into the impact of generative AI on data engineering, highlighting its ability to accelerate development, improve governance, and enhance productivity, while also noting the current limitations and future potential of AI in the field.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. 
  • Your host is Tobias Macey and today I'm interviewing Sean Knapp about how Ascend is incorporating AI into their platform to help you keep up with the rapid rate of change
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Ascend is and the story behind it?
  • The last time we spoke was August of 2022. What are the most notable or interesting evolutions in your platform since then?
    • In that same time "AI" has taken up all of the oxygen in the data ecosystem. How has that impacted the ways that you and your customers think about their priorities?
  • The introduction of AI as an API has caused many organizations to try and leap-frog their data maturity journey and jump straight to building with advanced capabilities. How is that impacting the pressures and priorities felt by data teams?
  • At the same time that AI-focused product goals are straining data teams capacities, AI also has the potential to act as an accelerator to their work. What are the roadblocks/speedbumps that are in the way of that capability?
  • Many data teams are incorporating AI tools into parts of their workflow, but it can be clunky and cumbersome. How are you thinking about the fundamental changes in how your platform works with AI at its center?
  • Can you describe the technical architecture that you have evolved toward that allows for AI to drive the experience rather than being a bolt-on?
    • What are the concrete impacts that these new capabilities have on teams who are using Ascend?
  • What are the most interesting, innovative, or unexpected ways that you have seen Ascend + AI used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on incorporating AI into the core of Ascend?
  • When is Ascend the wrong choice?
  • What do you have planned for the future of AI in Ascend?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
06 Oct 2024Build Your Data Transformations Faster And Safer With SDF00:42:36
Summary
In this episode of the Data Engineering Podcast Lukas Schulte, co-founder and CEO of SDF, explores the development and capabilities of this fast and expressive SQL transformation tool. From its origins as a solution for addressing data privacy, governance, and quality concerns in modern data management, to its unique features like static analysis and type correctness, Lucas dives into what sets SDF apart from other tools like DBT and SQL Mesh. Tune in for insights on building a business around a developer tool, the importance of community and user experience in the data engineering ecosystem, and plans for future development, including supporting Python models and enhancing execution capabilities.
Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
  • Your host is Tobias Macey and today I'm interviewing Lukas Schulte about SDF, a fast and expressive SQL transformation tool that understands your schema
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what SDF is and the story behind it?
    • What's the story behind the name?
  • What problem are you solving with SDF?
    • dbt has been the dominant player for SQL-based transformations for several years, with other notable competition in the form of SQLMesh. Can you give an overview of the venn diagram for features and functionality across SDF, dbt and SQLMesh?
  • Can you describe the design and implementation of SDF?
    • How have the scope and goals of the project changed since you first started working on it?
  • What does the development experience look like for a team working with SDF?
    • How does that differ between the open and paid versions of the product?
  • What are the features and functionality that SDF offers to address intra- and inter-team collaboration?
  • One of the challenges for any second-mover technology with an established competitor is the adoption/migration path for teams who have already invested in the incumbent (dbt in this case). How are you addressing that barrier for SDF?
    • Beyond the core migration path of the direct functionality of the incumbent product is the amount of tooling and communal knowledge that grows up around that product. How are you thinking about that aspect of the current landscape?
  • What is your governing principle for what capabilities are in the open core and which go in the paid product?
  • What are the most interesting, innovative, or unexpected ways that you have seen SDF used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on SDF?
  • When is SDF the wrong choice?
  • What do you have planned for the future of SDF?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
14 Apr 2024Designing A Non-Relational Database Engine01:16:02

Summary

Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold.
  • Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Oren Eini about the work of designing and building a NoSQL database engine

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what constitutes a NoSQL database?
    • How have the requirements and applications of NoSQL engines changed since they first became popular ~15 years ago?
  • What are the factors that convince teams to use a NoSQL vs. SQL database?
    • NoSQL is a generalized term that encompasses a number of different data models. How does the underlying representation (e.g. document, K/V, graph) change that calculus?
  • How have the evolution in data formats (e.g. N-dimensional vectors, point clouds, etc.) changed the landscape for NoSQL engines?
  • When designing and building a database, what are the initial set of questions that need to be answered?
    • How many "core capabilities" can you reasonably design around before they conflict with each other?
  • How have you approached the evolution of RavenDB as you add new capabilities and mature the project?
    • What are some of the early decisions that had to be unwound to enable new capabilities?
  • If you were to start from scratch today, what database would you build?
  • What are the most interesting, innovative, or unexpected ways that you have seen RavenDB/NoSQL databases used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on RavenDB?
  • When is a NoSQL database/RavenDB the wrong choice?
  • What do you have planned for the future of RavenDB?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

27 Nov 2023Addressing The Challenges Of Component Integration In Data Platform Architectures00:29:43

Summary

Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • Developing event-driven pipelines is going to be a lot easier - Meet Functions! Memphis functions enable developers and data engineers to build an organizational toolbox of functions to process, transform, and enrich ingested events “on the fly” in a serverless manner using AWS Lambda syntax, without boilerplate, orchestration, error handling, and infrastructure in almost any language, including Go, Python, JS, .NET, Java, SQL, and more. Go to dataengineeringpodcast.com/memphis today to get started!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'll be sharing an update on my own journey of building a data platform, with a particular focus on the challenges of tool integration and maintaining a single source of truth

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • data sharing
  • weight of history
    • existing integrations with dbt
    • switching cost for e.g. SQLMesh
    • de facto standard of Airflow
  • Single source of truth
    • permissions management across application layers
    • Database engine
    • Storage layer in a lakehouse
    • Presentation/access layer (BI)
    • Data flows
    • dbt -> table level lineage
    • orchestration engine -> pipeline flows
      • task based vs. asset based
    • Metadata platform as the logical place for horizontal view

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

08 Jan 2023Automate Your Pipeline Creation For Streaming Data Transformations With SQLake00:44:06

Summary

Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda today to find out more.
  • Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!
  • Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.
  • Your host is Tobias Macey and today I'm interviewing Ori Rafael about the SQLake feature for the Upsolver platform that automatically generates pipelines from your queries

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what the SQLake product is and the story behind it?
    • What is the core problem that you are trying to solve?
  • What are some of the anti-patterns that you have seen teams adopt when designing and implementing DAGs in a tool such as Airlow?
  • What are the benefits of merging the logic for transformation and orchestration into the same interface and dialect (SQL)?
  • Can you describe the technical implementation of the SQLake feature?
  • What does the workflow look like for designing and deploying pipelines in SQLake?
  • What are the opportunities for using utilities such as dbt for managing logical complexity as the number of pipelines scales?
    • SQL has traditionally been challenging to compose. How did that factor into your design process for how to structure the dialect extensions for job scheduling?
  • What are some of the complexities that you have had to address in your orchestration system to be able to manage timeliness of operations as volume and complexity of the data scales?
  • What are some of the edge cases that you have had to provide escape hatches for?
  • What are the most interesting, innovative, or unexpected ways that you have seen SQLake used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on SQLake?
  • When is SQLake the wrong choice?
  • What do you have planned for the future of SQLake?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

11 Feb 2023Let The Whole Team Participate In Data With The Quilt Versioned Data Hub00:52:02

Summary

Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!
  • Your host is Tobias Macey and today I'm interviewing Aneesh Karve about how Quilt Data helps you bring order to your chaotic data in S3 with transactional versioning and data discovery built in

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Quilt is and the story behind it?
    • How have the goals and features of the Quilt platform changed since I spoke with Kevin in June of 2018?
  • What are the main problems that users are trying to solve when they find Quilt?
    • What are some of the alternative approaches/products that they are coming from?
  • How does Quilt compare with options such as LakeFS, Unstruk, Pachyderm, etc.?
  • Can you describe how Quilt is implemented?
  • What are the types of tools and systems that Quilt gets integrated with?
    • How do you manage the tension between supporting the lowest common denominator, while providing options for more advanced capabilities?
  • What is a typical workflow for a team that is using Quilt to manage their data?
  • What are the most interesting, innovative, or unexpected ways that you have seen Quilt used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Quilt?
  • When is Quilt the wrong choice?
  • What do you have planned for the future of Quilt?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

14 Aug 2023Unpacking The Seven Principles Of Modern Data Pipelines00:47:03

Summary

Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
  • Your host is Tobias Macey and today I'm interviewing Ariel Pohoryles about the seven principles of modern data pipelines

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by defining what you mean by a "modern" data pipeline?
  • At Rivery you published a white paper identifying seven principles of modern data pipelines:
    • Zero infrastructure management
    • ELT-first mindset
    • Speaks SQL and Python
    • Dynamic multi-storage layers
    • Reverse ETL & operational analytics
    • Full transparency
    • Faster time to value
  • What are the applications of data that you focused on while identifying these principles?
  • How do the application of these principles influence the ability of organizations and their data teams to encourage and keep pace with the use of data in the business?
  • What are the technical components of a pipeline infrastructure that are necessary to support a "modern" workflow?
  • How do the technologies involved impact the organizational involvement with how data is applied throughout the business?
  • When using managed services, what are the ways that the pricing model acts to encourage/discourage experimentation/exploration with data?
  • What are the most interesting, innovative, or unexpected ways that you have seen these seven principles implemented/applied?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working with customers to adapt to these principles?
  • What are the cases where some/all of these principles are undesirable/impractical to implement?
  • What are the opportunities for further advancement/sophistication in the ways that teams work with and gain value from data?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

01 Sep 2024Enhancing Data Accessibility and Governance with Gravitino00:38:41
Summary
As data architectures become more elaborate and the number of applications of data increases, it becomes increasingly challenging to locate and access the underlying data. Gravitino was created to provide a single interface to locate and query your data. In this episode Junping Du explains how Gravitino works, the capabilities that it unlocks, and how it fits into your data platform.
Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Your host is Tobias Macey and today I'm interviewing Junping Du about Gravitino, an open source metadata service for a unified view of all of your schemas
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Gravitino is and the story behind it?
  • What problems are you solving with Gravitino?
    • What are the methods that teams have relied on in the absence of Gravitino to address those use cases?
  • What led to the Hive Metastore being the default for so long?
    • What are the opportunities for innovation and new functionality in the metadata service?
  • The documentation suggests that Gravitino has overlap with a number of tool categories such as table schema (Hive metastore), metadata repository (Open Metadata), data federation (Trino/Alluxio). What are the capabilities that it can completely replace, and which will require other systems for more comprehensive functionality?
  • What are the capabilities that you are explicitly keeping out of scope for Gravitino?
  • Can you describe the technical architecture of Gravitino?
    • How have the design and scope evolved from when you first started working on it?
  • Can you describe how Gravitino integrates into an overall data platform?
    • In a typical day, what are the different ways that a data engineer or data analyst might interact with Gravitino?
  • One of the features that you highlight is centralized permissions management. Can you describe the access control model that you use for unifying across underlying sources?
  • What are the most interesting, innovative, or unexpected ways that you have seen Gravitino used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Gravitino?
  • When is Gravitino the wrong choice?
  • What do you have planned for the future of Gravitino?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
12 Apr 2025Simplifying Data Pipelines with Durable Execution00:39:49
Summary
In this episode of the Data Engineering Podcast Jeremy Edberg, CEO of DBOS, about durable execution and its impact on designing and implementing business logic for data systems. Jeremy explains how DBOS's serverless platform and orchestrator provide local resilience and reduce operational overhead, ensuring exactly-once execution in distributed systems through the use of the Transact library. He discusses the importance of version management in long-running workflows and how DBOS simplifies system design by reducing infrastructure needs like queues and CI pipelines, making it beneficial for data pipelines, AI workloads, and agentic AI.


Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
  • Your host is Tobias Macey and today I'm interviewing Jeremy Edberg about durable execution and how it influences the design and implementation of business logic
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what DBOS is and the story behind it?
  • What is durable execution?
    • What are some of the notable ways that inclusion of durable execution in an application architecture changes the ways that the rest of the application is implemented? (e.g. error handling, logic flow, etc.)
  • Many data pipelines involve complex, multi-step workflows. How does DBOS simplify the creation and management of resilient data pipelines? 
  • How does durable execution impact the operational complexity of data management systems?
  • One of the complexities in durable execution is managing code/data changes to workflows while existing executions are still processing. What are some of the useful patterns for addressing that challenge and how does DBOS help?
  • Can you describe how DBOS is architected?
    • How have the design and goals of the system changed since you first started working on it?
  • What are the characteristics of Postgres that make it suitable for the persistence mechanism of DBOS?
  • What are the guiding principles that you rely on to determine the boundaries between the open source and commercial elements of DBOS?
  • What are the most interesting, innovative, or unexpected ways that you have seen DBOS used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on DBOS?
  • When is DBOS the wrong choice?
  • What do you have planned for the future of DBOS?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
31 Mar 2024Adding Anomaly Detection And Observability To Your dbt Projects Is Elementary00:50:44

Summary

Working with data is a complicated process, with numerous chances for something to go wrong. Identifying and accounting for those errors is a critical piece of building trust in the organization that your data is accurate and up to date. While there are numerous products available to provide that visibility, they all have different technologies and workflows that they focus on. To bring observability to dbt projects the team at Elementary embedded themselves into the workflow. In this episode Maayan Salom explores the approach that she has taken to bring observability, enhanced testing capabilities, and anomaly detection into every step of the dbt developer experience.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold.
  • Your host is Tobias Macey and today I'm interviewing Maayan Salom about how to incorporate observability into a dbt-oriented workflow and how Elementary can help

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by outlining what elements of observability are most relevant for dbt projects?
  • What are some of the common ad-hoc/DIY methods that teams develop to acquire those insights?
    • What are the challenges/shortcomings associated with those approaches?
  • Over the past ~3 years there were numerous data observability systems/products created. What are some of the ways that the specifics of dbt workflows are not covered by those generalized tools?
    • What are the insights that can be more easily generated by embedding into the dbt toolchain and development cycle?
  • Can you describe what Elementary is and how it is designed to enhance the development and maintenance work in dbt projects?
  • How is Elementary designed/implemented?
    • How have the scope and goals of the project changed since you started working on it?
    • What are the engineering challenges/frustrations that you have dealt with in the creation and evolution of Elementary?
  • Can you talk us through the setup and workflow for teams adopting Elementary in their dbt projects?
  • How does the incorporation of Elementary change the development habits of the teams who are using it?
  • What are the most interesting, innovative, or unexpected ways that you have seen Elementary used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Elementary?
  • When is Elementary the wrong choice?
  • What do you have planned for the future of Elementary?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

10 Apr 2023An Exploration Of The Composable Customer Data Platform01:11:42

Summary

The customer data platform is a category of services that was developed early in the evolution of the current era of cloud services for data processing. When it was difficult to wire together the event collection, data modeling, reporting, and activation it made sense to buy monolithic products that handled every stage of the customer data lifecycle. Now that the data warehouse has taken center stage a new approach of composable customer data platforms is emerging. In this episode Darren Haken is joined by Tejas Manohar to discuss how Autotrader UK is addressing their customer data needs by building on top of their existing data stack.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack
  • Your host is Tobias Macey and today I'm interviewing Darren Haken and Tejas Manohar about building a composable CDP and how you can start adopting it incrementally

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what you mean by a "composable CDP"?
    • What are some of the key ways that it differs from the ways that we think of a CDP today?
  • What are the problems that you were focused on addressing at Autotrader that are solved by a CDP?
  • One of the promises of the first generation CDP was an opinionated way to model your data so that non-technical teams could own this responsibility. What do you see as the risks/tradeoffs of moving CDP functionality into the same data stack as the rest of the organization?
    • What about companies that don't have the capacity to run a full data infrastructure?
  • Beyond the core technology of the data warehouse, what are the other evolutions/innovations that allow for a CDP experience to be built on top of the core data stack?
  • added burden on core data teams to generate event-driven data models
  • When iterating toward a CDP on top of the core investment of the infrastructure to feed and manage a data warehouse, what are the typical first steps?
    • What are some of the components in the ecosystem that help to speed up the time to adoption? (e.g. pre-built dbt packages for common transformations, etc.)
  • What are the most interesting, innovative, or unexpected ways that you have seen CDPs implemented?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on CDP related functionality?
  • When is a CDP (composable or monolithic) the wrong choice?
  • What do you have planned for the future of the CDP stack?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

26 Dec 2022An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch01:12:00

Summary

Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!
  • Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
  • Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.
  • Your host is Tobias Macey and today I'm being interviewed by Scott Hirleman about my work on the podcasts and my experience building a data platform

Interview

  • Introduction
  • How did you get involved in the area of data management?

  • Data platform building journey

    • Why are you building, who are the users/use cases
    • How to focus on doing what matters over cool tools
    • How to build a good UX
    • Anything surprising or did you discover anything you didn't expect at the start
    • How to build so it's modular and can be improved in the future
  • General build vs buy and vendor selection process

    • Obviously have a good BS detector - how can others build theirs
    • So many tools, where do you start - capability need, vendor suite offering, etc.
    • Anything surprising in doing much of this at once
    • How do you think about TCO in build versus buy
    • Any advice
  • Guest call out

    • Be brave, believe you are good enough to be on the show
    • Look at past episodes and don't pitch the same as what's been on recently
    • And vendors, be smart, work with your customers to come up with a good pitch for them as guests...

Tobias' advice and learnings from building out a data platform:

  • Advice: when considering a tool, start from what are you actually trying to do. Yes, everyone has tools they want to use because they are cool (or some resume-driven development). Once you have a potential tool, is the capabilty you want to use a unloved feature or a main part of the product. If it's a feature, will they give it the care and attention it needs?
  • Advice: lean heavily on open source. You can fix things yourself and better direct the community's work than just filing a ticket and hoping with a vendor.
  • Learning: there is likely going to be some painful pieces missing, especially around metadata, as you build out your platform.
  • Advice: build in a modular way and think of what is my escape hatch? Yes, you have to lock yourself in a bit but build with the possibility of a vendor or a tool going away - whether that is your choice (e.g. too expensive) or it literally disappears (anyone remember FoundationDB?).
  • Learning: be prepared for tools to connect with each other but the connection to not be as robust as you want. Again, be prepared to have metadata challenges especially.
  • Advice: build your foundation to be strong. This will limit pain as things evolve and change. You can't build a large building on a bad foundation - or at least it's a BAD idea...
  • Advice: spend the time to work with your data consumers to figure out what questions they want to answer. Then abstract that to build to general challenges instead of point solutions.
  • Learning: it's easy to put data in S3 but it can be painfully difficult to query it. There's a missing piece as to how to store it for easy querying, not just the metadata issues.
  • Advice: it's okay to pay a vendor to lessen pain. But becoming wholly reliant on them can put you in a bad spot.
  • Advice: look to create paved path / easy path approaches. If someone wants to follow the preset path, it's easy for them. If they want to go their own way, more power to them, but not the data platform team's problem if it isn't working well.
  • Learning: there will be places you didn't expect to bend - again, that metadata layer for Tobias - to get things done sooner. It's okay to not have the end platform built at launch, move forward and get something going.
  • Advice: "one of the perennial problems in technlogy is the bias towards speed and action without necessarily understanding the destination." Really consider the path and if you are creating a scalable and maintainable solution instead of pushing for speed to deliver something.
  • Advice: consider building a buffer layer between upstream sources so if there are changes, it doesn't automatically break things downstream.

Tobias' data platform components: data lakehouse paradigm, Airbyte for data integration (chosen over Meltano), Trino/Starburst Galaxy for distributed querying, AWS S3 for the storage layer, AWS Glue for very basic metadata cataloguing, Dagster as the crucial orchestration layer, dbt

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

07 Jan 2024Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel00:50:26

Summary

Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Jignesh Patel about the research that he is conducting on technical scalability and user experience improvements around data management

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by summarizing your current areas of research and the motivations behind them?
  • What are the open questions today in technical scalability of data engines?
    • What are the experimental methods that you are using to gain understanding in the opportunities and practical limits of those systems?
  • As you strive to push the limits of technical capacity in data systems, how does that impact the usability of the resulting systems?
    • When performing research and building prototypes of the projects, what is your process for incorporating user experience into the implementation of the product?
  • What are the main sources of tension between technical scalability and user experience/ease of comprehension?
  • What are some of the positive synergies that you have been able to realize between your teaching, research, and corporate activities?
    • In what ways do they produce conflict, whether personally or technically?
  • What are the most interesting, innovative, or unexpected ways that you have seen your research used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on research of the scalability limits of data systems?
  • What is your heuristic for when a given research project needs to be terminated or productionized?
  • What do you have planned for the future of your academic research?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

27 Oct 2024Accelerate Migration Of Your Data Warehouse with Datafold's AI Powered Migration Agent00:48:50
Summary
Gleb Mezhanskiy, CEO and co-founder of DataFold, joins Tobias Macey to discuss the challenges and innovations in data migrations. Gleb shares his experiences building and scaling data platforms at companies like Autodesk and Lyft, and how these experiences inspired the creation of DataFold to address data quality issues across teams. He outlines the complexities of data migrations, including common pitfalls such as technical debt and the importance of achieving parity between old and new systems. Gleb also discusses DataFold's innovative use of AI and large language models (LLMs) to automate translation and reconciliation processes in data migrations, reducing time and effort required for migrations.
Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
  • Your host is Tobias Macey and today I'm welcoming back Gleb Mezhanskiy to talk about Datafold's experience bringing AI to bear on the problem of migrating your data stack
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what the Data Migration Agent is and the story behind it?
    • What is the core problem that you are targeting with the agent?
  • What are the biggest time sinks in the process of database and tooling migration that teams run into?
  • Can you describe the architecture of your agent?
    • What was your selection and evaluation process for the LLM that you are using?
  • What were some of the main unknowns that you had to discover going into the project?
    • What are some of the evolutions in the ecosystem that occurred either during the development process or since your initial launch that have caused you to second-guess elements of the design?
  • In terms of SQL translation there are libraries such as SQLGlot and the work being done with SDF that aim to address that through AST parsing and subsequent dialect generation. What are the ways that approach is insufficient in the context of a platform migration?
  • How does the approach you are taking with the combination of data-diffing and automated translation help build confidence in the migration target?
  • What are the most interesting, innovative, or unexpected ways that you have seen the Data Migration Agent used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on building an AI powered migration assistant?
  • When is the data migration agent the wrong choice?
  • What do you have planned for the future of applications of AI at Datafold?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
19 May 2024Zenlytic Is Building You A Better Coworker With AI Agents00:54:19

Summary

The purpose of business intelligence systems is to allow anyone in the business to access and decode data to help them make informed decisions. Unfortunately this often turns into an exercise in frustration for everyone involved due to complex workflows and hard-to-understand dashboards. The team at Zenlytic have leaned on the promise of large language models to build an AI agent that lets you converse with your data. In this episode they share their journey through the fast-moving landscape of generative AI and unpack the difference between an AI chatbot and an AI agent.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments today to subscribe. My thanks to the team at Code Comments for their support.
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Ryan Janssen and Paul Blankley about their experiences building AI powered agents for interacting with your data

Interview

  • Introduction
  • How did you get involved in data? In AI?
  • Can you describe what Zenlytic is and the role that AI is playing in your platform?
  • What have been the key stages in your AI journey?
    • What are some of the dead ends that you ran into along the path to where you are today?
    • What are some of the persistent challenges that you are facing?
  • So tell us more about data agents. Firstly, what are data agents and why do you think they're important?
  • How are data agents different from chatbots?
  • Are data agents harder to build? How do you make them work in production?
  • What other technical architectures have you had to develop to support the use of AI in Zenlytic?
  • How have you approached the work of customer education as you introduce this functionality?
  • What are some of the most interesting or erroneous misconceptions that you have heard about what the AI can and can't do?
  • How have you balanced accuracy/trustworthiness with user experience and flexibility in the conversational AI, given the potential for these models to create erroneous responses?
  • What are the most interesting, innovative, or unexpected ways that you have seen your AI agent used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on building an AI agent for business intelligence?
  • When is an AI agent the wrong choice?
  • What do you have planned for the future of AI in the Zenlytic product?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

28 Apr 2024Build Your Second Brain One Piece At A Time00:50:10
Summary
Generative AI promises to accelerate the productivity of human collaborators. Currently the primary way of working with these tools is through a conversational prompt, which is often cumbersome and unwieldy. In order to simplify the integration of AI capabilities into developer workflows Tsavo Knott helped create Pieces, a powerful collection of tools that complements the tools that developers already use. In this episode he explains the data collection and preparation process, the collection of model types and sizes that work together to power the experience, and how to incorporate it into your workflow to act as a second brain.


Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Tsavo Knott about Pieces, a personal AI toolkit to improve the efficiency of developers
Interview
  • Introduction
  • How did you get involved in machine learning?
  • Can you describe what Pieces is and the story behind it?
  • The past few months have seen an endless series of personalized AI tools launched. What are the features and focus of Pieces that might encourage someone to use it over the alternatives?
  • model selections
  • architecture of Pieces application
  • local vs. hybrid vs. online models
  • model update/delivery process
  • data preparation/serving for models in context of Pieces app
  • application of AI to developer workflows
  • types of workflows that people are building with pieces
  • What are the most interesting, innovative, or unexpected ways that you have seen Pieces used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Pieces?
  • When is Pieces the wrong choice?
  • What do you have planned for the future of Pieces?
Contact Info
Parting Question
  • From your perspective, what is the biggest barrier to adoption of machine learning today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
Links
The intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
16 Apr 2023Building Self Serve Business Intelligence With AI And Semantic Modeling At Zenlytic00:49:19

Summary

Business intellingence has been chasing the promise of self-serve data for decades. As the capabilities of these systems has improved and become more accessible, the target of what self-serve means changes. With the availability of AI powered by large language models combined with the evolution of semantic layers, the team at Zenlytic have taken aim at this problem again. In this episode Paul Blankley and Ryan Janssen explore the power of natural language driven data exploration combined with semantic modeling that enables an intuitive way for everyone in the business to access the data that they need to succeed in their work.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack
  • Your host is Tobias Macey and today I'm interviewing Paul Blankley and Ryan Janssen about Zenlytic, a no-code business intelligence tool focused on emerging commerce brands

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Zenlytic is and the story behind it?
  • Business intelligence is a crowded market. What was your process for defining the problem you are focused on solving and the method to achieve that outcome?
  • Self-serve data exploration has been attempted in myriad ways over successive generations of BI and data platforms. What are the barriers that have been the most challenging to overcome in that effort?
    • What are the elements that are coming together now that give you confidence in being able to deliver on that?
  • Can you describe how Zenlytic is implemented?
    • What are the evolutions in the understanding and implementation of semantic layers that provide a sufficient substrate for operating on?
    • How have the recent breakthroughs in large language models (LLMs) improved your ability to build features in Zenlytic?
    • What is your process for adding domain semantics to the operational aspect of your LLM?
  • For someone using Zenlytic, what is the process for getting it set up and integrated with their data?
  • Once it is operational, can you describe some typical workflows for using Zenlytic in a business context?
    • Who are the target users?
    • What are the collaboration options available?
  • What are the most complex engineering/data challenges that you have had to address in building Zenlytic?
  • What are the most interesting, innovative, or unexpected ways that you have seen Zenlytic used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Zenlytic?
  • When is Zenlytic the wrong choice?
  • What do you have planned for the future of Zenlytic?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

23 Dec 2024Building a Data Vision Board: A Guide to Strategic Planning00:49:59
Summary
In this episode of the Data Engineering Podcast Lior Barak shares his insights on developing a three-year strategic vision for data management. He discusses the importance of having a strategic plan for data, highlighting the need for data teams to focus on impact rather than just enablement. He introduces the concept of a "data vision board" and explains how it can help organizations outline their strategic vision by considering three key forces: regulation, stakeholders, and organizational goals. Lior emphasizes the importance of balancing short-term pressures with long-term strategic goals, quantifying the cost of data issues to prioritize effectively, and maintaining the strategic vision as a living document through regular reviews. He encourages data teams to shift from being enablers to impact creators and provides practical advice on implementing a data vision board, setting clear KPIs, and embracing a product mindset to create tangible business impacts through strategic data management.


Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • It’s 2024, why are we still doing data migrations by hand? Teams spend months—sometimes years—manually converting queries and validating data, burning resources and crushing morale. Datafold's AI-powered Migration Agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today to learn how Datafold can automate your migration and ensure source to target parity. 
  • Your host is Tobias Macey and today I'm interviewing Lior Barak about how to develop your three year strategic vision for data
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving an outline of the types of problems that occur as a result of not developing a strategic plan for an organization's data systems?
  • What is the format that you recommend for capturing that strategic vision?
    • What are the types of decisions and details that you believe should be included in a vision statement?
  • Why is a 3 year horizon beneficial? What does that scale of time encourage/discourage in the debate and decision-making process?
  • Who are the personas that should be included in the process of developing this strategy document?
  • Can you walk us through the steps and processes involved in developing the data vision board for an organization?
  • What are the time-frames or milestones that should lead to revisiting and revising the strategic objectives?
  • What are the most interesting, innovative, or unexpected ways that you have seen a data vision strategy used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data strategy development?
  • When is a data vision board the wrong choice?
  • What are some additional resources or practices that you recommend teams invest in as a supplement to this strategic vision exercise?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
15 Oct 2023Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable01:08:29

Summary

Building streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems for engineering teams. In this episode Eric Sammer discusses why more companies are including real-time capabilities in their products and the ways that Decodable makes it faster and easier.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES.
  • Your host is Tobias Macey and today I'm interviewing Eric Sammer about starting your stream processing journey with Decodable

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Decodable is and the story behind it?
    • What are the notable changes to the Decodable platform since we last spoke? (October 2021)
    • What are the industry shifts that have influenced the product direction?
  • What are the problems that customers are trying to solve when they come to Decodable?
  • When you launched your focus was on SQL transformations of streaming data. What was the process for adding full Java support in addition to SQL?
  • What are the developer experience challenges that are particular to working with streaming data?
    • How have you worked to address that in the Decodable platform and interfaces?
  • As you evolve the technical and product direction, what is your heuristic for balancing the unification of interfaces and system integration against the ability to swap different components or interfaces as new technologies are introduced?
  • What are the most interesting, innovative, or unexpected ways that you have seen Decodable used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Decodable?
  • When is Decodable the wrong choice?
  • What do you have planned for the future of Decodable?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

28 Aug 2023Building An Internal Database As A Service Platform At Cloudflare01:01:10

Summary

Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • Your host is Tobias Macey and today I'm interviewing Vignesh Ravichandran about building an internal database as a service platform at Cloudflare

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing the different database workloads that you have at Cloudflare?
    • What are the different methods that you have used for managing database instances?
  • What are the requirements and constraints that you had to account for in designing your current system?
  • Why Postgres?
  • optimizations for Postgres
    • simplification from not supporting multiple engines
  • limitations in postgres that make multi-tenancy challenging
  • scale of operation (data volume, request rate
  • What are the most interesting, innovative, or unexpected ways that you have seen your DBaaS used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on your internal database platform?
  • When is an internal database as a service the wrong choice?
  • What do you have planned for the future of Postgres hosting at Cloudflare?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

19 Dec 2022Making Sense Of The Technical And Organizational Considerations Of Data Contracts00:47:01

Summary

One of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!
  • Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
  • Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.
  • Your host is Tobias Macey and today I'm interviewing Abe Gong about the technical and organizational implementation of data contracts

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what your conception of a data contract is?
    • What are some of the ways that you have seen them implemented?
  • How has your work on Great Expectations influenced your thinking on the strategic and tactical aspects of adopting/implementing data contracts in a given team/organization?
    • What does the negotiation process look like for identifying what needs to be included in a contract?
  • What are the interfaces/integration points where data contracts are most useful/necessary?
  • What are the discussions that need to happen when deciding when/whether a contract "violation" is a blocking action vs. issuing a notification?
  • At what level of detail/granularity are contracts most helpful?
  • At the technical level, what does the implementation/integration/deployment of a contract look like?
  • What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts/great expectations?
  • When are data contracts the wrong choice?
  • What do you have planned for the future of data contracts in great expectations?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

10 Mar 2024Version Your Data Lakehouse Like Your Software With Nessie00:40:55

Summary

Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today!
  • Your host is Tobias Macey and today I'm interviewing Alex Merced, developer advocate at Dremio and co-author of the upcoming book from O'reilly, "Apache Iceberg, The definitive Guide", about Nessie, a git-like versioned catalog for data lakes using Apache Iceberg

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Nessie is and the story behind it?
  • What are the core problems/complexities that Nessie is designed to solve?
  • The closest analogue to Nessie that I've seen in the ecosystem is LakeFS. What are the features that would lead someone to choose one or the other for a given use case?
  • Why would someone choose Nessie over native table-level branching in the Apache Iceberg spec?
  • How do the versioning capabilities compare to/augment the data versioning in Iceberg?
  • What are some of the sources of, and challenges in resolving, merge conflicts between table branches?
  • Can you describe the architecture of Nessie?
  • How have the design and goals of the project changed since it was first created?
  • What is involved in integrating Nessie into a given data stack?
  • For cases where a given query/compute engine doesn't natively support Nessie, what are the options for using it effectively?
  • How does the inclusion of Nessie in a data lake influence the overall workflow of developing/deploying/evolving processing flows?
  • What are the most interesting, innovative, or unexpected ways that you have seen Nessie used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working with Nessie?
  • When is Nessie the wrong choice?
  • What have you heard is planned for the future of Nessie?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

03 Apr 2023Mapping The Data Infrastructure Landscape As A Venture Capitalist01:01:57

Summary

The data ecosystem has been building momentum for several years now. As a venture capital investor Matt Turck has been trying to keep track of the main trends and has compiled his findings into the MAD (ML, AI, and Data) landscape reports each year. In this episode he shares his experiences building those reports and the perspective he has gained from the exercise.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit dataengineeringpodcast.com/rudderstack today to learn more
  • Your host is Tobias Macey and today I'm interviewing Matt Turck about his annual report on the Machine Learning, AI, & Data landscape and the insights around data infrastructure that he has gained in the process

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what the MAD landscape report is and the story behind it?
    • At a high level, what is your goal in the compilation and maintenance of your landscape document?
    • What are your guidelines for what to include in the landscape?
  • As the data landscape matures, how have you seen that influence the types of projects/companies that are founded?
    • What are the product categories that were only viable when capital was plentiful and easy to obtain?
    • What are the product categories that you think will be swallowed by adjacent concerns, and which are likely to consolidate to remain competitive?
  • The rapid growth and proliferation of data tools helped establish the "Modern Data Stack" as a de-facto architectural paradigm. As we move into this phase of contraction, what are your predictions for how the "Modern Data Stack" will evolve?
    • Is there a different architectural paradigm that you see as growing to take its place?
  • How has your presentation and the types of information that you collate in the MAD landscape evolved since you first started it?~~
  • What are the most interesting, innovative, or unexpected product and positioning approaches that you have seen while tracking data infrastructure as a VC and maintainer of the MAD landscape?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on the MAD landscape over the years?
  • What do you have planned for future iterations of the MAD landscape?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

24 Jul 2023Build Real Time Applications With Operational Simplicity Using Dozer00:40:43

Summary

Real-time data processing has steadily been gaining adoption due to advances in the accessibility of the technologies involved. Despite that, it is still a complex set of capabilities. To bring streaming data in reach of application engineers Matteo Pelati helped to create Dozer. In this episode he explains how investing in high performance and operationally simplified streaming with a familiar API can yield significant benefits for software and data teams together.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • Modern data teams are using Hex to 10x their data impact. Hex combines a notebook style UI with an interactive report builder. This allows data teams to both dive deep to find insights and then share their work in an easy-to-read format to the whole org. In Hex you can use SQL, Python, R, and no-code visualization together to explore, transform, and model data. Hex also has AI built directly into the workflow to help you generate, edit, explain and document your code. The best data teams in the world such as the ones at Notion, AngelList, and Anthropic use Hex for ad hoc investigations, creating machine learning models, and building operational dashboards for the rest of their company. Hex makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact. Make your data team unstoppable with Hex. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial for your team!
  • Your host is Tobias Macey and today I'm interviewing Matteo Pelati about Dozer, an open source engine that includes data ingestion, transformation, and API generation for real-time sources

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Dozer is and the story behind it?
    • What was your decision process for building Dozer as open source?
  • As you note in the documentation, Dozer has overlap with a number of technologies that are aimed at different use cases. What was missing from each of them and the center of their Venn diagram that prompted you to build Dozer?
  • In addition to working in an interesting technological cross-section, you are also targeting a disparate group of personas. Who are you building Dozer for and what were the motivations for that vision?
    • What are the different use cases that you are focused on supporting?
    • What are the features of Dozer that enable engineers to address those uses, and what makes it preferable to existing alternative approaches?
  • Can you describe how Dozer is implemented?
    • How have the design and goals of the platform changed since you first started working on it?
    • What are the architectural "-ilities" that you are trying to optimize for?
  • What is involved in getting Dozer deployed and integrated into an existing application/data infrastructure?
  • How can teams who are using Dozer extend/integrate with Dozer?
    • What does the development/deployment workflow look like for teams who are building on top of Dozer?
  • What is your governance model for Dozer and balancing the open source project against your business goals?
  • What are the most interesting, innovative, or unexpected ways that you have seen Dozer used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dozer?
  • When is Dozer the wrong choice?
  • What do you have planned for the future of Dozer?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

24 Mar 2024Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+00:55:40

Summary

A core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. With their launch of Dagster+ as the redesigned commercial companion to the open source project they are investing in that capability with a suite of new features. In this episode Pete Hunt, CEO of Dagster labs, outlines these new capabilities, how they reduce the burden on data teams, and the increased collaboration that they enable across teams and business units.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Pete Hunt about how the launch of Dagster+ will level up your data platform and orchestrate across language platforms

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what the focus of Dagster+ is and the story behind it?
    • What problems are you trying to solve with Dagster+?
    • What are the notable enhancements beyond the Dagster Core project that this updated platform provides?
    • How is it different from the current Dagster Cloud product?
  • In the launch announcement you tease new capabilities that would be great to explore in turns:
    • Make data a team sport, enabling data teams across the organization
    • Deliver reliable, high quality data the organization can trust
    • Observe and manage data platform costs
    • Master the heterogeneous collection of technologies—both traditional and Modern Data Stack
  • What are the business/product goals that you are focused on improving with the launch of Dagster+
  • What are the most interesting, innovative, or unexpected ways that you have seen Dagster used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on the design and launch of Dagster+?
  • When is Dagster+ the wrong choice?
  • What do you have planned for the future of Dagster/Dagster Cloud/Dagster+?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

30 Mar 2025Overcoming Redis Limitations: The Dragonfly DB Approach00:43:58
Summary
In this episode of the Data Engineering Podcast Roman Gershman, CTO and founder of Dragonfly DB, explores the development and impact of high-speed in-memory databases. Roman shares his experience creating a more efficient alternative to Redis, focusing on performance gains, scalability, and cost efficiency, while addressing limitations such as high throughput and low latency scenarios. He explains how Dragonfly DB solves operational complexities for users and delves into its technical aspects, including maintaining compatibility with Redis while innovating on memory efficiency. Roman discusses the importance of cost efficiency and operational simplicity in driving adoption and shares insights on the broader ecosystem of in-memory data stores, future directions like SSD tiering and vector search capabilities, and the lessons learned from building a new database engine.


Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
  • Your host is Tobias Macey and today I'm interviewing Roman Gershman about building a high-speed in-memory database and the impact of the performance gains on data applications
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what DragonflyDB is and the story behind it?
  • What is the core problem/use case that is solved by making a "faster Redis"?
  • The other major player in the high performance key/value database space is Aerospike. What are the heuristics that an engineer should use to determine whether to use that vs. Dragonfly/Redis?
  • Common use cases for Redis involve application caches and queueing (e.g. Celery/RQ). What are some of the other applications that you have seen Redis/Dragonfly used for, particularly in data engineering use cases?
  • There is a piece of tribal wisdom that it takes 10 years for a database to iron out all of the kinks. At the same time, there have been substantial investments in commoditizing the underlying components of database engines. Can you describe how you approached the implementation of DragonflyDB to arive at a functional and reliable implementation?
  • What are the architectural elements that contribute to the performance and scalability benefits of Dragonfly?
    • How have the design and goals of the system changed since you first started working on it?
  • For teams who migrate from Redis to Dragonfly, beyond the cost savings what are some of the ways that it changes the ways that they think about their overall system design?
  • What are the most interesting, innovative, or unexpected ways that you have seen Dragonfly used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on DragonflyDB?
  • When is DragonflyDB the wrong choice?
  • What do you have planned for the future of DragonflyDB?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
07 Apr 2024Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer00:56:23

Summary

Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold.
  • Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Artyom Keydunov about the role of the semantic layer in your data platform

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by outlining the technical elements of what it means to have a "semantic layer"?
  • In the past couple of years there was a rapid hype cycle around the "metrics layer" and "headless BI", which has largely faded. Can you give your assessment of the current state of the industry around the adoption/implementation of these concepts?
  • What are the benefits of having a discrete service that offers the business metrics/semantic mappings as opposed to implementing those concepts as part of a more general system? (e.g. dbt, BI, warehouse marts, etc.)
    • At what point does it become necessary/beneficial for a team to adopt such a service?
    • What are the challenges involved in retrofitting a semantic layer into a production data system?
  • evolution of requirements/usage patterns
  • technical complexities/performance and cost optimization
  • What are the most interesting, innovative, or unexpected ways that you have seen Cube used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cube?
  • When is Cube/a semantic layer the wrong choice?
  • What do you have planned for the future of Cube?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

29 May 2023A Roadmap To Bootstrapping The Data Team At Your Startup00:42:32

Summary

Building a data team is hard in any circumstance, but at a startup it can be even more challenging. The requirements are fluid, you probably don't have a lot of existing data talent to manage the hiring and onboarding, and there is a need to move fast. Ghalib Suleiman has been on both sides of this equation and joins the show to share his hard-won wisdom about how to start and grow a data team in the early days of company growth.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack
  • Your host is Tobias Macey and today I'm interviewing Ghalib Suleiman about challenges and strategies for building data teams in a startup

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by sharing your conception of the responsibilities of a data team?
  • What are some of the common fallacies that organizations fall prey to in their first efforts at building data capabilities?
    • Have you found it more practical to hire outside talent to build out the first data systems, or grow that talent internally?
    • What are some of the resources you have found most helpful in training/educating the early creators and consumers of data assets?
  • When there is no internal data talent to assist with hiring, what are some of the problems that manifest in the hiring process?
    • What are the concepts that the new hire needs to know?
    • How much does the hiring manager/interviewer need to know about those concepts to evaluate skill?
  • What are the most critical skills for a first hire to have to start generating valuable output?
  • As a solo data person, what are the uphill battles that they need to be prepared for in the organization?
    • What are the rabbit holes that they should beware of?
  • What are some of the tactical
  • What are the most interesting, innovative, or unexpected ways that you have seen initial data hires tackle startup challenges?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on starting and growing data teams?
  • When is it more practical to outsource the data work?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

01 Oct 2023Building ETL Pipelines With Generative AI00:51:37

Summary

Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register at Neo4j.com/NODES.
  • Your host is Tobias Macey and today I'm interviewing Jay Mishra about the applications for generative AI in the ETL process

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What are the different aspects/types of ETL that you are seeing generative AI applied to?
    • What kind of impact are you seeing in terms of time spent/quality of output/etc.?
  • What kinds of projects are most likely to benefit from the application of generative AI?
  • Can you describe what a typical workflow of using AI to build ETL workflows looks like?
    • What are some of the types of errors that you are likely to experience from the AI?
    • Once the pipeline is defined, what does the ongoing maintenance look like?
    • Is the AI required to operate within the pipeline in perpetuity?
  • For individuals/teams/organizations who are experimenting with AI in their data engineering workflows, what are the concerns/questions that they are trying to address?
  • What are the most interesting, innovative, or unexpected ways that you have seen generative AI used in ETL workflows?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on ETL and generative AI?
  • When is AI the wrong choice for ETL applications?
  • What are your predictions for future applications of AI in ETL and other data engineering practices?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

18 Dec 2023Adding An Easy Mode For The Modern Data Stack With 5X00:56:12

Summary

The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm welcoming back Tarush Aggarwal to talk about what he and his team at 5x data are building to improve the user experience of the modern data stack.

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what 5x is and the story behind it?
    • We last spoke in March of 2022. What are the notable changes in the 5x business and product?
  • What are the notable shifts in the data ecosystem that have influenced your adoption and product direction?
    • What trends are you most focused on tracking as you plan the continued evolution of your offerings?
  • What are the points of friction that teams run into when trying to build their data platform?
  • Can you describe design of the system that you have built?
    • What are the strategies that you rely on to support adaptability and speed of onboarding for new integrations?
  • What are some of the types of edge cases that you have to deal with while integrating and operating the platform implementations that you design for your customers?
  • What is your process for selection of vendors to support?
    • How would you characterize your relationships with the vendors that you rely on?
  • For customers who have pre-existing investment in a portion of the data stack, what is your process for engaging with them to understand how best to support their goals?
  • What are the most interesting, innovative, or unexpected ways that you have seen 5XData used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on 5XData?
  • When is 5X the wrong choice?
  • What do you have planned for the future of 5X?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

31 Jul 2023Strategies For A Successful Data Platform Migration01:09:53

Summary

All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • Modern data teams are using Hex to 10x their data impact. Hex combines a notebook style UI with an interactive report builder. This allows data teams to both dive deep to find insights and then share their work in an easy-to-read format to the whole org. In Hex you can use SQL, Python, R, and no-code visualization together to explore, transform, and model data. Hex also has AI built directly into the workflow to help you generate, edit, explain and document your code. The best data teams in the world such as the ones at Notion, AngelList, and Anthropic use Hex for ad hoc investigations, creating machine learning models, and building operational dashboards for the rest of their company. Hex makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact. Make your data team unstoppable with Hex. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial for your team!
  • Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy and Rob Goretsky about when and how to think about migrating your data stack

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • A migration can be anything from a minor task to a major undertaking. Can you start by describing what constitutes a migration for the purposes of this conversation?
  • Is it possible to completely avoid having to invest in a migration?
  • What are the signals that point to the need for a migration?
    • What are some of the sources of cost that need to be accounted for when considering a migration? (both in terms of doing one, and the costs of not doing one)
    • What are some signals that a migration is not the right solution for a perceived problem?
  • Once the decision has been made that a migration is necessary, what are the questions that the team should be asking to determine the technologies to move to and the sequencing of execution?
  • What are the preceding tasks that should be completed before starting the migration to ensure there is no breakage downstream of the changing component(s)?
  • What are some of the ways that a migration effort might fail?
  • What are the major pitfalls that teams need to be aware of as they work through a data platform migration?
  • What are the opportunities for automation during the migration process?
  • What are the most interesting, innovative, or unexpected ways that you have seen teams approach a platform migration?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data platform migrations?
  • What are some ways that the technologies and patterns that we use can be evolved to reduce the cost/impact/need for migraitons?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

28 Jul 2024Achieving Data Reliability: The Role of Data Contracts in Modern Data Management00:49:26
Summary
Data contracts are both an enforcement mechanism for data quality, and a promise to downstream consumers. In this episode Tom Baeyens returns to discuss the purpose and scope of data contracts, emphasizing their importance in achieving reliable analytical data and preventing issues before they arise. He explains how data contracts can be used to enforce guarantees and requirements, and how they fit into the broader context of data observability and quality monitoring. The discussion also covers the challenges and benefits of implementing data contracts, the organizational impact, and the potential for standardization in the field.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • At Outshift, the incubation engine from Cisco, they are driving innovation in AI, cloud, and quantum technologies with the powerful combination of enterprise strength and startup agility. Their latest innovation for the AI ecosystem is Motific, addressing a critical gap in going from prototype to production with generative AI. Motific is your vendor and model-agnostic platform for building safe, trustworthy, and cost-effective generative AI solutions in days instead of months. Motific provides easy integration with your organizational data, combined with advanced, customizable policy controls and observability to help ensure compliance throughout the entire process. Move beyond the constraints of traditional AI implementation and ensure your projects are launched quickly and with a firm foundation of trust and efficiency. Go to motific.ai today to learn more!
  • Your host is Tobias Macey and today I'm interviewing Tom Baeyens about using data contracts to build a clearer API for your data
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe the scope and purpose of data contracts in the context of this conversation?
  • In what way(s) do they differ from data quality/data observability?
  • Data contracts are also known as the API for data, can you elaborate on this?
  • What are the types of guarantees and requirements that you can enforce with these data contracts?
  • What are some examples of constraints or guarantees that cannot be represented in these contracts?
  • Are data contracts related to the shift-left?
  • Data contracts are also known as the API for data, can you elaborate on this?
  • The obvious application of data contracts are in the context of pipeline execution flows to prevent failing checks from propagating further in the data flow. What are some of the other ways that these contracts can be integrated into an organization's data ecosystem?
  • How did you approach the design of the syntax and implementation for Soda's data contracts?
  • Guarantees and constraints around data in different contexts have been implemented in numerous tools and systems. What are the areas of overlap in e.g. dbt, great expectations?
  • Are there any emerging standards or design patterns around data contracts/guarantees that will help encourage portability and integration across tooling/platform contexts?
  • What are the most interesting, innovative, or unexpected ways that you have seen data contracts used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts at Soda?
  • When are data contracts the wrong choice?
  • What do you have planned for the future of data contracts?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
06 Feb 2023Reflecting On The Past 6 Years Of Data Engineering00:32:21

Summary

This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Your host is Tobias Macey and today I'm reflecting on the major trends in data engineering over the past 6 years

Interview

  • Introduction
  • 6 years of running the Data Engineering Podcast
  • Around the first time that data engineering was discussed as a role
    • Followed on from hype about "data science"
  • Hadoop era
  • Streaming
  • Lambda and Kappa architectures
    • Not really referenced anymore
  • "Big Data" era of capture everything has shifted to focusing on data that presents value
    • Regulatory environment increases risk, better tools introduce more capability to understand what data is useful
  • Data catalogs
    • Amundsen and Alation
  • Orchestration engine
    • Oozie, etc. -> Airflow and Luigi -> Dagster, Prefect, Lyft, etc.
    • Orchestration is now a part of most vertical tools
  • Cloud data warehouses
  • Data lakes
  • DataOps and MLOps
  • Data quality to data observability
  • Metadata for everything
    • Data catalog -> data discovery -> active metadata
  • Business intelligence
    • Read only reports to metric/semantic layers
    • Embedded analytics and data APIs
  • Rise of ELT
    • dbt
    • Corresponding introduction of reverse ETL
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on running the podcast?
  • What do you have planned for the future of the podcast?

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

24 Apr 2023Realtime Data Applications Made Easier With Meroxa00:45:26

Summary

Real-time capabilities have quickly become an expectation for consumers. The complexity of providing those capabilities is still high, however, making it more difficult for small teams to compete. Meroxa was created to enable teams of all sizes to deliver real-time data applications. In this episode DeVaris Brown discusses the types of applications that are possible when teams don't have to manage the complex infrastructure necessary to support continuous data flows.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack
  • Your host is Tobias Macey and today I'm interviewing DeVaris Brown about the impact of real-time data on business opportunities and risk profiles

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Meroxa is and the story behind it?
    • How have the focus and goals of the platform and company evolved over the past 2 years?
  • Who are the target customers for Meroxa?
    • What problems are they trying to solve when they come to your platform?
  • Applications powered by real-time data were the exclusive domain of large and/or sophisticated tech companies for several years due to the inherent complexities involved. What are the shifts that have made them more accessible to a wider variety of teams?
    • What are some of the remaining blockers for teams who want to start using real-time data?
  • With the democratization of real-time data, what are the new categories of products and applications that are being unlocked?
    • How are organizations thinking about the potential value that those types of apps/services can provide?
  • With data flowing constantly, there are new challenges around oversight and accuracy. How does real-time data change the risk profile for applications that are consuming it?
    • What are some of the technical controls that are available for organizations that are risk-averse?
  • What skills do developers need to be able to effectively design, develop, and deploy real-time data applications?
    • How does this differ when talking about internal vs. consumer/end-user facing applications?
  • What are the most interesting, innovative, or unexpected ways that you have seen Meroxa used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Meroxa?
  • When is Meroxa the wrong choice?
  • What do you have planned for the future of Meroxa?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

05 May 2024Barking Up The Wrong GPTree: Building Better AI With A Cognitive Approach00:54:17
Summary
Artificial intelligence has dominated the headlines for several months due to the successes of large language models. This has prompted numerous debates about the possibility of, and timeline for, artificial general intelligence (AGI). Peter Voss has dedicated decades of his life to the pursuit of truly intelligent software through the approach of cognitive AI. In this episode he explains his approach to building AI in a more human-like fashion and the emphasis on learning rather than statistical prediction.
Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Peter Voss about what is involved in making your AI applications more "human"
Interview
  • Introduction
  • How did you get involved in machine learning?
  • Can you start by unpacking the idea of "human-like" AI? 
    • How does that contrast with the conception of "AGI"?
  • The applications and limitations of GPT/LLM models have been dominating the popular conversation around AI. How do you see that impacting the overrall ecosystem of ML/AI applications and investment?
  • The fundamental/foundational challenge of every AI use case is sourcing appropriate data. What are the strategies that you have found useful to acquire, evaluate, and prepare data at an appropriate scale to build high quality models? 
  • What are the opportunities and limitations of causal modeling techniques for generalized AI models?
  • As AI systems gain more sophistication there is a challenge with establishing and maintaining trust. What are the risks involved in deploying more human-level AI systems and monitoring their reliability?
  • What are the practical/architectural methods necessary to build more cognitive AI systems? 
    • How would you characterize the ecosystem of tools/frameworks available for creating, evolving, and maintaining these applications?
  • What are the most interesting, innovative, or unexpected ways that you have seen cognitive AI applied?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on desiging/developing cognitive AI systems?
  • When is cognitive AI the wrong choice?
  • What do you have planned for the future of cognitive AI applications at Aigo?
Contact Info
Parting Question
  • From your perspective, what is the biggest barrier to adoption of machine learning today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
Links
The intro and outro music is from Hitman's Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0
25 Feb 2024Find Out About The Technology Behind The Latest PFAD In Analytical Database Development00:56:01

Summary

Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today!
  • Your host is Tobias Macey and today I'm interviewing Paul Dix about his investment in the Apache Arrow ecosystem and how it led him to create the latest PFAD in database design

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by describing the FDAP stack and how the components combine to provide a foundational architecture for database engines?
    • This was the core of your recent re-write of the InfluxDB engine. What were the design goals and constraints that led you to this architecture?
  • Each of the architectural components are well engineered for their particular scope. What is the engineering work that is involved in building a cohesive platform from those components?
  • One of the major benefits of using open source components is the network effect of ecosystem integrations. That can also be a risk when the community vision for the project doesn't align with your own goals. How have you worked to mitigate that risk in your specific platform?
  • Can you describe the operational/architectural aspects of building a full data engine on top of the FDAP stack?
    • What are the elements of the overall product/user experience that you had to build to create a cohesive platform?
  • What are some of the other tools/technologies that can benefit from some or all of the pieces of the FDAP stack?
  • What are the pieces of the Arrow ecosystem that are still immature or need further investment from the community?
  • What are the most interesting, innovative, or unexpected ways that you have seen parts or all of the FDAP stack used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on/with the FDAP stack?
  • When is the FDAP stack the wrong choice?
  • What do you have planned for the future of the InfluxDB IOx engine and the FDAP stack?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

30 Oct 2023Surveying The Market Of Database Products00:47:12

Summary

Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
  • Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro.
  • Your host is Tobias Macey and today I'm interviewing Tanya Bragin about her views on the database products market

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What are the aspects of the database market that keep you interested as a VP of product?
    • How have your experiences at Elastic informed your current work at Clickhouse?
  • What are the main product categories for databases today?
    • What are the industry trends that have the most impact on the development and growth of different product categories?
    • Which categories do you see growing the fastest?
  • When a team is selecting a database technology for a given task, what are the types of questions that they should be asking?
  • Transactional engines like Postgres, SQL Server, Oracle, etc. were long used as analytical databases as well. What is driving the broad adoption of columnar stores as a separate environment from transactional systems?
    • What are the inefficiencies/complexities that this introduces?
    • How can the database engine used for analytical systems work more closely with the transactional systems?
  • When building analytical systems there are numerous moving parts with intricate dependencies. What is the role of the database in simplifying observability of these applications?
  • What are the most interesting, innovative, or unexpected ways that you have seen Clickhouse used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on database products?
  • What are your prodictions for the future of the database market?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

29 Jan 2024Build A Data Lake For Your Security Logs With Scanner01:02:38

Summary

Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Cliff Crosland about Scanner, a security data lake platform for analyzing security logs and identifying issues quickly and cost-effectively

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Scanner is and the story behind it?
    • What were the shortcomings of other tools that are available in the ecosystem?
  • What is Scanner explicitly not trying to solve for in the security space? (e.g. SIEM)
  • A query engine is useless without data to analyze. What are the data acquisition paths/sources that you are designed to work with?- e.g. cloudtrail logs, app logs, etc.
    • What are some of the other sources of signal for security monitoring that would be valuable to incorporate or integrate with through Scanner?
  • Log data is notoriously messy, with no strictly defined format. How do you handle introspection and querying across loosely structured records that might span multiple sources and inconsistent labelling strategies?
  • Can you describe the architecture of the Scanner platform?
    • What were the motivating constraints that led you to your current implementation?
    • How have the design and goals of the product changed since you first started working on it?
  • Given the security oriented customer base that you are targeting, how do you address trust/network boundaries for compliance with regulatory/organizational policies?
  • What are the personas of the end-users for Scanner?
    • How has that influenced the way that you think about the query formats, APIs, user experience etc. for the prroduct?
  • For teams who are working with Scanner can you describe how it fits into their workflow?
  • What are the most interesting, innovative, or unexpected ways that you have seen Scanner used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Scanner?
  • When is Scanner the wrong choice?
  • What do you have planned for the future of Scanner?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

13 Oct 2024The Role of Python in Shaping the Future of Data Platforms with DLT00:54:08
Summary
In this episode of the Data Engineering Podcast, Adrian Broderieux and Marcin Rudolph, co-founders of DLT Hub, delve into the principles guiding DLT's development, emphasizing its role as a library rather than a platform, and its integration with lakehouse architectures and AI application frameworks. The episode explores the impact of the Python ecosystem's growth on DLT, highlighting integrations with high-performance libraries and the benefits of Arrow and DuckDB. The episode concludes with a discussion on the future of DLT, including plans for a portable data lake and the importance of interoperability in data management tools.
Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
  • Your host is Tobias Macey and today I'm interviewing Adrian Brudaru and Marcin Rudolf, cofounders at dltHub, about the growth of dlt and the numerous ways that you can use it to address the complexities of data integration
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what dlt is and how it has evolved since we last spoke (September 2023)?
    • What are the core principles that guide your work on dlt and dlthub?
  • You have taken a very opinionated stance against managed extract/load services. What are the shortcomings of those platforms, and when would you argue in their favor?
  • The landscape of data movement has undergone some interesting changes over the past year. Most notably, the growth of PyAirbyte and the rapid shifts around the needs of generative AI stacks (vector stores, unstructured data processing, etc.). How has that informed your product development and positioning?
    • The Python ecosystem, and in particular data-oriented Python, has also undergone substantial evolution. What are the developments in the libraries and frameworks that you have been able to benefit from?
  • What are some of the notable investments that you have made in the developer experience for building dlt pipelines?
    • How have the interfaces for source/destination development improved?
  • You recently published a post about the idea of a portable data lake. What are the missing pieces that would make that possible, and what are the developments/technologies that put that idea within reach?
  • What is your strategy for building a sustainable product on top of dlt?
    • How does that strategy help to form a "virtuous cycle" of improving the open source foundation?
  • What are the most interesting, innovative, or unexpected ways that you have seen dlt used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on dlt?
  • When is dlt the wrong choice?
  • What do you have planned for the future of dlt/dlthub?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
06 Nov 2023Shining Some Light In The Black Box Of PostgreSQL Performance00:54:52

Summary

Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
  • Your host is Tobias Macey and today I'm interviewing Lukas Fittl about optimizing your database performance and tips for tuning Postgres

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What are the different ways that database performance problems impact the business?
  • What are the most common contributors to performance issues?
  • What are the useful signals that indicate performance challenges in the database?
    • For a given symptom, what are the steps that you recommend for determining the proximate cause?
  • What are the potential negative impacts to be aware of when tuning the configuration of your database?
  • How does the database engine influence the methods used to identify and resolve performance challenges?
  • Most of the database engines that are in common use today have been around for decades. How have the lessons learned from running these systems over the years influenced the ways to think about designing new engines or evolving the ones we have today?
  • What are the most interesting, innovative, or unexpected ways that you have seen to address database performance?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on databases?
  • What are your goals for the future of database engines?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

08 Dec 2024An Exploration Of The Impediments To Reusable Data Pipelines00:51:32
Summary
In this episode of the Data Engineering Podcast the inimitable Max Beauchemin talks about reusability in data pipelines. The conversation explores the "write everything twice" problem, where similar pipelines are built without code reuse, and discusses the challenges of managing different SQL dialects and relational databases. Max also touches on the evolving role of data engineers, drawing parallels with front-end engineering, and suggests that generative AI could facilitate knowledge capture and distribution in data engineering. He encourages the community to share reference implementations and templates to foster collaboration and innovation, and expresses hopes for a future where code reuse becomes more prevalent.


Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
  • Your host is Tobias Macey and today I'm joined again by Max Beauchemin to talk about the challenges of reusability in data pipelines
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by sharing your current thesis on the opportunities and shortcomings of code and component reusability in the data context?
    • What are some ways that you think about what constitutes a "component" in this context?
  • The data ecosystem has arguably grown more varied and nuanced in recent years. At the same time, the number and maturity of tools has grown. What is your view on the current trend in productivity for data teams and practitioners?
  • What do you see as the core impediments to building more reusable and general-purpose solutions in data engineering?
    • How can we balance the actual needs of data consumers against their requests (whether well- or un-informed) to help increase our ability to better design our workflows for reuse?
  • In data engineering there are two broad approaches; code-focused or SQL-focused pipelines. In principle one would think that code-focused environments would have better composability. What are you seeing as the realities in your personal experience and what you hear from other teams?
  • When it comes to SQL dialects, dbt offers the option of Jinja macros, whereas SDF and SQLMesh offer automatic translation. There are also tools like PRQL and Malloy that aim to abstract away the underlying SQL. What are the tradeoffs across those options that help or hinder the portability of transformation logic?
  • Which layers of the data stack/steps in the data journey do you see the greatest opportunity for improving the creation of more broadly usable abstractions/reusable elements?
  • low/no code systems for code reuse
  • impact of LLMs on reusability/composition
  • impact of background on industry practices (e.g. DBAs, sysadmins, analysts vs. SWE, etc.)
  • polymorphic data models (e.g. activity schema)
  • What are the most interesting, innovative, or unexpected ways that you have seen teams address composability and reusability of data components?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data-oriented tools and utilities?
  • What are your hopes and predictions for sharing of code and logic in the future of data engineering?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
10 Sep 2023An Overview Of The State Of Data Orchestration In An Increasingly Complex Data Ecosystem01:01:26

Summary

Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • Your host is Tobias Macey and today I'm welcoming back Nick Schrock to talk about the state of the ecosystem for data orchestration

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by defining what data orchestration is and how it differs from other types of orchestration systems? (e.g. container orchestration, generalized workflow orchestration, etc.)
  • What are the misconceptions about the applications of/need for/cost to implement data orchestration?
    • How do those challenges of customer education change across roles/personas?
  • Because of the multi-faceted nature of data in an organization, how does that influence the capabilities and interfaces that are needed in an orchestration engine?
  • You have been working on Dagster for five years now. How have the requirements/adoption/application for orchestrators changed in that time?
  • One of the challenges for any orchestration engine is to balance the need for robust and extensible core capabilities with a rich suite of integrations to the broader data ecosystem. What are the factors that you have seen make the most influence in driving adoption of a given engine?
  • What are the most interesting, innovative, or unexpected ways that you have seen data orchestration implemented and/or used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data orchestration?
  • When is a data orchestrator the wrong choice?
  • What do you have planned for the future of orchestration with Dagster?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

11 Jun 2023Build Better Tests For Your dbt Projects With Datafold And data-diff00:48:22

Summary

Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack
  • Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy about how to test your dbt projects with Datafold

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Datafold is and what's new since we last spoke? (July 2021 and July 2022 about data-diff)
  • What are the roadblocks to data testing/validation that you see teams run into most often?
    • How does the tooling used contribute to/help address those roadblocks?
  • What are some of the error conditions/failure modes that data-diff can help identify in a dbt project?
    • What are some examples of tests that need to be implemented by the engineer?
  • In your experience working with data teams, what typically constitutes the "staging area" for a dbt project? (e.g. separate warehouse, namespaced tables, snowflake data copies, lakefs, etc.)
  • Given a dbt project that is well tested and has data-diff as part of the validation suite, what are the challenges that teams face in managing the feedback cycle of running those tests?
  • In application development there is the idea of the "testing pyramid", consisting of unit tests, integration tests, system tests, etc. What are the parallels to that in data projects?
    • What are the limitations of the data ecosystem that make testing a bigger challenge than it might otherwise be?
  • Beyond test execution, what are the other aspects of data health that need to be included in the development and deployment workflow of dbt projects? (e.g. freshness, time to delivery, etc.)
  • What are the most interesting, innovative, or unexpected ways that you have seen Datafold and/or data-diff used for testing dbt projects?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on dbt testing internally or with your customers?
  • When is Datafold/data-diff the wrong choice for dbt projects?
  • What do you have planned for the future of Datafold?

Contact Info

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Special Guest: Gleb Mezhanskiy.

Sponsored By:

Support Data Engineering Podcast

27 May 2024Data Migration Strategies For Large Scale Systems01:00:00

Summary

Any software system that survives long enough will require some form of migration or evolution. When that system is responsible for the data layer the process becomes more challenging. Sriram Panyam has been involved in several projects that required migration of large volumes of data in high traffic environments. In this episode he shares some of the valuable lessons that he learned about how to make those projects successful.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments today to subscribe. My thanks to the team at Code Comments for their support.
  • Your host is Tobias Macey and today I'm interviewing Sriram Panyam about his experiences conducting large scale data migrations and the useful strategies that he learned in the process

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by sharing some of your experiences with data migration projects?
    • As you have gone through successive migration projects, how has that influenced the ways that you think about architecting data systems?
  • How would you categorize the different types and motivations of migrations?
    • How does the motivation for a migration influence the ways that you plan for and execute that work?
  • Can you talk us through one or two specific projects that you have taken part in?
  • Part 1: The Triggers
    • Section 1: Technical Limitations triggering Data Migration
      • Scaling bottlenecks: Performance issues with databases, storage, or network infrastructure
      • Legacy compatibility: Difficulties integrating with modern tools and cloud platforms
      • System upgrades: The need to migrate data during major software changes (e.g., SQL Server version upgrade)
    • Section 2: Types of Migrations for Infrastructure Focus
      • Storage migration: Moving data between systems (HDD to SSD, SAN to NAS, etc.)
      • Data center migration: Physical relocation or consolidation of data centers
      • Virtualization migration: Moving from physical servers to virtual machines (or vice versa)
    • Section 3: Technical Decisions Driving Data Migrations
      • End-of-life support: Forced migration when older software or hardware is sunsetted
      • Security and compliance: Adopting new platforms with better security postures
      • Cost Optimization: Potential savings of cloud vs. on-premise data centers
  • Part 2: Challenges (and Anxieties)
    • Section 1: Technical Challenges
      • Data transformation challenges: Schema changes, complex data mappings
      • Network bandwidth and latency: Transferring large datasets efficiently
      • Performance testing and load balancing: Ensuring new systems can handle the workload
      • Live data consistency: Maintaining data integrity while updates occur in the source system
      • Minimizing Lag: Techniques to reduce delays in replicating changes to the new system
      • Change data capture: Identifying and tracking changes to the source system during migration
    • Section 2: Operational Challenges
      • Minimizing downtime: Strategies for service continuity during migration
      • Change management and rollback plans: Dealing with unexpected issues
      • Technical skills and resources: In-house expertise/data teams/external help
    • Section 3: Security & Compliance Challenges
      • Data encryption and protection: Methods for both in-transit and at-rest data
      • Meeting audit requirements: Documenting data lineage & the chain of custody
      • Managing access controls: Adjusting identity and role-based access to the new systems
  • Part 3: Patterns
    • Section 1: Infrastructure Migration Strategies
      • Lift and shift: Migrating as-is vs. modernization and re-architecting during the move
      • Phased vs. big bang approaches: Tradeoffs in risk vs. disruption
      • Tools and automation: Using specialized software to streamline the process
      • Dual writes: Managing updates to both old and new systems for a time
      • Change data capture (CDC) methods: Log-based vs. trigger-based approaches for tracking changes
      • Data validation & reconciliation: Ensuring consistency between source and target
    • Section 2: Maintaining Performance and Reliability
      • Disaster recovery planning: Failover mechanisms for the new environment
      • Monitoring and alerting: Proactively identifying and addressing issues
      • Capacity planning and forecasting growth to scale the new infrastructure
    • Section 3: Data Consistency and Replication
      • Replication tools - strategies and specialized tooling
      • Data synchronization techniques, eg Pros and cons of different methods (incremental vs. full)
      • Testing/Verification Strategies for validating data correctness in a live environment
      • Implication of large scale systems/environments
      • Comparison of interesting strategies:
        • DBLog, Debezium, Databus, Goldengate etc
  • What are the most interesting, innovative, or unexpected approaches to data migrations that you have seen or participated in?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data migrations?
  • When is a migration the wrong choice?
  • What are the characteristics or features of data technologies and the overall ecosystem that can reduce the burden of data migration in the future?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

29 Dec 2022Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams00:58:46

Summary

With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!
  • Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
  • RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder
  • Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell.
  • Your host is Tobias Macey and today I'm interviewing Vishal Singh about his experience building data products at Starburst

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what your definition of a "data product" is?
    • What are some of the different contexts in which the idea of a data product is applicable?
    • How do the parameters of a data product change across those different contexts/consumers?
  • What are some of the ways that you see the conversation around the purpose and practice of building data products getting overloaded by conflicting objectives?
  • What do you see as common challenges in data teams around how to approach product thinking in their day-to-day work?
  • What are some of the tactical ways that product-oriented work on data problems differs from what has become common practice in data teams?
  • What are some of the features that you are building at Starburst that contribute to the efforts of data teams to build full-featured product experiences for their data?
  • What are the most interesting, innovative, or unexpected ways that you have seen Starburst used in the context of data products?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working at Starburst?
  • When is a data product the wrong choice?
  • What do you have planned for the future of support for data product development at Starburst?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

22 Jan 2024Modern Customer Data Platform Principles01:01:33

Summary

Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP).

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro.
  • Your host is Tobias Macey and today I'm interviewing Tasso Argyros about the role of a customer data platform in the context of the modern data stack

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what the role of the CDP is in the context of a businesses data ecosystem?
    • What are the core technical challenges associated with building and maintaining a CDP?
    • What are the organizational/business factors that contribute to the complexity of these systems?
  • The early days of CDPs came with the promise of "Customer 360". Can you unpack that concept and how it has changed over the past ~5 years?
  • Recent years have seen the adoption of reverse ETL, cloud data warehouses, and sophisticated product analytics suites. How has that changed the architectural approach to CDPs?
    • How have the architectural shifts changed the ways that organizations interact with their customer data?
  • How have the responsibilities shifted across different roles?
    • What are the governance policy and enforcement challenges that are added with the expansion of access and responsibility?
  • What are the most interesting, innovative, or unexpected ways that you have seen CDPs built/used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on CDPs?
  • When is a CDP the wrong choice?
  • What do you have planned for the future of ActionIQ?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

21 May 2023Keep Your Data Lake Fresh With Real Time Streams Using Estuary00:55:51

Summary

Batch vs. streaming is a long running debate in the world of data integration and transformation. Proponents of the streaming paradigm argue that stream processing engines can easily handle batched workloads, but the reverse isn't true. The batch world has been the default for years because of the complexities of running a reliable streaming system at scale. In order to remove that barrier, the team at Estuary have built the Gazette and Flow systems from the ground up to resolve the pain points of other streaming engines, while providing an intuitive interface for data and application engineers to build their streaming workflows. In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time data lake without all of the headache.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack
  • Your host is Tobias Macey and today I'm interviewing David Yaffe and Johnny Graettinger about using streaming data to build a real-time data lake and how Estuary gives you a single path to integrating and transforming your various sources

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Estuary is and the story behind it?
  • Stream processing technologies have been around for around a decade. How would you characterize the current state of the ecosystem?
    • What was missing in the ecosystem of streaming engines that motivated you to create a new one from scratch?
  • With the growth in tools that are focused on batch-oriented data integration and transformation, what are the reasons that an organization should still invest in streaming?
    • What is the comparative level of difficulty and support for these disparate paradigms?
  • What is the impact of continuous data flows on dags/orchestration of transforms?
  • What role do modern table formats have on the viability of real-time data lakes?
  • Can you describe the architecture of your Flow platform?
    • What are the core capabilities that you are optimizing for in its design?
  • What is involved in getting Flow/Estuary deployed and integrated with an organization's data systems?
  • What does the workflow look like for a team using Estuary?
    • How does it impact the overall system architecture for a data platform as compared to other prevalent paradigms?
  • How do you manage the translation of poll vs. push availability and best practices for API and other non-CDC sources?
  • What are the most interesting, innovative, or unexpected ways that you have seen Estuary used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Estuary?
  • When is Estuary the wrong choice?
  • What do you have planned for the future of Estuary?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

03 Jan 2025Breaking Down Data Silos: AI and ML in Master Data Management00:57:30
Summary
In this episode of the Data Engineering Podcast Dan Bruckner, co-founder and CTO of Tamr, talks about the application of machine learning (ML) and artificial intelligence (AI) in master data management (MDM). Dan shares his journey from working at CERN to becoming a data expert and discusses the challenges of reconciling large-scale organizational data. He explains how data silos arise from independent teams and highlights the importance of combining traditional techniques with modern AI to address the nuances of data reconciliation. Dan emphasizes the transformative potential of large language models (LLMs) in creating more natural user experiences, improving trust in AI-driven data solutions, and simplifying complex data management processes. He also discusses the balance between using AI for complex data problems and the necessity of human oversight to ensure accuracy and trust.


Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. 
  • As a listener of the Data Engineering Podcast you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us don't miss Data Citizens® Dialogues, the forward-thinking podcast brought to you by Collibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. In every episode of Data Citizens® Dialogues, industry leaders unpack data’s impact on the world; like in their episode “The Secret Sauce Behind McDonald’s Data Strategy”, which digs into how AI-driven tools can be used to support crew efficiency and customer interactions. In particular I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast-moving field. The Data Citizens Dialogues podcast is bringing the data conversation to you, so start listening now! Follow Data Citizens Dialogues on Apple, Spotify, YouTube, or wherever you get your podcasts.
  • Your host is Tobias Macey and today I'm interviewing Dan Bruckner about the application of ML and AI techniques to the challenge of reconciling data at the scale of business
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving an overview of the different ways that organizational data becomes unwieldy and needs to be consolidated and reconciled?
    • How does that reconciliation relate to the practice of "master data management"
  • What are the scaling challenges with the current set of practices for reconciling data?
  • ML has been applied to data cleaning for a long time in the form of entity resolution, etc. How has the landscape evolved or matured in recent years?
    • What (if any) transformative capabilities do LLMs introduce?
  • What are the missing pieces/improvements that are necessary to make current AI systems usable out-of-the-box for data cleaning?
  • What are the strategic decisions that need to be addressed when implementing ML/AI techniques in the data cleaning/reconciliation process?
  • What are the risks involved in bringing ML to bear on data cleaning for inexperienced teams?
  • What are the most interesting, innovative, or unexpected ways that you have seen ML techniques used in data resolution?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on using ML/AI in master data management?
  • When is ML/AI the wrong choice for data cleaning/reconciliation?
  • What are your hopes/predictions for the future of ML/AI applications in MDM and data cleaning?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
18 Nov 2024Streaming Data Into The Lakehouse With Iceberg And Trino At Going00:39:49

In this episode, I had the pleasure of speaking with Ken Pickering, VP of Engineering at Going, about the intricacies of streaming data into a Trino and Iceberg lakehouse. Ken shared his journey from product engineering to becoming deeply involved in data-centric roles, highlighting his experiences in ecommerce and InsurTech. At Going, Ken leads the data platform team, focusing on finding travel deals for consumers, a task that involves handling massive volumes of flight data and event stream information.

Ken explained the dual approach of passive and active search strategies used by Going to manage the vast data landscape. Passive search involves aggregating data from global distribution systems, while active search is more transactional, querying specific flight prices. This approach helps Going sift through approximately 50 petabytes of data annually to identify the best travel deals.

We delved into the technical architecture supporting these operations, including the use of Confluent for data streaming, Starburst Galaxy for transformation, and Databricks for modeling. Ken emphasized the importance of an open lakehouse architecture, which allows for flexibility and scalability as the business grows.

Ken also discussed the composition of Going's engineering and data teams, highlighting the collaborative nature of their work and the reliance on vendor tooling to streamline operations. He shared insights into the challenges and strategies of managing data life cycles, ensuring data quality, and maintaining uptime for consumer-facing applications.

Throughout our conversation, Ken provided a glimpse into the future of Going's data architecture, including potential expansions into other travel modes and the integration of large language models for enhanced customer interaction. This episode offers a comprehensive look at the complexities and innovations in building a data-driven travel advisory service.

16 Jun 2024Being Data Driven At Stripe With Trino And Iceberg00:53:20

Summary

Stripe is a company that relies on data to power their products and business. To support that functionality they have invested in Trino and Iceberg for their analytical workloads. In this episode Kevin Liu shares some of the interesting features that they have built by combining those technologies, as well as the challenges that they face in supporting the myriad workloads that are thrown at this layer of their data platform.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Kevin Liu about his use of Trino and Iceberg for Stripe's data lakehouse

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what role Trino and Iceberg play in Stripe's data architecture?
    • What are the ways in which your job responsibilities intersect with Stripe's lakehouse infrastructure?
  • What were the requirements and selection criteria that led to the selection of that combination of technologies?
    • What are the other systems that feed into and rely on the Trino/Iceberg service?
  • what kinds of questions are you answering with table metadata
    • what use case/team does that support
  • comparative utility of iceberg REST catalog
  • What are the shortcomings of Trino and Iceberg?
  • What are the most interesting, innovative, or unexpected ways that you have seen Iceberg/Trino used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Stripe's data infrastructure?
  • When is a lakehouse on Trino/Iceberg the wrong choice?
  • What do you have planned for the future of Trino and Iceberg at Stripe?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

04 Nov 2024Feldera: Bridging Batch and Streaming with Incremental Computation00:47:36
Summary
In this episode of the Data Engineering Podcast, the creators of Feldera talk about their incremental compute engine designed for continuous computation of data, machine learning, and AI workloads. The discussion covers the concept of incremental computation, the origins of Feldera, and its unique ability to handle both streaming and batch data seamlessly. The guests explore Feldera's architecture, applications in real-time machine learning and AI, and challenges in educating users about incremental computation. They also discuss the balance between open-source and enterprise offerings, and the broader implications of incremental computation for the future of data management, predicting a shift towards unified systems that handle both batch and streaming data efficiently.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
  • As a listener of the Data Engineering Podcast you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us you should listen to Data Citizens® Dialogues, the forward-thinking podcast from the folks at Collibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. They address questions around AI governance, data sharing, and working at global scale. In particular I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast-moving field. While data is shaping our world, Data Citizens Dialogues is shaping the conversation. Subscribe to Data Citizens Dialogues on Apple, Spotify, Youtube, or wherever you get your podcasts.
  • Your host is Tobias Macey and today I'm interviewing Leonid Ryzhyk, Lalith Suresh, and Mihai Budiu about Feldera, an incremental compute engine for continous computation of data, ML, and AI workloads
Interview
  • Introduction
  • Can you describe what Feldera is and the story behind it?
  • DBSP (the theory behind Feldera) has won multiple awards from the database research community. Can you explain what it is and how it solves the incremental computation problem?
  • Depending on which angle you look at it, Feldera has attributes of data warehouses, federated query engines, and stream processors. What are the unique use cases that Feldera is designed to address?
    • In what situations would you replace another technology with Feldera?
    • When is it an additive technology?
  • Can you describe the architecture of Feldera?
    • How have the design and scope evolved since you first started working on it?
  • What are the state storage interfaces available in Feldera?
    • What are the opportunities for integrating with or building on top of open table formats like Iceberg, Lance, Hudi, etc.?
  • Can you describe a typical workflow for an engineer building with Feldera?
  • You advertise Feldera's utility in ML and AI use cases in addition to data management. What are the features that make it conducive to those applications?
  • What is your philosophy toward the community growth and engagement with the open source aspects of Feldera and how you're balancing that with sustainability of the project and business?
  • What are the most interesting, innovative, or unexpected ways that you have seen Feldera used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Feldera?
  • When is Feldera the wrong choice?
  • What do you have planned for the future of Feldera?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
08 Jul 2024Neon: A Serverless And Developer Friendly Postgres00:57:43
Summary
Postgres is one of the most widely respected and liked database engines ever. To make it even easier to use for developers to use, Nikita Shamgunov decided to makee it serverless, so that it can scale from zero to infinity. In this episode he explains the engineering involved to make that possible, as well as the numerous details that he and his team are packing into the Neon service to make it even more attractive for anyone who wants to build on top of Postgres.
Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Nikita Shamgunov about his work on making Postgres a serverless database at Neon.
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Neon is and the story behind it?
    • The ecosystem around Postgres is large and varied. What are the pain points that you are trying to address with Neon? 
  • What does it mean for a database to be serverless?
    • What kinds of products and services are unlocked by making Postgres a serverless database?
  • How does your vision for Neon compare/contrast with what you know of PlanetScale?
  • Postgres is known for having a large ecosystem of plugins that add a lot of interesting and useful features, but the storage layer has not been as easily extensible historically. How have architectural changes in recent Postgres releases enabled your work on Neon?
  • What are the core pieces of engineering that you have had to complete to make Neon possible?
    • How have the design and goals of the project evolved since you first started working on it?
  • The separation of storage and compute is one of the most fundamental promises of the cloud. What new capabilities does that enable in Postgres?
    • How does the branching functionality change the ways that development teams are able to deliver and debug features?
  • Because the storage is now a networked system, what new performance/latency challenges does that introduce? How have you addressed them in Neon?
  • Anyone who has ever operated a Postgres instance has had to tackle the upgrade process. How does Neon address that process for end users?
  • The rampant growth of AI has touched almost every aspect of computing, and Postgres is no exception. How does the introduction of pgvector and semantic/similarity search functionality impact the adoption and usage patterns of Postgres/Neon?
    • What new challenges does that introduce for you as an operator and business owner?
  • What are the lessons that you learned from MemSQL/SingleStore that have been most helpful in your work at Neon?
  • What are the most interesting, innovative, or unexpected ways that you have seen Neon used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Neon?
  • When is Neon the wrong choice? Postgres?
  • What do you have planned for the future of Neon?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
09 Jun 2024X-Ray Vision For Your Flink Stream Processing With Datorios00:42:22

Summary

Streaming data processing enables new categories of data products and analytics. Unfortunately, reasoning about stream processing engines is complex and lacks sufficient tooling. To address this shortcoming Datorios created an observability platform for Flink that brings visibility to the internals of this popular stream processing system. In this episode Ronen Korman and Stav Elkayam discuss how the increased understanding provided by purpose built observability improves the usefulness of Flink.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments today to subscribe. My thanks to the team at Code Comments for their support.
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Ronen Korman and Stav Elkayam about pulling back the curtain on your real-time data streams by bringing intuitive observability to Flink streams

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Datorios is and the story behind it?
  • Data observability has been gaining adoption for a number of years now, with a large focus on data warehouses. What are some of the unique challenges posed by Flink?
    • How much of the complexity is due to the nature of streaming data vs. the architectural realities of Flink?
  • How has the lack of visibility into the flow of data in Flink impacted the ways that teams think about where/when/how to apply it?
  • How have the requirements of generative AI shifted the demand for streaming data systems?
    • What role does Flink play in the architecture of generative AI systems?
  • Can you describe how Datorios is implemented?
    • How has the design and goals of Datorios changed since you first started working on it?
  • How much of the Datorios architecture and functionality is specific to Flink and how are you thinking about its potential application to other streaming platforms?
  • Can you describe how Datorios is used in a day-to-day workflow for someone building streaming applications on Flink?
  • What are the most interesting, innovative, or unexpected ways that you have seen Datorios used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datorios?
  • When is Datorios the wrong choice?
  • What do you have planned for the future of Datorios?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

17 Jul 2023Datapreneurs - How Todays Business Leaders Are Using Data To Define The Future00:54:45

Summary

Data has been one of the most substantial drivers of business and economic value for the past few decades. Bob Muglia has had a front-row seat to many of the major shifts driven by technology over his career. In his recent book "Datapreneurs" he reflects on the people and businesses that he has known and worked with and how they relied on data to deliver valuable services and drive meaningful change.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • Your host is Tobias Macey and today I'm interviewing Bob Muglia about his recent book about the idea of "Datapreneurs" and the role of data in the modern economy

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what your concept of a "Datapreneur" is?
    • How is this distinct from the common idea of an entreprenur?
  • What do you see as the key inflection points in data technologies and their impacts on business capabilities over the past ~30 years?
  • In your role as the CEO of Snowflake you had a first-row seat for the rise of the "modern data stack". What do you see as the main positive and negative impacts of that paradigm?
    • What are the key issues that are yet to be solved in that ecosmnjjystem?
  • For technologists who are thinking about launching new ventures, what are the key pieces of advice that you would like to share?
  • What do you see as the short/medium/long-term impact of AI on the technical, business, and societal arenas?
  • What are the most interesting, innovative, or unexpected ways that you have seen business leaders use data to drive their vision?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on the Datapreneurs book?
  • What are your key predictions for the future impact of data on the technical/economic/business landscapes?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

08 Mar 2025Accelerated Computing in Modern Data Centers With Datapelago00:55:36
Summary
In this episode of the Data Engineering Podcast Rajan Goyal, CEO and co-founder of Datapelago, talks about improving efficiencies in data processing by reimagining system architecture. Rajan explains the shift from hyperconverged to disaggregated and composable infrastructure, highlighting the importance of accelerated computing in modern data centers. He discusses the evolution from proprietary to open, composable stacks, emphasizing the role of open table formats and the need for a universal data processing engine, and outlines Datapelago's strategy to leverage existing frameworks like Spark and Trino while providing accelerated computing benefits.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
  • Your host is Tobias Macey and today I'm interviewing Rajan Goyal about how to drastically improve efficiencies in data processing by re-imagining the system architecture
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by outlining the main factors that contribute to performance challenges in data lake environments?
  • The different components of open data processing systems have evolved from different starting points with different objectives. In your experience, how has that un-planned and un-synchronized evolution of the ecosystem hindered the capabilities and adoption of open technologies?
  • The introduction of a new cross-cutting capability (e.g. Iceberg) has typically taken a substantial amount of time to gain support across different engines and ecosystems. What do you see as the point of highest leverage to improve the capabilities of the entire stack with the least amount of co-ordination?
  • What was the motivating insight that led you to invest in the technology that powers Datapelago?
  • Can you describe the system design of Datapelago and how it integrates with existing data engines?
  • The growth in the generation and application of unstructured data is a notable shift in the work being done by data teams. What are the areas of overlap in the fundamental nature of data (whether structured, semi-structured, or unstructured) that you are able to exploit to bridge the processing gap?
  • What are the most interesting, innovative, or unexpected ways that you have seen Datapelago used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datapelago?
  • When is Datapelago the wrong choice?
  • What do you have planned for the future of Datapelago?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
23 Oct 2023Defining A Strategy For Your Data Products01:03:50

Summary

The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES.
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
  • Your host is Tobias Macey and today I'm interviewing Ranjith Raghunath about tactical elements of a data product strategy

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what is encompassed by the idea of a data product strategy?
    • Which roles in an organization need to be involved in the planning and implementation of that strategy?
  • order of operations:
    • strategy -> platform design -> implementation/adoption
    • platform implementation -> product strategy -> interface development
  • managing grain of data in products
  • team organization to support product development/deployment
  • customer communications - what questions to ask? requirements gathering, helping to understand "the art of the possible"
  • What are the most interesting, innovative, or unexpected ways that you have seen organizations approach data product strategies?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on defining and implementing data product strategies?
  • When is a data product strategy overkill?
  • What are some additional resources that you recommend for listeners to direct their thinking and learning about data product strategy?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

16 Jan 2023Building Applications With Data As Code On The DataOS00:48:37

Summary

The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!
  • Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.
  • Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda today to find out more.
  • Your host is Tobias Macey and today I'm interviewing Srujan Akula about DataOS, a pre-integrated and managed data platform built by The Modern Data Company

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what your mission at The Modern Data Company is and the story behind it?
  • Your flagship (only?) product is a platform that you're calling DataOS. What is the scope and goal of that platform?
    • Who is the target audience?
  • On your site you refer to the idea of "data as software". What are the principles and ways of thinking that are encompassed by that concept?
    • What are the platform capabilities that are required to make it possible?
  • There are 11 "Key Features" listed on your site for the DataOS. What was your process for identifying the "must have" vs "nice to have" features for launching the platform?
  • Can you describe the technical architecture that powers your DataOS product?
    • What are the core principles that you are optimizing for in the design of your platform?
    • How have the design and goals of the system changed or evolved since you started working on DataOS?
  • Can you describe the workflow for the different practitioners and stakeholders working on an installation of DataOS?
  • What are the interfaces and escape hatches that are available for integrating with and extending the operation of the DataOS?
  • What are the features or capabilities that you are expressly choosing not to implement? (e.g. ML pipelines, data sharing, etc.)
  • What are the design elements that you are focused on to make DataOS approachable and understandable by different members of an organization?
  • What are the most interesting, innovative, or unexpected ways that you have seen DataOS used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on DataOS?
  • When is DataOS the wrong choice?
  • What do you have planned for the future of DataOS?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

16 Mar 2025Astronomer's Role in the Airflow Ecosystem: A Deep Dive with Pete DeJoy00:51:41
Summary
In this episode of the Data Engineering Podcast Pete DeJoy, co-founder and product lead at Astronomer, talks about building and managing Airflow pipelines on Astronomer and the upcoming improvements in Airflow 3. Pete shares his journey into data engineering, discusses Astronomer's contributions to the Airflow project, and highlights the critical role of Airflow in powering operational data products. He covers the evolution of Airflow, its position in the data ecosystem, and the challenges faced by data engineers, including infrastructure management and observability. The conversation also touches on the upcoming Airflow 3 release, which introduces data awareness, architectural improvements, and multi-language support, and Astronomer's observability suite, Astro Observe, which provides insights and proactive recommendations for Airflow users.


Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
  • Your host is Tobias Macey and today I'm interviewing Pete DeJoy about building and managing Airflow pipelines on Astronomer and the upcoming improvements in Airflow 3
Interview
  • Introduction
  • Can you describe what Astronomer is and the story behind it?
  • How would you characterize the relationship between Airflow and Astronomer?
  • Astronomer just released your State of Airflow 2025 Report yesterday and it is the largest data engineering survey ever with over 5,000 respondents. Can you talk a bit about top level findings in the report?
  • What about the overall growth of the Airflow project over time?
  • How have the focus and features of Astronomer changed since it was last featured on the show in 2017?
  • Astro Observe GA’d in early February, what does the addition of pipeline observability mean for your customers? 
  • What are other capabilities similar in scope to observability that Astronomer is looking at adding to the platform?
  • Why is Airflow so critical in providing an elevated Observability–or cataloging, or something simlar - experience in a DataOps platform? 
    • What are the notable evolutions in the Airflow project and ecosystem in that time?
  • What are the core improvements that are planned for Airflow 3.0?
  • What are the most interesting, innovative, or unexpected ways that you have seen Astro used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Airflow and Astro?
  • What do you have planned for the future of Astro/Astronomer/Airflow?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
19 Dec 2022Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle01:05:29

Summary

The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show!
  • Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
  • RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder
  • Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell.
  • Your host is Tobias Macey and today I'm interviewing Juan Sequeda and Tim Gasper about their views on the role of the data mesh paradigm for driving re-assessment of the foundational principles of data systems

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What are the areas of the data ecosystem that you see the most turmoil and confusion?
  • The past couple of years have brought a lot of attention to the idea of the "modern data stack". How has that influenced the ways that your and your customers' teams think about what skills they need to be effective?
  • The other topic that is introducing a lot of confusion and uncertainty is the "data mesh". How has that changed the ways that teams think about who is involved in the technical and design conversations around data in an organization?
  • Now that we, as an industry, have reached a new generational inflection about how data is generated, processed, and used, what are some of the foundational principles that have proven their worth?
    • What are some of the new lessons that are showing the greatest promise?
    • data modeling
    • data platform/infrastructure
    • data collaboration
    • data governance/security/privacy
  • How does your work at data.world work support these foundational practices?
    • What are some of the ways that you work with your teams and customers to help them stay informed on industry practices?
    • What is your process for understanding the balance between hype and reality as you encounter new ideas/technologies?
  • What are some of the notable changes that have happened in the data.world product and market since I last had Bryon on the show in 2017?
  • What are the most interesting, innovative, or unexpected ways that you have seen data.world used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data.world?
  • When is data.world the wrong choice?
  • What do you have planned for the future of data.world?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

20 Nov 2023Unlocking Your dbt Projects With Practical Advice For Practitioners01:16:04

Summary

The dbt project has become overwhelmingly popular across analytics and data engineering teams. While it is easy to adopt, there are many potential pitfalls. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro.
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Dustin Dorsey and Cameron Cyr about how to design your dbt projects

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • What was your path to adoption of dbt?
    • What did you use prior to its existence?
    • When/why/how did you start using it?
  • What are some of the common challenges that teams experience when getting started with dbt?
    • How does prior experience in analytics and/or software engineering impact those outcomes?
  • You recently wrote a book to give a crash course in best practices for dbt. What motivated you to invest that time and effort?
    • What new lessons did you learn about dbt in the process of writing the book?
  • The introduction of dbt is largely responsible for catalyzing the growth of "analytics engineering". As practitioners in the space, what do you see as the net result of that trend?
    • What are the lessons that we all need to invest in independent of the tool?
  • For someone starting a new dbt project today, can you talk through the decisions that will be most critical for ensuring future success?
  • As dbt projects scale, what are the elements of technical debt that are most likely to slow down engineers?
    • What are the capabilities in the dbt framework that can be used to mitigate the effects of that debt?
    • What tools or processes outside of dbt can help alleviate the incidental complexity of a large dbt project?
  • What are the most interesting, innovative, or unexpected ways that you have seen dbt used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working with dbt? (as engineers and/or as autors)
  • What is on your personal wish-list for the future of dbt (or its competition?)?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

17 Mar 2024Reconciling The Data In Your Databases With Datafold00:58:14

Summary

A significant portion of data workflows involve storing and processing information in database engines. Validating that the information is stored and processed correctly can be complex and time-consuming, especially when the source and destination speak different dialects of SQL. In this episode Gleb Mezhanskiy, founder and CEO of Datafold, discusses the different error conditions and solutions that you need to know about to ensure the accuracy of your data.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council and use code dataengpod20 to register today!
  • Your host is Tobias Macey and today I'm welcoming back Gleb Mezhanskiy to talk about how to reconcile data in database environments

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by outlining some of the situations where reconciling data between databases is needed?
  • What are examples of the error conditions that you are likely to run into when duplicating information between database engines?
    • When these errors do occur, what are some of the problems that they can cause?
  • When teams are replicating data between database engines, what are some of the common patterns for managing those flows?
    • How does that change between continual and one-time replication?
  • What are some of the steps involved in verifying the integrity of data replication between database engines?
  • If the source or destination isn't a traditional database engine (e.g. data lakehouse) how does that change the work involved in verifying the success of the replication?
  • What are the challenges of validating and reconciling data?
    • Sheer scale and cost of pulling data out, have to do in-place
    • Performance. Pushing databases to the limit, especially hard for OLTP and legacy
    • Cross-database compatibilty
    • Data types
  • What are the most interesting, innovative, or unexpected ways that you have seen Datafold/data-diff used in the context of cross-database validation?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datafold?
  • When is Datafold/data-diff the wrong choice?
  • What do you have planned for the future of Datafold?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

01 Dec 2024The Art of Database Selection and Evolution00:59:56
Summary
In this episode of the Data Engineering Podcast Sam Kleinman talks about the pivotal role of databases in software engineering. Sam shares his journey into the world of data and discusses the complexities of database selection, highlighting the trade-offs between different database architectures and how these choices affect system design, query performance, and the need for ETL processes. He emphasizes the importance of understanding specific requirements to choose the right database engine and warns against over-engineering solutions that can lead to increased complexity. Sam also touches on the tendency of engineers to move logic to the application layer due to skepticism about database longevity and advises teams to leverage database capabilities instead. Finally, he identifies a significant gap in data management tooling: the lack of easy-to-use testing tools for database interactions, highlighting the need for better testing paradigms to ensure reliability and reduce bugs in data-driven applications.


Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • It’s 2024, why are we still doing data migrations by hand? Teams spend months—sometimes years—manually converting queries and validating data, burning resources and crushing morale. Datafold's AI-powered Migration Agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today to learn how Datafold can automate your migration and ensure source to target parity. 
  • Your host is Tobias Macey and today I'm interviewing Sam Kleinman about database tradeoffs across operating environments and axes of scale
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • The database engine you use has a substantial impact on how you architect your overall system. When starting a greenfield project, what do you see as the most important factor to consider when selecting a database?
  • points of friction introduced by database capabilities
  • embedded databases (e.g. SQLite, DuckDB, LanceDB), when to use and when do they become a bottleneck
  • single-node database engines (e.g. Postgres, MySQL), when are they legitimately a problem
  • distributed databases (e.g. CockroachDB, PlanetScale, MongoDB)
  • polyglot storage vs. general-purpose/multimodal databases
  • federated queries, benefits and limitations 
    • ease of integration vs. variability of performance and access control

Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
19 Feb 2023The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse00:55:07

Summary

Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to timextender.com/dataengineering where you can do two things: watch us build a data estate in 15 minutes and start for free today.
  • Your host is Tobias Macey and today I'm interviewing Ryan Blue about the evolution and applications of the Iceberg table format and how he is making it more accessible at Tabular

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Iceberg is and its position in the data lake/lakehouse ecosystem?
    • Since it is a fundamentally a specification, how do you manage compatibility and consistency across implementations?
  • What are the notable changes in the Iceberg project and its role in the ecosystem since our last conversation October of 2018?
  • Around the time that Iceberg was first created at Netflix a number of alternative table formats were also being developed. What are the characteristics of Iceberg that lead teams to adopt it for their lakehouse projects?
    • Given the constant evolution of the various table formats it can be difficult to determine an up-to-date comparison of their features, particularly earlier in their development. What are the aspects of this problem space that make it so challenging to establish unbiased and comprehensive comparisons?
  • For someone who wants to manage their data in Iceberg tables, what does the implementation look like?
    • How does that change based on the type of query/processing engine being used?
  • Once a table has been created, what are the capabilities of Iceberg that help to support ongoing use and maintenance?
  • What are the most interesting, innovative, or unexpected ways that you have seen Iceberg used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iceberg/Tabular?
  • When is Iceberg/Tabular the wrong choice?
  • What do you have planned for the future of Iceberg/Tabular?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

21 Apr 2025Advanced Lakehouse Management With The LakeKeeper Iceberg REST Catalog00:57:13
Summary
In this episode of the Data Engineering Podcast Victor Kessler, co-founder of Vakama, talks about the architectural patterns in the lake house enabled by a fast and feature-rich Iceberg catalog. Victor shares his journey from data warehouses to developing the open-source project, Lakekeeper, an Apache Iceberg REST catalog written in Rust that facilitates building lake houses with essential components like storage, compute, and catalog management. He discusses the importance of metadata in making data actionable, the evolution of data catalogs, and the challenges and innovations in the space, including integration with OpenFGA for fine-grained access control and managing data across formats and compute engines.

Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
  • Your host is Tobias Macey and today I'm interviewing Viktor Kessler about architectural patterns in the lakehouse that are unlocked by a fast and feature-rich Iceberg catalog
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what LakeKeeper is and the story behind it? 
    • What is the core of the problem that you are addressing?
  • There has been a lot of activity in the catalog space recently. What are the driving forces that have highlighted the need for a better metadata catalog in the data lake/distributed data ecosystem?
    • How would you characterize the feature sets/problem spaces that different entrants are focused on addressing?
  • Iceberg as a table format has gained a lot of attention and adoption across the data ecosystem. The REST catalog format has opened the door for numerous implementations. What are the opportunities for innovation and improving user experience in that space?
  • What is the role of the catalog in managing security and governance? (AuthZ, auditing, etc.)
    • What are the channels for propagating identity and permissions to compute engines? (how do you avoid head-scratching about permission denied situations)
  • Can you describe how LakeKeeper is implemented?
    • How have the design and goals of the project changed since you first started working on it?
  • For someone who has an existing set of Iceberg tables and catalog, what does the migration process look like?
  • What new workflows or capabilities does LakeKeeper enable for data teams using Iceberg tables across one or more compute frameworks?
  • What are the most interesting, innovative, or unexpected ways that you have seen LakeKeeper used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on LakeKeeper?
  • When is LakeKeeper the wrong choice?
  • What do you have planned for the future of LakeKeeper?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
18 Feb 2024Using Trino And Iceberg As The Foundation Of Your Data Lakehouse00:58:46

Summary

A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and scalability of data lakes.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Join in with the event for the global data community, Data Council Austin. From March 26th-28th 2024, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working togethr to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today.
  • Your host is Tobias Macey and today I'm interviewing Dain Sundstrom about building a data lakehouse with Trino and Iceberg

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • To start, can you share your definition of what constitutes a "Data Lakehouse"?
    • What are the technical/architectural/UX challenges that have hindered the progression of lakehouses?
    • What are the notable advancements in recent months/years that make them a more viable platform choice?
  • There are multiple tools and vendors that have adopted the "data lakehouse" terminology. What are the benefits offered by the combination of Trino and Iceberg?
    • What are the key points of comparison for that combination in relation to other possible selections?
  • What are the pain points that are still prevalent in lakehouse architectures as compared to warehouse or vertically integrated systems?
    • What progress is being made (within or across the ecosystem) to address those sharp edges?
  • For someone who is interested in building a data lakehouse with Trino and Iceberg, how does that influence their selection of other platform elements?
  • What are the differences in terms of pipeline design/access and usage patterns when using a Trino/Iceberg lakehouse as compared to other popular warehouse/lakehouse structures?
  • What are the most interesting, innovative, or unexpected ways that you have seen Trino lakehouses used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on the data lakehouse ecosystem?
  • When is a lakehouse the wrong choice?
  • What do you have planned for the future of Trino/Starburst?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

30 Jan 2023Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics00:50:44

Summary

Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!
  • Your host is Tobias Macey and today I'm interviewing Chris Merrick about the Omni Analytics platform and how they are adding automatic data modeling to your business intelligence

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Omni Analytics is and the story behind it?
    • What are the core goals that you are trying to achieve with building Omni?
  • Business intelligence has gone through many evolutions. What are the unique capabilities that Omni Analytics offers over other players in the market?

    • What are the technical and organizational anti-patterns that typically grow up around BI systems?
  • What are the elements that contribute to BI being such a difficult product to use effectively in an organization?

  • Can you describe how you have implemented the Omni platform?

    • How have the design/scope/goals of the product changed since you first started working on it?
  • What does the workflow for a team using Omni look like?

  • What are some of the developments in the broader ecosystem that have made your work possible?

  • What are some of the positive and negative inspirations that you have drawn from the experience that you and your team-mates have gained in previous businesses?

  • What are the most interesting, innovative, or unexpected ways that you have seen Omni used?

  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Omni?

  • When is Omni the wrong choice?

  • What do you have planned for the future of Omni?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

17 Sep 2023Building Linked Data Products With JSON-LD01:01:31

Summary

A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • If you’re a data person, you probably have to jump between different tools to run queries, build visualizations, write Python, and send around a lot of spreadsheets and CSV files. Hex brings everything together. Its powerful notebook UI lets you analyze data in SQL, Python, or no-code, in any combination, and work together with live multiplayer and version control. And now, Hex’s magical AI tools can generate queries and code, create visualizations, and even kickstart a whole analysis for you – all from natural language prompts. It’s like having an analytics co-pilot built right into where you’re already doing your work. Then, when you’re ready to share, you can use Hex’s drag-and-drop app builder to configure beautiful reports or dashboards that anyone can use. Join the hundreds of data teams like Notion, AllTrails, Loom, Mixpanel and Algolia using Hex every day to make their work more impactful. Sign up today at dataengineeringpodcast.com/hex to get a 30-day free trial of the Hex Team plan!
  • Your host is Tobias Macey and today I'm interviewing Brian Platz about using JSON-LD for building linked-data products

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what the term "linked data product" means and some examples of when you might build one?
    • What is the overlap between knowledge graphs and "linked data products"?
  • What is JSON-LD?
    • What are the domains in which it is typically used?
    • How does it assist in developing linked data products?
  • what are the characteristics that distinguish a knowledge graph from
  • What are the layers/stages of applications and data that can/should incorporate JSON-LD as the representation for records and events?
    • What is the level of native support/compatibiliity that you see for JSON-LD in data systems?
  • What are the modeling exercises that are necessary to ensure useful and appropriate linkages of different records within and between products and organizations?
  • Can you describe the workflow for building autonomous linkages across data assets that are modelled as JSON-LD?
  • What are the most interesting, innovative, or unexpected ways that you have seen JSON-LD used for data workflows?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on linked data products?
  • When is JSON-LD the wrong choice?
  • What are the future directions that you would like to see for JSON-LD and linked data in the data ecosystem?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

11 Dec 2023Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack00:51:18

Summary

If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. In this episode he shares what it is, how it works, and how you can start using it today to get notified when the critical metrics in your business aren't quite right.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro. That’s three free boards at dataengineeringpodcast.com/miro.
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Andrew Maguire about his work on the Anomstack project and how you can use it to run your own anomaly detection for your metrics

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Anomstack is and the story behind it?
    • What are your goals for this project?
    • What other tools/products might teams be evaluating while they consider Anomstack?
  • In the context of Anomstack, what constitutes a "metric"?
    • What are some examples of useful metrics that a data team might want to monitor?
  • You put in a lot of work to make Anomstack as easy as possible to get started with. How did this focus on ease of adoption influence the way that you approached the overall design of the project?
  • What are the core capabilities and constraints that you selected to provide the focus and architecture of the project?
  • Can you describe how Anomstack is implemented?
    • How have the design and goals of the project changed since you first started working on it?
  • What are the steps to getting Anomstack running and integrated as part of the operational fabric of a data platform?
    • What are the sharp edges that are still present in the system?
  • What are the interfaces that are available for teams to customize or enhance the capabilities of Anomstack?
  • What are the most interesting, innovative, or unexpected ways that you have seen Anomstack used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Anomstack?
  • When is Anomstack the wrong choice?
  • What do you have planned for the future of Anomstack?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

23 Sep 2024Scaling Airbyte: Challenges and Milestones on the Road to 1.000:57:11
Summary
Airbyte is one of the most prominent platforms for data movement. Over the past 4 years they have invested heavily in solutions for scaling the self-hosted and cloud operations, as well as the quality and stability of their connectors. As a result of that hard work, they have declared their commitment to the future of the platform with a 1.0 release. In this episode Michel Tricot shares the highlights of their journey and the exciting new capabilities that are coming next.
Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Your host is Tobias Macey and today I'm interviewing Michel Tricot about the journey to the 1.0 launch of Airbyte and what that means for the project
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Airbyte is and the story behind it?
  • What are some of the notable milestones that you have traversed on your path to the 1.0 release?
  • The ecosystem has gone through some significant shifts since you first launched Airbyte. How have trends such as generative AI, the rise and fall of the "modern data stack", and the shifts in investment impacted your overall product and business strategies?
  • What are some of the hard-won lessons that you have learned about the realities of data movement and integration?
    • What are some of the most interesting/challenging/surprising edge cases or performance bottlenecks that you have had to address?
  • What are the core architectural decisions that have proven to be effective?
    • How has the architecture had to change as you progressed to the 1.0 release?
  • A 1.0 version signals a degree of stability and commitment. Can you describe the decision process that you went through in committing to a 1.0 version?
  • What are the most interesting, innovative, or unexpected ways that you have seen Airbyte used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Airbyte?
  • When is Airbyte the wrong choice?
  • What do you have planned for the future of Airbyte after the 1.0 launch?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
16 Feb 2025Evolving Responsibilities in AI Data Management00:38:57
Summary
In this episode of the Data Engineering Podcast Bartosz Mikulski talks about preparing data for AI applications. Bartosz shares his journey from data engineering to MLOps and emphasizes the importance of data testing over software development in AI contexts. He discusses the types of data assets required for AI applications, including extensive test datasets, especially in generative AI, and explains the differences in data requirements for various AI application styles. The conversation also explores the skills data engineers need to transition into AI, such as familiarity with vector databases and new data modeling strategies, and highlights the challenges of evolving AI applications, including frequent reprocessing of data when changing chunking strategies or embedding models.


Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. 
  • Your host is Tobias Macey and today I'm interviewing Bartosz Mikulski about how to prepare data for use in AI applications
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by outlining some of the main categories of data assets that are needed for AI applications?
    • How does the nature of the application change those requirements? (e.g. RAG app vs. agent, etc.)
  • How do the different assets map to the stages of the application lifecycle?
    • What are some of the common roles and divisions of responsibility that you see in the construction and operation of a "typical" AI application?
  • For data engineers who are used to data warehousing/BI, what are the skills that map to AI apps?
  • What are some of the data modeling patterns that are needed to support AI apps?
    • chunking strategies 
    • metadata management
  • What are the new categories of data that data engineers need to manage in the context of AI applications?
    • agent memory generation/evolution 
    • conversation history management
    • data collection for fine tuning
  • What are some of the notable evolutions in the space of AI applications and their patterns that have happened in the past ~1-2 years that relate to the responsibilities of data engineers?
  • What are some of the skills gaps that teams should be aware of and identify training opportunities for?
  • What are the most interesting, innovative, or unexpected ways that you have seen data teams address the needs of AI applications?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI applications and their reliance on data?
  • What are some of the emerging trends that you are paying particular attention to?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
21 Jul 2024How Generative AI Is Impacting Data Engineering Teams00:54:45
Summary
Generative AI has rapidly gained adoption for numerous use cases. To support those applications, organizational data platforms need to add new features and data teams have increased responsibility. In this episode Lior Gavish, co-founder of Monte Carlo, discusses the various ways that data teams are evolving to support AI powered features and how they are incorporating AI into their work.
Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Your host is Tobias Macey and today I'm interviewing Lior Gavish about the impact of AI on data engineers
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by clarifying what we are discussing when we say "AI"?
  • Previous generations of machine learning (e.g. deep learning, reinforcement learning, etc.) required new features in the data platform. What new demands is the current generation of AI introducing?
  • Generative AI also has the potential to be incorporated in the creation/execution of data pipelines. What are the risk/reward tradeoffs that you have seen in practice?
    • What are the areas where LLMs have proven useful/effective in data engineering?
  • Vector embeddings have rapidly become a ubiquitous data format as a result of the growth in retrieval augmented generation (RAG) for AI applications. What are the end-to-end operational requirements to support this use case effectively?
    • As with all data, the reliability and quality of the vectors will impact the viability of the AI application. What are the different failure modes/quality metrics/error conditions that they are subject to?
  • As much as vectors, vector databases, RAG, etc. seem exotic and new, it is all ultimately shades of the same work that we have been doing for years. What are the areas of overlap in the work required for running the current generation of AI, and what are the areas where it diverges?
    • What new skills do data teams need to acquire to be effective in supporting AI applications?
  • What are the most interesting, innovative, or unexpected ways that you have seen AI impact data engineering teams?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working with the current generation of AI?
  • When is AI the wrong choice?
  • What are your predictions for the future impact of AI on data engineering teams?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your 
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
11 Feb 2024Data Sharing Across Business And Platform Boundaries00:59:56

Summary

Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
  • Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
  • Your host is Tobias Macey and today I'm interviewing Andy Jefferson about how to solve the problem of data sharing

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you start by giving some context and scope of what we mean by "data sharing" for the purposes of this conversation?
  • What is the current state of the ecosystem for data sharing protocols/practices/platforms?
    • What are some of the main challenges/shortcomings that teams/organizations experience with these options?
  • What are the technical capabilities that need to be present for an effective data sharing solution?
    • How does that change as a function of the type of data? (e.g. tabular, image, etc.)
  • What are the requirements around governance and auditability of data access that need to be addressed when sharing data?
  • What are the typical boundaries along which data access requires special consideration for how the sharing is managed?
  • Many data platform vendors have their own interfaces for data sharing. What are the shortcomings of those options, and what are the opportunities for abstracting the sharing capability from the underlying platform?
  • What are the most interesting, innovative, or unexpected ways that you have seen data sharing/Bobsled used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on data sharing?
  • When is Bobsled the wrong choice?
  • What do you have planned for the future of data sharing?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

18 Jun 2023How Column-Aware Development Tooling Yields Better Data Models00:46:20

Summary

Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack-
  • Your host is Tobias Macey and today I'm interviewing Satish Jayanthi about the practice and promise of building a column-aware data architecture through intentional modeling

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • How has the move to the cloud for data warehousing/data platforms influenced the practice of data modeling?
    • There are ongoing conversations about the continued merits of dimensional modeling techniques in modern warehouses. What are the modeling practices that you have found to be most useful in large and complex data environments?
  • Can you describe what you mean by the term column-aware in the context of data modeling/data architecture?
    • What are the capabilities that need to be built into a tool for it to be effectively column-aware?
  • What are some of the ways that tools like dbt miss the mark in managing large/complex transformation workloads?
  • Column-awareness is obviously critical in the context of the warehouse. What are some of the ways that that information can be fed into other contexts? (e.g. ML, reverse ETL, etc.)
  • What is the importance of embedding column-level lineage awareness into transformation tool vs. layering on top w/ dedicated lineage/metadata tooling?
  • What are the most interesting, innovative, or unexpected ways that you have seen column-aware data modeling used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on building column-aware tooling?
  • When is column-aware modeling the wrong choice?
  • What are some additional resources that you recommend for individuals/teams who want to learn more about data modeling/column aware principles?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

20 Aug 2023Harnessing Generative AI For Creating Educational Content With Illumidesk00:54:52

Summary

Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners.

Announcements

  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack
  • This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold
  • You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize today to get 2 weeks free!
  • Your host is Tobias Macey and today I'm interviewing Greg Werner about building IllumiDesk, a data-driven and AI powered online learning platform

Interview

  • Introduction
  • How did you get involved in the area of data management?
  • Can you describe what Illumidesk is and the story behind it?
  • What are the challenges that educators and content creators face in developing and maintaining digital course materials for their target audiences?
  • How are you leaning on data integrations and AI to reduce the initial time investment required to deliver courseware?
  • What are the opportunities for collecting and collating learner interactions with the course materials to provide feedback to the instructors?
  • What are some of the ways that you are incorporating pedagogical strategies into the measurement and evaluation methods that you use for reports?
  • What are the different categories of insights that you need to provide across the different stakeholders/personas who are interacting with the platform and learning content?
  • Can you describe how you have architected the Illumidesk platform?
  • How have the design and goals shifted since you first began working on it?
  • What are the strategies that you have used to allow for evolution and adaptation of the system in order to keep pace with the ecosystem of generative AI capabilities?
  • What are the failure modes of the content generation that you need to account for?
  • What are the most interesting, innovative, or unexpected ways that you have seen Illumidesk used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Illumidesk?
  • When is Illumidesk the wrong choice?
  • What do you have planned for the future of Illumidesk?

Contact Info

Parting Question

  • From your perspective, what is the biggest gap in the tooling or technology for data management today?

Closing Announcements

  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
  • To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers

Links

The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Sponsored By:

Support Data Engineering Podcast

Améliorez votre compréhension de Data Engineering Podcast avec My Podcast Data

Chez My Podcast Data, nous nous efforçons de fournir des analyses approfondies et basées sur des données tangibles. Que vous soyez auditeur passionné, créateur de podcast ou un annonceur, les statistiques et analyses détaillées que nous proposons peuvent vous aider à mieux comprendre les performances et les tendances de Data Engineering Podcast. De la fréquence des épisodes aux liens partagés en passant par la santé des flux RSS, notre objectif est de vous fournir les connaissances dont vous avez besoin pour vous tenir à jour. Explorez plus d'émissions et découvrez les données qui font avancer l'industrie du podcast.
© My Podcast Data