Medidata’s journey to a contemporary lakehouse structure on AWS


This publish was co-authored by Mike Araujo Principal Engineer at Medidata Options.

The life sciences trade is transitioning from fragmented, standalone instruments in direction of built-in, platform-based options. Medidata, a Dassault Systèmes firm, is constructing a next-generation knowledge platform that addresses the advanced challenges of recent scientific analysis. On this publish, we present you ways Medidata created a unified, scalable, real-time knowledge platform that serves hundreds of scientific trials worldwide with AWS providers, Apache Iceberg, and a contemporary lakehouse structure.

Challenges with legacy structure

Because the Medidata scientific knowledge repository expanded, the crew acknowledged the shortcomings of the legacy knowledge answer to supply high quality knowledge merchandise to their clients throughout their rising portfolio of information choices. A number of knowledge tenants started to erode. The next diagram reveals Medidata’s legacy extract, rework, and cargo (ETL) structure.

Constructed upon a collection of scheduled batch jobs, the legacy system proved ill-equipped to supply a unified view of the information throughout all the ecosystem. Batch jobs ran at completely different intervals, usually requiring a ample diploma of scheduling buffer to ensure upstream jobs accomplished inside the anticipated window. As the information quantity expanded, the roles and their schedules continued to inflate, introducing a latency window between ingestion and processing for dependent shoppers. Completely different shoppers working from varied underlying knowledge providers additional magnified the issue as pipelines needed to be repeatedly constructed throughout quite a lot of knowledge supply stacks.

The increasing portfolio of pipelines started to overwhelm present upkeep operations. With extra operations, the chance for failure expanded and restoration efforts additional sophisticated. Present observability techniques had been inundated with operational knowledge, and figuring out the basis trigger of information high quality points grew to become a multi-day endeavor. Will increase within the knowledge quantity required scaling issues throughout all the knowledge property.

Moreover, the proliferation of information pipelines and copies of the information in numerous applied sciences and storage techniques necessitated increasing entry controls with enhanced security measures to ensure solely the right customers had entry to the subset of information to which they had been permitted. Ensuring entry management modifications had been accurately propagated throughout all techniques added an extra layer of complexity to shoppers and producers.

Resolution overview

With the appearance of Medical Knowledge Studio (Medidata’s unified knowledge administration and analytics answer for scientific trials) and Knowledge Join (Medidata’s knowledge answer for buying, remodeling, and exchanging digital well being file (EHR) knowledge throughout healthcare organizations), Medidata launched a brand new world of information discovery, evaluation, and integration to the life sciences trade powered by open supply applied sciences and hosted on AWS. The next diagram illustrates the answer structure.

Fragmented batch ETL jobs had been changed by real-time Apache Flink streaming pipelines, an open supply, distributed engine for stateful processing, and powered by Amazon Elastic Kubernetes Service (Amazon EKS), a totally managed Kubernetes service. The Flink jobs write to Apache Kafka operating in Amazon Managed Apache Kafka (Amazon MSK), a streaming knowledge service that manages Kafka infrastructure and operations, earlier than touchdown in Iceberg tables backed by the AWS Glue Knowledge Catalog, a centralized metadata repository for knowledge belongings. From this assortment of Iceberg tables, a central, single supply of information is now accessible from quite a lot of shoppers with out extra downstream processing, assuaging the necessity for customized pipelines to fulfill the necessities of downstream shoppers. By means of these basic architectural modifications, the crew at Medidata solved the problems offered by the legacy answer.

Knowledge availability and consistency

With the introduction of the Flink jobs and Iceberg tables, the crew was in a position to ship a constant view of their knowledge throughout the Medidata knowledge expertise. Pipeline latency was diminished from days to minutes, serving to Medidata clients understand a 99% efficiency achieve from the information ingestion to the information analytics layers. Because of Iceberg’s interoperability, Medidata customers noticed the identical view of the information no matter the place they considered that knowledge, minimizing the necessity for consumer-driven customized pipelines as a result of Iceberg may plug into present shoppers.

Upkeep and sturdiness

Iceberg’s interoperability supplied a single copy of the information to fulfill their use circumstances, so the Medidata crew may focus its commentary and upkeep efforts on a five-times smaller subset of operations than beforehand required. Observability was enhanced by tapping into the assorted metadata parts and metrics uncovered by Iceberg and the Knowledge Catalog. High quality administration remodeled from cross-system traces and queries to a single evaluation of unified pipelines, with an added advantage of cut-off date knowledge queries because of the Iceberg snapshot function. Knowledge quantity will increase are dealt with with out-of-box scaling supported by all the infrastructure stack and AWS Glue Iceberg optimization options that embody compaction, snapshot retention, and orphan file deletion, which offer a set-and-forget expertise for fixing plenty of widespread Iceberg frustrations, such because the small file drawback, orphan file retention, and question efficiency.

Safety

With Iceberg on the middle of its answer structure, the Medidata crew now not needed to spend the time constructing customized entry management layers with enhanced security measures at every knowledge integration level. Iceberg on AWS centralizes the authorization layer utilizing acquainted techniques resembling AWS Id and Entry Administration (IAM), offering a single and sturdy management for knowledge entry. The information additionally stays completely inside the Medidata digital non-public cloud (VPC), additional decreasing the chance for unintended disclosures.

Conclusion

On this publish, we demonstrated how legacy universe of consumer-driven customized ETL pipelines may be changed with a scalable, high-performant streaming lakehouses. By placing Iceberg on AWS on the middle of information operations, you may have a single supply of information in your shoppers.

To be taught extra about Iceberg on AWS, discuss with Optimizing Iceberg tables and Utilizing Apache Iceberg on AWS.


Concerning the authors

Mike Araujo

Mike is a Principal Engineer at Medidata Options, engaged on constructing a subsequent technology knowledge and AI platform for scientific knowledge and trials. By utilizing the ability of open supply applied sciences resembling Apache Kafka, Apache Flink, and Apache Iceberg, Mike and his crew have enabled the supply of billions of scientific occasions and knowledge transformations in close to actual time to downstream shoppers, purposes, and AI brokers. His core abilities deal with architecting and constructing massive knowledge and ETL options at scale in addition to their integration in agentic workflows.

Sandeep Adwankar

Sandeep is a Senior Product Supervisor at AWS, who has pushed function launches throughout Amazon SageMaker, AWS Glue, and AWS Lake Formation. He has led initiatives in Amazon S3 Tables analytics, Iceberg compaction methods, and AWS Glue Iceberg optimizations. His latest work focuses on generative AI and autonomous techniques, together with the AWS Glue Knowledge Catalog mannequin context protocol and Amazon Bedrock structured data bases. Primarily based within the California Bay Space, he works with clients across the globe to translate enterprise and technical necessities into merchandise that speed up their enterprise outcomes.

Ian Beatty

Ian is a Technical Account Supervisor at AWS, the place he focuses on supporting unbiased software program vendor (ISV) clients within the healthcare and life sciences (HCLS) and monetary providers trade (FSI) sectors. Primarily based within the Rochester, NY space, Ian helps ISV clients navigate their cloud journey by sustaining resilient and optimized workloads on AWS. With over a decade of expertise constructing on AWS since 2014, he brings deep technical experience from his earlier roles as an AWS Architect and DevSecOps crew lead for SaaS ISVs earlier than becoming a member of AWS greater than 3 years in the past.

Ashley Chen

Ashley is a Options Architect at AWS primarily based in Washington D.C. She helps unbiased software program vendor (ISV) clients within the healthcare and life sciences industries, specializing in buyer enablement, generative AI purposes, and container workloads.