Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Philips Hue releases new upgraded Turaco outdoor lights

    Influencer-hosted Convergence indie showcase returns today, featuring 30 new games

    Who needs a publisher, anyway? “I mean, who does physical any more?”

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Read the extended transcript: President Donald Trump interviewed by ‘NBC Nightly News’ anchor Tom Llamas

      February 6, 2026

      Stocks and bitcoin sink as investors dump software company shares

      February 4, 2026

      AI, crypto and Trump super PACs stash millions to spend on the midterms

      February 2, 2026

      To avoid accusations of AI cheating, college students are turning to AI

      January 29, 2026

      ChatGPT can embrace authoritarian ideas after just one prompt, researchers say

      January 24, 2026
    • Business

      The HDD brand that brought you the 1.8-inch, 2.5-inch, and 3.5-inch hard drives is now back with a $19 pocket-sized personal cloud for your smartphones

      February 12, 2026

      New VoidLink malware framework targets Linux cloud servers

      January 14, 2026

      Nvidia Rubin’s rack-scale encryption signals a turning point for enterprise AI security

      January 13, 2026

      How KPMG is redefining the future of SAP consulting on a global scale

      January 10, 2026

      Top 10 cloud computing stories of 2025

      December 22, 2025
    • Crypto

      Is Bitcoin Price Entering a New Bear Market? Here’s Why Metrics Say Yes

      February 19, 2026

      Cardano’s Trading Activity Crashes to a 6-Month Low — Can ADA Still Attempt a Reversal?

      February 19, 2026

      Is Extreme Fear a Buy Signal? New Data Questions the Conventional Wisdom

      February 19, 2026

      Coinbase and Ledn Strengthen Crypto Lending Push Despite Market Slump

      February 19, 2026

      Bitcoin Caught Between Hawkish Fed and Dovish Warsh

      February 19, 2026
    • Technology

      Philips Hue releases new upgraded Turaco outdoor lights

      February 19, 2026

      The ‘last-mile’ data problem is stalling enterprise agentic AI — ‘golden pipelines’ aim to fix it

      February 19, 2026

      New agent framework matches human-engineered AI systems — and adds zero inference cost to deploy

      February 19, 2026

      Alibaba’s Qwen 3.5 397B-A17 beats its larger trillion-parameter model — at a fraction of the cost

      February 19, 2026

      When accurate AI is still dangerously incomplete

      February 19, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»Dataframely: A polars-native data frame validation library
    Technology

    Dataframely: A polars-native data frame validation library

    TechAiVerseBy TechAiVerseApril 30, 2025No Comments7 Mins Read3 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    Dataframely: A polars-native data frame validation library

    At QuantCo, we are constantly trying to improve the quality of our code bases to ensure that they remain easily maintainable. More recently, this often involved migrating data pipelines from pandas to polars in order to achieve significant performance gains.

    At the end of 2023, we started undertaking an effort to modernize a massive legacy codebase in our one of our longest-running projects. While doing that, we realized that our existing data frame processing code had an integral flaw: column names, data types, value ranges, and other invariants — none of it was obvious just from reading the code.

    As a result, the typical approach for understanding a function’s behavior involved executing it on client infrastructure — the only place the actual data is available. Then, we would manually step through each pandas transformation to inspect the data before and after every change. Naturally, this is tedious, error-prone, and far from efficient.

    Once we’d rewritten a chain of transformations in polars, the absence of static type checking or runtime validation on data frame contents meant that bugs were hard to catch. To ensure correctness, we often had to run our entire pipeline end-to-end on large datasets – which required significant time and compute resources.

    Eventually, we realized that we needed a better way to describe, validate and reason about the content of the data frames in our data pipeline. We wanted to make invariants obvious while reading the code and actually enforce these invariants at runtime to ensure correctness.

    Data frame validation to the rescue

    A natural solution to this problem are data frame validation libraries. Already back in 2023, Python libraries existed that allowed defining data frame schemas and verifying that data frames comply with these schemas, i.e. fulfill predefined expectations.

    In some projects, we had already been using pandera, a widely known open-source library, to validate pandas data frames. Unfortunately, back in 2023, pandera did not have any polars support and a notable polars-native alternative, namely patito, was still in its infancy and could not be considered production-ready.

    However, even today, we’re still encountering several limitations with pandera and patito for our use case. We concluded that they are inherent to their scope & design and cannot easily be addressed by contributing to these projects – which, we still actively do regardless (e.g., we maintain the conda-forge feedstock of pandera).

    Specifically, pandera and patito are missing support for

    • validation of interdependent data frames
    • soft validation including introspection of failures
    • test data generation from schemas
    • strict static type checking for data frames

    Introducing dataframely: A polars-native data frame validation library

    To remedy the shortcomings of these libraries, we developed dataframely. dataframely is a declarative data frame validation library with first-class support for polars data frames. Its purpose is to make data pipelines written in polars (1) more robust by ensuring that data meets expectations and (2) more readable by adding schema information to data frame type hints.

    Talk is cheap, so, let’s have a look at some code examples.

    Defining schemas

    To get started with dataframely, you first define a schema. At QuantCo, we are often dealing with insurance claims — for instance, we might create a schema for a data frame containing hospital invoices:

    import dataframely as dy
    
    class InvoiceSchema(dy.Schema):
        invoice_id = dy.String(primary_key=True)
        admission_date = dy.Date(nullable=False)
        discharge_date = dy.Date(nullable=False)
        amount = dy.Decimal(nullable=False, min_exclusive=Decimal(0))
    
        @dy.rule()
        def discharge_after_admission() -> pl.Expr:
            return pl.col("discharge_date") >= pl.col("admission_date")
    

    While we can describe the data frame in terms of its columns and their data types, we can also encode expectations on the column level as well as across columns. For example, we can designate one (or multiple) column(s) as primary key or define a custom validation rule that acts across columns.

    Validating a data frame

    Once we’ve defined a schema, we can pass a pl.DataFrame or pl.LazyFrame into its validate classmethod to validate that the contents match the schema definition. If we want to automatically coerce the column types to the types specified in the schema, we can pass cast=True.

    invoices = pl.DataFrame({
        "invoice_id": ["001", "002", "003"],
        "admission_date": [date(2025, 1, 1), date(2025, 1, 5), date(2025, 1, 1)],
        "discharge_date": [date(2025, 1, 4), date(2025, 1, 7), date(2025, 1, 1)],
        "amount": [1000.0, 200.0, 400.0]
    })
    
    validated: dy.DataFrame[InvoiceSchema] = InvoiceSchema.validate(invoices, cast=True)
    

    If any row in invoices is invalid, i.e., any rule defined on individual columns or the entire schema evaluates to False, a validation exception is raised. Otherwise, if all rows in invoices are valid, validate returns a validated data frame of type dy.DataFrame[InvoiceSchema].

    Importantly, dy.DataFrame[InvoiceSchema] is a pure typing construct and one still deals with a pl.DataFrame at runtime. This has the benefit that dataframely can be adopted in a gradual fashion where any dy.DataFrame[...] can easily be passed to a method that accepts a pl.DataFrame (and vice versa by using type: ignore comments).

    The biggest benefit is, however: the generic data frame type immediately tells the reader of the code what data they can expect to find in the data frame. This markedly improves the usefulness of mypy for data frame-based code: the type checker can now ensure that a data frame passed to a method fulfills certain preconditions wrt. its contents – without incurring a hidden performance hit at runtime.

    Validating groups of data frames

    Oftentimes, data frames (or rather “tables”) are interdependent and proper data validation requires consideration of multiple tables that share a common primary key. dataframely enables users to define “collections” for groups of data frames with validation rules on the collection level. To create a collection, we first introduce a second schema for diagnosis data frames:

    class DiagnosisSchema(dy.Schema):
        invoice_id = dy.String(primary_key=True)
        diagnosis_code = dy.String(primary_key=True, regex=r"[A-Z][0-9]{2,4}")
        is_main = dy.Bool(nullable=False)
    
        @dy.rule(group_by=["invoice_id"])
        def exactly_one_main_diagnosis() -> pl.Expr:
            return pl.col("is_main").sum() == 1
    

    We can then create a collection that bundles invoices and the diagnoses that belong to these invoices:

    # Introduce a collection for groups of schema-validated data frames
    class HospitalClaims(dy.Collection):
        invoices: dy.LazyFrame[InvoiceSchema]
        diagnoses: dy.LazyFrame[DiagnosisSchema]
    
        @dy.filter()
        def at_least_one_diagnosis_per_invoice(self) -> pl.LazyFrame:
            return self.invoices.join(
                self.diagnoses.select(pl.col("invoice_id").unique()),
                on="invoice_id",
                how="inner",
            )
    

    Notice how we can further define our expectations on the collection contents by adding validation across members of the collection using the @dy.filter decorator.

    If we call validate on the collection, a validation exception will be raised if any of the input data frames does not satisfy its schema definition or the filters on the collection result in the removal of at least one row across any of the input data frames.

    Soft-validation and validation failure introspection

    While calling validate is useful to ensure correctness, in production pipelines, we typically do not want to raise an exception at run-time. To this end, dataframely provides the filter method to perform “soft-validation” of schemas and collections. filter returns the rows that pass validation and an additional FailureInfo object to inspect invalid rows:

    good, failure = InvoiceSchema.filter(invoices, cast=True)
    
    # Inspect the reasons for the failed rows
    failure.counts()
    
    # Inspect the co-occurrences of validation failures
    failure.cooccurrence_counts()
    

    Since filter does not raise an exception, we can safely use it in our production code and log the invalid rows to inspect them later.

    Additional features

    Throughout our journey with dataframely, we realized that, defining schemas and, thus, encoding expectations on data frame contents has various benefits beyond running validation. For example, we can automatically derive the SQL schema of a table if we want to write a data frame to a database. Another possibility is to automatically generate sample data for unit testing that adheres to the schema, thus letting test authors focus on test content rather than verbosely creating data frames. To learn about all the possibilities, check out the API documentation.

    Experiences in practice

    Understanding the structure and content of data frames is crucial when working with tabular data — a core requirement for the highly robust data pipelines we aim to build at QuantCo. dataframely has already brought us closer to that goal: today, we are successfully using dataframely in the day-to-day work of multiple teams across several clients for both, analytical and production pipelines.

    Our data scientists and engineers love dataframely because

    • It improves the legibility, comprehensibility, and robustness of pipeline code
    • It increases code quality and confidence in code correctness through statically typed APIs and contracts
    • It enables code generation from data frame schema definitions (e.g., for SQL operations)
    • It allows introspecting pipeline failures more easily
    • It facilitates unit testing data pipelines through sample test data generation

    We’re excited to open-source dataframely and share it with the data engineering community. If you’re working with complex data pipelines and looking to improve reliability, productivity, and peace of mind, we think you’ll love it too.

    Check out dataframely on GitHub and let us know what you think!

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleOptimize Google’s new Interaction to Next Paint metric
    Next Article I Created Perfect Wiki and Reached $250K in Annual Revenue Without Investors
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Philips Hue releases new upgraded Turaco outdoor lights

    February 19, 2026

    The ‘last-mile’ data problem is stalling enterprise agentic AI — ‘golden pipelines’ aim to fix it

    February 19, 2026

    New agent framework matches human-engineered AI systems — and adds zero inference cost to deploy

    February 19, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025684 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025273 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025156 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 2025118 Views
    Don't Miss
    Technology February 19, 2026

    Philips Hue releases new upgraded Turaco outdoor lights

    Philips Hue releases new upgraded Turaco outdoor lights – NotebookCheck.net News ⓘ Philips HueThe Philips…

    Influencer-hosted Convergence indie showcase returns today, featuring 30 new games

    Who needs a publisher, anyway? “I mean, who does physical any more?”

    Still Wakes The Deep dev The Chinese Room signs with Lyrical Publishing for new title

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Philips Hue releases new upgraded Turaco outdoor lights

    February 19, 20260 Views

    Influencer-hosted Convergence indie showcase returns today, featuring 30 new games

    February 19, 20262 Views

    Who needs a publisher, anyway? “I mean, who does physical any more?”

    February 19, 20262 Views
    Most Popular

    7 Best Kids Bikes (2025): Mountain, Balance, Pedal, Coaster

    March 13, 20250 Views

    VTOMAN FlashSpeed 1500: Plenty Of Power For All Your Gear

    March 13, 20250 Views

    This new Roomba finally solves the big problem I have with robot vacuums

    March 13, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.