Pathways and Challenges Toward an Open Data Ecosystem

Image: Depositphotos

Source: Irving Wladawsky-Berger

“Our data is everywhere and powering everything,” noted “Pathways to Open Data,” a report by Linux Foundation Research published in March of 2025. “From marketing, to healthcare, to government services, to the emerging phenomenon of programming AI agents, organizations leverage data to be as efficient and effective as possible. However, data is often siloed within entities and any third-party data access requires overcoming significant technical, legal, economic, operational, and cultural obstacles that are multifactorial and at times may seem intractable. The increasing reliance on data calls for an assessment of these obstacles and how organizations can shift toward greater openness and sharing.”

The report defines open data as “data infrastructure that has the technical and legal requirements in place to make the data freely accessible for universal use, reuse, and redistribution.” Open data has its roots in open science and open knowledge in general, where non-personal and non-commercial data is freely published for the purpose of greater innovation, transparency, and collaboration. This culture of openness is strongest in public institutions, where the data collected is considered a public good without profit-generating opportunities, and where transparency of government and public-sector information is encouraged.”

Many governments have developed open data portals to enable access to public information, such as the Public Sector Information Directive launched by the European Union in 2003, and Data.Gov, launched by the Obama administration in May of 2009 with the goal of increasing transparency and accountability through open data. Initially launched with 47 data sets, Data.Gov now includes over 300,000 open data sets.

Beyond the public sector, the notion of open data becomes much more complex due to commercial and privacy considerations. In some sectors, like healthcare research, open data is a common good that will benefit everyone in society. However, if the data includes personally identifiable information, abiding by privacy regulations and data protection measures is paramount.

Moreover, in our emerging data economy, “data generation and collection has become a key component of the profit model, causing large corporations to build walled gardens around their data and controlling the flow of information,” the report adds. “The walled garden concept is felt across industries and sectors.”

To gather the necessary input, the report’s authors, Anna Hermansen and Paul Moritz Wiegmann held a discussion session on Pathways to an Open Data Ecosystem at the 2024 World Open Innovation Conference which took place in November at UC Berkeley. Session participants were asked a number of questions:

  • What are some challenges you face in access to and use of data?

  • How does your organization or project rely on data to innovate?

  • How does your organization make its data open to internal and external access?

  • How do you access relevant third-party data?

  • How have you incorporated technology to address your data needs?

  • Beyond technology, what policy or cultural changes have been implemented in your organization?

  • Based on your experience, how can we make data more open?

Participants included academics and practitioners from a variety of industry sectors. They shared relevant insights about barriers and opportunities for open data based on their research expertise and their practical experience. Let me summarize some of the report’s key findings.

Why does open data matter?

Access to data has long been a crucial part of business intelligence and innovation, especially so in our data-centric AI age. Data is needed to train AI models so they can make meaningful predictions. Data is not merely fuel for AI, but a determining factor in the overall system quality, and a way to help build AI systems capable of dealing with complex real-world problems.

The essence of open source is collaborative innovation,  that is, working with people all over the world as a community to address important problems. This has long been the case with research and open source software (OSS) communities. Open source protocols have led to the huge success of the Internet and World Wide Web, — among the major enablers of collaborative innovation the world has ever known.

Similarly, sharing data, even among competitors, can help an organization innovate faster by developing a variety of products leveraging the open data available to all. Session participants mentioned that access to data is empowering. “Being able to drive certain outcomes using public and internal data makes the case for data openness and accessibility.” Triangulating third party data with internal proprietary data “is essential to train large AI models, validate research, or discover market opportunities.” In addition, “Beyond the analytical value of a shared dataset, this activity also increases trust.” As one participant said, there’s an implication of trust among a group that’s openly sharing open data that’s is rather compelling.

Unique challenges of open data compared to open source software

The Linux Foundation (LF), initially named the Open Source Development Labs (OSDL), was founded in 2000 by a small number of companies to support the continued development of Linux.  Ever since, open source software has been widely accepted around the world as evidenced by the impressive growth and current scope of the LF.

“When considering the challenges faced by open data, it is important to consider the characteristics of data that make it unique as compared to other content, such as software,” said the LF Research report. The report referenced a recent blog post by Marc Prioleau, “The Unique Challenges of Open Data Projects,” in which he wrote:

“While open data projects can leverage the decades of experience inside the Linux Foundation in building open source software communities, the emerging field of open data presents novel considerations that deserve careful attention.”

His blog lists six characteristics that make open data different from open software:

  • The proprietary origins of data: Unlike open source software, where contributions often start as open from inception, open data usually begins with proprietary sources.

  • The patchwork of data licenses to navigate: While open source software projects typically operate under a single license, open data projects must navigate a patchwork of licenses.

  • The scale and cost of collecting, hosting, and maintaining data: Data is massive. Storage and compute costs can require a substantial infrastructure that may run into millions of dollars annually.

  • The workflows required for the ongoing production of data: Data requires a continuous production approach instead of incremental development, including regular updates to reflect real-world changes.

  • Assuring accuracy and quality of data: Data describes something real; unlike code, where competing versions can coexist, open data projects must resolve conflicting data to avoid disseminating inaccuracies. And,

  • Protecting personally identifiable information: Contributed data may contain personal identifying information that could pose significant ethical and legal risks.

Current challenges of open data

Data is rapidly becoming the backbone of modern technologies — powering everything from AI models to digital infrastructure. Consequently, the need for well-managed, accessible, and high-quality open data has never been more critical. Session participants brought up a number of current challenges to open data, including:

  • The cost/quality tradeoff: Is private data that you pay for better quality than open data because it has been carefully curated versus relying on volunteer contributions that may not be reliable and up to date?

  • The labor intensity of creating open data sets: Data curation is expensive due to the labor required to manage the data, from collection to maintenance to quality control.

  • Standardization: If data is not standardized it will impact its reliability and usefulness. Standardization is more complex for some industries than others. In engineering, data tends to be relatively concrete and easy to interpret. But in other industry sectors, such as healthcare, data is much harder to interpret making data sharing and interoperability even more complex. 

  • Data privacy: The potential to expose sensitive personal or business data made some participants hesitant to use open data as well as to open source their own data. Regulatory compliance is another serious concern.

  • Control over data: “Beyond the privacy and quality concerns that keep data closed, participants also expressed the possibility of losing a competitive advantage by sharing data.”

The future of open data: next steps

The report’s overriding conclusion is that “the entire walled garden approach needs to be dismantled with new governance mechanisms, decentralization, collaboration, and open source. Analysis of the discussion revealed three important themes to help reshape the data sharing landscape and shift the ecosystem toward greater openness.”

• Building open data infrastructure requires a reworking of current data collection & sharing processes. We currently have an asymmetric control dynamics, where large amounts of user data is gathered by large companies as a major part of their business model, without the user giving permission or getting feedback on how their data is used. This asymmetric power dynamic could be addressed by reconfiguring usage rights in order to increase visibility and transparency on the use of data, reduce fears around data openness and privacy, and “create an environment where sharing becomes more important than protecting data.” In addition, data ownership should be managed through licenses “such as the Community Data License Agreement (CDLA) which provides the legal framework to share data.”

• Open data requires supporting a cultural shift toward greater openness,  and incentivizing  collaboration around the pre-competitive data layers that are particularly useful for collaborating with other in their industry. “Building a value proposition to contribute data to a collaborative dataset is key for this to happen,” and

• Building open datasets requires the right kind of governance structure that balances a culture of collaboration and neutrality while still managing for checks and balances. Data governance must become a top priority for open source AI projects, where data workflows are managed responsibly with attention to quality and compliance. “Considering new forms of governance that support incentives to share data and bring the individual back into the process has the potential to transform the data landscape and encourage greater publishing of data for the benefit of all.”

The Pathways to Open Data discussion “revealed insights into how academics and practitioners are considering the tradeoffs between open and closed data and identified some realistic concerns and expectations for open databases,” said the LF Research report in conclusion. “Through the analysis and reporting of this session, we hope to shed light on the importance of open data and encourage those working with data to consider the ways that they can better collaborate on datasets, incentivize sharing, and reshape the culture of their organization to support greater openness. As policies and cultures shift with new technologies, new governments, and new economic concerns, it is crucial to establish an orientation of openness no matter the headwinds.”


Irving Wladawsky-Berger

Irving Wladawsky-Berger, PhD., is a Research Affiliate at MIT's Sloan School of Management and at Cybersecurity at MIT Sloan (CAMS) and Fellow of the Initiative on the Digital Economy, of MIT Connection Science, and of the Stanford Digital Economy Lab.

Visit Irving on LinkedIn.