Icebreaker One (IB1) aims to make trusted data more available to organisations and people by focusing on governance first. This is a good approach and a laudable goal, and I wish it was already in place!
As part of this IB1 make the case that connecting to data rather than collecting data is a better approach for working with disparate data sets. Part of the argument stems from the scale of the data being generated today: it is easier to leave the data with the organisation that generates it, hosts it and updates it, than to continuously take copies. diffs and updates of those datasets. Their approach is to "harmonise the rules, standards and governance". I can see that the work of ib1.org, and related projects like schema.org can yield significant benefits but until they're available for use at scale then for now WikiSim is taking the inversion of responsibility approach taken by sites like archive.org and perma.cc. Here the source website does not need to opt in, so if you want a permanent snapshot of a public webpage, data or resource, you can do that without the source website owner needing to do anything. For now this is the only model for reliably preserving, sharing and working with trusted data.
3 main challenges to reach trusted data at scale
There are 3 main challenges that I see with the connect not collect model at present, 2 based around trust and 1 around performance:
Trust that the source data will remain available when you need it. If you are relying on a third party to host and serve data, what happens if they update their site and break their links, or go out of business, change their terms of service or simply decide to take the data down? The risk of link rot is significant.
Trust that the source data is unchanged. If you are connecting to a third party's data, how do you know that it has not been tampered with? This is particularly important for sensitive data such as that influencing policy and political decision making that WikiSim is aiming to support. A simple hash gives you good guarantees that the data you are accessing is exactly the same as when it was first recorded, but it does not prevent the source from changing the data later on nor does it provide any insight into why the data may have changed or to what extent: was it a simple typo being correctly or a crucial figure being changed.
For both 1 & 2, one potential solution is for a kind of URL that guarantees that the resource accessed by it would always be available and immutable, byte-for-byte, errors and all included. This is something that DOI.org gets close to and both archive.org and perma.cc provide.
WikiSim.org's current approach is to collect the data, and provide links to each permanent version of the data. If it was just a case of having the raw CSV or HTML then archive.org and perma.cc would be sufficient but the other aims of WikiSim are to provide cleaned efficient data to work with in the browser and to do so when data are not available as URLs.
Example 1: this data set on monthly foreign exchange rate of GBP to USD has been produced from an original CSV over 10 times the size. The original data also lacks a permanent URL and instead is accessed via Javascript on the source page.
Example 2: another data set on WikiSim of minimum wages which is cleaned to reduce its size from 1Mb to 30Kb by removing redundant fields and a second wrapper page to make it trivial to then use this data in other WikiSim page calculations.
These two examples show that even if the source website goes down, or changes their data, the version stored on WikiSim.org remains available and unchanged. This is vital for both trust in the data and reliability for any other WikiSim pages that depend on that data set.
As an aside, guarantees about the resource being provided in good faith, accurate and trustworthy in the sense that data has not been fabricated or manipulated is a different challenge entirely and one that Tony Ageh & dot-public are aiming to address via governance.
The final requirement is performance. When a user views a WikiSim page they are not viewing a static HTML page but instead a live recalculation of the data. If that recalculation requires other data sets then those will be downloaded by the user's browser whilst the page renders. If those also have dependencies this grandparent pages will be downloaded etc. In the future there may be multiple users across the site accessing a range of sources. This increases the probability that the resulting experience will degrade (become slower) if a popular page accidentally results in a DDOS of the source website, and or the experience for the user and the utility of WikiSim will break entirely if a source website is down. Caching can help with this, but then you are back to a collect not connect model.
Alternatives - GitHub.com?
As mentioned, perma.cc and archive.org provide a convenient way to snapshot URLs however as highlighted above, some data is only available via interacting with a webpage via its JavaScript. Some data also benefits from cleaning and or reducing its size. So for WikiSim.org the only other relatively convenient alternative is for hosting datasets on GitHub or similar services. This has the advantage that the data can be version controlled, and GitHub provides permanent URLs for each version of a file. GitHub has servers with good uptime and bandwidth so DDOS risk is also minimal. GitHub also allows for serving data to any website due to its permissive CORS headers which is how the first version of Energy Explorer gets its solar, wind, and electrical demand data. Whilst convenient for individual users to publish data to GitHub it is however not clear how this scales to a global wiki platform where you want to allow anyone with valuable data to contribute it permanently into the public commons and anyone else to edit and refine it.
Other aspects
There are other important aspects that IB1 has already explored in depth and that I have not yet had the resources to consider fully with regard to WikiSim's approach to data. These include discoverability of data, and also licensing of data use for different purposes.
Conclusion
WikiSim.org's current approach of collecting data rather than connecting to it will remain for now until permanent, immutable and performant datasets are guaranteed through other means. However preparing WikiSim.org's governance so that it could be compatible with initiatives like IB1.org and provide part of the foundational layer for trusted open data at scale is important to explore further.
I'm grateful for Stefan Magdalinski for letting me know about Icebreaker One.
Cross posted from ajamesphillips.com