Learn more about the technical details of the Smart Data sandbox

Synthetic Data overview

250220-sme-smart-data-model
250220-consumer-smart-data-model

The above images show the provisional models of the SME and consumer datasets available in the sandbox. The synthetic data will provide data for 5,000 individuals and 100 small businesses over a one year timeline. Entrants must use at least one of the synthetic datasets provided in the sandbox as part of their solution but teams are also welcome to bring any data they have the right to use into their secured area of the sandbox. 

Key points to note regarding the datasets:

  • The arrows and linking variables give an indication of how the datasets can be linked.
  • The data will be split between a number of synthetic organisations e.g. banking transaction data will be held by one of 4 synthetic banks.
  • To simplify linking data from different sectors back to a consumer, we have included unique identifier variables. These wouldn’t normally exist in real data, but will allow participants to focus on innovation and bringing their idea to life, rather than common data handling tasks like record matching. Of course, the identifiers can be ignored in favour of more realistic matching, for example by name, date of birth or address.
  • To further simplify data manipulation and aggregation, the datasets from different sectors have been standardised to a degree.

Any queries about the datasets should be directed to [email protected]

Details of the APIs, to extend the data portability of Open Banking to new sectors

Data access (via API) has been at the heart of enabling major innovations. Open Banking was an early adopter of this approach, but the opportunity for benefits to consumers extends much further. The Smart Data Sandbox will provide access to APIs and underlying synthetic data across a range of sectors including finance, transport and retail to mirror what a Smart Data ecosystem might look like, and inspire innovation. APIs will enable participants to securely access the data in ways previously not possible to develop their solutions as part of the Smart Data Challenge. 

Details on the process of generating synthetic data

The synthetic data for the sandbox is generated using Smart Data Foundry’s synthetic data platform, Aizle. Aizle uses Agent Based Simulation to create the synthetic data, combining deep domain knowledge and industry expertise, with input and guidance from the Smart Data Challenge Prize advisory group. Aizle generates coherent, linked, and interoperable synthetic datasets without relying on or needing real training data. The framework of thousands of input parameters and simulated behaviours is constantly updated, and produces high fidelity and utility outputs for building new products, exploring machine learning models, and solving data governance and access problems – accelerating the pace of innovation.

Information about the Sandbox platform architecture

The Smart Data Sandbox platform brings together technical tools and synthetic data, enabling participants to innovate and develop their solutions. APIs for cross-sector data will be available in a Data Marketplace. Each participant will be allocated a private and secure team zone on the platform where they will be able to access technical and developer tools, use the data, and develop their solution. The secure spaces, with full data access, will be in egress-restricted environments available on the platform, whereas APIs will provide a proportion of the datasets which can be used on the platform or locally. Each team zone will also have a showcase space where they can showcase their solution and other materials with the Smart Data Challenge Prize team and judges.

Technical Skills

Innovators will need to bring along many skills and qualities such as curiosity, problem solving, communication, networking and even resilience over the course of the Challenge Prize, but there are some technical capabilities that will be useful for the team to utilise the sandbox to the full and develop a prototype. Desirable technical skills include:

Data Handling:

  • Data Wrangling & Preprocessing: Understanding and cleaning data, handling missing values, data transformations.
  • Data Visualisation: Communicating insights and trends through charts, dashboards, and other visuals.
  • Statistical Analysis: Using techniques like hypothesis testing, regression, and clustering to extract insights.

Software Engineering & Development:

  • Programming Languages: Proficiency in languages like Python, R, SQL for data manipulation and model development, or languages and frameworks  like Java, Javascript, .NET, PHP, for developing web applications.
  • Cloud Platforms: Familiarity with cloud platforms for data storage, processing, and deployment.
  • API Integration: Connecting different data sources and applications through APIs for efficient data flow.

Find out how to access the sandbox