Conducting Analysis in SDC
As a data analyst, work within the USDOT Secure Data Commons (SDC) to share code and data, upload datasets, and export approved derived analyses. Through the SDC, you can:
- Share code and data with other analysts
- Upload your own datasets
- Export approved derived analysis
We'll provide you with a cloud-based workstation with preloaded programming environments and software that grants you access to the data lake and data warehouse. The workstation also includes commercially available tools - no local software or tool installation needed!
Analytical Tools and Query Languages Supported
The SDC platform provides on-demand access to popular programming and statistical tool packages for cloud-based processing (for experienced analysts). Other, nonstandard software can be installed upon request, both individually and across user groups. For software requiring special licenses, analysts may provide their own existing licenses.
Types of Datasets
The SDC platform provides a data lake of transportation-related structured, semi-structured, and unstructured datasets that are stored in raw, curated, and published formats. Each dataset has different data agreements based on the complexity and sensitivity of the data. Access to specific data is approved by data providers - learn more about specific dataset formats below:
Unaltered data are stored in their native/original "as-is" (raw) format. Uploads can be continuous through streaming sources (i.e., APIs or sensors) or through one-time uploads from external sources. This data can be structured (databases, logs, financial data), semi-structured (HTML, XML, RDF, CSV), or unstructured (images, PDFs, Word documents). Raw data cannot be copied or exported.
Data curation is the process of integrating raw data collected from various sources and annotating and presenting the data so that the value of the data is maintained and made available for reuse and preservation. For researchers and data scientists, curated datasets enable data discovery and retrieval and maintain data quality. During the curation process, data are transformed from unstructured and semi-structured formats to structured formats; and data deduplication, obfuscation, and cleansing processes are conducted - resulting in high-quality data that enables researchers to elicit meaningful insights.
Researchers create published datasets to disclose their research and allow other users to verify and reuse the data beyond their original purpose. Published datasets are a result of combining analyses on curated datasets in the SDC platform with other datasets or algorithms owned or created by a researcher or data scientist.
As a data analyst planning to do analysis in the SDC, use the steps below to get started.