This site contains archive materials from the FNIH OMOP Pilot Program. For updated content visit http://omop.org
This site contains archive materials from the FNIH OMOP Pilot Program. For updated content visit http://omop.org
This web page presents the Observational Medical Dataset Simulator (OSIM) Version 2
(updated April 11, 2012)
The initial Observational Medical Dataset Simulator was released in 2009 and used to generate datasets with millions of hypothetical patients with drug exposure, background conditions, and known adverse events for the purpose of benchmarking methods performance. OSIM has provided large-scale datasets to methodologists and facilitated the establishment of the OMOP Cup Competition. It also advanced the OMOP Research Team's insights about the complex interdependencies between clinical observations in real data, and how those relationships may influence a method's behavior in identifying true associations and discerning from false positive findings.
Based on these insights, continued research has resulted in the development of a second-generation simulated dataset procedure, known as OSIM2. OSIM2 represents an alternative design to accommodate additional complexities observed in real-world data, including advanced modeling of the correlations between drugs and conditions. OSIM2 allows for more direct comparisons between simulated data and real observational databases, and should enable greater methods evaluation by allowing assessment of how methods accommodate these complex interrelationships. At OMOP, OSIM2 is used to benchmark the performance of methods to estimate the strength of association between drug treatment and outcome.
OSIM2 source code, documentation, and databases are available for download:
Download of OSIM2 Datasets
We have generated 16 OSIM2 datasets that are now available for download. Each dataset is a 10m person dataset modeled after Thomson Reuters MarketScan® Lab Database (MSLR), one without any signals injected, and then the other 15 databases have different size/types of signals (relative risk: 1.25, 1.5, 2, 4, 10; and risk type: acute onset (equals 'any exposure' events occurring within 30d of exposure start), insidious, and accumulative). MSLR, covering 2003 – 2009, represents privately-insured population, with administrative claims from inpatient, outpatient, and pharmacy services supplemented by laboratory results.
The datasets listed below are freely available for download through OMOP’s anonymous FTP server. For example, you can download: OSIM2_10M_MSLR_MEDDRA_6, which has a set of signals injected at RR=1.50 and with insidious onset (during exposure or 30d afterwards).
| OSIM2 Datasets | Injected Signals at Relative Risk Equals | Risk Type | Size |
|---|---|---|---|
| OSIM2_10M_MSLR_MEDDRA_0 | None | None | 3.5GB |
| OSIM2_10M_MSLR_MEDDRA_3 | 1.25 | Insidious | 3.5GB |
| OSIM2_10M_MSLR_MEDDRA_6 | 1.5 | Insidious | 3.5GB |
| OSIM2_10M_MSLR_MEDDRA_9 | 2 | Insidious | 3.5GB |
| OSIM2_10M_MSLR_MEDDRA_12 | 4 | Insidious | 3.5GB |
| OSIM2_10M_MSLR_MEDDRA_15 | 10 | Insidious | 3.8GB |
| OSIM2_10M_MSLR_MEDDRA_2 | 1.25 | Any Exposure | 3.5GB |
| OSIM2_10M_MSLR_MEDDRA_5 | 1.5 | Any Exposure | 3.5GB |
| OSIM2_10M_MSLR_MEDDRA_8 | 2 | Any Exposure | 3.5GB |
| OSIM2_10M_MSLR_MEDDRA_11 | 4 | Any Exposure | 3.5GB |
| OSIM2_10M_MSLR_MEDDRA_14 | 10 | Any Exposure | 3.6GB |
| OSIM2_10M_MSLR_MEDDRA_1 | 1.25 | Accumulative | 3.5GB |
| OSIM2_10M_MSLR_MEDDRA_4 | 1.5 | Accumulative | 3.5GB |
| OSIM2_10M_MSLR_MEDDRA_7 | 2 | Accumulative | 3.5GB |
| OSIM2_10M_MSLR_MEDDRA_10 | 4 | Accumulative | 3.5GB |
| OSIM2_10M_MSLR_MEDDRA_13 | 10 | Accumulative | 3.7GB |
Please note that these are very large files. We have tested the OSIM2 dataset downloads using FileZilla and WS-FTP. FileZilla is free open source client software that can be downloaded from: http://filezilla-project.org/download.php
To log in to the anonymous FTP server use the following credentials:
Server: 23.21.159.38
Login: anonymous
Password: blank
Our FTP server supports SFTP protocol (port 22)
On the server, there are two main folders:
● MedDRA: All data in this folder use MedDRA based condition concepts.
○ Transition Matrices. Currently there are transition matrices available for the following databases: GE, MDCD, MDCR, MSLR
○ OSIM2 dataset. All 16 OSIM2 datasets are available in individual directories. These folders contain simulated data in Common Data Model Version 2. OSIM2 is not available in CDM V3, only in V2 format.
● SNOMED: All data in this folder use SNOMED-CT based condition concepts.
○ Transition Matrices. Currently there are transition matrices available for the following databases: CCAE, MDCD, MDCR, MSLR
○ IN THE FUTURE: OSIM2 data will be available in SNOMED format.
Please contact OMOP to share with us your experience with OSIM2 datasets.