ClinicalDataGenerator

The configuration must be provided as JSON object. The columns of the generated dataframe are configured by using the name of the column as key in the JSON configuration. E.g.:

{   "id": "INTEGER", "description": "STRING" }

will result in:

id	description
444637	Ku awofel Idolaz
648937	il Itoxenag ob Av fi t

The value must be either the name of the desired datatype, or a JSON object with the following content:

```
"repeat": 2
```
or
```
"repeat": {"Min": 1, "Max": 2}
```
Specifies the number of times this element should be repeated. The other values are repeated meanwhile.
```
"Columns": {"ColumnA": "INTEGER", "ColumnB": "STRING"}
```
Sub-Columns (for the generation of repeating elements in a LONG format way).
```
"type": "INTEGER"
```
Datatype of the element to generate. Is inferred by the datatype of distribution if present.
```
"missingProbability": 0.2
```
Probability, that element in this column will be null.

The following distributions are possible:

"UniformDistribution": {"min": "2010-01-01", "max": "2011-12-31"}

Sample the values for this column from a UniformDistribution.

"GaussianDistribution": {"min": "2010-01-01", "max": "2011-12-31", "mean": "2011-06-01", "sigma": "P20D"}

Sample the values for this column from a GaussianDistribution.

```
"ValueDistribution": {"male": 0.52, "female": 0.45, "other": 0.03}
```
Generate the given values with the given probability.
Please note that also numbers and booleans must be quoted due to JSON format limitations. E.g. you have to use
```
{"1": 0.5, "2": 0.25, "3": 0.25}
```
instead of
```
{1: 0.5, 2: 0.25, 3: 0.25}
```
```
"FakerDistribution": "name.fullName"
```
or
```
"FakerDistribution": {"expression": "name.fullName", "locale": "de"}
```
Use the Java Faker library to generate a value. Trigger auto-completion using Ctrl+Space to see available options!
```
"XegerDistribution": "[A-Za-z0-9]{2,5}"
```
Use Xeger to generate strings matching a specific regular expression.

This tool produces artificial study data for a given Operational Data Model (ODM) file. This synthetic study data can be used, e.g. for the evaluation of other ODM-based tools. In case you have no ODM-File, you can use a prepared test file on the server instead

ODM Basics

Operational Data Model (ODM) is a standard widely used by Electronic Data Capture systems. It defines the metadata of a study as well as the collected study data. The overall structure of a ODM file looks like this:

<ODM xmlns="http://www.cdisc.org/ns/odm/v1.3" FileType="Snapshot" FileOID="WHO5 ODM File"
			 CreationDateTime="2017-03-15T00:00:00" ODMVersion="1.3.2">
			<Study OID="StudyOID">
				<GlobalVariables>
					<StudyName>Sample Study</StudyName>
					<StudyDescription/>
					<ProtocolName>Sample Protocol</ProtocolName>
				</GlobalVariables>
		
				<MetaDataVersion OID="MetaDataOID" Name="WHO5MetaData">
					<!-- the study events, forms, item groups and items for the study data are defined here  -->
				</MetaDataVersion>
			</Study>
		
			<ClinicalData StudyOID="StudyOID" MetaDataVersionOID="MetaDataOID">
				<SubjectData SubjectKey="subject 1">
					<!-- Subject 1's study data -->
				</SubjectData>
				<SubjectData SubjectKey="subject 2">
					<!-- Subject 2's study data -->
				</SubjectData>
				<!-- more subject data -->
			</ClinicalData>
		</ODM>

This tool's purpose is to read the metadata-part of the file and generate values for a given number of subjects. A simple example of ODM metadata is given below:

<Study OID="StudyOID">
			<GlobalVariables><!-- --></GlobalVariables>
		
			<MetaDataVersion OID="MetaDataVersionOID" Name="MetaData">
				<Protocol>
					<StudyEventRef StudyEventOID="StudyEventOID" Mandatory="Yes"/>
				</Protocol>
		
				<StudyEventDef OID="StudyEventOID" Name="StudyEvent" Repeating="No" Type="Scheduled">
					<FormRef FormOID="FormOID" Mandatory="Yes"/>
				</StudyEventDef>
		
				<FormDef OID="FormOID" Name="example questionnaire" Repeating="No">
					<ItemGroupRef ItemGroupOID="ItemGroupOID" Mandatory="Yes"/>
				</FormDef>
		
				<ItemGroupDef OID="ItemGroupOID" Name="basic demographic information" Repeating="No">
					<ItemRef ItemOID="GenderItemOID" Mandatory="Yes"/>
				</ItemGroupDef>
		
				<ItemDef OID="GenderItemOID" Name="Item.1" DataType="string">
					<Question>
						<TranslatedText xml:lang="en">patient's gender</TranslatedText>
					</Question>
					<CodeListRef CodeListOID="CodeList.gender"/>
				</ItemDef>
		
				<CodeList OID="CodeList.gender" Name="Gender-CodeList" DataType="string">
					<CodeListItem CodedValue="male"><Decode><TranslatedText xml:lang="en">male</TranslatedText></Decode></CodeListItem>
					<CodeListItem CodedValue="female"><Decode><TranslatedText xml:lang="en">female</TranslatedText></Decode></CodeListItem>
					<CodeListItem CodedValue="other"><Decode><TranslatedText xml:lang="en">other</TranslatedText></Decode></CodeListItem>
					<CodeListItem CodedValue="unknown"><Decode><TranslatedText xml:lang="en">unknown</TranslatedText></Decode></CodeListItem>
				</CodeList>
			</MetaDataVersion>
		</Study>

Please note: An ODM file might define just clinical data, so please make sure your ODM file contains metadata. An ODM file may define multiple studies, but this is not common, so this option is currently not supported by the tool. If you have multiple studies, please copy them into individual ODM files.

Usage

To generate clinical data, simply upload your ODM file and set the number of patients. This tool reads the ODM-file, and automatically generates values for all items in the form. The result will be stored in a .zip-file containing also the original file and the configuration. The generated data in our case will look similar to this:

<ClinicalData StudyOID="StudyOID" MetaDataVersionOID="MetaDataVersionOID">
			<SubjectData SubjectKey="patient 0">
				<StudyEventData StudyEventOID="StudyEventOID">
					<FormData FormOID="FormOID">
						<ItemGroupData ItemGroupOID="ItemGroupOID">
							<ItemData Value="other" ItemOID="GenderItemOID"/>
						</ItemGroupData>
					</FormData>
				</StudyEventData>
			</SubjectData>
			<SubjectData SubjectKey="patient 1">
				<StudyEventData StudyEventOID="StudyEventOID">
					<FormData FormOID="FormOID">
						<ItemGroupData ItemGroupOID="ItemGroupOID">
							<ItemData Value="unknown" ItemOID="GenderItemOID"/>
						</ItemGroupData>
					</FormData>
				</StudyEventData>
			</SubjectData>
			<SubjectData SubjectKey="patient 2">
				<StudyEventData StudyEventOID="StudyEventOID">
					<FormData FormOID="FormOID">
						<ItemGroupData ItemGroupOID="ItemGroupOID">
							<ItemData Value="female" ItemOID="GenderItemOID"/>
						</ItemGroupData>
					</FormData>
				</StudyEventData>
			</SubjectData>
			<!-- [...] -->
		</ClinicalData>

As you can see, the metadata in ODM is organized in multiple levels (StudyEvent, Form, ItemGroups). Each element can be allowed to appear more than one time per patients. A range for the number of allowed repetitions can be configured for each level in the GUI ("Repeat key settings") or the JSON configuration file.

Study events, forms, item groups and items might be declared as non-mandatory. In this case, you can configure the probability by which the elements will be omitted during the generation process using "Missing values settings" in the GUI for each level or by configuring them for individual elements using their OID in the JSON configuration file (see below).

DataTypes, Ranges and ValueDomains

ODM supports a lot of commonly used datatypes, e.g. strings, numbers and dates, but also more exotic data types like HexFloat or PartialDateTime. Currently, this tool supports only boolean, string, integer, float, double, date, time and datetime, as these are the most commonly used types.

If nothing is specified, the whole domain of a datatype is used. In a perfect world, the domain of allowed values is always reasonably constraint by RangeCheck:

<MetaDataVersion OID="MetaDataVersionOID" Name="MetaData">
			<!-- [...] -->
			<ItemDef OID="AgeItemOID" Name="Item.2" DataType="integer">
				<RangeCheck Comparator="GT" SoftHard="Hard">
					<CheckValue>0</CheckValue>
				</RangeCheck>
				<RangeCheck Comparator="LT" SoftHard="Hard">
					<CheckValue>120</CheckValue>
				</RangeCheck>
			</ItemDef>
		</MetaDataVersion>

In practice, this is often not the case. If you dont want to change the ODM file, you can use either the "Global value domains" parameter datatype-wise in the GUI, or you can configure distributions for individual items using the JSON configuration (see below) to generate useful values.

Individual element configuration

Besides the configuration on this page, you can configure individual elements by using the "Probabilities"-Element in the JSON configuration file. You can set the probability of missing values for individual StudyEvents, Forms, ItemGroups and Items by using their OID. In case that elements are reused, you can select them inside the specific context using a chain of OIDs:

{
		  "probabilities": {
			"Items": {
			  <ItemOID>: {
				"ValueDistribution": {"1": 0.5, "2": 0.25, "3": 0.25},
				"missingProbability": 0.42
			  },
			  <ItemOID>: {
				"ValueDistribution": {"true": 0.75, "false": 0.25}
			  }
			},
			"ItemGroups": {
			  <ItemGroupOID>: {
				"missingProbability": 0.5,
				"repeat": {"minimum": 8, "maximum": 90},
				"Items": {
				  <ItemOID>: {
					"GaussianDistribution": {"mean": 3, "sigma": 1}
				  },
				  <ItemOID>: { /* ... */ }
				}
			  },
			  <ItemGroupOID>: { /* ... */ }
			},
			"Forms": {
			  <FormOID>: {
				"missingProbability": 0.5,
				"repeat": {"minimum": 8, "maximum": 90},
				"ItemGroups": {
				  <ItemGroupOID>: {
					"missingProbability": 50,
					"Items": {
					  <ItemOID>: {
						"UniformDistribution": {"min": 0.0, "max": 1.0}
					  }
					}
				  }
				}
			  }
			},
			"StudyEvents": {
			  <StudyEventOID>: {
				"missingProbability": 0.5,
				"repeat": {"minimum": 8, "maximum": 90},
				"Forms": {
				  <FormOID>: {
					"missingProbability": 0.5,
					"ItemGroups": {
					  <ItemGroupOID>: {
						"missingProbability": 0.4,
						"Items": {
						  <ItemOID>: {
							"GaussianDistribution": {"mean": "2010-01-01", "sigma": "P2DT3H4M"},
							"missingProbability": 0.5
						  },
						  <ItemOID>: {/* ... */}
						}
					  }
					}
				  },
				  <FormOID>: {/* ... */}
				}
			  },
			  <StudyEventOID>: {/* ... */}
			}
		  }
		}

Distributions

To change the range of values generated for a specific item, you can use a distribution. Currently, we support three types of distribution: The uniform distribution and gaussian distribution, which are especially useful when dealing with numbers or dates, and the value distribution, which comes very handy when dealing with a set of fixed codes.

A uniform distribution is specified by the minimal and maximal value. Please remember, that a the range of allowed values of a specific item might be already constraint by RangeChecks and Global Value domains.

"UniformDistribution": {"min": "2017-01-01", "max": "2019-01-02"}

The gaussian distribution is specified by its mean and its deviation. Please note, that the mean value must be inside the range of allowed values.

"GaussianDistribution": {"mean": 1.0, "sigma": 0.5}

The value distribution allows the specification of specific probability for a individual values:

"ValueDistribution": {"male": 0.45, "female": 0.45, "other": 0.05, "unknown": 0.05}

Because of limitations in the JSON format, you have to quote booleans and numbers. The value after the colon represents the probability, that the given element will be chosen. Please make sure that all probabilities of a value distribution sum up to 1.0.

Repeating elements

StudyEvents, Forms and ItemGroups can be specified as repeatable. In that case, the "Repeat key settings" apply. If you want to set these for individual elements, you can specify a minimal and maximal number of repetitions by using it's OID:

"repeat": {"minimum": 10, "maximum": 20}

Configuration File Cheat Sheet

{
		  "randomSeed": 23455 /* for consistent results, otherwise chosen randomly */,
		  "subjectCount" : 25,
		  "keepExistingData" : true,
		
		  "studyEventMissingProbability" : 1.0,
		  "formMissingProbability" : 1.0,
		  "itemGroupMissingProbability" : 1.0,
		  "itemMissingProbability" : 1.0,
		
		  "studyEventRepeatKeyInterval" : {"minimum" : 2, "maximum" : 4},
		  "formRepeatKeyInterval" : {"minimum" : 2, "maximum" : 4},
		  "itemGroupRepeatKeyInterval" : {"minimum" : 2, "maximum" : 4},
		
		  "globalValueDomains" : {
			"booleanValueDomain" : {
			  "minimum" : false, "includeMinimum" : true,
			  "maximum" : true, "includeMaximum" : true
			},
			"stringValueDomain" : {
			  "minimum" : 1, "includeMinimum" : true, /* refers to string length */
			  "maximum" : 64, "includeMaximum" : true
			},
			"integerValueDomain" : {
			  "minimum" : -2147483648, "includeMinimum" : true,
			  "maximum" : 2147483647, "includeMaximum" : true
			},
			"floatValueDomain" : {
			  "minimum" : 1.4E-45, "includeMinimum" : true,
			  "maximum" : 3.4028235E38, "includeMaximum" : true
			},
			"doubleValueDomain" : {
			  "minimum" : 4.9E-324, "includeMinimum" : true,
			  "maximum" : 1.7976931348623157E308, "includeMaximum" : true
			},
			"dateValueDomain" : {
			  "minimum" : [ 1919, 8, 14 ], "includeMinimum" : true,
			  "maximum" : "2119-08-14", "includeMaximum" : true
			},
			"timeValueDomain" : {
			  "minimum" : [ 0, 0 ], "includeMinimum" : true,
			  "maximum" : "23:59:59.9999999", "includeMaximum" : true
			},
			"dateTimeValueDomain" : {
			  "minimum" : [ 1919, 8, 14, 11, 26, 21, 723000000 ],
			  "includeMinimum" : true,
			  "maximum" : "2119-08-01T13:38:41", //CAVE: leading zeros at "08" and "01" are required!
			  "includeMaximum" : true
			}
		  },
		  "probabilities": { // configuration for specific elements by OID
			"StudyEvents": {
			  <StudyEventOID>: { //unknown OIDs will be ignored silently
				"missingProbability": 0.5,
				"repeat": {"minimum": 1, "maximum": 5},
				// Probability, that a non-mandatory StudyEvent will be not produced (default is 0.2=20%)
				"Forms": {
				  <FormOID>: {
					"missingProbability": 0.5, // Probability, that a not missingProbability form will not be generated
					"repeat": {"minimum": 8, "maximum": 90},
					"ItemGroups": {
					  <ItemGroupOID>: {
						"missingProbability": 0.4,
						"repeat": {"minimum": 5, "maximum": 15},
						"Items": {
						  <ItemOID>: {
							"GaussianDistribution": { "mean": "2010-01-01", "sigma": "P2DT3H4M" }
							//P2DT3H4M = 2 days 3 hours 4 minutes standard deviation (DATETIME)
							//PT3H2M1S = 3 hours 2 minutes 1 seconds standard deviation (TIME)
							//P1Y2M3D = 1 year 2 month 3 day (DATE), see ISO-8601 for more info
						  }
						}
					  }
					}
				  },
				  <FormOID>: ...
				}
			  }
			},
			"Forms": {
			  <FormOID>: {
				"missingProbability": 0.5, // Setting inside a specific StudyEvent has precedence
				"repeat": {"minimum": 8, "maximum": 90},
				"ItemGroups": {
				  <ItemGroupOID>: {
					"missingProbability": 0.5,
					"Items": {
					  <ItemOID>: {
						"UniformDistribution": {"min": 0.0, "max": 1.0}
					  }
					}
				  }
				}
			  }
			},
			"ItemGroups": {
			  <ItemGroupOID>: {
				"missingProbability": 0.5, //Form-specific and StudyEvent+Form-specific settings have precedence
				"repeat": {"minimum": 8, "maximum": 90},
				"Items": {
				  <ItemOID>: {
					"GaussianDistribution": { "mean": 3, "sigma": 1 }
				  },
				  <ItemOID>: {
					"ValueDistribution": { "1": 0.5, "2": 0.25, "3": 0.25 },
					"missingProbability": 0.42
				  }
				  /* ... */
				}
			  }
			},
			"Items": {
			  <ItemOID>: {
				"ValueDistribution": {"true": 0.75, "false": 0.25} //key must always be a String (thx to JSON)
			  }
			  /* ... */
			}
		  }
		}

You can use the CSV generator of this tool by sending a request to the following endpoint:
Note: You can customize the fields of this request just as you would in the user interface


						POST /csv/api HTTP/1.1

						Host: https://clinicaldatagenerator.uni-muenster.de

						Content-Type: multipart/form-data;

						Content-Disposition: form-data; name="json_conf"


{
	"rowCount": 25,
	"randomSeed": 8223,
	"globalValueDomains": {
		"booleanValueDomain": {
			"minimum": false,
			"maximum": true,
			"includeMinimum": true,
			"includeMaximum": true
		},
		"stringValueDomain": {
			"minimum": "1",
			"maximum": "64",
			"includeMinimum": true,
			"includeMaximum": true
		},
		"integerValueDomain": {
			"minimum": -2147483648,
			"maximum": 2147483647,
			"includeMinimum": true,
			"includeMaximum": true
		},
		"floatValueDomain": {
			"minimum": -1000000,
			"maximum": 1000000,
			"includeMinimum": true,
			"includeMaximum": true
		},
		"doubleValueDomain": {
			"minimum": -1000000,
			"maximum": 1000000,
			"includeMinimum": true,
			"includeMaximum": true
		},
		"dateValueDomain": {
			"minimum": "1924-07-23",
			"maximum": "2124-07-23",
			"includeMinimum": true,
			"includeMaximum": true
		},
		"timeValueDomain": {
			"minimum": "00:00:00",
			"maximum": "23:59:59",
			"includeMinimum": true,
			"includeMaximum": true
		},
		"dateTimeValueDomain": {
			"minimum": "1924-07-23T14:26:44",
			"maximum": "2124-07-23T14:26:44",
			"includeMinimum": true,
			"includeMaximum": true
		}
	},
	"outputFormat": "CSV",
	"csvSettings": {
		"delimiter": ",",
		"nullString": "N/A",
		"escape": "\\",
		"quoteMode": "MINIMAL",
		"quoteChar": "\""
	},
	"columns": {
		// Insert your column configuration here
	}
}

`Ctrl`+`Space`	Autocomplete
`Ctrl`+`Shift`+`K`	Delete Line
`Alt`+`Shift`+`F`	Format Document
`Ctrl`+`Shift`+`O`	Navigate to columns by name

Menu

Additional information

Menu

Additional information

An error occurred...