73 lines
4.5 KiB
Plaintext
73 lines
4.5 KiB
Plaintext
# The Parallel BerlinMOD Data Generator creates a BerlinMOD data set in parallel on a cluster.
|
|
# It can only run after Parallel Secondo has been correctly installed.
|
|
# The generator contains following files:
|
|
# * Bash generator script, named genParaBerlinMOD.sh.
|
|
# * A Hadoop program named GenMOD.jar
|
|
# * A set of Secondo scripts, include:
|
|
# - BerlinMOD_DataGenerator_map.SEC (Generate data on slaves in Map stage)
|
|
# - BerlinMOD_DataGenerator_reduce.SEC (Generate data on slaves in Reduce stage)
|
|
# - BerlinMOD_DataGenerator_master1.SEC (Set global parameters on the master database)
|
|
# - BerlinMOD_DataGenerator_master2.SEC (Collect distributed data on the master at last)
|
|
# All required files are prepared in the generator, along with this explanatory document.
|
|
|
|
##################################################################################
|
|
# Generate the data set on a single computer
|
|
##################################################################################
|
|
# The user can also use this generator to sequentially create
|
|
# the BerlinMOD data set in a single computer,
|
|
# by running BerlinMOD_DataGenerator_map.SEC and BerlinMOD_DataGenerator_reduce.SEC in order.
|
|
# It can be achieved either with the bash generate script by setting the argument -l,
|
|
# or with the following steps.
|
|
# . Copy the two scripts to the $SECONDO_BUILD_DIR/bin/
|
|
# . Prepare the streets, homeRegions and workRegions data files to $SECONDO_BUILD_DIR/bin/.
|
|
# . Start SecondoTTYBDB, create a database.
|
|
# . Set SCALEFACTOR, like: let SCALEFACTOR = 0.01.
|
|
# . Run the BerlinMOD_DataGenerator_map.SEC, and then the BerlinMOD_DataGenerator_reduce.SEC script.
|
|
# . Close the database.
|
|
# Note the data set created by this generator is different from the one that is created by the normal BerlinMOD generator.
|
|
# However, it is identical to the data set created with this generator on a cluster, by setting the same scale factor.
|
|
|
|
##################################################################################
|
|
# Use the generate script.
|
|
##################################################################################
|
|
# Before running the bash script, following prerequisites are needed:
|
|
# . Distribute the data files streets, homeRegions and workRegions to the cluster.
|
|
# This can be done by simply put the data files to $SECONDO_BUILD_DIR/bin/, and then run ps-secondo-buildMini -co.
|
|
# . Start Parallel Secondo
|
|
# . Keep all files of the generator together, and run the generator on the master node of the cluster.
|
|
|
|
# The data set can be simply created with "genParaBerlinMOD.sh",
|
|
# which asks three optional arguments:
|
|
# * -d : Sets the name of the created database
|
|
# * -s : Sets the scale factor of the data set
|
|
# * -p : Sets the simulated period in days
|
|
# * -l : Generate the data on a single computer, in the first Data Server of the local machine.
|
|
|
|
# Basically, this generator creates the data set with the following workflow.
|
|
# # Create a master database, setting global parameters and environment data sets.
|
|
# # Run the Hadoop program to generate the data in each slave Data Server.
|
|
# # Create a set of flist objects in the master database, to access the data created in slaves.
|
|
|
|
##################################################################################
|
|
## The following notes are prepared only for advanced users.
|
|
##################################################################################
|
|
|
|
## Differences between the Parallel Generator and the normal BerlinMOD Generator.
|
|
## * Disable the creating, opening and closing database operations.
|
|
## * Check the existent of data files, also they are directly put in the bin directory
|
|
## * Use (if, then, else, endif) commands. Although they cannot be used in TTYCS interface,
|
|
## but working fine in the parallel generation.
|
|
## * Not locally set the SCALEFACTOR value
|
|
## * Add L_START and L_END for setting local P_NUMCARS and vehicle Ids.
|
|
## * Not set global random seed, but set the seed for each trip.
|
|
## * Add S_START and S_END, to locate local samples.
|
|
## * For dataMtrip relation, temporally create local dataMtrip1 in Map tasks,
|
|
## and then globally adjust the TripId in the Reduce stage.
|
|
## * Create a WORLD_BBOX for the local data in the Map stage,
|
|
## and get a global bounding box at last.
|
|
## * Disable the export for streets relation
|
|
## * Locally export datasets to disks, by setting P_EXPORT_TYPE as "Block"
|
|
## * Locally export the bounding box of the data space, according to the WORLD policy.
|
|
## * Change the generation policy for QueryLicences,
|
|
## in order to produce the same samples running on a single computer.
|
|
## |