secondo/Algebras/Hadoop/Parallel_BerlinMOD/generator/README

# The Parallel BerlinMOD Data Generator creates a BerlinMOD data set in parallel on a cluster.
# It can only run after Parallel Secondo has been correctly installed.
# The generator contains following files:
#	* Bash generator script, named genParaBerlinMOD.sh.
#	* A Hadoop program named GenMOD.jar
# 	* A set of Secondo scripts, include:
#		- BerlinMOD_DataGenerator_map.SEC		(Generate data on slaves in Map stage)
#		- BerlinMOD_DataGenerator_reduce.SEC	(Generate data on slaves in Reduce stage)
#		- BerlinMOD_DataGenerator_master1.SEC	(Set global parameters on the master database)
#		- BerlinMOD_DataGenerator_master2.SEC	(Collect distributed data on the master at last)
# All required files are prepared in the generator, along with this explanatory document.

##################################################################################
# Generate the data set on a single computer
##################################################################################
# The user can also use this generator to sequentially create
# the BerlinMOD data set in a single computer,
# by running BerlinMOD_DataGenerator_map.SEC and BerlinMOD_DataGenerator_reduce.SEC in order.
# It can be achieved either with the bash generate script by setting the argument -l,
# or with the following steps.
#	. Copy the two scripts to the $SECONDO_BUILD_DIR/bin/
#	. Prepare the streets, homeRegions and workRegions data files to $SECONDO_BUILD_DIR/bin/.
#	. Start SecondoTTYBDB, create a database.
#	. Set SCALEFACTOR, like: let SCALEFACTOR = 0.01.
#	. Run the BerlinMOD_DataGenerator_map.SEC, and then the BerlinMOD_DataGenerator_reduce.SEC script.
#	. Close the database.
# Note the data set created by this generator is different from the one that is created by the normal BerlinMOD generator.
# However, it is identical to the data set created with this generator on a cluster, by setting the same scale factor.

##################################################################################
# Use the generate script.
##################################################################################
# Before running the bash script, following prerequisites are needed:
#   . Distribute the data files streets, homeRegions and workRegions to the cluster.
#		This can be done by simply put the data files to $SECONDO_BUILD_DIR/bin/, and then run ps-secondo-buildMini -co.
# 	. Start Parallel Secondo
#	. Keep all files of the generator together, and run the generator on the master node of the cluster.

# The data set can be simply created with "genParaBerlinMOD.sh",
# which asks three optional arguments:
#	* -d : Sets the name of the created database
#	* -s : Sets the scale factor of the data set
#	* -p : Sets the simulated period in days
#	* -l : Generate the data on a single computer, in the first Data Server of the local machine.

# Basically, this generator creates the data set with the following workflow.
#	# Create a master database, setting global parameters and environment data sets.
#	# Run the Hadoop program to generate the data in each slave Data Server.
#	# Create a set of flist objects in the master database, to access the data created in slaves.

##################################################################################
## The following notes are prepared only for advanced users.
##################################################################################

## Differences between the Parallel Generator and the normal BerlinMOD Generator.
##	* Disable the creating, opening and closing database operations.
##	* Check the existent of data files, also they are directly put in the bin directory
##	* Use (if, then, else, endif) commands. Although they cannot be used in TTYCS interface,
##		but working fine in the parallel generation.
##	* Not locally set the SCALEFACTOR value
##	* Add L_START and L_END for setting local P_NUMCARS and vehicle Ids.
##	* Not set global random seed, but set the seed for each trip.
##	* Add S_START and S_END, to locate local samples.
##	* For dataMtrip relation, temporally create local dataMtrip1 in Map tasks,
##		and then globally adjust the TripId in the Reduce stage.
##	* Create a WORLD_BBOX for the local data in the Map stage,
##		and get a global bounding box at last.
##	* Disable the export for streets relation
##	* Locally export datasets to disks, by setting P_EXPORT_TYPE as "Block"
##	* Locally export the bounding box of the data space, according to the WORLD policy.
##	* Change the generation policy for QueryLicences,
##		in order to produce the same samples running on a single computer.
##