Module Mapred_toolkit.DSeq

module DSeq: sig .. end


The following operations are distributed over the task nodes.

It is here required that the underlying stores are files! It is not possible to distribute operations on notebooks.

type config 
val create_config : ?name:string ->
?task_files:string list ->
?bigblock_size:int ->
?map_tasks:int ->
?merge_limit:int ->
?split_limit:int ->
?partitions:int ->
?enhanced_mapping:int ->
?phases:Mapred_def.phases ->
?report:bool ->
?report_to:Netchannels.out_obj_channel ->
?keep_temp_files:bool -> Mapred_def.mapred_env -> config
The default values for the configurations are taken from mapred_env, and are effectively taken from the .conf files. It is possible, though, to override the values.
val get_rc : config -> Mapred_io.record_config
Returns the record config.

Another way for getting the record config is Mapred_def.get_rc.

type 'a result 
The result of a distributed operation
val get_result : 'a result -> 'a
Get the computed value (or an exception)
val stats : 'a result -> Mapred_stats.stats
Get statistics
val job_id : config -> string
Return the job ID
class type mapred_info = object .. end
Access to the job environment and the job configuration
val mapl : (mapred_info -> 'a -> 'b list) Mapred_rfun.rfun ->
'a Mapred_toolkit.Place.t ->
'b Mapred_toolkit.Place.t ->
config ->
('b, [ `W ]) Mapred_toolkit.Seq.seq list result
mapl m pl_in pl_out conf: Runs a map-only job. This means that the records from pl_in are piped through the function m, and the result is written into new files in pl_out.

The created files are also returned in the output sequences.

val mapl_sort_fold : mapl:(mapred_info -> 'a -> 'b list) Mapred_rfun.rfun ->
hash:(mapred_info -> 'b -> int) Mapred_rfun.rfun ->
cmp:(mapred_info -> 'b -> 'b -> int) Mapred_rfun.rfun ->
initfold:(mapred_info -> int -> 'c) Mapred_rfun.rfun ->
fold:(mapred_info -> 'c -> 'b -> 'c * 'd list)
Mapred_rfun.rfun ->
?finfold:(mapred_info -> 'c -> 'd list) Mapred_rfun.rfun ->
partition_of:(mapred_info -> 'b -> int) Mapred_rfun.rfun ->
?initcombine:(mapred_info -> 'e) Mapred_rfun.rfun ->
?combine:(mapred_info -> 'e -> 'b -> 'e * 'b list)
Mapred_rfun.rfun ->
?fincombine:(mapred_info -> 'e -> 'b list)
Mapred_rfun.rfun ->
'a Mapred_toolkit.Place.t ->
'd Mapred_toolkit.Place.t ->
config ->
'b Mapred_toolkit.Place.codec ->
('d, [ `W ]) Mapred_toolkit.Seq.seq list result
mapl_sort_fold <args> pl_in pl_out conf int_codec: This is map/reduce. The records from pl_in are mapped/sorted/reduced and finally written into new files in pl_out. There are a number of named arguments defining the job:

Note that the mapped elements are sorted by first comparing the integers returned by hash, and only if such integers are equal, the two elements are compared in detail by calling cmp. See Mapred_sorters for useful definitions of hash and cmp.

The int_codec is used for representing intermediate files (output of the map phase, and input/output of the shuffle phases).