Update java/src/com/twitter/pycascading/Util.java by TaoLinVT · Pull Request #3 · ianoc/pycascading

TaoLinVT · 2013-01-02T00:59:03Z

Add gzip compression support (Also need to enable it in python code).

ianoc · 2013-01-02T17:28:23Z

There's no if statement around any of this, we are changing the upstream behavior of pycascading ?

TaoLinVT · 2013-01-02T18:30:10Z

There are two reasons (Reformat):

(1). My test shows that these have no impact unless python code enables compression. For example changes following

redis_index = test_flow.basic_sink(cascading.scheme.hadoop.TextLine(), \
                         outputs["redis_index"])

to:

redis_index = redis_builder.basic_sink(cascading.scheme.hadoop.TextLine(cascading.scheme.hadoop.TextLine.Compress.ENABLE), \
                         outputs["redis_index"])

Only the one with compress enabled will generate gz files. All the others remain the same.

(2). Consider the use case in which one flow has multiple sinks. If there is one sink with compress.enabled scheme, then we have to enable those java compress codes. Those properties will also be applied to other sinks as well, even if they do not have compress scheme. Because this cannot be avoid, there is no point to add if statement around those new statements.

Thanks,
Tao

ianoc · 2013-01-02T18:32:35Z

Its very specific to GZIP and a particular set of flags to go into upstream. I'll look at being able to pass in a set of hadoop properties into the run option instead. Changing the code upstream if we decide on a different codec seems like a bad idea. If its passed into run rather than a sink to control what compression is used then it will effect all sinks as expected?

TaoLinVT · 2013-01-02T19:27:20Z

OK, I see.

How about the following which allows us to use different codec later by adding entries to config:
mapredOutputCompress = config.get("mapred.output.compress", "true")
mapredCompressMapOut = config.get("mapred.compress.map.output", "true")
mapredOutputCompressionType = config.get("mapred.output.compression.type", "BLOCK")
mapredMapOutputCompressionCodec = config.get("mapred.map.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
mapredOutputCompressionCodec = "mapred.output.compression.codec","org.apache.hadoop.io.compress.GzipCodec")
properties.setProperty("mapred.output.compress", mapredOutputCompress);
properties.setProperty("mapred.compress.map.output", mapredCompressMapOut);
properties.setProperty("mapred.output.compression.type", mapredOutputCompressionType);
properties.setProperty("mapred.map.output.compression.codec", mapredMapOutputCompressionCodec);
properties.setProperty("mapred.output.compression.codec", mapredOutputCompressionCodec);

Supported codec strings are defined by:
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/compress/package-summary.html

ianoc · 2013-02-06T21:40:59Z

This seems to:

Alter default behavior in potentially unexpected ways
Add very specific defaults(which should always be to existing behavior if there are any without good motivation for the contrary)
Adds lots more code for specific hadoop flags.

The config map is a String -> Object map, so we can have a case where it maps to a string -> string map

i.e. config.get("pycascading.hadoop.mapred.options") -> Map<String, String>

We then iterate through this string map setting all the key's present. It should leave the existing behavior in pycascading as it was and let us easily pass in a hash map to set any options (for compression or anything else in future too).

Thoughts?

TaoLinVT · 2013-02-12T22:12:58Z

OK. Here is plan:

pycascading.pipe.config = dict() defined in python/pycascading/bootstrap.py (line 70). And it will be passed to python/pycascading/tap.py (line 237) as parameter config to Util.run().

So

We can check whether config has "pycascading.hadoop.mapred.options" key, if yes, then we get the map defined by "pycascading.hadoop.mapred.options" and apply settings (key/val pairs). For example,

#Add following lines to Util.java before line 153
key_for_config_options = "pycascading.hadoop.mapred.options"
if key_for_config_options in config:
config_options = config.get(key_for_config_options)
for (option_name, option_value) in config_options.iteritems():
properties.setProperty(option_name, option_value);

We also need to revise python/pycascading/tap.py Flow.run API (line 228) so that we can pass configuration to Util.run.

def run(self, name="pycascading flow", num_reducers=50, min_split_size=0, config=None):
"""Start the Cascading job.

We call this when we are done building the pipeline and explicitly want
to start the flow process.
"""
source_map = self.__get_active_sources()
tails = [t.get_assembly() for t in self.tails]
import pycascading.pipe
key_for_config_options = "pycascading.hadoop.mapred.options"
if config and key_for_config_options in config:
    pycascading.pipe.config[key_for_config_options] = config[key_for_config_options]
Util.run(name, num_reducers, min_split_size, pycascading.pipe.config, source_map, \
         self.sink_map, tails)

Now the last step in our grid script, we should call flow.run() with config dict which contains following key/value pair:

"pycascading.hadoop.mapred.options": {
"mapred.output.compress": "true",
"mapred.compress.map.output": "true",
"mapred.output.compression.type": "BLOCK",
"mapred.map.output.compression.codec": "org.apache.hadoop.io.compress.GzipCodec",
"mapred.output.compression.codec": "org.apache.hadoop.io.compress.GzipCodec",
}

…script and default behavior is the same as before.

TaoLinVT · 2013-02-12T22:24:00Z

Please see the new diff. Thank you! -- Tao

Update java/src/com/twitter/pycascading/Util.java

b21cc2e

Add gzip compression support (Also need to enable it in python code).

ianoc reviewed Jan 2, 2013
View reviewed changes

Revised code so that compression configuration is controlled by user …

3c25144

…script and default behavior is the same as before.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update java/src/com/twitter/pycascading/Util.java#3

Update java/src/com/twitter/pycascading/Util.java#3
TaoLinVT wants to merge 2 commits intoianoc:casc2from
TaoLinVT:casc2

TaoLinVT commented Jan 2, 2013

Uh oh!

ianoc Jan 2, 2013

Uh oh!

TaoLinVT commented Jan 2, 2013

Uh oh!

ianoc commented Jan 2, 2013

Uh oh!

TaoLinVT commented Jan 2, 2013

Uh oh!

ianoc commented Feb 6, 2013

Uh oh!

TaoLinVT commented Feb 12, 2013

Uh oh!

TaoLinVT commented Feb 12, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TaoLinVT commented Jan 2, 2013

Uh oh!

ianoc Jan 2, 2013

Choose a reason for hiding this comment

Uh oh!

TaoLinVT commented Jan 2, 2013

Uh oh!

ianoc commented Jan 2, 2013

Uh oh!

TaoLinVT commented Jan 2, 2013

Uh oh!

ianoc commented Feb 6, 2013

Uh oh!

TaoLinVT commented Feb 12, 2013

Uh oh!

TaoLinVT commented Feb 12, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants