Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have some working knowledge about Python but pretty new to Apache Beam. I have encountered an example from Apache Beam about a simple word count program. The snippet that I'm confused looks like this:

  pipeline_options = PipelineOptions(pipeline_args)
  pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
  with beam.Pipeline(options=pipeline_options) as p:

    # Read the text file[pattern] into a PCollection.
    lines = p | ReadFromText(known_args.input)

    # Count the occurrences of each word.
    counts = (
        lines
        | 'Split' >> (
            beam.FlatMap(lambda x: re.findall(r'[A-Za-z']+', x)).
            with_output_types(unicode))
        | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
        | 'GroupAndSum' >> beam.CombinePerKey(sum))

    # Format the counts into a PCollection of strings.
    def format_result(word_count):
      (word, count) = word_count
      return '%s: %s' % (word, count)

    output = counts | 'Format' >> beam.Map(format_result)

    # Write the output using a "Write" transform that has side effects.
    # pylint: disable=expression-not-assigned
    output | WriteToText(known_args.output)

The full version of the code is here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_minimal.py

I'm very confused by the "|" and ">>" operators used here. What do they mean here? Are they natively supported in Python?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
344 views
Welcome To Ask or Share your Answers For Others

1 Answer

Since this code is written in Beam, the symbols you are talking about are native to Beam Pipeline. | is the pipeline symbol which indicates the pipeline being addressed to for the given operation: Like in your example, p is the source pipeline for lines = p | ReadFromText(known_args.input) and lines is the source pipeline for

counts = (
        lines
        | 'Split' >> (
            beam.FlatMap(lambda x: re.findall(r'[A-Za-z']+', x)).
            with_output_types(unicode))
        | 'PairWithOne' >> beam.Map(lambda x: (x, 1))
        | 'GroupAndSum' >> beam.CombinePerKey(sum))

>> gives a name to a certain operation for ease of reading on the UI.

In your example, 'GroupAndSum' >> beam.CombinePerKey(sum)), GroupAndSum is the name of the combine operation and so on.

Read the documentation given by @Klaus D. in the comments for more clarity.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...