Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .agent/skills/adding-new-metadata/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,6 @@ You must ensure that when a DoFn processes an element and outputs a new element,
### Timers
If metadata needs to survive timer firings (e.g., knowing an `@OnTimer` fired because of a system drain), it must be added to Timer data structures. This is a bit of uncharted area which was only implemented for CausedByDrain metadata that comes from backend, not from persisted metadata. In order to persist all WindowedValue metadata across timer, more work has to be done, below are some pointers:
* `runners/core-java/src/main/java/org/apache/beam/runners/core/TimerInternals.java` and implementations (e.g., `WindmillTimerInternals.java` in Dataflow).
* `runners/samza/src/test/java/org/apache/beam/runners/samza/runtime/KeyedTimerData.java` (or generic `TimerData`).
* **Action:** Add the field to `TimerData`, next to `CausedByDrain`. Propagate it when setting the timer and expose it when the timer fires so it bubbles up.
* Eventually, metadata from Timer lands in WindowedValue, so it can be exposed to users. Keep field names, types, and getters similar to WindowedValue as much as possible, as common interface may be introduced eventually.

Expand Down Expand Up @@ -116,4 +115,4 @@ User needs to access the metadata in their `DoFn` (e.g., `@ProcessElement public
9. [ ] Update `ReduceFnRunner` and `OutputAndTimeBoundedSplittableProcessElementInvoker` for complex transform propagation.
10. [ ] If required by timers, update `TimerData` and `TimerInternals`.
11. [ ] If exposed to the user, update `DoFnSignatures` and `ByteBuddyDoFnInvokerFactory`.
12. [ ] Update other runners (Flink, Spark, Samza) to ensure they propagate the new `WindowedValue` fields correctly in their specific operators/runners.
12. [ ] Update other runners (Flink, Spark) to ensure they propagate the new `WindowedValue` fields correctly in their specific operators/runners.
1 change: 0 additions & 1 deletion .agent/skills/runners/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,6 @@ Runners execute Beam pipelines on distributed processing backends. Each runner t
| Dataflow | `runners/google-cloud-dataflow-java/` | Google Cloud Dataflow |
| Flink | `runners/flink/` | Apache Flink |
| Spark | `runners/spark/` | Apache Spark |
| Samza | `runners/samza/` | Apache Samza |
| Jet | `runners/jet/` | Hazelcast Jet |
| Twister2 | `runners/twister2/` | Twister2 |

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,6 @@ categories:
- "PostCommit Python ValidatesRunner Dataflow"
- "PostCommit Python ValidatesRunner Spark"
- "PostCommit Python ValidatesRunner Flink"
- "PostCommit Python ValidatesRunner Samza"
- "Build python source distribution and wheels"
- "Java Tests"
- "PostCommit Java"
Expand Down Expand Up @@ -113,7 +112,6 @@ categories:
- "PreCommit Java Kafka IO Direct"
- "PostCommit Java Examples Direct"
- "PreCommit Java JDBC IO Direct"
- "PostCommit Java ValidatesRunner Samza"
- "PreCommit Java Mqtt IO Direct"
- "PreCommit Java Kinesis IO Direct"
- "PreCommit Java MongoDb IO Direct"
Expand All @@ -138,7 +136,6 @@ categories:
- "PreCommit Java Thrift IO Direct"
- "PreCommit Java Snowflake IO Direct"
- "PreCommit Java Solr IO Direct"
- "PostCommit Java PVR Samza"
- "PreCommit Java Tika IO Direct"
- "PostCommit Java SingleStoreIO IT"
- "PostCommit Java ValidatesRunner Direct"
Expand Down Expand Up @@ -209,7 +206,6 @@ categories:
- "PerformanceTests BigQueryIO Batch Java Json"
- "PerformanceTests SQLBigQueryIO Batch Java"
- "PerformanceTests XmlIOIT"
- "PostCommit XVR Samza"
- "PerformanceTests ManyFiles TextIOIT"
- "PerformanceTests XmlIOIT HDFS"
- "PerformanceTests ParquetIOIT"
Expand Down Expand Up @@ -291,8 +287,7 @@ categories:
tests:
- "PerformanceTests MongoDBIO IT"
- "PreCommit GoPortable"
- "PreCommit GoPrism"
- "PostCommit Go VR Samza"
- "PreCommit GoPrism"
- "PostCommit Go Dataflow ARM"
- "LoadTests Go CoGBK Dataflow Batch"
- "LoadTests Go Combine Dataflow Batch"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -390,103 +390,6 @@ python -m apache_beam.examples.wordcount --input /path/to/inputfile \
```
{{end}}

{{if (eq .Sdk "java" "go")}}
### Samza runner

The Apache Samza Runner can be used to execute Beam pipelines using Apache Samza. The Samza Runner executes Beam pipeline in a Samza application and can run locally. The application can further be built into a .tgz file, and deployed to a YARN cluster or Samza standalone cluster with Zookeeper.

The Samza Runner and Samza are suitable for large scale, stateful streaming jobs, and provide:

* First class support for local state (with RocksDB store). This allows fast state access for high frequency streaming jobs.
* Fault-tolerance with support for incremental checkpointing of state instead of full snapshots. This enables Samza to scale to applications with very large state.
* A fully asynchronous processing engine that makes remote calls efficient.
* Flexible deployment model for running the applications in any hosting environment with Zookeeper.
* Features like canaries, upgrades and rollbacks that support extremely large deployments with minimal downtime.

Additionally, you can read more about the Samza Runner [here](https://beam.apache.org/documentation/runners/samza/)

#### Run example
{{end}}

{{if (eq .Sdk "go")}}

Need import:
```
"github.com/apache/beam/sdks/v2/go/pkg/beam/runners/samza"
```

It is necessary to give an endpoint where the runner is raised with `--endpoint`:
```
$ go install github.com/apache/beam/sdks/v2/go/examples/wordcount
# As part of the initial setup, for non linux users - install package unix before run
$ go get -u golang.org/x/sys/unix
$ wordcount --input gs://dataflow-samples/shakespeare/kinglear.txt \
--output gs://<your-gcs-bucket>/counts \
--runner samza \
--project your-gcp-project \
--region your-gcp-region \
--temp_location gs://<your-gcs-bucket>/tmp/ \
--staging_location gs://<your-gcs-bucket>/binaries/ \
--worker_harness_container_image=apache/beam_go_sdk:latest \
--endpoint=localhost:8081
```
{{end}}

{{if (eq .Sdk "java")}}
You can specify your dependency on the Samza Runner by adding the following to your `pom.xml`:

```
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-samza</artifactId>
<version>2.42.0</version>
<scope>runtime</scope>
</dependency>

<!-- Samza dependencies -->
<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-api</artifactId>
<version>${samza.version}</version>
</dependency>

<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-core_2.11</artifactId>
<version>${samza.version}</version>
</dependency>

<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-kafka_2.11</artifactId>
<version>${samza.version}</version>
<scope>runtime</scope>
</dependency>

<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-kv_2.11</artifactId>
<version>${samza.version}</version>
<scope>runtime</scope>
</dependency>

<dependency>
<groupId>org.apache.samza</groupId>
<artifactId>samza-kv-rocksdb_2.11</artifactId>
<version>${samza.version}</version>
<scope>runtime</scope>
</dependency>
```

Console:
```
$ mvn exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Psamza-runner \
-Dexec.args="--runner=SamzaRunner \
--inputFile=/path/to/input \
--output=/path/to/counts"
```

### Nemo runner

The Apache Nemo Runner can be used to execute Beam pipelines using Apache Nemo. The Nemo Runner can optimize Beam pipelines with the Nemo compiler through various optimization passes and execute them in a distributed fashion using the Nemo runtime. You can also deploy a self-contained application for local mode or run using resource managers like YARN or Mesos.
Expand Down
115 changes: 0 additions & 115 deletions release/src/main/scripts/jenkins_jobs.txt

This file was deleted.

Loading
Loading