-
Notifications
You must be signed in to change notification settings - Fork 6.5k
chore(deps): update dependency pyspark to v4 #13387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
chore(deps): update dependency pyspark to v4 #13387
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @renovate-bot, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request. This PR is an automated dependency update generated by renovate-bot. Its primary purpose is to upgrade the pyspark
dependency within the pubsublite/spark-connector
module. Specifically, it updates the required version of pyspark[sql]
from 3.5.5
to the new major version 4.0.0
. This is a straightforward update to keep the project's dependencies current.
Highlights
- Dependency Update: The core change is updating the
pyspark[sql]
dependency to version4.0.0
. - Automated PR: This pull request was automatically created by renovate-bot.
- Manual Merge Required: The PR is configured to require manual merging and is not set up for automerge.
Changelog
- pubsublite/spark-connector/requirements.txt
- Updated the version constraint for
pyspark[sql]
from==3.5.5
to==4.0.0
.
- Updated the version constraint for
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Spark version bumps,
From three-five-five to four,
New features await.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates pyspark
from version 3.5.5
to 4.0.0
. While keeping dependencies up-to-date is generally good practice, this is a major version upgrade for PySpark, which often introduces breaking changes and requires careful consideration.
The most significant concern with this upgrade to Spark 4.0 is its move to Scala 2.13 by default (from Scala 2.12 in Spark 3.x). This has direct implications for the binary compatibility of Spark connectors, such as the Pub/Sub Lite Spark connector JARs used in the examples and tests within this repository.
Thorough testing and verification are crucial before merging this change.
Summary of Findings
- Connector JAR Compatibility with Spark 4.0/Scala 2.13: Spark 4.0 uses Scala 2.13 by default. The existing Pub/Sub Lite Spark connector JARs (e.g.,
pubsublite-spark-sql-streaming-1.0.0
) are likely built for Scala 2.12 and Spark 3.x, which will cause runtime failures. This needs to be verified, and compatible JARs must be used. This is a critical issue. - Thorough Testing for PySpark 4.0 Breaking Changes: Major version upgrades like PySpark 3.x to 4.x can introduce subtle breaking changes or behavior modifications beyond connector compatibility (e.g., changes in default SQL configurations). All examples and especially the integration tests in
spark_streaming_test.py
need to be thoroughly re-validated. This is a high-severity concern. - Missing newline at end of file: The file
pubsublite/spark-connector/requirements.txt
is missing a newline character at the end of the file. This is a minor stylistic issue and has not been commented on directly due to review settings focusing on medium severity and above.
Merge Readiness
This pull request updates pyspark
to a new major version (4.0.0). Due to the critical concern regarding the compatibility of the Pub/Sub Lite Spark connector JARs with Spark 4.0's default Scala 2.13, and the general risk of breaking changes in a major library upgrade, I recommend that this PR not be merged until these issues are thoroughly investigated and addressed.
Specifically, it's essential to:
- Confirm the availability and usage of Pub/Sub Lite Spark connector JARs compatible with Spark 4.0/Scala 2.13.
- Successfully run all integration tests in
spark_streaming_test.py
with the new PySpark version and compatible connectors. - Review PySpark 4.0 release notes for other potential breaking changes affecting the examples.
As a reviewer, I am not authorized to approve pull requests. Please ensure these critical points are resolved and further review is conducted before considering merging.
pyspark[sql]==4.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upgrading to PySpark 4.0.0 is a significant change. Apache Spark 4.0 now uses Scala 2.13 by default, whereas Spark 3.x versions (like 3.5.5) primarily used Scala 2.12.
This raises a critical concern about the compatibility of the Pub/Sub Lite Spark connector JARs used in your examples and tests (spark_streaming_test.py
):
gs://pubsublite-spark/pubsublite-spark-sql-streaming-1.0.0-with-dependencies.jar
gs://spark-lib/pubsublite/pubsublite-spark-sql-streaming-LATEST-with-dependencies.jar
These JARs, especially version 1.0.0
, are likely compiled against Scala 2.12. For instance, the java-pubsublite-spark
connector version 1.1.0
(released Feb 2024) specifies Spark 3.3.2 and Scala 2.12.
Could you please verify the following points?
- Are there versions of the Pub/Sub Lite Spark connector JARs available that are compiled for Spark 4.0 and Scala 2.13?
- Have the paths in
spark_streaming_test.py
been updated to use these compatible JARs if necessary? - Have all tests in
spark_streaming_test.py
been executed successfully with PySpark 4.0 and the (potentially new) connector JARs? Without compatible connectors, these tests are very likely to fail due to binary incompatibilities between Scala versions.
Additionally, PySpark 4.0 may introduce other breaking changes or behavior modifications (e.g., the default change for spark.sql.legacy.respectNullabilityInTextDatasetConversion
). It's important to review the Apache Spark 4.0.0 release notes for any other changes that might affect your examples and ensure they behave as expected.
This PR contains the following updates:
==3.5.5
->==4.0.0
Release Notes
apache/spark (pyspark)
v4.0.0
Compare Source
Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Never, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.