Skip to content

fix: improved links (URLs) extraction for parse_node, resolves #822 #828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 26, 2024

Conversation

Levyathanus
Copy link
Contributor

The method _extract_urls of the parse_node was not extracting well formed URLs, causing problems when using the function urljoin from urllib.parse (ref. issue #822). These changes try to parse the URLs more precisely including: "absolute" URLs (e.g.: "www.website.com/...", "http://www.website.com/...", "https://www.website.com/...", "website.com/...", "http://website.com/...", "https://website.com/...", etc.), image URLs, "relative" URLs (e.g.: "/test/page.html", "/?test=test", etc.) which will be joined later with the source URL.

Copy link
Collaborator

@VinciGit00 VinciGit00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for the contribution

@VinciGit00 VinciGit00 merged commit adddd64 into ScrapeGraphAI:pre/beta Nov 26, 2024
1 check passed
Copy link

🎉 This PR is included in version 1.32.0-beta.3 🎉

The release is available on:

Your semantic-release bot 📦🚀

Copy link

github-actions bot commented Dec 5, 2024

🎉 This PR is included in version 1.33.0-beta.1 🎉

The release is available on:

Your semantic-release bot 📦🚀

Copy link

github-actions bot commented Dec 5, 2024

🎉 This PR is included in version 1.33.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants