Skip to content

Commit 2ccb608

Browse files
authored
Merge pull request #129 from VinciGit00/main
reallignement
2 parents c11331a + 35ae76f commit 2ccb608

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+977
-611
lines changed

CHANGELOG.md

Lines changed: 62 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,75 @@
1-
## [0.5.0-beta.8](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.7...v0.5.0-beta.8) (2024-05-02)
1+
## [0.6.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.2...v0.6.0) (2024-05-02)
22

33

44
### Features
55

6+
* added node and graph for CSV scraping ([4d542a8](https://github.com/VinciGit00/Scrapegraph-ai/commit/4d542a88f7d949a5ba360dcd880716c8110a5d14))
67
* Allow end users to pass model instances for llm and embedding model ([b86aac2](https://github.com/VinciGit00/Scrapegraph-ai/commit/b86aac2188887642564a34d13d55d0fcff220ec1))
8+
* modified node name ([02d1af0](https://github.com/VinciGit00/Scrapegraph-ai/commit/02d1af006cb89bf860ee4f1186f582e2049a8e3d))
9+
10+
11+
### CI
12+
13+
* **release:** 0.5.0-beta.7 [skip ci] ([40b2a34](https://github.com/VinciGit00/Scrapegraph-ai/commit/40b2a346d57865ca21915ecaa658096c52a2cc6b))
14+
* **release:** 0.5.0-beta.8 [skip ci] ([c11331a](https://github.com/VinciGit00/Scrapegraph-ai/commit/c11331a26ac325dfcf489272442ceeed13225a39))
15+
16+
## [0.5.2](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.1...v0.5.2) (2024-05-02)
17+
18+
19+
### Bug Fixes
720

8-
## [0.5.0-beta.7](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.6...v0.5.0-beta.7) (2024-05-01)
21+
* bug on script_creator_graph.py ([4a3bc37](https://github.com/VinciGit00/Scrapegraph-ai/commit/4a3bc37f2fbb24953edd68f28234ff14302ac120))
22+
23+
## [0.5.1](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0...v0.5.1) (2024-05-02)
24+
25+
26+
### Bug Fixes
27+
28+
* examples and graphs ([5cf4e4f](https://github.com/VinciGit00/Scrapegraph-ai/commit/5cf4e4f92f024041c44211aebd2e3bdf73351a00))
29+
30+
31+
### Docs
32+
33+
* added venv suggestion ([ba2b24b](https://github.com/VinciGit00/Scrapegraph-ai/commit/ba2b24b4cd82d63f9235051eb0e95519c51fd639))
34+
* base and fetch node ([e981796](https://github.com/VinciGit00/Scrapegraph-ai/commit/e9817963c8e98e35662cc5a140b0348792d25307))
35+
* change contributing.md with new ci/cd workflow ([3e91a46](https://github.com/VinciGit00/Scrapegraph-ai/commit/3e91a46522ab1f6b2f733efd234b06df4687c695))
36+
* fixed basegraph docstring ([29427c2](https://github.com/VinciGit00/Scrapegraph-ai/commit/29427c233485816967c4ecd6c1951351be9b27ce))
37+
* graphs and helpers docstrings ([0631985](https://github.com/VinciGit00/Scrapegraph-ai/commit/0631985e6156bd21ec5317faff9e345c8aa7f88b))
38+
* refactor examples ([c11fc28](https://github.com/VinciGit00/Scrapegraph-ai/commit/c11fc288963e1a2818e451279a3bf53eb33e22be))
39+
* refactor models docstrings ([18c20eb](https://github.com/VinciGit00/Scrapegraph-ai/commit/18c20eb03de183a0311be5ffe21f53ec4edf1b87))
40+
* refactor nodes docstrings ([1409797](https://github.com/VinciGit00/Scrapegraph-ai/commit/140979747598210674131befadd786800c9fb5ec))
41+
* update utils docstrings ([cf038b3](https://github.com/VinciGit00/Scrapegraph-ai/commit/cf038b33eaae42f65d7d9c782b5729092b272dd0))
42+
43+
## [0.5.0](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.4.1...v0.5.0) (2024-04-30)
944

1045

1146
### Features
1247

13-
* added node and graph for CSV scraping ([4d542a8](https://github.com/VinciGit00/Scrapegraph-ai/commit/4d542a88f7d949a5ba360dcd880716c8110a5d14))
14-
* modified node name ([02d1af0](https://github.com/VinciGit00/Scrapegraph-ai/commit/02d1af006cb89bf860ee4f1186f582e2049a8e3d))
48+
* add cluade integration ([e0ffc83](https://github.com/VinciGit00/Scrapegraph-ai/commit/e0ffc838b06c0f024026a275fc7f7b4243ad5cf9))
49+
* add co-author ([719a353](https://github.com/VinciGit00/Scrapegraph-ai/commit/719a353410992cc96f46ec984a5d3ec372e71ad2))
50+
* **fetch:** added playwright support ([42ab0aa](https://github.com/VinciGit00/Scrapegraph-ai/commit/42ab0aa1d275b5798ab6fc9feea575fe59b6e767))
51+
* added verbose flag to suppress print statements ([2dd7817](https://github.com/VinciGit00/Scrapegraph-ai/commit/2dd7817cfb37cfbeb7e65b3a24655ab238f48026))
52+
* base groq + requirements + toml update with groq ([7dd5b1a](https://github.com/VinciGit00/Scrapegraph-ai/commit/7dd5b1a03327750ffa5b2fb647eda6359edd1fc2))
53+
* **refactor:** changed variable names ([8fba7e5](https://github.com/VinciGit00/Scrapegraph-ai/commit/8fba7e5490f916b325588443bba3fff5c0733c17))
54+
* **llm:** implemented groq model ([dbbf10f](https://github.com/VinciGit00/Scrapegraph-ai/commit/dbbf10fc77b34d99d64c6cd7f74524b6d8e57fa5))
55+
* updated requirements.txt ([d368725](https://github.com/VinciGit00/Scrapegraph-ai/commit/d36872518a6d234eba5f8b7ddca7da93797874b2))
56+
57+
58+
### Bug Fixes
59+
60+
* script generator and add new benchmarks ([e3d0194](https://github.com/VinciGit00/Scrapegraph-ai/commit/e3d0194dc93b20dc254fc48bba11559bf8a3a185))
61+
62+
63+
### CI
64+
65+
* **release:** 0.4.0-beta.3 [skip ci] ([d13321b](https://github.com/VinciGit00/Scrapegraph-ai/commit/d13321b2f86d98e2a3a0c563172ca0dd29cdf5fb))
66+
* **release:** 0.5.0-beta.1 [skip ci] ([450291f](https://github.com/VinciGit00/Scrapegraph-ai/commit/450291f52e48cd35b2b8cc50ff66f5336326fa25))
67+
* **release:** 0.5.0-beta.2 [skip ci] ([ff7d12f](https://github.com/VinciGit00/Scrapegraph-ai/commit/ff7d12f1389d8eed87e9f6b2fc8b099767a904a9))
68+
* **release:** 0.5.0-beta.3 [skip ci] ([7e81f7c](https://github.com/VinciGit00/Scrapegraph-ai/commit/7e81f7c03f79c43219743be52affabbaf0d66387))
69+
* **release:** 0.5.0-beta.4 [skip ci] ([14e56f6](https://github.com/VinciGit00/Scrapegraph-ai/commit/14e56f6ab1711a08e749edbda860d349db491dae))
70+
* **release:** 0.5.0-beta.5 [skip ci] ([5ac97e2](https://github.com/VinciGit00/Scrapegraph-ai/commit/5ac97e2fb321be40c9787fbf8cb53fa62cf0ce06))
71+
* **release:** 0.5.0-beta.6 [skip ci] ([9356124](https://github.com/VinciGit00/Scrapegraph-ai/commit/9356124ce39568e88f7d2965181579c4ff0a5752))
72+
1573

1674
## [0.5.0-beta.6](https://github.com/VinciGit00/Scrapegraph-ai/compare/v0.5.0-beta.5...v0.5.0-beta.6) (2024-04-30)
1775

CONTRIBUTING.md

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,22 +15,31 @@ Thank you for your interest in contributing to **ScrapeGraphAI**! We welcome con
1515

1616
To get started with contributing, follow these steps:
1717

18-
1. Fork the repository on GitHub.
18+
1. Fork the repository on GitHub **(FROM pre/beta branch)**.
1919
2. Clone your forked repository to your local machine.
20-
3. Install the necessary dependencies.
20+
3. Install the necessary dependencies from requirements.txt or via pyproject.toml as you prefere :).
2121
4. Make your changes or additions.
2222
5. Test your changes thoroughly.
2323
6. Commit your changes with descriptive commit messages.
2424
7. Push your changes to your forked repository.
25-
8. Submit a pull request to the main repository.
25+
8. Submit a pull request to the pre/beta branch.
26+
27+
N.B All the pull request to the main branch will be rejected!
2628

2729
## Contributing Guidelines
2830

2931
Please adhere to the following guidelines when contributing to ScrapeGraphAI:
3032

3133
- Follow the code style and formatting guidelines specified in the [Code Style](#code-style) section.
32-
- Make sure your changes are well-documented and include any necessary updates to the project's documentation.
33-
- Write clear and concise commit messages that describe the purpose of your changes.
34+
- Make sure your changes are well-documented and include any necessary updates to the project's documentation and requirements if needed.
35+
- Write clear and concise commit messages that describe the purpose of your changes and the last commit before the pull request has to follow the following format:
36+
- `feat: Add new feature`
37+
- `fix: Correct issue with existing feature`
38+
- `docs: Update documentation`
39+
- `style: Improve formatting and style`
40+
- `refactor: Restructure code`
41+
- `test: Add or update tests`
42+
- `perf: Improve performance`
3443
- Be respectful and considerate towards other contributors and maintainers.
3544

3645
## Code Style
@@ -42,6 +51,7 @@ Please make sure to format your code accordingly before submitting a pull reques
4251
- [Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/)
4352
- [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html)
4453
- [The Hitchhiker's Guide to Python](https://docs.python-guide.org/writing/style/)
54+
- [Pylint style of code for the documentation](https://pylint.pycqa.org/en/1.6.0/tutorial.html)
4555

4656
## Submitting a Pull Request
4757

@@ -53,7 +63,7 @@ To submit your changes for review, please follow these steps:
5363
4. Select your forked repository and the branch containing your changes.
5464
5. Provide a descriptive title and detailed description for your pull request.
5565
6. Reviewers will provide feedback and discuss any necessary changes.
56-
7. Once your pull request is approved, it will be merged into the main repository.
66+
7. Once your pull request is approved, it will be merged into the pre/beta branch.
5767

5868
## Reporting Issues
5969

README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,9 @@ you will also need to install Playwright for javascript-based scraping:
2727
```bash
2828
playwright install
2929
```
30+
31+
**Note**: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries 🐱
32+
3033
## 🔍 Demo
3134
Official streamlit demo:
3235

examples/groq/.env.example

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
GROQ_APIKEY= "your groq key"
1+
GROQ_APIKEY= "your groq key"
2+
OPENAI_APIKEY="your openai api key"
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
"""
2+
Basic example of scraping pipeline using SmartScraper
3+
"""
4+
5+
import os
6+
from dotenv import load_dotenv
7+
from scrapegraphai.graphs import SmartScraperGraph
8+
from scrapegraphai.utils import prettify_exec_info
9+
10+
load_dotenv()
11+
12+
13+
# ************************************************
14+
# Define the configuration for the graph
15+
# ************************************************
16+
17+
groq_key = os.getenv("GROQ_APIKEY")
18+
openai_key = os.getenv("OPENAI_APIKEY")
19+
20+
graph_config = {
21+
"llm": {
22+
"model": "groq/gemma-7b-it",
23+
"api_key": groq_key,
24+
"temperature": 0
25+
},
26+
"embeddings": {
27+
"api_key": openai_key,
28+
"model": "gpt-3.5-turbo",
29+
},
30+
"headless": False
31+
}
32+
33+
# ************************************************
34+
# Create the SmartScraperGraph instance and run it
35+
# ************************************************
36+
37+
smart_scraper_graph = SmartScraperGraph(
38+
prompt="List me all the projects with their description.",
39+
# also accepts a string with the already downloaded HTML code
40+
source="https://perinim.github.io/projects/",
41+
config=graph_config
42+
)
43+
44+
result = smart_scraper_graph.run()
45+
print(result)
46+
47+
# ************************************************
48+
# Get graph execution info
49+
# ************************************************
50+
51+
graph_exec_info = smart_scraper_graph.get_execution_info()
52+
print(prettify_exec_info(graph_exec_info))

examples/openai/custom_graph_openai.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@
4040
fetch_node = FetchNode(
4141
input="url | local_dir",
4242
output=["doc"],
43+
node_config={"headless": True, "verbose": True}
4344
)
4445
parse_node = ParseNode(
4546
input="doc",

examples/openai/scrape_plain_text_openai.py

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -53,11 +53,3 @@
5353

5454
graph_exec_info = smart_scraper_graph.get_execution_info()
5555
print(prettify_exec_info(graph_exec_info))
56-
57-
58-
# ************************************************
59-
# Get graph execution info
60-
# ************************************************
61-
62-
graph_exec_info = smart_scraper_graph.get_execution_info()
63-
print(prettify_exec_info(graph_exec_info))

examples/openai/script_generator_openai.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
"api_key": openai_key,
2121
"model": "gpt-3.5-turbo",
2222
},
23-
"library": "beautifoulsoup"
23+
"library": "beautifulsoup"
2424
}
2525

2626
# ************************************************

examples/openai/xml_scraper_openai.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,14 @@
2323
# Define the configuration for the graph
2424
# ************************************************
2525

26-
gemini_key = os.getenv("GOOGLE_APIKEY")
26+
openai_key = os.getenv("OPENAI_APIKEY")
2727

2828
graph_config = {
2929
"llm": {
30-
"api_key": gemini_key,
30+
"api_key": openai_key,
3131
"model": "gpt-3.5-turbo",
3232
},
33+
"verbose":False,
3334
}
3435

3536
# ************************************************

examples/single_node/fetch_node.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@
1212
robots_node = FetchNode(
1313
input="url | local_dir",
1414
output=["doc"],
15+
node_config={
16+
"headless": False
17+
}
1518
)
1619

1720
# ************************************************

examples/single_node/robot_node.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,9 @@
2626
robots_node = RobotsNode(
2727
input="url",
2828
output=["is_scrapable"],
29-
node_config={"llm": llm_model}
29+
node_config={"llm": llm_model,
30+
"headless": False
31+
}
3032
)
3133

3234
# ************************************************

pyproject.toml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
[tool.poetry]
22
name = "scrapegraphai"
33

4-
version = "0.5.0b8"
4+
version = "0.6.0"
5+
56

67
description = "A web scraping library based on LangChain which uses LLM and direct graph logic to create scraping pipelines."
78
authors = [

scrapegraphai/builders/graph_builder.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
"""
2-
Module for making the graph building
2+
GraphBuilder Module
33
"""
44

55
from langchain_core.prompts import ChatPromptTemplate

scrapegraphai/graphs/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
"""
22
__init__.py file for graphs folder
33
"""
4+
45
from .base_graph import BaseGraph
56
from .smart_scraper_graph import SmartScraperGraph
67
from .speech_graph import SpeechGraph

scrapegraphai/graphs/abstract_graph.py

Lines changed: 51 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
"""
2-
Module having abstract class for creating all the graphs
2+
AbstractGraph Module
33
"""
4+
45
from abc import ABC, abstractmethod
56
from typing import Optional
67
from ..models import OpenAI, Gemini, Ollama, AzureOpenAI, HuggingFace, Groq
@@ -9,13 +10,34 @@
910

1011
class AbstractGraph(ABC):
1112
"""
12-
Abstract class representing a generic graph-based tool.
13+
Scaffolding class for creating a graph representation and executing it.
14+
15+
Attributes:
16+
prompt (str): The prompt for the graph.
17+
source (str): The source of the graph.
18+
config (dict): Configuration parameters for the graph.
19+
llm_model: An instance of a language model client, configured for generating answers.
20+
embedder_model: An instance of an embedding model client, configured for generating embeddings.
21+
verbose (bool): A flag indicating whether to show print statements during execution.
22+
headless (bool): A flag indicating whether to run the graph in headless mode.
23+
24+
Args:
25+
prompt (str): The prompt for the graph.
26+
config (dict): Configuration parameters for the graph.
27+
source (str, optional): The source of the graph.
28+
29+
Example:
30+
>>> class MyGraph(AbstractGraph):
31+
... def _create_graph(self):
32+
... # Implementation of graph creation here
33+
... return graph
34+
...
35+
>>> my_graph = MyGraph("Example Graph", {"llm": {"model": "gpt-3.5-turbo"}}, "example_source")
36+
>>> result = my_graph.run()
1337
"""
1438

1539
def __init__(self, prompt: str, config: dict, source: Optional[str] = None):
16-
"""
17-
Initializes the AbstractGraph with a prompt, file source, and configuration.
18-
"""
40+
1941
self.prompt = prompt
2042
self.source = source
2143
self.config = config
@@ -32,6 +54,7 @@ def __init__(self, prompt: str, config: dict, source: Optional[str] = None):
3254
self.final_state = None
3355
self.execution_info = None
3456

57+
3558
def _set_model_token(self, llm):
3659

3760
if 'Azure' in str(type(llm)):
@@ -43,8 +66,18 @@ def _set_model_token(self, llm):
4366

4467
def _create_llm(self, llm_config: dict, chat=False) -> object:
4568
"""
46-
Creates an instance of the language model (OpenAI or Gemini) based on configuration.
69+
Create a large language model instance based on the configuration provided.
70+
71+
Args:
72+
llm_config (dict): Configuration parameters for the language model.
73+
74+
Returns:
75+
object: An instance of the language model client.
76+
77+
Raises:
78+
KeyError: If the model is not supported.
4779
"""
80+
4881
llm_defaults = {
4982
"temperature": 0,
5083
"streaming": False
@@ -119,16 +152,27 @@ def _create_llm(self, llm_config: dict, chat=False) -> object:
119152

120153
def get_state(self, key=None) -> dict:
121154
"""""
122-
Obtain the current state
155+
Get the final state of the graph.
156+
157+
Args:
158+
key (str, optional): The key of the final state to retrieve.
159+
160+
Returns:
161+
dict: The final state of the graph.
123162
"""
163+
124164
if key is not None:
125165
return self.final_state[key]
126166
return self.final_state
127167

128168
def get_execution_info(self):
129169
"""
130170
Returns the execution information of the graph.
171+
172+
Returns:
173+
dict: The execution information of the graph.
131174
"""
175+
132176
return self.execution_info
133177

134178
@abstractmethod

0 commit comments

Comments
 (0)