Description
import json
from scrapegraphai.graphs import SmartScraperGraph
from ollama import Client
ollama_client = Client(host='http://localhost:11434')
Define the configuration for the scraping pipeline
graph_config = {
"llm": {
"model": "ollama/llama3.2",
"temperature": 0.0,
"format": "json",
"model_tokens": 4096,
"base_url": "http://localhost:11434",
},
"embeddings": {
"model": "nomic-embed-text",
},
}
Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
prompt="Extract me all the news from the website along with headlines",
source="https://www.bbc.com/",
config=graph_config
)
Run the pipeline
result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))
Output*********************************************
from langchain_community.callbacks.manager import get_openai_callback
You can use the langchain cli to automatically upgrade many imports. Please see documentation here https://python.langchain.com/docs/versions/v0_2/
from langchain.callbacks import get_openai_callback
Token indices sequence length is longer than the specified maximum sequence length for this model (7678 > 1024). Running this sequence through the model will result in indexing errors
{
"headlines": [
"Life is not easy - Haaland penalty miss sums up Man City crisis",
"How a 1990s Swan Lake changed dance forever"
],
"articles": [
{
"title": "BBC News",
"url": "https://www.bbc.com/news/world-europe-63711133"
},
{
"title": "Matthew Bourne on his male Swan Lake - the show that shook up the dance world",
"url": "https://www.bbc.com/culture/article/20241126-matthew-bourne-on-his-male-swan-lake-the-show-that-shook-up-the-dance-world-forever"
}
]
}
Even after specifying that model tokens = 4096 it is not effecting its maximum sequence length(1024). How can i increase it ? How can i chunk the website into size of its max_sequence_length so that i can scrape the whole website.
PS: Also having the option to further crawl the links and scrape subsequent websites would be great. Thanks
Ubuntu 22.04 LTS
GPU : RTX 4070 12GB VRAM
RAM : 16GB DDR5
Ollama/Llama3.2:3B model