Skip to content

I am trying to run SmartScraperGraph() using ollama with llama3.2 model but i am getting this warning that "Token indices sequence length is longer than the specified maximum sequence length for this model (7678 > 1024)." and the whole website is not being scraped. #853

Closed
@GODCREATOR333

Description

@GODCREATOR333

import json
from scrapegraphai.graphs import SmartScraperGraph
from ollama import Client

ollama_client = Client(host='http://localhost:11434')

Define the configuration for the scraping pipeline

graph_config = {
"llm": {
"model": "ollama/llama3.2",
"temperature": 0.0,
"format": "json",
"model_tokens": 4096,
"base_url": "http://localhost:11434",
},
"embeddings": {
"model": "nomic-embed-text",
},
}

Create the SmartScraperGraph instance

smart_scraper_graph = SmartScraperGraph(
prompt="Extract me all the news from the website along with headlines",
source="https://www.bbc.com/",
config=graph_config
)

Run the pipeline

result = smart_scraper_graph.run()
print(json.dumps(result, indent=4))

Output*********************************************

from langchain_community.callbacks.manager import get_openai_callback
You can use the langchain cli to automatically upgrade many imports. Please see documentation here https://python.langchain.com/docs/versions/v0_2/
from langchain.callbacks import get_openai_callback
Token indices sequence length is longer than the specified maximum sequence length for this model (7678 > 1024). Running this sequence through the model will result in indexing errors
{
"headlines": [
"Life is not easy - Haaland penalty miss sums up Man City crisis",
"How a 1990s Swan Lake changed dance forever"
],
"articles": [
{
"title": "BBC News",
"url": "https://www.bbc.com/news/world-europe-63711133"
},
{
"title": "Matthew Bourne on his male Swan Lake - the show that shook up the dance world",
"url": "https://www.bbc.com/culture/article/20241126-matthew-bourne-on-his-male-swan-lake-the-show-that-shook-up-the-dance-world-forever"
}
]
}

Even after specifying that model tokens = 4096 it is not effecting its maximum sequence length(1024). How can i increase it ? How can i chunk the website into size of its max_sequence_length so that i can scrape the whole website.

PS: Also having the option to further crawl the links and scrape subsequent websites would be great. Thanks

Ubuntu 22.04 LTS
GPU : RTX 4070 12GB VRAM
RAM : 16GB DDR5

Ollama/Llama3.2:3B model

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions