ai.smithery/arjunkmrm-scrapermcp_el
Extract and parse web pages into clean HTML, links, or Markdown. Handle dynamic, complex, or block…
0Tools
10Findings
—Stars
—Downloads
Mar 19, 2026Last Scanned
Findings10
3critical
6high
0medium
1low
0informational
criticalK8Cross-Boundary Credential SharingMCP05-privilege-escalationAML.T0054
Pattern "(forward|pass|send|relay|proxy|propagate)[_\s-]?(token|credential|api[_\s-]?key|secret|password|auth)" matched in source_code: "proxy_password" (at position 1208)
Never forward, share, or embed credentials across trust boundaries. Use OAuth token exchange (RFC 8693) to create scoped, delegated tokens instead of passing original credentials. Never include credentials in tool responses. Required by ISO 27001 A.5.17 and OWASP ASI03.
criticalC1Command InjectionMCP03-command-injectionAML.T0054
Pattern "`[^`]+`" matched in source_code: "`",
"{", "}", "[", "]", "'",
]:
thor_mcp_clean_url = thor_mcp_clean_url.replace(char, "_")
# Limit total filename length to 200 characters
thor_mcp_max_length = 200 - len(thor_mcp_today) - 1 # Subtract date and separator length
thor_mcp_htmlName = f"{thor_mcp_today}_{thor_mcp_clean_url[:thor_mcp_max_length]}"
thor_mcp_filename = f"{thor_mcp_save_dir}/{thor_mcp_htmlName}.html"
if os.path.exists(thor_mcp_filename):
try:
with open(thor_mcp_filename, "r", encoding="utf-8") as f:
thor_mcp_html = f.read()
print(f"Read HTML from local cache: {thor_mcp_filename}")
except IOError as e:
raise ToolError(f"Failed to read cache file")
else:
thor_mcp_html = await scrape(url, thor_mcp_myProxyConfig)
if not thor_mcp_html:
raise ToolError(f"Web scraping failed, unable to get content")
try:
with open(thor_mcp_filename, "w", encoding="utf-8") as f:
f.write(thor_mcp_html)
print(f"HTML saved to {thor_mcp_filename}")
except IOError as e:
raise ToolError(f"Failed to save HTML file")
else:
# When cache is disabled, still save HTML but use different directory and second-level timestamp filename
thor_mcp_now = datetime.now().strftime("%Y%m%d%H%M%S")
thor_mcp_save_dir = "html_temp"
os.makedirs(thor_mcp_save_dir, exist_ok=True)
# Clean special characters from URL and limit filename length
thor_mcp_clean_url = url.split("//")[-1]
for char in [
"?", ",", "/", "\\", ":", "*", '"', "<", ">", "|",
"%", "=", "&", "+", ";", "@", "#", "$", "^", "`" (at position 4949)
Replace exec()/execSync() with execFile() and pass arguments as an array, never as a string. Validate all inputs against an allowlist before use in any shell context. For subprocess.run, always pass a list and shell=False.
criticalQ13MCP Bridge Package Supply Chain AttackMCP10-supply-chainAML.T0054
Pattern "(?:mcp|fastmcp|langchain-mcp|llama-index-mcp)(?:>=|~=|==)?(?!\d)" matched in source_code: "fastmcp" (at position 475)
MCP bridge packages (mcp-remote, mcp-proxy, @modelcontextprotocol/sdk, fastmcp) are high-value supply chain targets — CVE-2025-6514 (CVSS 9.6) in mcp-remote affected 437,000+ installs. Always pin exact versions (no ^ or ~ ranges). Use lockfiles (package-lock.json, pnpm-lock.yaml, uv.lock). Never run `npx mcp-remote` without version pinning. Verify package integrity with `npm audit` or `pip-audit` before deployment. Reference: CVE-2025-6514, OWASP ASI04.
highD1Known CVEs in DependenciesMCP08-dependency-vuln
Dependency "aiohttp@3.9.0" has known CVEs:
Update dependencies to versions that patch known CVEs. Run 'npm audit fix' or 'pip-audit' to identify and resolve vulnerable dependencies.
highD1Known CVEs in DependenciesMCP08-dependency-vuln
Dependency "mcp@null" has known CVEs:
Update dependencies to versions that patch known CVEs. Run 'npm audit fix' or 'pip-audit' to identify and resolve vulnerable dependencies.
highO8Timing-Based Covert ChannelMCP04-data-exfiltrationAML.T0057
Pattern "(?:delay|sleep|timeout|interval)\s*[:=]\s*(?:[^;]*(?:secret|token|password|credential|key|env))" matched in source_code: "timeout = ClientTimeout(total=120)
# Create asynchronous HTTP client session
async with aiohttp.ClientSession(
headers=headers,
timeout=timeout,
connector=aiohttp.TCPConnector(),
max_field_size=32768,
) as session:
try:
# Use session to initiate GET request
async with session.get(
url, # Target URL
proxy=proxy, # Use proxy
proxy_auth=proxy_auth, # Use proxy authentication
ssl=False, # Disable SSL verification
) as response:
# Check if response status code is 200 (success)
if response.status == 200:
# Return response text content
return await response.text()
else:
# Construct error message containing status code and URL
error_msg = f"Status code: {response.status}, URL: {url}"
# Throw retry exception
raise ScrapeRetryException(error_msg)
except aiohttp.ClientError as e:
error_msg = f"HTTP client error"
raise ScrapeRetryException(error_msg)
except asyncio.TimeoutError:
error_msg = (
f"Request timeout: 60 seconds"
)
raise ScrapeRetryException(error_msg)
except Exception as e:
error_msg = f"Unknown error:"
raise ScrapeRetryException(error_msg)
async def scrape(url: str, myProxyConfig: ProxyConfig) -> str:
"""
Web scraping method
Parameters:
url: URL address to scrape
myProxyConfig: Proxy configuration object
Returns:
Returns web page content text on success, returns empty string on failure
"""
try:
result = await scrape_with_retry(url, myProxyConfig)
return result
except ScrapeRetryException:
return ""
def clean_html(html: str) -> str:
"""Clean HTML string"""
cleaner = Cleaner(
scripts=True,
kill_tags=["nav", "svg", "footer", "noscript", "script", "form"],
style=True,
remove_tags=[],
safe_attrs=list(defs.safe_attrs) + ["idx"],
inline_style=True,
links=True,
meta=False,
embedded=True,
frames=False,
forms=False,
annoying_tags=False,
page_structure=False,
javascript=True,
comments=True,
)
return cleaner.clean_html(html)
def strip_html(thor_mcp_html: str) -> str:
"""Simplify HTML string, remove unnecessary elements, attributes and redundant content"""
import re
# Call clean_html function for initial cleaning (assuming the function is already defined externally)
thor_mcp_cleaned_html = clean_html(thor_mcp_html)
# Parse the cleaned HTML string into XML tree structure
thor_mcp_html_tree = fromstring(thor_mcp_cleaned_html)
# Traverse all elements in the HTML tree (including nested descendant elements)
for thor_mcp_element in thor_mcp_html_tree.iter():
# Remove style attribute (inline styles) from all elements
if "style" in thor_mcp_element.attrib:
del thor_mcp_element.attrib["style"] # Use del statement to delete element attribute
if (
(
not thor_mcp_element.attrib # No attributes
or (len(thor_mcp_element.attrib) == 1 and "idx" in thor_mcp_element.attrib) # Or only contains idx attribute
)
and not thor_mcp_element.getchildren() # No child elements
and (not thor_mcp_element.text or not thor_mcp_element.text.strip()) # No text or blank text
and (not thor_mcp_element.tail or not thor_mcp_element.tail.strip()) # No tail text or blank tail
):
# Get parent element (may be None if it's the root element)
thor_mcp_parent = thor_mcp_element.getparent()
# Only remove if parent element exists
if thor_mcp_parent is not None:
# Remove current element from parent's tree structure
thor_mcp_parent.remove(thor_mcp_element)
# Convert processed XML tree back to HTML string
return tostring(thor_mcp_html_tree, encoding='unicode')
# Remove elements containing "footer" or "hidden" in class or id
thor_mcp_xpath_query = (
".//*[contains(@class, 'footer') or contains(@id, 'footer') or "
"contains(@class, 'hidden') or contains(@id, 'hidden')]"
)
# Use XPath query to find all elements that need to be removed
thor_mcp_elements_to_remove = thor_mcp_html_tree.xpath(thor_mcp_xpath_query)
# Traverse all elements that need to be removed
for thor_mcp_element in thor_mcp_elements_to_remove:
# Get the parent element of the current element
thor_mcp_parent = thor_mcp_element.getparent()
# Only perform removal if parent element exists
if thor_mcp_parent is not None:
# Remove current element from parent element
thor_mcp_parent.remove(thor_mcp_element)
# Reserialize HTML tree to string
thor_mcp_stripped_html = tostring(thor_mcp_html_tree, encoding="unicode")
# Replace multiple spaces with single space
thor_mcp_stripped_html = re.sub(r"\s{2,}", " ", thor_mcp_stripped_html)
# Replace consecutive newlines with empty string
thor_mcp_stripped_html = re.sub(r"\n{2,}", "", thor_mcp_stripped_html)
return thor_mcp_stripped_html
def extract_links_with_text(thor_mcp_html: str, thor_mcp_base_url: str | None = None) -> list[str]:
"""
Extract links with display text from HTML
Parameters:
thor_mcp_html (str): Input HTML string
thor_mcp_base_url (str | None): Base URL for converting relative URLs to absolute URLs
If None, relative URLs remain unchanged
Returns:
list[str]: List of links in format [display text] URL
"""
# Use lxml's fromstring function to parse HTML string into XML tree structure
thor_mcp_html_tree = fromstring(thor_mcp_html)
# Initialize empty list to store formatted links
thor_mcp_links = []
# Traverse all <a> tags containing href attribute (XPath selector)
for thor_mcp_link in thor_mcp_html_tree.xpath("//a[@href]"):
# Get value of href attribute (link target address)
thor_mcp_href = thor_mcp_link.get("href")
# Get all text content within the tag (including child tag text), and remove leading/trailing whitespace
thor_mcp_text = thor_mcp_link.text_content().strip()
# Only process when both href and text exist (filter empty links or empty text)
if thor_mcp_href and thor_mcp_text:
# Skip empty text or pure whitespace text (although strip() is used, prevent special whitespace characters)
if not thor_mcp_text:
continue
# Skip in-page anchor links (starting with #)
if thor_mcp_href.startswith("#"):
continue
# Skip JavaScript pseudo-links
if thor_mcp_href.startswith("javascript:"):
continue
# Convert URL when base_url is provided and it's a relative path (starting with /)
if thor_mcp_base_url and thor_mcp_href.startswith("/"):
# Remove trailing slash from base_url to avoid double slash issue
thor_mcp_base = thor_mcp_base_url.rstrip("/")
# Concatenate into absolute URL
thor_mcp_href = f"{thor_mcp_base}{thor_mcp_href}"
# Add formatted link to result list: [text] URL
thor_mcp_links.append(f"[{thor_mcp_text}] {thor_mcp_href}")
# Return list of all qualified links
return thor_mcp_links
def get_content(thor_mcp_content: str, thor_mcp_output_format: str) -> str:
"""
Extract content from response and convert to appropriate format
Parameters:
thor_mcp_content: Response content string
thor_mcp_output_format: Output format ("html", "links", or other formats converted to markdown)
Returns:
Formatted content string
"""
if thor_mcp_output_format == "html":
return thor_mcp_content
if thor_mcp_output_format == "links":
thor_mcp_links = extract_links_with_text(thor_mcp_content)
return "\n".join(thor_mcp_links)
thor_mcp_stripped_html = strip_html(thor_mcp_content) # Simplify HTML content
return markdownify(thor_mcp_stripped_html)
# For other formats, return original content string
# Main program entry point (when running this script directly)
if __name__ == "__main__": # If current script is the main program entry
# Get the Starlette app and add CORS middleware
app = mcp.streamable_http_app() # Get streamable HTTP application instance
# Add CORS middleware with proper header exposure for MCP session management
app.add_middleware( # Add CORS middleware to application
CORSMiddleware, # Use CORS middleware class
allow_origins=["*"], # Allow cross-origin requests from all origins (should be more restrictive in production)
allow_credentials=True, # Allow credentials (such as cookies)
allow_methods=["GET", "POST", "OPTIONS"], # Allowed HTTP methods
allow_headers=["*"], # Allow all request headers
expose_headers=["mcp-session-id", "mcp-protocol-version"], # Allow client to read MCP session ID and protocol version headers
max_age=86400, # Preflight request cache time (seconds)
)
# Use PORT environment variable
port = int(os.environ.get("PORT", 8081)) # Get port number from env" (at position 9603)
Remove all code that calculates sleep/delay durations from application data, secrets, or any variable-length content. Tool response times should be constant or determined only by legitimate processing time. If rate limiting is needed, use fixed intervals not derived from data values. Monitor for anomalous response time patterns that could indicate timing-based exfiltration.
highQ14Concurrent MCP Server Race ConditionMCP07-insecure-configT1068
Pattern "(?:read|write|modify|delete).*(?:file|path|directory)(?!.*(?:lock|mutex|semaphore|flock|atomic))" matched in source_code: "Read HTML from local cache: {thor_mcp_file" (at position 5649)
MCP servers sharing filesystem or database backends with other servers must implement proper concurrency controls. Use: (1) file locking (flock/lockfile) for filesystem operations, (2) database transactions for all read-modify-write sequences, (3) atomic file operations (O_EXCL, mkdtemp) instead of check-then-create, (4) lstat() to detect symlinks before following (CVE-2025-53109). Never assume exclusive access to shared resources — other MCP servers may be operating concurrently.
highC3Server-Side Request Forgery (SSRF)MCP04-data-exfiltrationAML.T0057
Pattern "session\.(?:get|post|put|delete|patch|request)\s*\([^)]*(?:req|request|input|param|params|args|url|uri|href|target|endpoint|host)" matched in source_code: "session.get(
url, # Target URL" (at position 9956)
Validate ALL user-supplied URLs before making HTTP requests:
1. Parse the URL and check the hostname against an explicit allowlist of permitted domains.
2. Block requests to RFC 1918 private ranges: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16.
3. Block loopback (127.0.0.0/8), link-local (169.254.0.0/16), and IPv6 equivalents.
4. Block file:// and other non-http(s) protocols explicitly.
5. Disable automatic redirect following, or re-validate each redirect destination.
6. In cloud environments: block requests to IMDS endpoints (169.254.169.254,
metadata.google.internal) at both the application AND network layer.
Example (Node.js): Use the `ssrf-req-filter` package or implement URL validation
against an allowlist before calling fetch/axios/got.
highC7Wildcard CORSMCP07-insecure-config
Pattern "allow_origins\s*=\s*\[\s*['"]\*['"]\s*\]" matched in source_code: "allow_origins=["*"]" (at position 18757)
Replace wildcard CORS with an explicit allowlist of permitted origins. Wildcard CORS allows any website to make requests to the MCP server.
lowF4MCP Spec Non-ComplianceMCP07-insecure-config
Server fails MCP spec compliance checks: required:server_name; required:server_version; required:protocol_version; recommended:tool_descriptions; recommended:parameter_descriptions
Follow the MCP specification for server metadata. Include server name, version, and protocol version. Provide descriptions for all tools and parameters.
Tools
No tools exposed by this server.
Security Category Deep Dive
Sub-Category Tree · Remediation Roadmap · Attack Stories · Compliance Overlay · ATLAS Techniques · Maturity Model
Prompt Injection
Prompt & context manipulation attacks
69
Maturity
14
Rules
5
Sub-Categories
1
Gaps
64%
Implemented
56
Tests
1
Stories
100%3 rules
Injection via tool descriptions and parameter fields
GAP-001Prompt Injection Coverage GapMissing detection coverage for emerging prompt injection attack variants not addressed by current rules
100%4 rules
Hidden instructions via external content and tool responses
100%2 rules
Context window saturation and prior-approval exploitation
100%3 rules
Payload hiding via invisible chars, base64, schema fields
100%2 rules
Injection via prompt templates and runtime tool output