ai.smithery/arjunkmrm-scrapermcp_el

Extract and parse web pages into clean HTML, links, or Markdown. Handle dynamic, complex, or block…

smitheryai-ml

GitHub

0Tools

10Findings

—Stars

—Downloads

Mar 19, 2026Last Scanned

Findings10

3critical

6high

0medium

1low

0informational

criticalK8Cross-Boundary Credential SharingMCP05-privilege-escalationAML.T0054

Never forward, share, or embed credentials across trust boundaries. Use OAuth token exchange (RFC 8693) to create scoped, delegated tokens instead of passing original credentials. Never include credentials in tool responses. Required by ISO 27001 A.5.17 and OWASP ASI03.

criticalC1Command InjectionMCP03-command-injectionAML.T0054

Pattern "`[^`]+`" matched in source_code: "`", "{", "}", "[", "]", "'", ]: thor_mcp_clean_url = thor_mcp_clean_url.replace(char, "_") # Limit total filename length to 200 characters thor_mcp_max_length = 200 - len(thor_mcp_today) - 1 # Subtract date and separator length thor_mcp_htmlName = f"{thor_mcp_today}_{thor_mcp_clean_url[:thor_mcp_max_length]}" thor_mcp_filename = f"{thor_mcp_save_dir}/{thor_mcp_htmlName}.html" if os.path.exists(thor_mcp_filename): try: with open(thor_mcp_filename, "r", encoding="utf-8") as f: thor_mcp_html = f.read() print(f"Read HTML from local cache: {thor_mcp_filename}") except IOError as e: raise ToolError(f"Failed to read cache file") else: thor_mcp_html = await scrape(url, thor_mcp_myProxyConfig) if not thor_mcp_html: raise ToolError(f"Web scraping failed, unable to get content") try: with open(thor_mcp_filename, "w", encoding="utf-8") as f: f.write(thor_mcp_html) print(f"HTML saved to {thor_mcp_filename}") except IOError as e: raise ToolError(f"Failed to save HTML file") else: # When cache is disabled, still save HTML but use different directory and second-level timestamp filename thor_mcp_now = datetime.now().strftime("%Y%m%d%H%M%S") thor_mcp_save_dir = "html_temp" os.makedirs(thor_mcp_save_dir, exist_ok=True) # Clean special characters from URL and limit filename length thor_mcp_clean_url = url.split("//")[-1] for char in [ "?", ",", "/", "\\", ":", "*", '"', "<", ">", "|", "%", "=", "&", "+", ";", "@", "#", "$", "^", "`" (at position 4949)

Replace exec()/execSync() with execFile() and pass arguments as an array, never as a string. Validate all inputs against an allowlist before use in any shell context. For subprocess.run, always pass a list and shell=False.

criticalQ13MCP Bridge Package Supply Chain AttackMCP10-supply-chainAML.T0054

MCP bridge packages (mcp-remote, mcp-proxy, @modelcontextprotocol/sdk, fastmcp) are high-value supply chain targets — CVE-2025-6514 (CVSS 9.6) in mcp-remote affected 437,000+ installs. Always pin exact versions (no ^ or ~ ranges). Use lockfiles (package-lock.json, pnpm-lock.yaml, uv.lock). Never run `npx mcp-remote` without version pinning. Verify package integrity with `npm audit` or `pip-audit` before deployment. Reference: CVE-2025-6514, OWASP ASI04.

highD1Known CVEs in DependenciesMCP08-dependency-vuln

Dependency "aiohttp@3.9.0" has known CVEs:

Update dependencies to versions that patch known CVEs. Run 'npm audit fix' or 'pip-audit' to identify and resolve vulnerable dependencies.

highD1Known CVEs in DependenciesMCP08-dependency-vuln

Dependency "mcp@null" has known CVEs:

Update dependencies to versions that patch known CVEs. Run 'npm audit fix' or 'pip-audit' to identify and resolve vulnerable dependencies.

highO8Timing-Based Covert ChannelMCP04-data-exfiltrationAML.T0057

Pattern "(?:delay|sleep|timeout|interval)\s*[:=]\s*(?:[^;]*(?:secret|token|password|credential|key|env))" matched in source_code: "timeout = ClientTimeout(total=120) # Create asynchronous HTTP client session async with aiohttp.ClientSession( headers=headers, timeout=timeout, connector=aiohttp.TCPConnector(), max_field_size=32768, ) as session: try: # Use session to initiate GET request async with session.get( url, # Target URL proxy=proxy, # Use proxy proxy_auth=proxy_auth, # Use proxy authentication ssl=False, # Disable SSL verification ) as response: # Check if response status code is 200 (success) if response.status == 200: # Return response text content return await response.text() else: # Construct error message containing status code and URL error_msg = f"Status code: {response.status}, URL: {url}" # Throw retry exception raise ScrapeRetryException(error_msg) except aiohttp.ClientError as e: error_msg = f"HTTP client error" raise ScrapeRetryException(error_msg) except asyncio.TimeoutError: error_msg = ( f"Request timeout: 60 seconds" ) raise ScrapeRetryException(error_msg) except Exception as e: error_msg = f"Unknown error:" raise ScrapeRetryException(error_msg) async def scrape(url: str, myProxyConfig: ProxyConfig) -> str: """ Web scraping method Parameters: url: URL address to scrape myProxyConfig: Proxy configuration object Returns: Returns web page content text on success, returns empty string on failure """ try: result = await scrape_with_retry(url, myProxyConfig) return result except ScrapeRetryException: return "" def clean_html(html: str) -> str: """Clean HTML string""" cleaner = Cleaner( scripts=True, kill_tags=["nav", "svg", "footer", "noscript", "script", "form"], style=True, remove_tags=[], safe_attrs=list(defs.safe_attrs) + ["idx"], inline_style=True, links=True, meta=False, embedded=True, frames=False, forms=False, annoying_tags=False, page_structure=False, javascript=True, comments=True, ) return cleaner.clean_html(html) def strip_html(thor_mcp_html: str) -> str: """Simplify HTML string, remove unnecessary elements, attributes and redundant content""" import re # Call clean_html function for initial cleaning (assuming the function is already defined externally) thor_mcp_cleaned_html = clean_html(thor_mcp_html) # Parse the cleaned HTML string into XML tree structure thor_mcp_html_tree = fromstring(thor_mcp_cleaned_html) # Traverse all elements in the HTML tree (including nested descendant elements) for thor_mcp_element in thor_mcp_html_tree.iter(): # Remove style attribute (inline styles) from all elements if "style" in thor_mcp_element.attrib: del thor_mcp_element.attrib["style"] # Use del statement to delete element attribute if ( ( not thor_mcp_element.attrib # No attributes or (len(thor_mcp_element.attrib) == 1 and "idx" in thor_mcp_element.attrib) # Or only contains idx attribute ) and not thor_mcp_element.getchildren() # No child elements and (not thor_mcp_element.text or not thor_mcp_element.text.strip()) # No text or blank text and (not thor_mcp_element.tail or not thor_mcp_element.tail.strip()) # No tail text or blank tail ): # Get parent element (may be None if it's the root element) thor_mcp_parent = thor_mcp_element.getparent() # Only remove if parent element exists if thor_mcp_parent is not None: # Remove current element from parent's tree structure thor_mcp_parent.remove(thor_mcp_element) # Convert processed XML tree back to HTML string return tostring(thor_mcp_html_tree, encoding='unicode') # Remove elements containing "footer" or "hidden" in class or id thor_mcp_xpath_query = ( ".//*[contains(@class, 'footer') or contains(@id, 'footer') or " "contains(@class, 'hidden') or contains(@id, 'hidden')]" ) # Use XPath query to find all elements that need to be removed thor_mcp_elements_to_remove = thor_mcp_html_tree.xpath(thor_mcp_xpath_query) # Traverse all elements that need to be removed for thor_mcp_element in thor_mcp_elements_to_remove: # Get the parent element of the current element thor_mcp_parent = thor_mcp_element.getparent() # Only perform removal if parent element exists if thor_mcp_parent is not None: # Remove current element from parent element thor_mcp_parent.remove(thor_mcp_element) # Reserialize HTML tree to string thor_mcp_stripped_html = tostring(thor_mcp_html_tree, encoding="unicode") # Replace multiple spaces with single space thor_mcp_stripped_html = re.sub(r"\s{2,}", " ", thor_mcp_stripped_html) # Replace consecutive newlines with empty string thor_mcp_stripped_html = re.sub(r"\n{2,}", "", thor_mcp_stripped_html) return thor_mcp_stripped_html def extract_links_with_text(thor_mcp_html: str, thor_mcp_base_url: str | None = None) -> list[str]: """ Extract links with display text from HTML Parameters: thor_mcp_html (str): Input HTML string thor_mcp_base_url (str | None): Base URL for converting relative URLs to absolute URLs If None, relative URLs remain unchanged Returns: list[str]: List of links in format [display text] URL """ # Use lxml's fromstring function to parse HTML string into XML tree structure thor_mcp_html_tree = fromstring(thor_mcp_html) # Initialize empty list to store formatted links thor_mcp_links = [] # Traverse all <a> tags containing href attribute (XPath selector) for thor_mcp_link in thor_mcp_html_tree.xpath("//a[@href]"): # Get value of href attribute (link target address) thor_mcp_href = thor_mcp_link.get("href") # Get all text content within the tag (including child tag text), and remove leading/trailing whitespace thor_mcp_text = thor_mcp_link.text_content().strip() # Only process when both href and text exist (filter empty links or empty text) if thor_mcp_href and thor_mcp_text: # Skip empty text or pure whitespace text (although strip() is used, prevent special whitespace characters) if not thor_mcp_text: continue # Skip in-page anchor links (starting with #) if thor_mcp_href.startswith("#"): continue # Skip JavaScript pseudo-links if thor_mcp_href.startswith("javascript:"): continue # Convert URL when base_url is provided and it's a relative path (starting with /) if thor_mcp_base_url and thor_mcp_href.startswith("/"): # Remove trailing slash from base_url to avoid double slash issue thor_mcp_base = thor_mcp_base_url.rstrip("/") # Concatenate into absolute URL thor_mcp_href = f"{thor_mcp_base}{thor_mcp_href}" # Add formatted link to result list: [text] URL thor_mcp_links.append(f"[{thor_mcp_text}] {thor_mcp_href}") # Return list of all qualified links return thor_mcp_links def get_content(thor_mcp_content: str, thor_mcp_output_format: str) -> str: """ Extract content from response and convert to appropriate format Parameters: thor_mcp_content: Response content string thor_mcp_output_format: Output format ("html", "links", or other formats converted to markdown) Returns: Formatted content string """ if thor_mcp_output_format == "html": return thor_mcp_content if thor_mcp_output_format == "links": thor_mcp_links = extract_links_with_text(thor_mcp_content) return "\n".join(thor_mcp_links) thor_mcp_stripped_html = strip_html(thor_mcp_content) # Simplify HTML content return markdownify(thor_mcp_stripped_html) # For other formats, return original content string # Main program entry point (when running this script directly) if __name__ == "__main__": # If current script is the main program entry # Get the Starlette app and add CORS middleware app = mcp.streamable_http_app() # Get streamable HTTP application instance # Add CORS middleware with proper header exposure for MCP session management app.add_middleware( # Add CORS middleware to application CORSMiddleware, # Use CORS middleware class allow_origins=["*"], # Allow cross-origin requests from all origins (should be more restrictive in production) allow_credentials=True, # Allow credentials (such as cookies) allow_methods=["GET", "POST", "OPTIONS"], # Allowed HTTP methods allow_headers=["*"], # Allow all request headers expose_headers=["mcp-session-id", "mcp-protocol-version"], # Allow client to read MCP session ID and protocol version headers max_age=86400, # Preflight request cache time (seconds) ) # Use PORT environment variable port = int(os.environ.get("PORT", 8081)) # Get port number from env" (at position 9603)

Remove all code that calculates sleep/delay durations from application data, secrets, or any variable-length content. Tool response times should be constant or determined only by legitimate processing time. If rate limiting is needed, use fixed intervals not derived from data values. Monitor for anomalous response time patterns that could indicate timing-based exfiltration.

highQ14Concurrent MCP Server Race ConditionMCP07-insecure-configT1068

Pattern "(?:read|write|modify|delete).*(?:file|path|directory)(?!.*(?:lock|mutex|semaphore|flock|atomic))" matched in source_code: "Read HTML from local cache: {thor_mcp_file" (at position 5649)

MCP servers sharing filesystem or database backends with other servers must implement proper concurrency controls. Use: (1) file locking (flock/lockfile) for filesystem operations, (2) database transactions for all read-modify-write sequences, (3) atomic file operations (O_EXCL, mkdtemp) instead of check-then-create, (4) lstat() to detect symlinks before following (CVE-2025-53109). Never assume exclusive access to shared resources — other MCP servers may be operating concurrently.

highC3Server-Side Request Forgery (SSRF)MCP04-data-exfiltrationAML.T0057

Validate ALL user-supplied URLs before making HTTP requests: 1. Parse the URL and check the hostname against an explicit allowlist of permitted domains. 2. Block requests to RFC 1918 private ranges: 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16. 3. Block loopback (127.0.0.0/8), link-local (169.254.0.0/16), and IPv6 equivalents. 4. Block file:// and other non-http(s) protocols explicitly. 5. Disable automatic redirect following, or re-validate each redirect destination. 6. In cloud environments: block requests to IMDS endpoints (169.254.169.254, metadata.google.internal) at both the application AND network layer. Example (Node.js): Use the `ssrf-req-filter` package or implement URL validation against an allowlist before calling fetch/axios/got.

highC7Wildcard CORSMCP07-insecure-config

Pattern "allow_origins\s*=\s*\[\s*['"]\*['"]\s*\]" matched in source_code: "allow_origins=["*"]" (at position 18757)

Replace wildcard CORS with an explicit allowlist of permitted origins. Wildcard CORS allows any website to make requests to the MCP server.

lowF4MCP Spec Non-ComplianceMCP07-insecure-config

Server fails MCP spec compliance checks: required:server_name; required:server_version; required:protocol_version; recommended:tool_descriptions; recommended:parameter_descriptions

Follow the MCP specification for server metadata. Include server name, version, and protocol version. Provide descriptions for all tools and parameters.

Tools

No tools exposed by this server.

Security Category Deep Dive

Sub-Category Tree · Remediation Roadmap · Attack Stories · Compliance Overlay · ATLAS Techniques · Maturity Model

⚡

Prompt Injection

Prompt & context manipulation attacks

Maturity

Rules

Sub-Categories

Gaps

64%

Implemented

Tests

Stories

PI-DIRDirect Input Injection

100%3 rules

Injection via tool descriptions and parameter fields

GAP-001Prompt Injection Coverage GapMissing detection coverage for emerging prompt injection attack variants not addressed by current rules

PI-INDIndirect / Gateway Injection

100%4 rules

Hidden instructions via external content and tool responses

PI-CTXContext Manipulation

100%2 rules

Context window saturation and prior-approval exploitation

PI-ENCEncoding & Obfuscation

100%3 rules

Payload hiding via invisible chars, base64, schema fields

PI-TPLTemplate & Output Poisoning

100%2 rules

Injection via prompt templates and runtime tool output