Antheta.com is an work in progress web scraper for scraping all kinds of different data. You can currently download data in .JSON format.
Antheta automatically attempts to scrape multiple different types of data:
- General data
- Title, description, favicon
- Detected technologies (WordPress, jQuery, MailChimp etc...)
- Host (IP-address)
- API endpoints
- AJAX Requests along with methods, parameters, headers and timestamps
- Site size
- Links (links in 3 different categories: outbount, inbound & navbar links)
- Outbound links
- Inbound links (with the same host)
- Navbar links
- Socials (social links)
- Inline + linked
- Parses JSON Scripts e.g. schema.org
- Fonts (Google fonts found on site)
- Forms (form method, action and all form elements + checks for recaptcha)
- Contacts (note that the tool is not intended for spam, instead use it to secure your own sites against bots)
- Email addresses
- Parses and tries to match multiple different regexes
- Works with common obfuscators e.g. CloudFlare (looks for dynamically loaded content and not the initial html)
- Phone numbers
- Tables (table header + content)
- HTML (full page HTML code)
- IP-Addresses (parses IP-addresses in real-time)
- Most React based websites/apps have an __NEXT_DATA__ element that contain important information.
The robot also has more capabilities that have not yet been implemented on the frontend:
- Click on elements or specific areas.
- Useful for recaptchas if using different IP's.
- Navigate between pages.
- Fill out forms and inputs.
- Wait for certain elements to load or wait for a specific amount of time.
- Access & exit IFrames on the target site.
- Screenshot pages or specific areas.
- for manipulating the target site to your needs.
- Use cases are almost endless.
- Manipulate target site DOM.
- Save/Access XHR requests content.
The whole idea of this web scraper is to provide a service where you can automate almost any task online, for example login to a service and post something or gather data from specific websites automatically and access it via Antheta's API.
You can also use CSS element selectors to specify the data you wish to scrape. Example: ".container .links a->href" will get the "href" attribute from that element along with other common attributes (class, id, name, html, text etc)
Read more about the types of data on Antheta.com
Antheta.com will have a full dashboard that anyone can use and specify exactly what they wish to scrape. Heres what the upcoming Anheta Editor currently looks like:
The user will have the ability to create their own datasets where they can specify their own columns and this would act sort of like database tables. Say you want to gather emails from thousands of sites, you can make a dataset for it and later on connect to it using our API or just export the list.