Data Extraction

Introduction

Data extraction, also referred to as web scraping involves the process of retrieving data from websites. In this guide we will specifically focus on this stage and utilize C++ as our primary programming language. To illustrate we will use passport data as an example. Although C++ is not commonly employed for web scraping due to its complexity and lack of high level libraries it offers a level of control and performance that can be advantageous in certain scenarios. For these tasks libraries such as Gumbo (an HTML parsing library) and Boost (for networking) can be utilized. The data is directly validated to an off-chain database structure the Entrypoints administrate. There, the data is verified, and then brought on-chain with a zero-knowledge proof.

Preliminary Steps

Before writing the code for data extraction, we need to understand the structure of the webpage we'll be scraping. This includes finding the HTML tags containing the data we're interested in. You can do this by inspecting the webpage's source code.

Web Request

To start the process, we need to send a web request to the URL of the webpage containing the passport data and receive the HTML content of the page. For this, we can use the Boost library's Beast and Asio modules, which provide facilities for network programming. Here is a basic example:

#include <boost/beast/core.hpp>
#include <boost/beast/http.hpp>
#include <boost/asio/connect.hpp>
#include <boost/asio/ip/tcp.hpp>

namespace beast = boost::beast;
namespace http = beast::http;
namespace net = boost::asio;
using tcp = net::ip::tcp;

std::string getHTML(const std::string& host, const std::string& target)
{
    net::io_context ioc;
    tcp::resolver resolver{ioc};
    beast::tcp_stream stream{ioc};

    auto const results = resolver.resolve(host, "http");

    stream.connect(results);

    http::request<http::string_body> req{http::verb::get, target, 11};
    req.set(http::field::host, host);
    req.set(http::field::user_agent, "C++ Web Scraper");

    http::write(stream, req);

    beast::flat_buffer buffer;
    http::response<http::dynamic_body> res;
    http::read(stream, buffer, res);

    stream.socket().shutdown(tcp::socket::shutdown_both);

    return boost::beast::buffers_to_string(res.body().data());
}

In this example, we connect to the host, send a GET request, read the response, and return it as a string.

HTML Parsing

Once we have the HTML content, we can parse it to extract the data we're interested in. Gumbo, an HTML parsing library written in C, can help us with this task. Here's an example:

#include <gumbo.h>

void search_for_links(GumboNode* node) {
    if (node->type != GUMBO_NODE_ELEMENT) {
        return;
    }

    GumboAttribute* title_attr;
    if (node->v.element.tag == GUMBO_TAG_SPAN && (title_attr = gumbo_get_attribute(&node->v.element.attributes, "title"))) {
        // We've found a span with class "title".
        std::cout << "Title: " << title_attr->value << std::endl;
    }

    GumboVector* children = &node->v.element.children;
    for (unsigned int i = 0; i < children->length; ++i) {
        search_for_links(static_cast<GumboNode*>(children->data[i]));
    }
}

void parseHTML(const std::string& html) {
    GumboOutput* output = gumbo_parse(html.c_str());
    search_for_links(output->root);
    gumbo_destroy_output(&kGumboDefaultOptions, output);
}

In this example, we traverse the HTML tree recursively, looking for span elements with the class "title", and print their contents. Replace the tag name and class according to the actual tags and classes in the web page you're scraping.

Combining the Steps

Finally, we can combine these two functions into a complete web scraping program:

int main() {
    std::string html = getHTML("www.example.com", "/passportdata");
    parseHTML(html);
    return 0;
}

Conclusion

To sum up performing web scraping using C++ is indeed possible. It requires more complexity and involvement compared to languages, like Python or JavaScript. It is crucial to handle exceptions and errors and adhere to the websites robots.txt policies and terms of service to avoid any potential legal complications. Additionally keep in mind that web pages can undergo changes unexpectedly. Therefore it is essential to design your web scraper in a way that gracefully adapts to these changes and notifies you when it can no longer extract the data.

Last updated

Logo