Web scraping which is a part of web application plays a negative role in web development. Because it is a process of using bots to extract a particular content or the whole page from an external website. And many of the search engines and websites treat it as the malicious attack.
If you continuously scrap the same web page consistently, then it will amount to ban your IP address either temporary or permanently.
So in this case, we can use Yahoo YQL(Yahoo Query Language) as a proxy to make a cross-domain request by which we can easily scrap HTML web pages.
JQuery cross-domain Screen Scraping
First we will need to target a web page to scrap HTML content. So, just for example purpose, I am using here ‘example.com’. From here, I will scrap heading <h1> tag,
1. Now, navigate to Yahoo YQL Console. and enter the following into the textbox area, and then select JSON, hit on ‘Test‘
select * from html where url="http://example.com/"
Note: The above query will scrap the whole page of example.com
3. And, copy and paste the REST Query (at above image, rest query is the third highlighted area) into the Notepad for further use.
4. Now, just open notepad, paste the below code into it and save as HTML.
var yql = "https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fexample.com%2F%22&format=json&diagnostics=true&callback=";
document.getElementById('parse-data').innerHTML = data.query.results.body.div.h1;
1. The first highlighted code which you see at above is ‘REST QUERY‘ url that we kept safe in step 3.
2. And the second highlighted code is used to extract heading tag from the JSON data. To know more about JSON, go to w3school click here
Now open the saved HTML file with your chrome browser and then you will able to see the heading tag of example.com. see below picture