Remember, if you’re not using Excel 2016 or later, then you’ll need to install the power query add-in.

Data to Extract

In this post we’re going to take a look at how we can pull data from a series of similar pages. I’m a big MMA fan, so the example we’re going to look at is getting a list of all UFC results from Wikipedia.

If you visit the Wikipedia page for UFC events there’s a table of Past Events. If you click on one of the events you’ll see a results table. If you look at a few more events, you’ll notice the structure is the exact same and they all have a results table. This is the data I want to get, but from all 400+ events listed in the past event section. If the number of pages was any larger, you might be better off using another tool like Python, but we’re going to be using Power Query.

Create a Query Function

First, we will create a query to extract the data on one page. We will then turn this into a function query where the input is an event page URL. This way we can apply the query to each URL in a list of all the URL’s.

Head to the Data tab in the ribbon and press the From Web button under the Get & Transform section. If you’re working with Excel 2013 or earlier via the add-in, then this will be found under the Power Query tab.

Enter the URL and press the Ok button.

Excel will connect with the page and the Navigator dialog box will open.

Rename the query to fGetWikiResults. This will be the name we call to use our query function later on.

Now we can edit our query to turn it into a query function. Go to the View tab and press the Advanced Editor button. This will allow us to edit the code that Excel has created to extract the data from this URL.

We will need to edit this code to the following. The parts that need to be added/changed are highlighted in red. Press the Done button when finished editing the query. This will turn our query into a parametrized query with the URL as an input.

You should see the data preview in the query editor has been replaced with a parameter input. We don’t need to enter anything here and we can just leave it blank.

We can then save our query function by going to the Home tab and pressing the Close & Load button.

You should now see the fGetWikiResults query function in the Queries & Connections window.

Get a List of URL’s

Now we will need to get our list of event page URL’s from the Past Events page. We could use power query to import this table but this would just pull in the text and not the underlying hyperlink. The best way to get the list of URL’s is to parse the source code from the page. You can view any webpage’s source code by pressing Ctrl + U from the Chrome browser. You’ll need to be fairly familiar with HTML to find what you’re looking for. The first couple lines of HTML we are interested in looks like this. I have highlighted the hyperlinks we’re interested in to demonstrate where they are. You can parse these out in another Excel workbook using some filters and basic text formula. We will also need to concatenate the starting part of the address (ie. https://en.wikipedia.org/wiki/UFC_217).

Once we have the full list of event URL’s, we can turn the list into an Excel Table using the Ctrl + T keyboard shortcut and name it URL_List.

Use the Query Function on our URL List

We are now ready to use the fGetWikiResults query function on our list of event URL’s.

Create a query based on the URL_List table. Select a cell in the table and go to the Data tab in the ribbon and press the From Table/Range button under the Get & Transform section.

Now we will add a custom column to the query. This is where we’ll invoke our fGetWikiResults query function. Go to the Add Column tab and press the Custom Column button.

Add a New column name to the custom column and then add the Custom column formula fGetWikiResults([URL]).

The new custom column will contain a Table for each URL and we will need to expand this table to see the results. Left click on the Results Data column filter icon seen in the column heading. Select Expand from the menu and press the OK button.

Some of the column headings were missing in our source data, so we can rename them. Double left click on the column heading to rename it.

We can now Close & Load the query and the results data will load into a new sheet. This will take a good few minutes so be patient. This is why you should probably start considering Python or similar tools if you have any more pages than this example.

How To Extract Data From Multiple Webpages With Power Query - 57