Copying large files from SharePoint Online

I recently worked on a project where we needed to copy some large files from a specified library in SharePoint Online. In that library, there were several layers of folders and many different types of files. My goal was to copy files that had a certain file extension and a file name that started with a specific string.

I began by comparing the capabilities between Azure Data Factory and Logic Apps. Logic Apps allows me to start the process via a webhook. It also has a SharePoint online connector that provides easy AAD authentication. And there is a special operation that allows me to retrieve the properties (path, filename) of files without retrieving the files. This makes it easy to loop through the list of files that is already filtered to only the files I want.

Logic Apps process: 1) When an HTTP request is received 1) Get files (properties only) 3) For each 4a) Success Response 4b) Failure response
Logic App that retrieves file metadata from SharePoint and then performs an action for each file in the list

Azure Data Factory does have a SharePoint Online connector, but you can’t use it to copy files. You must use the HTTP connector to copy files from SharePoint in Data Factory. And that requires you to make a web call first to get a bearer token to be used in the authentication for the source of the copy activity.

While the ADF options work, I thought I would go with the Logic App because it felt cleaner and easier for others to understand. That is until I tried to copy some large files and encountered the following error.

Http request failed as there is an error: 'Cannot write more bytes to the buffer than the configured maximum buffer size: 104857600.'.

It turned out there were some extremely large files to be copied from that SharePoint library. SharePoint has a default limit of 100 MB buffer size, and the Get File Content action doesn’t natively support chunking.

At that point I would have to figure out how to handle the chunking using an HTTP call. That made the Logic App no better than Data Factory for my purposes. And since I already had part of the larger process in Data Factory, I went back there to see how it handled chunking.

Copying a Subset of Files from SharePoint Online Using Data Factory

And it turns out you don’t have to do anything special in Data Factory to handle the chunking of the large files – it does that work for you.

So I set up a similar process to what I had in my Logic App in my ADF pipeline.

Lookup to get file properties with path to For Each loop
ADF pipeline that retrieves a list of files in a SharePoint library and performs an action on each file

First, I used a Lookup activity to get the list of files. When you do a lookup against a SharePoint Online dataset, you can use OData filters to limit the data returned by the lookup. In my case this was necessary because I only wanted 144 of the items out of the list of 4300. Lookups can only return 5,000 rows, so I definitely needed to add filters to my query. Also, folders are considered an item in the data returned, so if you are looking only for files, you need to filter out folders (either by content type or by the name of the file).

Settings for a Lookup activity in ADF. It uses a SharePoint online source dataset. The Use query setting is set to query. The Query setting contains the text $filer=endswith(Name, .bak') and startswith(Name, 'Dama'). First row only is not enabled.
Data Factory Lookup against a SharePoint Online list using an OData query with filters

The results of the lookup, returning one item per file, are fed into a For Each activity. Inside the For Each activity, I placed a Web activity to get the necessary bearer token and a Copy activity to copy the file to blob storage.

Data Factory web activity with a path to a copy activity
Inside the For Each activity is a Web activity to retrieve a bearer token, followed by a Copy activity that copies the files to blob storage.

To get a bearer token, you need a service principal with the correct permissions in SharePoint. Then you make a POST to https://accounts.accesscontrol.windows.net/{tenantID}/tokens/OAuth/2.

You need the following header: Content-Type: application/x-www-form-urlencoded.

In the body of the HTTP call, you must include: grant_type=client_credentials&client_id={clientid}@{tenantid}&client_secret={secretvalue}&resource=00000003-0000-0ff1-ce00-000000000000/{SharePoint tenant}.sharepoint.com@{tenantID}.

Make sure to set Secure Output so your credentials aren’t written to the ADF logs. Preferably, you would store the client ID and client secret in Key Vault and use a web activity to look them up before getting the token.

You can then use the output from the web activity in your copy activity.

ADF Copy Activity Source Settings: Binary dataset. Path and Name are passed as parameters. The request method is GET. The additional header contains the auth token from the web activity.
Source settings for a Copy activity in ADF that copies files from SharePoint

The Additional Headers should be populated with the auth token as shown below:

Authorization: Bearer @{activity('WEB_GetToken').output.access_token}

Since I don’t need to read the file contents, just copy it to blob storage, I used a binary dataset. I parameterized the dataset and linked service so my For Each activity could feed it the path and name of each file. This is because the lookup provides the path separately from the file name (which includes the file extension).

Binary dataset with linked service pointing to the HTTP linked service and parameters for Path and Name. The value for Path is @dataset().Path. The value for Name is @dataset().Name. The relative URL is not populated.
Binary dataset using an HTTP linked service used for retrieve files from SharePoint online

The linked service uses the HTTP connector. It has two parameters added: path and name.

HTTP linked service with parameters for path and name. Authentication is anonymous.
HTTP linked service used to retrieve files in a SharePoint Online library in Data Factory

The base url I used in the linked service is:

@{concat('https://{tenant name}.sharepoint.com/sites/{site name}/_api/web/GetFileByServerRelativeUrl(''',replace(linkedService().Path,' ','%20'),'/',replace(linkedService().Name,' ','%20'),''')/$value')}

If my file is located at https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV, the URL I need to retrieve the file is https://mytenant.sharepoint.com/sites/site1/libraryname/folder1/folder2/folder3/myfile.CSV')/$value.

So once I add my destination info to the Copy Activity, I can copy all the files retrieved from my lookup to blob storage.

A Few Tricky Things

Everyone’s SharePoint tenant is set up differently. Your url will vary based upon the configuration of your site collections/sites. When I tried this in my tenant, I didn’t have to include the “site/sitename” before “/_api/web…”. But I did in my client’s tenant when I was building this.

We granted permissions to the service principal at the site collection level to get this to work. If there was a way to grant lower permissions, we couldn’t figure it out. Here the XML we used when the SharePoint admin was granting permissions.

<AppPermissionRequests AllowAppOnlyPolicy="true">  
  <AppPermissionRequest Scope=http://sharepoint/content/sitecollection/web 
   Right="Read" />
  <AppPermissionRequest Scope=http://sharepoint/content/sitecollection/web/list 
   Right="Read" />
</AppPermissionRequests>

Your auth token is valid for 1 hour. If you copy a bunch of files sequentially, and it takes longer than that, you might get a timeout error. That is why I moved my web activity inside the For Each activity. If you know it’s only a few files and they are quick, you should just get one bearer token and use it for all the files.

Lastly, be sure to set retries and timeouts appropriately. Remember that the default timeout for an activity is 7 days, and that is probably not desirable.

It took a lot of trial and error for me to get this working, as I couldn’t find many blog posts or docs about this. Special thanks to Bill Fellows for helping me through it. I sincerely hope this blog post helps someone avoid my struggles.

Share

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Trust DCAC with your data

Your data systems may be treading water today, but are they prepared for the next phase of your business growth?