首页 > 解决方案 > F# Api crawler using recursion or loop

问题描述

I am trying to build a simple API crawler, that goes through paginated content and returns the aggregate result. This was my first attempt on writing a recursive function:

namespace MyProject

open FSharp.Data

type MyJson = JsonProvider<"./Resources/Example.json">

module ApiClient = 
    let rec getAllPagesRec baseUrl (acc: MyJson.ItemList[]) (page: int) = 
        async {
            let pageUrl = baseUrl + "&page=" + page.ToString()
            let! response = Http.AsyncRequestString(pageUrl)
            let content = MyJson.Parse(response)

            let mergedArray = Array.concat [ content.ItemList; acc ]

            if content.PageNumber < content.PageCount
            then return! getAllPagesRec baseUrl mergedArray (page+1)
            else return mergedArray
        }

However, then I read in Microsoft Docs that code like this is wasteful of memory and processor time. So I re-wrote my function into a loop:

    let getAllPagesLoop baseUrl =
        async {
        let mutable continueLoop = true
        let mutable page = 1
        let mutable acc = [||]
        
        while continueLoop do
            let pageUrl = baseUrl + "&page=" + page.ToString()
            let! response = Http.AsyncRequestString(pageUrl)
            let content = MyJson.Parse(response)
            acc <- Array.concat [ content.ItemList; acc ]
            
            if content.PageNumber < content.PageCount
            then page <- page + 1
            else continueLoop <- false
        
        return acc
    }

The second function looks much more C#-y and it contains a lot of mutation, which seems to contradict the philosophy of F#. Is there any other way how to "optimize" the first function? Maybe using the yield/yield! keyword? Or are both functions good enough?

标签: f#

解决方案


Your first piece of code is fine*. What the Microsoft docs say is to avoid using recursions where the recursions rely on previous values. Fibonacci is defined as

Fib n = fin (n-1) + Fib(n-2)

So if you calculate Fib 5 you do

Fib 4 + Fib 3

But when you do Fib 4 you do

Fib 3 + Fib 2

Which means you're calculating Fib 3 again. Idem for Fib 2. For a large n there would be lots of recalculations. What they're saying is in these scenarios you want to cache those results.

In your scenario, presumably page 1 has links to page 10,11,12 but not to page 1 and page 10 has links to page 100,101,102 but not to pages 1.., 10... etc. Then there's no wasted computation as you're not recalculating anything.

If this is not the case and there might cyclic links , then you need to keep track of the visited pages e.g. pass a list with the visited pages and each time you visit one you add the visited to the list, and avoid fetching the list if it's already in the visited list.

On another note your code may not be taking advantage of parallelization. If you could get the number of pages in advance (maybe a single call for the first page). You could parallelize the downloading of the pages like this:

let getPage baseUrl (page: int) : MyJson.ItemList[] = 
    async {
        let pageUrl = baseUrl + "&page=" + page.ToString()
        let! response = Http.AsyncRequestString(pageUrl)
        let content = MyJson.Parse(response)

        return content.ItemList
    }

 [| 1..nPages |]
 |>Array.map (getPage baseUrl)
 |>Async.Parallel
 |>Async.RunSynchronously
 |>Array.concat

推荐阅读