f# - F# Api crawler using recursion or loop
问题描述
I am trying to build a simple API crawler, that goes through paginated content and returns the aggregate result. This was my first attempt on writing a recursive function:
namespace MyProject
open FSharp.Data
type MyJson = JsonProvider<"./Resources/Example.json">
module ApiClient =
let rec getAllPagesRec baseUrl (acc: MyJson.ItemList[]) (page: int) =
async {
let pageUrl = baseUrl + "&page=" + page.ToString()
let! response = Http.AsyncRequestString(pageUrl)
let content = MyJson.Parse(response)
let mergedArray = Array.concat [ content.ItemList; acc ]
if content.PageNumber < content.PageCount
then return! getAllPagesRec baseUrl mergedArray (page+1)
else return mergedArray
}
However, then I read in Microsoft Docs that code like this is wasteful of memory and processor time. So I re-wrote my function into a loop:
let getAllPagesLoop baseUrl =
async {
let mutable continueLoop = true
let mutable page = 1
let mutable acc = [||]
while continueLoop do
let pageUrl = baseUrl + "&page=" + page.ToString()
let! response = Http.AsyncRequestString(pageUrl)
let content = MyJson.Parse(response)
acc <- Array.concat [ content.ItemList; acc ]
if content.PageNumber < content.PageCount
then page <- page + 1
else continueLoop <- false
return acc
}
The second function looks much more C#-y and it contains a lot of mutation, which seems to contradict the philosophy of F#. Is there any other way how to "optimize" the first function? Maybe using the yield
/yield!
keyword? Or are both functions good enough?
解决方案
Your first piece of code is fine*. What the Microsoft docs say is to avoid using recursions where the recursions rely on previous values. Fibonacci is defined as
Fib n = fin (n-1) + Fib(n-2)
So if you calculate Fib 5 you do
Fib 4 + Fib 3
But when you do Fib 4 you do
Fib 3 + Fib 2
Which means you're calculating Fib 3
again. Idem for Fib 2.
For a large n there would be lots of recalculations. What they're saying is in these scenarios you want to cache those results.
In your scenario, presumably page 1 has links to page 10,11,12 but not to page 1 and page 10 has links to page 100,101,102 but not to pages 1.., 10... etc. Then there's no wasted computation as you're not recalculating anything.
If this is not the case and there might cyclic links , then you need to keep track of the visited pages e.g. pass a list with the visited pages and each time you visit one you add the visited to the list, and avoid fetching the list if it's already in the visited list.
On another note your code may not be taking advantage of parallelization. If you could get the number of pages in advance (maybe a single call for the first page). You could parallelize the downloading of the pages like this:
let getPage baseUrl (page: int) : MyJson.ItemList[] =
async {
let pageUrl = baseUrl + "&page=" + page.ToString()
let! response = Http.AsyncRequestString(pageUrl)
let content = MyJson.Parse(response)
return content.ItemList
}
[| 1..nPages |]
|>Array.map (getPage baseUrl)
|>Async.Parallel
|>Async.RunSynchronously
|>Array.concat
推荐阅读
- r - 如何为两列的唯一值汇总行
- android - 在 Sqlite Database Android 中使用 in 关键字获取 Summation
- python - 如何通过私有代理 selenium 进行连接
- r - 根据累积总和和另一个组创建分组
- adfs - 在 ADFS 中,如何删除登录页面中的“Active Directory”选项
- javascript - 复选框结构,如果选择了单个子复选框,则应自动选择父复选框
- java - 使用 Maven、Gauge-Java 框架在不同配置上运行并行测试
- apache-spark - 如何使用结构化流的 writestream 写入具有重新分区的文件?
- laravel - 为什么我们需要在 Laravel 的 SlackMessage 中使用和返回方法?
- c# - 无法创建通用字典...有时