首页 > 解决方案 > 重构python函数的最佳方法

问题描述

我有一个混乱的函数,我正在努力重构以提高效率和可读性。我的 python 技能充其量只是初学者到中级,我想有一种更清洁的方法来完成这项任务。

下面的函数接受一个字符串,其中包含各种业务联系相关信息。信息用冒号分隔。公司名称始终是第一个字段,因此可以轻松提取,但其余“列”(冒号之间的数据)可能包含也可能不包含,并且顺序并不总是相同。

该函数有两个参数,1) 行数据(包含以下示例的字符串)和 2) 我希望返回的数据元素。

# Business Contact Information
def parseBusinessContactInformation(self,rowdata,element):

    ## Process Business Contact Information
    ## example rowdata = "Business Name, LLC : Business DBA : Email- person@email.com : Phone- 1234567890 : Website- www.site.com"
    ## example rowdata = "Business Name, LLC : Email- person@email.com : Phone- 1234567890 : Website- www.site.com"
    ## example rowdata = "Business Name, LLC : Business DBA : Phone- 1234567890 : Website- www.site.com"
    ## example rowdata = "Business Name, LLC : Phone- 1234567890"
  
    businessName = None
    businessDba = None
    businessPhone = None
    businessEmail = None
    businessWebsite = None
    
    # Split rowdata on :
    contactData = rowdata.split(':')

    ## [0] - business name should always be present
    businessName = contactData[0].strip()
    
    ## [1] - doing_business_as or another field if not present
    if 1 < len(contactData) and re.search('email',contactData[1].lower()):
        contactTemp = contactData[1].split('-')
        businessEmail = contactTemp[1].strip()
        businessDba = contactData[0].strip()
    elif 1 < len(contactData) and re.search('phone',contactData[1].lower()):
        contactTemp = contactData[1].split('-')
        businessPhone = contactTemp[1].strip()
        businessDba = contactData[0].strip()
    elif 1 < len(contactData) and re.search('website',contactData[1].lower()):
        contactTemp = contactData[1].split('-')
        businessWebsite = contactTemp[1].strip()
        businessDba = contactData[0].strip()
    elif 1 < len(contactData) and not re.search(r'(phone|email|website)',contactData[1].lower()):
        businessDba = contactData[1].strip()
    else:
        businessDba = self.dataNotAvailableMessage
    
    ## [2] - phone or email or website
    if 2 < len(contactData) and re.search('email',contactData[2].lower()):
        contactTemp = contactData[2].split('-')
        businessEmail = contactTemp[1].strip()
    elif 2 < len(contactData) and re.search('phone',contactData[2].lower()):
        contactTemp = contactData[2].split('-')
        businessPhone = contactTemp[1].strip()
    elif 2 < len(contactData) and re.search('website',contactData[2].lower()):
        contactTemp = contactData[2].split('-')
        businessWebsite = contactTemp[1].strip()
    
    ## [3] - phone or email or website
    if 3 < len(contactData) and re.search('email',contactData[3].lower()):
        contactTemp = contactData[3].split('-')
        businessEmail = contactTemp[1].strip()
    elif 3 < len(contactData) and re.search('phone',contactData[3].lower()):
        contactTemp = contactData[3].split('-')
        businessPhone = contactTemp[1].strip()
    elif 3 < len(contactData) and re.search('website',contactData[3].lower()):
        contactTemp = contactData[3].split('-')
        businessWebsite = contactTemp[1].strip()
    
    if element == "businessName":
        return businessName
    elif element == "businessDba":
        return businessDba
    elif element == "businessPhone":
        return businessPhone
    elif element == "businessEmail":
        return businessEmail
    elif element == "businessWebsite":
        return businessWebsite
    else:
        return self.dataNotAvailableMessage

我试图了解一种更好的方法来做到这一点。

标签: pythonrefactoringtext-parsing

解决方案


重构是一个累积的过程。您在Martin Fowler 和 Kent Beck的Refactoring中全面描述了该方法。

它的核心是一系列小的行为保持转换。(马丁·福勒,https://refactoring.com/

最重要的部分是:“小”和“行为保持”。“小”这个词是不言自明的,但是单元测试应该确保“行为保持”。

初步评论:我建议你坚持使用PEP 8 Style Guide

行为保持

用文档字符串 ( https://www.python.org/dev/peps/pep-0008/#id33 )替换您的评论。这非常有用,因为您在 docstring(又名doctests)中编写了一些单元测试。

class MyParser:
    dataNotAvailableMessage = "dataNotAvailableMessage"

    # Business Contact Information
    def parseBusinessContactInformation(self,rowdata,element):
        """Process Business Contact Information
        
        Examples:
        >>> p = MyParser()
        >>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessPhone")
        '1234567890'
        
        >>> p.parseBusinessContactInformation("Business Name, LLC : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessName")
        'Business Name, LLC'
        
        >>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Phone- 1234567890 : Website- www.site.com", "businessDba")
        'Business DBA'
        
        >>> p.parseBusinessContactInformation("Business Name, LLC : Phone- 1234567890", "businessEmail") is None
        True
        
        >>> p.parseBusinessContactInformation("Business Name, LLC : Phone- 1234567890", "?") 
        'dataNotAvailableMessage'
        
        """

        ...
        
import doctest
doctest.testmod()            
  

您应该编写更多单元测试(使用https://docs.python.org/3/library/unittest.html以避免淹没文档字符串)来保护当前行为,但这是一个好的开始。

现在,一个小的转变:看看那些(el)if 1 < len(contactData) and ...线。您可以只测试一次长度:

if 1 < len(contactData):
    if re.search('email',contactData[1].lower()):
        contactTemp = contactData[1].split('-')
        businessEmail = contactTemp[1].strip()
        businessDba = contactData[0].strip()
    elif re.search('phone',contactData[1].lower()):
        contactTemp = contactData[1].split('-')
        businessPhone = contactTemp[1].strip()
        businessDba = contactData[0].strip()
    elif re.search('website',contactData[1].lower()):
        contactTemp = contactData[1].split('-')
        businessWebsite = contactTemp[1].strip()
        businessDba = contactData[0].strip()
    elif not re.search(r'(phone|email|website)',contactData[1].lower()):
        businessDba = contactData[1].strip()
    else:
        businessDba = self.dataNotAvailableMessage
else:
    businessDba = self.dataNotAvailableMessage

您注意到倒数第二个else无法到达:您有phone,emailwebsite没有:

if 1 < len(contactData):
    if re.search('email',contactData[1].lower()):
        contactTemp = contactData[1].split('-')
        businessEmail = contactTemp[1].strip()
        businessDba = contactData[0].strip()
    elif re.search('phone',contactData[1].lower()):
        contactTemp = contactData[1].split('-')
        businessPhone = contactTemp[1].strip()
        businessDba = contactData[0].strip()
    elif re.search('website',contactData[1].lower()):
        contactTemp = contactData[1].split('-')
        businessWebsite = contactTemp[1].strip()
        businessDba = contactData[0].strip()
    else:
        businessDba = contactData[1].strip()
else:
    businessDba = self.dataNotAvailableMessage

对 [2] 和 [3] 执行相同操作:

if 2 < len(contactData):
    if re.search('email',contactData[2].lower()):
        contactTemp = contactData[2].split('-')
        businessEmail = contactTemp[1].strip()
    elif re.search('phone',contactData[2].lower()):
        contactTemp = contactData[2].split('-')
        businessPhone = contactTemp[1].strip()
    elif re.search('website',contactData[2].lower()):
        contactTemp = contactData[2].split('-')
        businessWebsite = contactTemp[1].strip()
    
if 3 < len(contactData):
    if re.search('email',contactData[3].lower()):
        contactTemp = contactData[3].split('-')
        businessEmail = contactTemp[1].strip()
    elif re.search('phone',contactData[3].lower()):
        contactTemp = contactData[3].split('-')
        businessPhone = contactTemp[1].strip()
    elif re.search('website',contactData[3].lower()):
        contactTemp = contactData[3].split('-')
        businessWebsite = contactTemp[1].strip()

现在你看到了一个清晰的模式。除了第一部分的作业businessDba,你显然有三倍的相同过程。首先,我们隔离businessDba第一部分中的赋值:

if 1 < len(contactData):
    if re.search('(email|phone|website)',contactData[1].lower()):
        businessDba = contactData[0].strip()
    else:
        businessDba = contactData[1].strip()
else:
    businessDba = self.dataNotAvailableMessage

进而:

if 1 < len(contactData):
    if re.search('email',contactData[1].lower()):
        contactTemp = contactData[1].split('-')
        businessEmail = contactTemp[1].strip()
    elif re.search('phone',contactData[1].lower()):
        contactTemp = contactData[1].split('-')
        businessPhone = contactTemp[1].strip()
    elif re.search('website',contactData[1].lower()):
        contactTemp = contactData[1].split('-')
        businessWebsite = contactTemp[1].strip()

在我们走得更远之前,我们可以删除该行

businessName = None
businessDba = None

自从businessName并且businessDba始终具有价值。并替换新行:

businessDba = contactData[0].strip()

经过:

businessDba = businessName

这使得回退变得明确。

现在,我们有三倍于相同的过程。循环是个好主意:

for i in range(1, 3):
    if i >= len(contactData):
        break
        
    if re.search('email',contactData[i].lower()):
        contactTemp = contactData[i].split('-')
        businessEmail = contactTemp[1].strip()
    elif re.search('phone',contactData[i].lower()):
        contactTemp = contactData[i].split('-')
        businessPhone = contactTemp[1].strip()
    elif re.search('website',contactData[i].lower()):
        contactTemp = contactData[i].split('-')
        businessWebsite = contactTemp[1].strip()

我们可以提取contactTemp = ,即使它并不总是有用:

for i in range(1, 3):
    if i >= len(contactData):
        break
    contactTemp = contactData[i].split('-')
        
    if re.search('email',contactData[i].lower()):
        businessEmail = contactTemp[1].strip()
    elif re.search('phone',contactData[i].lower()):
        businessPhone = contactTemp[1].strip()
    elif re.search('website',contactData[i].lower()):
        businessWebsite = contactTemp[1].strip()

这更好,但我发现最后一部分 ( if element == ...) 真的很麻烦:你测试element所有可能性。有人想要一本字典。对于一个小的转换,我们可以写:

d = {
    "businessName": businessName,
    "businessDba": businessDba,
    "businessPhone": businessPhone,
    "businessEmail": businessEmail,
    "businessWebsite": businessWebsite
}
return d.get(element, self.dataNotAvailableMessage)

现在,我们可以创建它并即时更新它,而不是在最后初始化 dict:

    d = {
        "businessPhone": None,
        "businessEmail": None,
        "businessWebsite": None
    }
    
    # Split rowdata on :
    contactData = rowdata.split(':')

    ## [0] - business name should always be present
    d["businessName"] = contactData[0].strip()

    if 1 < len(contactData):
        if re.search('(email|phone|website)',contactData[1].lower()):
            d["businessDba"] = d["businessName"]
        else:
            d["businessDba"] = contactData[1].strip()
    else:
        d["businessDba"] = self.dataNotAvailableMessage

    for i in range(1, 4):
        if i >= len(contactData):
            break
            
        contactTemp = contactData[i].split('-')
        if re.search('email',contactData[i].lower()):
            d["businessEmail"] = contactTemp[1].strip()
        elif re.search('phone',contactData[i].lower()):
            d["businessPhone" = contactTemp[1].strip()
        elif re.search('website',contactData[i].lower()):
            d["businessWebsite"] = contactTemp[1].strip()
    
    return d.get(element, self.dataNotAvailableMessage)

我对每个修改都进行了测试,它仍然有效,但它并不那么容易阅读。我们可以提取一个创建字典的函数:

def parseBusinessContactInformation(self, rowdata, element):
    d = self._parseBusinessContactInformation(rowdata)
    return d.get(element, self.dataNotAvailableMessage)

def _parseBusinessContactInformation(self, rowdata):
    ...

有一个小的行为改变

这还不错,但我们可以通过一些小的行为改变来改善这一点(我希望你能接受这个新行为!):

    for i in range(1, 4):
        if i >= len(contactData):
            break
            
        contactTemp = contactData[i].split('-')
        if len(contactTemp) > 1:
            d["business" + contactTemp[0].strip()] = contactTemp[1].strip()
        

行为改变是什么?简单地说,我们现在接受类似

>>> p = MyParser()
>>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Foo- Bar", "businessFoo")
'Bar'

由于我们接受更多element的 s,我们应该改变循环range

    for i in range(1, len(contactData)):
        ...

是时候关注一个轻微的不一致了:为什么可以businessDba拥有self.dataNotAvailableMessage为不存在元素的情况而创建的值?我们应该使用None

    d = {
        "businessDba": None,
        ...
    }

并删除这两行:

    else:
        d["businessDba"] = self.dataNotAvailableMessage

那么这可以简化:

    if 1 < len(contactData):
        if "-" in contactData[1]:
            d["businessDba"] = d["businessName"]
        else:
            d["businessDba"] = contactData[1].strip()

这是代码:

def parseBusinessContactInformation(self,rowdata,element):
    """Process Business Contact Information
    
    Examples:
    >>> p = MyParser()
    >>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessPhone")
    '1234567890'
    
    >>> p.parseBusinessContactInformation("Business Name, LLC : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessName")
    'Business Name, LLC'
    
    >>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Phone- 1234567890 : Website- www.site.com", "businessDba")
    'Business DBA'
    
    >>> p.parseBusinessContactInformation("Business Name, LLC : Phone- 1234567890", "businessEmail") is None
    True
    
    >>> p.parseBusinessContactInformation("Business Name, LLC : Phone- 1234567890", "?") 
    'dataNotAvailableMessage'
    
    >>> p.parseBusinessContactInformation("Business Name, LLC : Business DBA : Foo- Bar", "businessFoo")
    'Bar'
    
    """
    d = self._parseBusinessContactInformation(rowdata)
    return d.get(element, self.dataNotAvailableMessage)
  
def _parseBusinessContactInformation(self,rowdata):
    d = {
        "businessDba": None,
        "businessPhone": None,
        "businessEmail": None,
        "businessWebsite": None
    }
    
    # Split rowdata on :
    contactData = rowdata.split(':')

    ## [0] - business name should always be present
    d["businessName"] = contactData[0].strip()

    if 1 < len(contactData):
        if "-" in contactData[1]:
            d["businessDba"] = d["businessName"]
        else:
            d["businessDba"] = contactData[1].strip()

    for i in range(1, len(contactData)):
        contactTemp = contactData[i].split('-')
        if len(contactTemp) > 1:
            d["business" + contactTemp[0].strip()] = contactTemp[1].strip()

    return d

最后一点:切换到蛇案例,制作一个get和一个parse函数:parse返回一个字典,同时get返回一个值:

data_not_available_message = "dataNotAvailableMessage"

def get_business_contact_information(self, rowdata, element):
    """Process Business Contact Information
    
    Examples:
    >>> p = MyParser()
    >>> p.get_business_contact_information("Business Name, LLC : Business DBA : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessPhone")
    '1234567890'
    
    >>> p.get_business_contact_information("Business Name, LLC : Email- person@email.com : Phone- 1234567890 : Website- www.site.com", "businessName")
    'Business Name, LLC'
    
    >>> p.get_business_contact_information("Business Name, LLC : Business DBA : Phone- 1234567890 : Website- www.site.com", "businessDba")
    'Business DBA'
    
    >>> p.get_business_contact_information("Business Name, LLC : Phone- 1234567890", "businessEmail") is None
    True
    
    >>> p.get_business_contact_information("Business Name, LLC : Phone- 1234567890", "?") 
    'dataNotAvailableMessage'
    
    >>> p.get_business_contact_information("Business Name, LLC : Business DBA : Foo- Bar", "businessFoo")
    'Bar'
    
    :param rowdata: ...
    :param element: ...
    :return: ...
    """
    d = self._parse_business_contact_information(rowdata)
    return d.get(element, self.data_not_available_message)

进行了一些外观上的更改,使其更加 Pythonic:

def parse_business_contact_information(self, rowdata):
    """Process Business Contact Information
    
    Examples:
    >>> p = MyParser()
    >>> p.parse_business_contact_information("Business Name, LLC : Business DBA : Email- person@email.com : Phone- 1234567890 : Website- www.site.com") == {
    ... 'businessDba': 'Business DBA', 'businessPhone': '1234567890', 'businessEmail': 'person@email.com', 
    ... 'businessWebsite': 'www.site.com', 'businessName': 'Business Name, LLC'}        
    True

    >>> p.parse_business_contact_information("Business Name, LLC : Phone- 1234567890") == {
    ... 'businessDba': 'Business Name, LLC', 'businessPhone': '1234567890', 'businessEmail': None, 
    ... 'businessWebsite': None, 'businessName': 'Business Name, LLC'}
    True
    
    :param rowdata: ...
    :return: ...
    """
    d = dict.fromkeys(("businessDba", "businessPhone", 
                       "businessEmail", "businessWebsite"))
    
    name, *others = rowdata.split(':') # destructuring assignment

    d["businessName"] = name.strip()
    if not others:
        return d
    
    if "-" in others[0]:
        d["businessDba"] = d["businessName"]
    else:
        d["businessDba"] = others[0].strip()
        others.pop(0) # consume others[0]

    for data in others:
        try:
            key, value = data.split('-', 1) # a- b-c => a, b-c
        except ValueError: # too many/not enough values to unpack
            print("Element {} should have a dash".format(data))
        else:
            d["business" + key.strip()] = value.strip()

    return d

代码并不完美,但它比以前更清晰,至少在我看来。

总结一下方法:

  1. 编写单元测试以保护行为;
  2. 进行小的转换以保留行为提高可读性。考虑你能做的,不要在这里关注性能;
  3. 继续,直到你有清楚的东西/当你转圈并进行不必要的修改时停止;
  4. 如有必要,提高性能。

推荐阅读