Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to convert Word documents to Excel tables in Python

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

How to convert Word documents into Excel tables in Python? for this problem, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

Test word document read

First, test the data reading of the first page of an word document:

From docx import Documentdoc = Document ("No. 02 quality inspector Senior technician (level I) Theory Test Paper .docx") for I, paragraph in enumerate (doc.paragraphs [: 55]): print (I, paragraph.text)

Match the question type, title, and specific options

What we need to do now is to match the type of question, the topic and the specific options, and the rules can be found by observation:

The question begins with an uppercase number

The title is an ordinary number +. The beginning

Options begin with parentheses + letters

Additional points to note:

There is also an ordinary number + in the first few lines of text. At the beginning, it needs to be ruled out directly.

There are some special white space characters to be excluded from the title of question 7 and the options of question 19.

Both parentheses and decimal points have both half-width and full-width.

For the second point to pay attention to:

Check the white space characters in these two places:

Doc.paragraphs [21] .text

'7. (\ xa0\ xa0) is the first company to implement six Sigma management.\ xa0'

Doc.paragraphs [49] .text

'(a) Parameter design (B) constant design\ u3000 (C) variable design\ u3000\ u3000 (D) system design

Found to be\ xa0 and\ u3000, respectively.

After sorting out the general idea, I will organize the processing code:

Import refrom docx import Documentdoc = Document ("No. 02 quality inspector Senior technician (level I) Theory Test Paper .docx") black_char = re.compile ("[\ s\ U3000\ xa0] +") chinese_nums_rule = re.compile ("[1234], (. +?)\ (") title_rule = re.compile ("\ ABCDEF. ") option_rule = re.compile ("\ ([ABCDEF]\) ") option_rule_search = re.compile ("\ ([ABCDEF]\) [^ (] + ") # traversing data starting with" one, single choice "in the word document for paragraph in doc.paragraphs [5:25]: # remove white space characters Change full-width characters to half-width characters And adjust the parentheses to the middle two spaces line = black_char.sub (", paragraph.text). Replace (" (","). Replace (")"). Replace (".", ".). Replace (" () "). "()") # skip ifnot line: continue if title_rule.match (line): print ("title", line) elif option_rule.match (line): print ("options") for blank lines Option_rule_search.findall (line)) else: chinese_nums_match = chinese_nums_rule.match (line) if chinese_nums_match: print ("question type", chinese_nums_match.group (1))

Save the matched data to a structured dictionary

Now I intend to store the currently matched text data as structured data in the form of a dictionary, which is designed as follows:

Perfect the code according to the above design:

Import refrom docx import Documentfrom collections import OrderedDictdoc = Document ("No. 02 quality inspector Senior technician (level I) Theory Test Paper .docx") black_char = re.compile ("[\ s\ U3000\ xa0] +") chinese_nums_rule = re.compile ("[1234], (. +?)\ (") title_rule = re.compile ("\ dholder. ") option_rule = re.compile ("\ ([ABCDEF]\) ") option_rule_search = re.compile ("\ ([ABCDEF]\) [^ (] + ") # Save the final structured data question_type2data = OrderedDict () # traverse the data for paragraph in doc.paragraphs from the" one, single choice "in the word document [5:]: # remove white space characters Change full-width characters to half-width characters And adjust the parentheses to the middle space line = black_char.sub (", paragraph.text) .replace (" (","). Replace (")") .replace (".", ") .replace (" () "). "()") # skip ifnot line: continue if title_rule.match (line): options = title2options.setdefault (line) for blank lines []) elif option_rule.match (line): options.extend (option_rule_search.findall (line)) else: chinese_nums_match = chinese_nums_rule.match (line) if chinese_nums_match: question_type = chinese_nums_match.group (1) title2options = question_type2data.setdefault (question_type, OrderedDict ()) traverses the structured dictionary and stores

Then we iterate through the structured dictionary and save the data to the pandas object:

Import pandas as pdresult = [] max_options_len = 0for question_type, title2options in question_type2data.items (): for title, options in title2options.items (): result.append ([question_type, title, * options]) options_len = len (options) if options_len > max_options_len: max_options_len = options_lendf = pd.DataFrame (result, columns= ["question type") "title"] + [f "option {I}" for i in range (1, max_options_len+1)]) # the question type can be simplified Remove the words df ['type'] = df ['type'] .str.replace ("select", ") df.head ()

Results:

如何用Python将Word文档转换为Excel表格

Finally save the result:

Df.to_excel ("result.xlsx", index=False) complete code

The final complete code:

Import pandas as pdimport refrom docx import Documentfrom collections import OrderedDictdoc = Document ("No. 02 quality inspector Senior technician (level I) Theory Test Paper .docx") black_char = re.compile ("[\ s\ U3000\ xa0] +") chinese_nums_rule = re.compile ("[1234], (. +?)\ (") title_rule = re.compile ("\ dholder. ") option_rule = re.compile ("\ ([ABCDEF]\) ") option_rule_search = re.compile ("\ ([ABCDEF]\) [^ (] + ") # Save the final structured data question_type2data = OrderedDict () # traverse the data for paragraph in doc.paragraphs from the" one, single choice "in the word document [5:]: # remove white space characters Change full-width characters to half-width characters And adjust the parentheses to the middle space line = black_char.sub (", paragraph.text) .replace (" (","). Replace (")") .replace (".", ") .replace (" () "). "()") # skip ifnot line: continue if title_rule.match (line): options = title2options.setdefault (line) for blank lines Elif option_rule.match (line): options.extend (option_rule_search.findall (line)) else: chinese_nums_match = chinese_nums_rule.match (line) if chinese_nums_match: question_type = chinese_nums_match.group (1) title2options = question_type2data.setdefault (question_type) OrderedDict () result = [] max_options_len = 0for question_type, title2options in question_type2data.items (): for title, options in title2options.items (): result.append ([question_type, title, * options]) options_len = len (options) if options_len > max_options_len: max_options_len = options_lendf = pd.DataFrame (result, columns= ["question type") "title"] + [f "option {I}" for i in range (1, max_options_len+1)]) # the question type can be simplified Remove the words df ['type'] = df ['type'] .str.replace ("select", ") df.to_excel (" result.xlsx ", index=False)

The resulting document:

The answer to the question about how to convert Word documents into Excel tables in Python is shared here. I hope the above content can be of some help to everyone. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report