In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
How to convert Word documents into Excel tables in Python? for this problem, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.
Test word document read
First, test the data reading of the first page of an word document:
From docx import Documentdoc = Document ("No. 02 quality inspector Senior technician (level I) Theory Test Paper .docx") for I, paragraph in enumerate (doc.paragraphs [: 55]): print (I, paragraph.text)
Match the question type, title, and specific options
What we need to do now is to match the type of question, the topic and the specific options, and the rules can be found by observation:
The question begins with an uppercase number
The title is an ordinary number +. The beginning
Options begin with parentheses + letters
❝
Additional points to note:
There is also an ordinary number + in the first few lines of text. At the beginning, it needs to be ruled out directly.
There are some special white space characters to be excluded from the title of question 7 and the options of question 19.
Both parentheses and decimal points have both half-width and full-width.
❞
For the second point to pay attention to:
Check the white space characters in these two places:
Doc.paragraphs [21] .text
'7. (\ xa0\ xa0) is the first company to implement six Sigma management.\ xa0'
Doc.paragraphs [49] .text
'(a) Parameter design (B) constant design\ u3000 (C) variable design\ u3000\ u3000 (D) system design
Found to be\ xa0 and\ u3000, respectively.
After sorting out the general idea, I will organize the processing code:
Import refrom docx import Documentdoc = Document ("No. 02 quality inspector Senior technician (level I) Theory Test Paper .docx") black_char = re.compile ("[\ s\ U3000\ xa0] +") chinese_nums_rule = re.compile ("[1234], (. +?)\ (") title_rule = re.compile ("\ ABCDEF. ") option_rule = re.compile ("\ ([ABCDEF]\) ") option_rule_search = re.compile ("\ ([ABCDEF]\) [^ (] + ") # traversing data starting with" one, single choice "in the word document for paragraph in doc.paragraphs [5:25]: # remove white space characters Change full-width characters to half-width characters And adjust the parentheses to the middle two spaces line = black_char.sub (", paragraph.text). Replace (" (","). Replace (")"). Replace (".", ".). Replace (" () "). "()") # skip ifnot line: continue if title_rule.match (line): print ("title", line) elif option_rule.match (line): print ("options") for blank lines Option_rule_search.findall (line)) else: chinese_nums_match = chinese_nums_rule.match (line) if chinese_nums_match: print ("question type", chinese_nums_match.group (1))
Save the matched data to a structured dictionary
Now I intend to store the currently matched text data as structured data in the form of a dictionary, which is designed as follows:
Perfect the code according to the above design:
Import refrom docx import Documentfrom collections import OrderedDictdoc = Document ("No. 02 quality inspector Senior technician (level I) Theory Test Paper .docx") black_char = re.compile ("[\ s\ U3000\ xa0] +") chinese_nums_rule = re.compile ("[1234], (. +?)\ (") title_rule = re.compile ("\ dholder. ") option_rule = re.compile ("\ ([ABCDEF]\) ") option_rule_search = re.compile ("\ ([ABCDEF]\) [^ (] + ") # Save the final structured data question_type2data = OrderedDict () # traverse the data for paragraph in doc.paragraphs from the" one, single choice "in the word document [5:]: # remove white space characters Change full-width characters to half-width characters And adjust the parentheses to the middle space line = black_char.sub (", paragraph.text) .replace (" (","). Replace (")") .replace (".", ") .replace (" () "). "()") # skip ifnot line: continue if title_rule.match (line): options = title2options.setdefault (line) for blank lines []) elif option_rule.match (line): options.extend (option_rule_search.findall (line)) else: chinese_nums_match = chinese_nums_rule.match (line) if chinese_nums_match: question_type = chinese_nums_match.group (1) title2options = question_type2data.setdefault (question_type, OrderedDict ()) traverses the structured dictionary and stores
Then we iterate through the structured dictionary and save the data to the pandas object:
Import pandas as pdresult = [] max_options_len = 0for question_type, title2options in question_type2data.items (): for title, options in title2options.items (): result.append ([question_type, title, * options]) options_len = len (options) if options_len > max_options_len: max_options_len = options_lendf = pd.DataFrame (result, columns= ["question type") "title"] + [f "option {I}" for i in range (1, max_options_len+1)]) # the question type can be simplified Remove the words df ['type'] = df ['type'] .str.replace ("select", ") df.head ()
Results:
Finally save the result:
Df.to_excel ("result.xlsx", index=False) complete code
The final complete code:
Import pandas as pdimport refrom docx import Documentfrom collections import OrderedDictdoc = Document ("No. 02 quality inspector Senior technician (level I) Theory Test Paper .docx") black_char = re.compile ("[\ s\ U3000\ xa0] +") chinese_nums_rule = re.compile ("[1234], (. +?)\ (") title_rule = re.compile ("\ dholder. ") option_rule = re.compile ("\ ([ABCDEF]\) ") option_rule_search = re.compile ("\ ([ABCDEF]\) [^ (] + ") # Save the final structured data question_type2data = OrderedDict () # traverse the data for paragraph in doc.paragraphs from the" one, single choice "in the word document [5:]: # remove white space characters Change full-width characters to half-width characters And adjust the parentheses to the middle space line = black_char.sub (", paragraph.text) .replace (" (","). Replace (")") .replace (".", ") .replace (" () "). "()") # skip ifnot line: continue if title_rule.match (line): options = title2options.setdefault (line) for blank lines Elif option_rule.match (line): options.extend (option_rule_search.findall (line)) else: chinese_nums_match = chinese_nums_rule.match (line) if chinese_nums_match: question_type = chinese_nums_match.group (1) title2options = question_type2data.setdefault (question_type) OrderedDict () result = [] max_options_len = 0for question_type, title2options in question_type2data.items (): for title, options in title2options.items (): result.append ([question_type, title, * options]) options_len = len (options) if options_len > max_options_len: max_options_len = options_lendf = pd.DataFrame (result, columns= ["question type") "title"] + [f "option {I}" for i in range (1, max_options_len+1)]) # the question type can be simplified Remove the words df ['type'] = df ['type'] .str.replace ("select", ") df.to_excel (" result.xlsx ", index=False)
The resulting document:
The answer to the question about how to convert Word documents into Excel tables in Python is shared here. I hope the above content can be of some help to everyone. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.