Friday, 12 May 2017

how to working with Microsoft word files in python





  • Python can create and modify Word documents, which have the .docx file extension.
  • in order to work with word documents we need to install python-docx module. 
  • pip install python-docx
  • after install that module,import that module by using command import docx , not import python-docx.
  • If you don’t have Word, LibreOffice Writer and OpenOffice Writer are both free alternative applications for Windows, OS X, and Linux that can be used to open .docx files.
  • Compared to plain-text, .docx files have a lot of structure.
  • This structure is represented by three different data types in Python-Docx. 
  • At the highest level, a Document object represents the entire document. 
  • The Document object contains a list of Paragraph objects for the paragraphs in the document.
  • Each of these Paragraph objects contains a list of one or more Run objects.


Reading Word Documents:

import docx

doc = docx.Document('demo.docx')

len(doc.paragraphs)

doc.paragraphs[0].text

doc.paragraphs[1].text
  
len(doc.paragraphs[1].runs)
  
doc.paragraphs[1].runs[0].text
 
doc.paragraphs[1].runs[1].text
   
doc.paragraphs[1].runs[2].text
   
doc.paragraphs[1].runs[3].text

Getting the Full Text from a .docx File:

readDocx.py

import docx

def  getText(filename):

     doc = docx.Document(filename)

     fullText = [ ]

     for para in doc.paragraphs:

           fullText.append(para.text)

     return '\n'.join(fullText)

  • The readDocx.py program can be imported like any other module.
  • Now if you just need the text from a Word document, you can enter the following:
import readDocx

print(readDocx.getText('demo.docx'))

Writing Word Documents:

import docx

doc = docx.Document( )

doc.add_paragraph('Hello world!')

doc.save('helloworld.docx')


Create a Word document:

from docx import Document

d = Document()

d.add_heading('Hamlet')

d.add_heading('dramatis personae', 2)

d.add_paragraph('Hamlet, the Prince of Denmark')

d.save('hamlet.docx')

Read a Word document:

document = Document('hamlet.docx')

for para in document.paragraphs:

          print(para.text)



No comments:

Post a Comment