javascript로 XML 태그, 속성 분석하기

JS.TS

javascript로 XML 태그, 속성 분석하기

mmalmmizal 2024. 7. 17. 12:10

컴파일러의 역할

컴파일러는 소스코드를 기계어로 바꾸어주는 역할을 하며, 이 때 구문분석 -> 최적화 -> 코드생성 -> 링킹 과정이 진행된다. 구문분석 과정에서 소스코드는 어휘 분석 프로그램을 차례로 지나가며 구문분석을 진행한다.

어휘 분석 프로그램

tokenizer : 문자 스트림을 토큰 단위로 분해, 입력을 기본 어휘 요소 (단어, 단어구, 문자열 등 의미 있는 단위)로 분해

일반적인 Token 종류

identifier : 식별하기 위한 이름
keyword : 미리 지정한 예약어
separator : 글자를 구분하는 문자
operator : 연산을 위한 심볼
literal : 숫자, 논리, 문자
comment : 줄 또는 블록 코멘터리

lexer : 토큰 스트림을 통해 쪼개진 토큰들의 추가 분석 수행, 문법 규칙에 따라 토큰 범주 분루

(Lexical Analyze : lexer를 거치며 그 결과의 의미를 분석하는 과정 )

parser : 렉서가 생성한 토큰 시퀀스를 분석하고 추상 구문 트리(AST) or 다른 내부 표현을 구성

(Syntax Analyze : Lexical Analyze된 데이터를 구조적으로 나타내는 과정에서 데이터가 올바른지 검증하는 역할 )

//입력
const word = "hello world!";

//Tokenizer
["const", "word", "=", "hello world!", ";"] 

//Lexer 
[
  { type: "Keyword", value: "const" },
  { type: "Identifier", value: "word" },
  { type: "Operator", value: "=" },
  { type: "String", value: "hello world!" },
  { type: "Punctuation", value: ";" }
]

//Parser
{
  type: "VariableDeclaration",
  kind: "const",
  declarations: [
    {
      type: "VariableDeclarator",
      id: { type: "Identifier", name: "word" },
      init: { type: "Literal", value: "hello world!", raw: "\"hello world!\"" }
    }
  ]
}

AST ( Abstract Syntax Tree)

분석된 구문을 트리 형태로 나타내는 자료구조

parser에 의해 도출된 결과는 AST 형태로 생성된다.

xml 파일 분석하기

아래와 같은 클래스를 이용하여 parsing을 진행한다.

class Element {
  constructor(tagName, attributes = []) {
    this.tagName = tagName; // 태그 이름
    this.attributes = attributes; // 속성 항목
  }
}

class Attribute {
  constructor(name, value) {
    this.name = name; // 속성 이름
    this.value = value; // 속성 값
  }
}

class Data {
  constructor(prolog = [], elements = []) {
    this.prolog = prolog;
    this.elements = elements;
  }
}

1. Tokenization & Lexing

정규표현식을 이용하여 나누어 토큰화 & 프로로그 또는 태그 정보를 더 높은 의미 구조로 변환 작업

  const prologPattern = /<\?xml[^>]+\?>/;
  const tagPattern = /<(\w+)([^>]*)>/g;
  const attrPattern = /(\w+)=["']([^"']+)["']/g;

2. parsing

prolog 프로로그

<?xml version="1.0" encoding="UTF-8"?>

XML의 프로로그에는 XML 선언, 처리 지시사항, 주석과 문서 타입 정의를 다루는 내용이 담긴다.

xml 프로로그가 존재하는 태그라면 프로로그 속성을 객체타입으로 추출한다

  "prolog": [
    {
      "name": "version",
      "value": "1.0"
    },
    {
      "name": "encoding",
      "value": "UTF-8"
    }
  ]

tag & attributes 태그와 속성

  while ((match = tagPattern.exec(input)) !== null) {
    const tagName = match[1];
    const attributesString = match[2].trim();

    const attributes = [];

    let attrMatch;
    while ((attrMatch = attrPattern.exec(attributesString)) !== null) {
      attributes.push(new Attribute(attrMatch[1], attrMatch[2]));
    }

    data.elements.push(new Element(tagName, attributes));
  }

exec 함수를 통해 tagPattern과 match된 태그 이름과 속성 문자열을 저장한다.

속성 문자열은 다시 attrPattern과 match하여 새로운 객체로 attributes 배열에 저장한다.

최종 파싱된 결과

JSON.stringify() 함수를 통해 JSON 형식의 문자열로 출력한다.

replacer 함수는 null로, 스페이싱 인자는 2개의 스페이스를 출력한다.

function displayJSON() {
  console.log(JSON.stringify(data, null, 2));
}

{
  "prolog": [],
  "elements": [
    {
      "tagName": "HTML",
      "attributes": [
        {
          "name": "lang",
          "value": "ko"
        }
      ]
    },
    {
      "tagName": "BODY",
      "attributes": []
    },
    {
      "tagName": "FONT",
      "attributes": [
        {
          "name": "name",
          "value": "Seoul"
        }
      ]
    }
  ]
}

XML과 JSON 데이터 표현과 방식

XML (Extensible Markup Language): 데이터 교환 및 저장을 위한 마크업 언어

기본 구조:

xml

<tag attribute="value">content</tag>
- 태그와 속성, 콘텐츠로 구성
DTD와 XML Schema: XML 문서의 구조와 데이터 타입을 정의
- DTD (Document Type Definition): XML 문서의 구조를 정의하는 오래된 방식
- XML Schema: XML 문서의 구조와 데이터 타입을 더 세밀하게 정의하는 방식
XPath와 XQuery: XML 데이터를 선택하고 쿼리하는 언어
- XPath: XML 문서 내 특정 요소와 속성을 선택
- XQuery: XML 데이터베이스에서 복잡한 쿼리 수행

JSON (JavaScript Object Notation): 경량 데이터 교환 형식, 가독성과 데이터 구조 표현이 쉬움

기본 구조:

json

{ "key": "value", "array": [1, 2, 3], "object": { "nested_key": "nested_value" } }
데이터 타입:
- 문자열 (String): "Hello"
- 숫자 (Number): 123
- 객체 (Object): { "key": "value" }
- 배열 (Array): [1, 2, 3]
- 불리언 (Boolean): true, false
- null: null
장점:
- 경량: 단순하고 가볍다
- 가독성: 사람이 읽고 쓰기 쉬움
- 호환성: 대부분의 프로그래밍 언어와 호환 가능

References

https://www.quora.com/Whats-the-difference-between-a-tokenizer-lexer-and-parser

Whats-the-difference-between-a-tokenizer-lexer-and-parser

https://trumanfromkorea.tistory.com/79

tokenizer,lexer, parser